Impala 自用使用手册_impala 空值判断-CSDN博客

本文档详细介绍了Impala的使用，包括数据类型转换函数、如何计算不重复值、利用Lag和Lead函数处理时间序列数据、复杂数据类型如map和array的操作，以及排序和窗口函数的应用，特别是空值排序和计算销售额占比、环比的方法。

Impala 自用使用手册

一、函数
二、数据类型
- 1. map & array
三、排序
- 1. 空值排序问题
四、一些优雅写法
- 1. 计算销售额占比
- 2. 计算销售额环比
五、窗口函数合集
- 1. 窗口函数计算顺序

一、函数

1. 数据类型转换函数

cast(a as T)
/*
Converts the results of the expression expr to type T. For example, cast('1' as BIGINT) will convert the string '1' to its integral representation. 
A null is returned if the conversion does not succeed. 
If cast(expr as boolean) Hive returns true for a non-empty string.
即 CAST(字段名 as 要转换成的类型)
*/

# eg.
CAST(id as string)

2. 计算某列不重复值

count([DISTINCT|ALL] col)
/* An aggregate function that returns the number of rows, or the number of non-NULL rows. */

# eg.
COUNT(DISTINCT [brand_name])

3. 取出某一字段的前N行的数据(Lag)或后N行的数据(Lead)

lag(expr [, offset] [, default]) OVER ([partition_by_clause] order_by_clause)

/*
This function returns the value of an expression using column values from a preceding row. 
You specify an integer offset, which designates a row position some number of rows previous to the current row. 
Any column references in the expression argument refer to column values from that prior row.

Offset是偏移量，即是上1个或上N个的值，假设当前行在表中排在第5行，offset为3，则表示我们所要找的数据行就是表中的第2行（即5-3=2）
default默认值，当两个函数取上N/下N个值，当在表中从当前行位置向前数N行已经超出了表的范围时，函数将default这个参数值作为函数的返回值，若没有指定默认值，则返回NULL
*/

# eg.
round(lag(sum(sales),1)over(PARTITION BY platform ORDER BY dt)


lead(expr [, offset] [, default]) OVER ([partition_by_clause] order_by_clause)
/*
This function returns the value of an expression using column values from a following row.
You specify an integer offset, which designates a row position some number of rows after to the current row. 
Any column references in the expression argument refer to column values from that later row.

Offset是偏移量，即是上1个或上N个的值，假设当前行在表中排在第5行，offset为3，则表示我们所要找的数据行就是表中的第2行（即5-3=2）
default默认值，当两个函数取上N/下N个值，当在表中从当前行位置向前数N行已经超出了表的范围时，函数将default这个参数值作为函数的返回值，若没有指定默认值，则返回NULL
*/

# eg.
round(lead(sum(sales),1)over(PARTITION BY platform ORDER BY dt)

注：这两个函数可以很方便地计算同比和环比

二、数据类型

1. map & array

含义
array 类型视为一张表, 其列名为 item
map类型有两个列, 一个是key, 一个是value
建表

create table market_review(
    id int,
    platform array<string> comment 'array[tmall,jd]'
    tag map<string,string>  comment 'map<产品名称，产地>'
)
stored as parquet

示例
+-----------------+---------------------+----------------------------+--+
| product_id      |       platform      |           tag              |
+-----------------+---------------------+----------------------------+--+
| 1               |   ["pdd","vip"]     |     {"薇诺娜":"中国"}       |
| 2               |   ["tmall","jd"]    |     {"兰蔻":"法国"}         |
+-----------------+---------------------+----------------------------+--+

#注：Impala 只用parquet格式存储时，才能使用复杂数据类型

查询
impala 不允许直接查复杂数据类型，因此基本的处理方式是把复杂类型当作一张表来查。

查全量
select * from market_review a, a.platform b, a.tag c

or

select product_id , platform_array.item, tag_map.key, tag_map.value 
from market_review,market_review.platform as platform_array,market_review.tag as tag_map;

+----+------------+----------+--------+
| product_id |  platform  | key      | value  |
+----+------------+----------+--------+
|     1      |    pdd     |  薇诺娜  |  中国   |
|     1      |    vip     |  薇诺娜  |  中国   |
|     2      |    tmall   |   兰蔻   |  法国   |
|     2      |     jd     |   兰蔻   |  法国   |
+----+------------+----------+--------+

加筛选器
where
platform_array.item = 'pdd'
tag_map.key = '薇诺娜'
favorite_array.POS = 0 (取数组第一列）

三、排序

1. 空值排序问题

空值放在最前面/后面

order by 字段名 [ASC]/DESC NULLS LAST
# eg:
order by sales DESC NULLS LAST

四、一些优雅写法

1. 计算销售额占比

改进前：
with a as(
SELECT sum(sales) as total_sales from market_review
),
b as (
SELECT platform,sum(sales) as p_sales from market_review
GROUP BY 1
)
select platform,round(p_sales/total_sales,2) as ratio from b left join a 
on 1=1
order by 2 desc;

改进后：
SELECT platform,round(sum(sales),2),round(sum(sales)/sum(sum(sales))over(),2) as ratio from market_review
GROUP BY 1
order by 2 desc;

2. 计算销售额环比

注：需在整个表数据完整的情况下才可使用该方法

SELECT platform,dt,sum(sales),round(lag(sum(sales),1)over(PARTITION BY platform ORDER BY dt)/sum(sales)-1,2) as growth_rate from market_review
GROUP BY 1,2 
order by 1,2

五、窗口函数合集

1. 窗口函数计算顺序

select 后面的窗口函数会按前面几列叠加执行，即先执行前面几列，得出结果之后再执行窗口函数

解说：
SELECT platform,round(sum(sales),2),round(sum(sales)/sum(sum(sales))over(),2) as ratio from market_review
GROUP BY 1
order by 2 desc;

/*
原理：之所以要用两个sum，是由于窗口函数的执行顺序导致的，执行窗口函数前会先执行sum(sku_value_sales)，
所以要把sum(sku_value_sales)看成一个整体，再sum一次，即sum(sum(sku_value_sales))
*/