Spark1.4发布,支持了窗口分析函数(window functions)。在离线平台中,90%以上的离线分析任务都是使用Hive实现,其中必然会使用很多窗口分析函数,如果SparkSQL支持窗口分析函数,
那么对于后面Hive向SparkSQL中的迁移的工作量会大大降低,使用方式如下:
1、初始化数据
创建表
- create table window_test2 (url string, rate int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
准备测试数据
- url1,12
- url2,11
- url1,23
- url2,25
- url1,58
- url3,11
- url2,25
- url3,58
- url2,11
加载数据:
- load data local inpath '/opt/bin/short_opt/windows2.data' overwrite into table window_test2 ;
2、窗口函数测试
查询所有数据
- select * from window_test2;
+-------+-------+| url | rate |+-------+-------+| url1 | 12 || url2 | 11 || url1 | 23 || url2 | 25 || url1 | 58 || url3 | 11 || url2 | 25 || url3 | 58 || url2 | 11 |+-------+-------+分组排序:
- select url,rate,row_number() over(partition by url order by rate desc) as r from window_test2;
+-------+-------+----+| url | rate | r |+-------+-------+----+| url1 | 58 | 1 || url1 | 23 | 2 || url1 | 12 | 3 || url2 | 25 | 1 || url2 | 25 | 2 || url2 | 11 | 3 || url2 | 11 | 4 || url3 | 58 | 1 || url3 | 11 | 2 |+-------+-------+----+分组统计sum
- select url,rate,sum(rate) over(partition by url ) as r from window_test2;
+-------+-------+-----+| url | rate | r |+-------+-------+-----+| url1 | 12 | 93 || url1 | 23 | 93 || url1 | 58 | 93 || url2 | 11 | 72 || url2 | 25 | 72 || url2 | 25 | 72 || url2 | 11 | 72 || url3 | 11 | 69 || url3 | 58 | 69 |+-------+-------+-----+分组统计avg
- select url,rate,avg(rate) over(partition by url ) as r from window_test2;
+-------+-------+-------+| url | rate | r |+-------+-------+-------+| url1 | 12 | 31.0 || url1 | 23 | 31.0 || url1 | 58 | 31.0 || url2 | 25 | 18.0 || url2 | 11 | 18.0 || url2 | 11 | 18.0 || url2 | 25 | 18.0 || url3 | 11 | 34.5 || url3 | 58 | 34.5 |+-------+-------+-------+分组统计count
- select url,rate,count(rate) over(partition by url ) as r from window_test2;
+-------+-------+----+| url | rate | r |+-------+-------+----+| url1 | 12 | 3 || url1 | 23 | 3 || url1 | 58 | 3 || url2 | 11 | 4 || url2 | 25 | 4 || url2 | 25 | 4 || url2 | 11 | 4 || url3 | 11 | 2 || url3 | 58 | 2 |+-------+-------+----+分组lag
- select url,rate,lag(rate) over(partition by url ) as r from window_test2;
+-------+-------+-------+| url | rate | r |+-------+-------+-------+| url1 | 12 | NULL || url1 | 23 | 12 || url1 | 58 | 23 || url2 | 25 | NULL || url2 | 11 | 25 || url2 | 11 | 11 || url2 | 25 | 11 || url3 | 11 | NULL || url3 | 58 | 11 |+-------+-------+-------+
3、spark-1.4以后,支持所有的窗口函数了,有利用于hive作业向spark-sql来转换。
posted on 2017-10-23 22:04
xzc 阅读(680)
评论(0) 编辑 收藏 所属分类:
hadoop