部分内容出处:
http://www.atatech.org/article/detail/5617/0
http://www.atatech.org/article/detail/4392/515
一.UDFS函数介绍
1. 基本UDF
(1)SHOWFUNCTIONS:这个用来熟悉未知函数。
DESCRIBE FUNCTION<function_name>;
(2)A IS NULL
A IS NOT NULL
(3)A LIKE B 普通sql匹配如 like “a%”
A RLIKE B通过正则表达式匹配
A REGEXP B 通过正则表达式匹配
(4)round(double a):四舍五入
(5)rand(),rand(int seed):返回在(0,1)平均分布的随机数
(6)COALESCE(pv, 0):将 pv 为 null 的行转为0,很实用
2. 日期函数
(1)datediff(string enddate, stringstartdate):
返回enddate和startdate的天数的差,例如datediff('2009-03-01','2009-02-27') = 2
(2)date_add(stringstartdate, int days):
加days天数到startdate:date_add('2008-12-31', 1) ='2009-01-01'
(3)date_sub(stringstartdate, int days):
减days天数到startdate:date_sub('2008-12-31', 1) ='2008-12-30'
(4)date_format(date,date_pattern)
CREATETEMPORARY FUNCTION date_format AS'com.taobao.hive.udf.UDFDateFormat';
根据格式串format 格式化日期和时间值date,返回结果串。
date_format('2010-10-10','yyyy-MM-dd','yyyyMMdd')
(5)str_to_date(str,format)
将字符串转化为日期函数
CREATE TEMPORARY FUNCTIONstr_to_date AS 'com.taobao.hive.udf.UDFStrToDate';
str_to_date('09/01/2009','MM/dd/yyyy')
3. 字符串函数
(1)length(stringA):返回字符串长度
(2)concat(stringA, string B...):
合并字符串,例如concat('foo','bar')='foobar'。注意这一函数可以接受任意个数的参数
(3)substr(stringA, int start) substring(string A,int start):
返回子串,例如substr('foobar',4)='bar'
(4)substring(string A, int start,int len):
返回限定长度的子串,例如substr('foobar',4, 1)='b'
(5)split(stringstr, string pat):
返回使用pat作为正则表达式分割str字符串的列表。例如,split('foobar','o')[2] = 'bar'。
(6)getkeyvalue(str,param):
从字符串中获得指定 key 的 value 值 UDFKeyValue
CREATE TEMPORARY FUNCTION getkeyvalue AS 'com.taobao.hive.udf.UDFKeyValue';
4. 自定义函数
(1)row_number
- CREATE TEMPORARY FUNCTION row_number AS 'com.taobao.ad.data.search.udf.UDFrow_number';
- select ip,uid,row_number(ip,uid) from (
- select ip,uid,logtime from atpanel
- distribute by ip,uid
- sort by ip,uid,logtime desc
- ) a
(2)拆分key_value键值对
- CREATE TEMPORARY FUNCTION ExplodeEX AS 'com.taobao.hive.udtf.UDTFExplodeEX';
- select
- split(kvs,'_')[0] as key,
- split(kvs,'_')[1] as key,
- from ( select 'a-1|b-2' as kv from dual ) t
- lateral view explode (split(kv,'\\|')) result as kvs
二. HIVE新特性
1. 支持多列的COUNT(*)和COUNT DISTINCT查询
select count(distinct col1, col2) from table_name;select count(*) from table_name;
2. 提供以本地模式运行Hive的选项
设置mapred.job.tracker=local可开启本地运行模式
3. 增强的列重命名语法
增加 ALTERTABLE table_name CHANGE old_name new_name语法。
4. 支持UNIQUE JOIN HIVE-591
select .. from JOINTABLES (A,B,C) WITH KEYS (A.key, B.key, C.key) where ....
5. 增加检测表和分区状态的语法HIVE-667
使用show table_name语法,检查表和分区的状态,包括大小和创建、访问时间戳。6. 增加建表时支持STRUCT,结构体
7. 增加选择驱动表的提示
8. 增加/*+STREAMTABLE(tb_alias)*/ HINT,以在Join操作时指定驱动表:
SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.valFROM a
JOIN b ON (a.key = b.key1)JOIN c ON (c.key = b.key1)
指定此HINT后,原先默认的右表驱动会失效。
9. left Semi-Join HIVE-870
Left Semi-Join是可以高效实现IN/EXISTS子查询的语义。以下SQL语义:
(1)SELECT a.key, a.value FROM a WHERE a.key in (SELECT b.key FROM b);
未实现Left Semi-Join之前,Hive实现上述语义的语句是:
SELECT t1.key, t1.value FROM a t1
left outer join (SELECT distinct key from b) t2
on t1.id = t2.id where t2.id is not null;
(2)可被替换为Left Semi-Join如下:
SELECT a.key, a.valFROM a LEFT SEMI JOIN b on (a.key = b.key)
这一实现减少至少1次MapReduce过程,注意Left Semi-Join的Join条件必须是等值。
10.Skew Join优化 HIVE-964 ,数据倾斜
优化skewed join key为map join。开启hive.optimize.skewjoin=true可优化倾斜的数据。Skew Join优化需要额外的mapjoin操作,且不能节省shuffle的代价。
11.Sorted merge (map) join HIVE-1194
(对关键表key排序)
如果MapJoin中的表都是有序的,这一特性使得Join操作无需扫描整个表,这将大大加速Join操作。可通过hive.optimize.bucketmapjoin.sortedmerge=true开启这个功能,获得高的性能提升。
12.支持ALTER TABLE修改分区的InputFormat/OutputFormat定义
这一特性使得我们可以用压缩方式(SequenceFileInputFormat)存储后续表分区的数据,同时又不需要对以前的表分区做修改,即透明切换到压缩格式。
13.支持并发提交没有依赖关系的MR过程HIVE-549
此前的Hive仅仅顺序提交MR任务。这一增强使得没有依赖关系的多次MR过程(例如Union all语义中的多个子查询)可以并发提交。某些情况下可以提高单条HQL命令的响应速度。以下参数对并发提交功能启作用:
hive.exec.parallel[=false]
hive.exec.parallel.thread.number[=8]
14.Sorted Group byHIVE-931
(中间表的预处理)
对已排序的字段做Group by可以不再额外提交一次MR过程。这种情况下可以提高执行效率。
15.UDTF支持
UDTF即User defined table function,是一种UDF,区别是这种UDF可以返回多条记录。这一修改使得当前很多Transform脚本可以被替换为更通用、更高效、更用户友好的UDTF实现。UDTF是一种1:n输出,可用于行转列等。
UDTF不支持UDTF/列混合的select、不支持嵌套、不支持相同子查询中的GROUP BY / CLUSTER BY /DISTRIBUTE BY / SORT BY。
UDTF可与Lateral View相结合。
16.支持动态分区HIVE-1002HIVE-1238
动态分区可通过设定hive.exec.dynamic.partition=true打开DP特性。使用方法:
INSERT OVERWRITETABLE tbl partition (col1[=value][, col2[=value] …])
使用hive.exec.dynamic.partition.mode = nonstrict动态分区有一定风险,包括小文件、覆盖数据等。默认分区开关:
hive.exec.default.dynamic.partition.name
17.插入强制排序HIVE-1193
只需要打开hive.enforce.sorting选项即可。这一特性对Sorted merge bucket (map) join非常有用
18.支持视图功能
可用于字段级别的权限控制
19.支持持笛卡尔积join(1.0特性 SELECT a.*, b.*FROM aCROSS JOIN b
- CREATE VIEW [IF NOT EXISTS] view_name
- [ (column_name [COMMENT column_comment], … ) ]
- [COMMENT ‘view_comment’]
- AS SELECT …
- [ ORDER BY … LIMIT … ]
三. hive优化方式总结
1. 多表join优化代码结构:
select .. from JOINTABLES (A,B,C) WITH KEYS (A.key, B.key, C.key) where ....
关联条件相同多表join会优化成一个job
2. LeftSemi-Join是可以高效实现IN/EXISTS子查询的语义
SELECT a.key,a.value FROM a WHERE a.key in (SELECT b.key FROM b);
(1)未实现Left Semi-Join之前,Hive实现上述语义的语句是:
SELECT t1.key, t1.valueFROM a t1
left outer join (SELECT distinctkey from b) t2 on t1.id = t2.id
where t2.id is not null;
(2)可被替换为Left Semi-Join如下:
SELECT a.key, a.valFROM a LEFT SEMI JOIN b on (a.key = b.key)
这一实现减少至少1次MR过程,注意Left Semi-Join的Join条件必须是等值。
3. 预排序减少map join和group by扫描数据HIVE-1194
(1)重要报表预排序,打开hive.enforce.sorting选项即可
(2)如果MapJoin中的表都是有序的,这一特性使得Join操作无需扫描整个表,这将大大加速Join操作。可通过
hive.optimize.bucketmapjoin.sortedmerge=true开启这个功能,获得高的性能提升。
- set hive.mapjoin.cache.numrows=10000000;
- set hive.mapjoin.size.key=100000;
- Insert overwrite table pv_users
- Select /*+MAPJOIN(pv)*/ pv.pageid,u.age
- from page_view pv
- join user u on (pv.userid=u.userid;
(3)Sorted Group byHIVE-931
对已排序的字段做Group by可以不再额外提交一次MR过程。这种情况下可以提高执行效率。
4. 次性pv uv计算框架
(1)多个mr任务批量提交
hive.exec.parallel[=false]
hive.exec.parallel.thread.number[=8]
(2) 一次性计算框架,结合multi group by
如果少量数据多个union会优化成一个job;
反之计算量过大可以开启批量mr任务提交减少计算压力;
利用两次group by 解决count distinct 数据倾斜问题
- Set hive.exec.parallel=true;
- Set hive.exec.parallel.thread.number=2;
- From(
- Select
- Yw_type,
- Sum(case when type=’pv’ then ct end) as pv,
- Sum(case when type=’pv’ then 1 end) as uv,
- Sum(case when type=’click’ then ct end) as ipv,
- Sum(case when type=’click’ then 1 end) as ipv_uv
- from (
- select
- yw_type,log_type,uid,count(1) as ct
- from (
- select ‘total’ yw_type,‘pv’ log_type,uid from pv_log
- union all
- select ‘cat’ yw_type,‘click’ log_type,uid from click_log
- ) t group by yw_type,log_type
- ) t group by yw_type
- ) t
- Insert overwrite table tmp_1
- Select pv,uv,ipv,ipv_uv
- Where yw_type=’total’
-
- Insert overwrite table tmp_2
- Select pv,uv,ipv,ipv_uv
- Where yw_type=’cat’;
5. 控制hive中的map和reduce数
(1)合并小文件
- set mapred.max.split.size=100000000;
- set mapred.min.split.size.per.node=100000000;
- set mapred.min.split.size.per.rack=100000000;
- set hive.input.format=
- org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
hive.input.format=……表示合并小文件。大于文件块大小128m的,按照128m来分隔,小于128m,大于100m的,按照100m来分隔,把那些小于100m的(包括小文件和分隔大文件剩下的),进行合并,最终生成了74个块
(2)耗时任务增大map数
setmapred.reduce.tasks=10;
6. 利用随机数减少数据倾斜
大表之间join容易因为空值产生数据倾斜
- select
- a.uid
- from big_table_a a
- left outer join big_table_b b
- on b.uid = case when a.uid is null or length(a.uid)=0
- then concat('rd_sid',rand()) else a.uid end;
四. 小技巧
1.空值处理, 结果表\N用空字符串代替
ALTER TABLE a SETSERDEPROPERTIES('serialization.null.format' = '');
2. 避免暴力扫描分区
今日全量=昨日全量+今日增量
30数据=前一个30日数据-31日数据+今日数据
适用场景:需求稳定,需要访问30天或1年数据
3. 利用动态分区减少任务执行时间
五. 通过JobTracker 源数据找出低效代码
1. On条件没写或者扫描过多分区情况
Uv计算参考一次性pv uv计算框架解决方案,on或者分区条件没写去掉即可
- select
- id as 天网id,prgname as 任务路径,viewname as 显示名称,job_id ,job_name,job_value,
- length(trim(inputdir))-length(replace(trim(inputdir),',',''))+1 as pathcnt
- from (
- select
- t1.id,t1.prgname,t1.viewname,
- t3.job_id,t3.job_name ,
- t3.job_value,
- DBMS_LOB.SUBSTR(t3.job_value,4000) as inputdir
- from(
- select
- id,prgname,paravalue,viewname from dwa.etl_task_program t
- where priority in('xx','xxx')
- and appflag=0
- ) t1,
- dwa.hdp_job_map t2,
- dwa.hdp_job_conf t3
- where t1.id = t2.id
- and t2.job_id = t3.job_id
- and t2.gmtdate = trunc(sysdate-1)
- and t3.gmtdate = trunc(sysdate-1)
- and t3.job_name = 'mapred.input.dir'
- )
- where length(trim(inputdir))-length(replace(trim(inputdir),',','')) > 10;
2. 同一个脚本相同单表被扫描多次
尽量把所需要的数据一次性读出来
- select sky_id as 天网id,viewname as 天网显示名称,
- tab_name as 被扫描表,on_duty as 负责人,count(1) as 扫描次数
- from(
- select distinct a.tab_name,c.sql_id,a.sub_sql_id,c.sky_id,e.viewname,e.on_duty
- from dwa.meta_tab a,
- dwa.meta_sqlsub b,
- (select * from
- (select sky_id,sql_id,sql_src,
- row_number() over(partition by sky_id,length(sql_src) order by sql_id) rn
- from dwa.meta_sqlfull
- )where rn=1) c,
- dwa.meta_col d,dwa.etl_task_program e
- where e.priority in('xx','xxx')
- and e.appflag=0 and e.id=c.sky_id
- and a.sub_sql_id=b.sub_sql_id and a.tab_id=d.tab_id and a.sub_sql_id=d.sub_sql_id and b.sqlfull_id=c.sql_id
- and a.tab_name not like '%-%' and b.sql_type='select'
- order by c.sky_id,c.sql_id,a.sub_sql_id
- )group by sky_id,viewname,tab_name,on_duty
- having count(1) >1
- order by cnt desc;
3. Job数过多
尽量一次性读取所需数据
才有union方式合并任务
Left outer join on条件相同会合并成一个job
- SELECT /*+ parallel(t,32) */
- groupname,
- id,
- BIZ_SORTID,
- ON_DUTY,
- PRGNAME,
- job_cnt,
- JOB_TOTAL_MAPS,
- JOB_TOTAL_REDUCES,
- TOTAL_TIME,
- HDFS_BYTES_READ,
- HDFS_BYTES_WRITTEN,
- TOTAL_MAP_TIME,
- TOTAL_REDUCE_TIME,
- MAP_INPUT_RECORDS,
- MAP_OUTPUT_RECORDS,
- REDUCE_INPUT_RECORDS,
- REDUCE_OUTPUT_RECORDS,
- time,
- row_number() over(partition by groupname order by TIME desc) rn_time,
- row_number() over(partition by groupname order by TOTAL_MAP_TIME+TOTAL_REDUCE_TIME desc) rn_slots
- rom(
- select
- DWA.ETL_TASK_BASELINE.name as groupname,
- DWA.HDP_JOB_MAP.ID,
- DWA.ETL_TASK_PROGRAM.BIZ_SORTID,
- DWA.ETL_TASK_PROGRAM.ON_DUTY,
- DWA.ETL_TASK_LOG.PRGNAME,
- count(DWA.HDP_JOB_MAP.job_id) job_cnt,
- sum(DWA.HDP_JOB_STAT.JOB_TOTAL_MAPS) JOB_TOTAL_MAPS,
- sum(DWA.HDP_JOB_STAT.JOB_TOTAL_REDUCES) JOB_TOTAL_REDUCES,
- sum(DWA.HDP_JOB_STAT.TOTAL_TIME) TOTAL_TIME,
- sum(DWA.HDP_JOB_STAT.HDFS_BYTES_READ) HDFS_BYTES_READ,
- sum(DWA.HDP_JOB_STAT.HDFS_BYTES_WRITTEN) HDFS_BYTES_WRITTEN,
- sum(DWA.HDP_JOB_STAT.TOTAL_MAP_TIME) TOTAL_MAP_TIME,
- sum(DWA.HDP_JOB_STAT.TOTAL_REDUCE_TIME) TOTAL_REDUCE_TIME,
- sum(DWA.HDP_JOB_STAT.MAP_INPUT_RECORDS) MAP_INPUT_RECORDS,
- sum(DWA.HDP_JOB_STAT.MAP_OUTPUT_RECORDS) MAP_OUTPUT_RECORDS,
- sum(DWA.HDP_JOB_STAT.REDUCE_INPUT_RECORDS) REDUCE_INPUT_RECORDS,
- sum(DWA.HDP_JOB_STAT.REDUCE_OUTPUT_RECORDS) REDUCE_OUTPUT_RECORDS,
- trunc((DWA.ETL_TASK_LOG.edate-DWA.ETL_TASK_LOG.sdate)*24*60) time
- FROM
- DWA.HDP_JOB_MAP,
- DWA.ETL_TASK_PROGRAM,
- DWA.ETL_TASK_LOG,
- DWA.HDP_JOB_STAT,
- DWA.ETL_TASK_BASELINE
- WHERE
- ( DWA.HDP_JOB_STAT.JOB_ID=DWA.HDP_JOB_MAP.JOB_ID )
- AND ( DWA.HDP_JOB_MAP.ID=DWA.ETL_TASK_LOG.ID )
- AND ( DWA.ETL_TASK_LOG.ID=DWA.ETL_TASK_PROGRAM.ID )
- AND ( DWA.ETL_TASK_PROGRAM.BASELINE_ID=DWA.ETL_TASK_BASELINE.ID )
- AND
- (
- ( ( DWA.HDP_JOB_STAT.GMTDATE ) = trunc(sysdate) )
- AND
- ( ( DWA.HDP_JOB_MAP.GMTDATE ) = trunc(sysdate) )
- AND
- ( ( DWA.ETL_TASK_LOG.GMTDATE ) = trunc(sysdate) )
- AND DWA.ETL_TASK_PROGRAM.priority in('xx','xxx')
- )
- GROUP BY
- DWA.ETL_TASK_BASELINE.name,
- DWA.HDP_JOB_MAP.ID,
- DWA.ETL_TASK_PROGRAM.BIZ_SORTID,
- DWA.ETL_TASK_PROGRAM.ON_DUTY,
- DWA.ETL_TASK_LOG.PRGNAME,
- (DWA.ETL_TASK_LOG.edate-DWA.ETL_TASK_LOG.sdate)*24*60
- ) t
- where time is not null and job_cnt>10
4. From表个数过多(节点入度过高)
- select sky_id as 天网id,viewname as 显示名称,
- sum(cnt) as 来源表使用次数,count(cnt) as 来源表个数
- from(
- select sky_id,viewname,tab_name,on_duty,count(1) cnt
- from(
- select distinct a.tab_name,c.sql_id,a.sub_sql_id,c.sky_id,e.viewname,e.on_duty
- from dwa.meta_tab a,dwa.meta_sqlsub b,
- (
- select *
- from(
- select sky_id,sql_id,sql_src,
- row_number() over(partition by sky_id,length(sql_src) order by sql_id) rn
- from dwa.meta_sqlfull)
- where rn=1
- ) c,
- dwa.meta_col d,dwa.etl_task_program e
- where e.priority in('xx','xxx')
- and e.appflag=0 and e.id=c.sky_id
- and a.sub_sql_id=b.sub_sql_id
- and a.tab_id=d.tab_id and a.sub_sql_id=d.sub_sql_id and b.sqlfull_id=c.sql_id
- and a.tab_name not like '%-%' and b.sql_type='select'
- order by c.sky_id,c.sql_id,a.sub_sql_id
- )
- group by sky_id,viewname,tab_name,on_duty
- order by cnt desc
- )
- group by sky_id,viewname
- order by sum(cnt) desc;
5. Job倾斜情况
空值处理方法:
(1)直接过滤掉
(2)空值加上随机数分散到不同的reduce
解决方法一job2,方法二job1
- select a11.GMTDATE as 任务执行日期,
- a11.GROUP_NAME as 业务线名称,
- a11.ID as 天网id,
- a11.SORT_ID as 云梯优先级,
- a11.NAME as 天网显示名称,
- a11.JOB_ID as job_id,
- a11.KEY_FLAG 是否关键节点任务,
- a11.USER_NAME 用户名,
- sum(a11.JOB_AVG_TIME) WJXBFS1,
- sum(a11.JOB_MAX_TIME) WJXBFS2,
- sum(a11.JOB_AVG_RECORDS) WJXBFS3,
- sum(a11.JOB_MAX_RECORDS) WJXBFS4
- from DWA.VIEW_HDP_JOB_STAT a11
- where gmtdate=date'2012-09-27'
- and group_name in ('xxxxx')
-
- group by a11.GMTDATE,
- a11.GROUP_NAME,
- a11.ID,
- a11.SORT_ID,
- a11.NAME,
- a11.JOB_ID,
- a11.KEY_FLAG,
- a11.USER_NAME ;
6. 相同输入字节数的任务抽取与合并
数据源相同的任务,抽取相同的job进行合并
- drop table gv_job_mapinput;
- create table gv_job_mapinput as
- select
- id,prgname,job_id,MAP_INPUT_BYTES
- from
- (
- select
- DWA.ETL_TASK_BASELINE.name groupname,
- DWA.HDP_JOB_MAP.ID,
- DWA.ETL_TASK_PROGRAM.BIZ_SORTID,
- DWA.ETL_TASK_PROGRAM.ON_DUTY,
- DWA.ETL_TASK_LOG.PRGNAME,
- DWA.HDP_JOB_MAP.job_id,
- sum(DWA.HDP_JOB_STAT.JOB_TOTAL_MAPS) JOB_TOTAL_MAPS,
- sum(DWA.HDP_JOB_STAT.JOB_TOTAL_REDUCES) JOB_TOTAL_REDUCES,
- sum(DWA.HDP_JOB_STAT.TOTAL_TIME) TOTAL_TIME,
- sum(DWA.HDP_JOB_STAT.HDFS_BYTES_READ) HDFS_BYTES_READ,
- sum(DWA.HDP_JOB_STAT.HDFS_BYTES_WRITTEN) HDFS_BYTES_WRITTEN,
- sum(DWA.HDP_JOB_STAT.TOTAL_MAP_TIME) TOTAL_MAP_TIME,
- sum(DWA.HDP_JOB_STAT.TOTAL_REDUCE_TIME) TOTAL_REDUCE_TIME,
- sum(DWA.HDP_JOB_STAT.MAP_INPUT_RECORDS) MAP_INPUT_RECORDS,
- sum(DWA.HDP_JOB_STAT.MAP_INPUT_BYTES) MAP_INPUT_BYTES,
- sum(DWA.HDP_JOB_STAT.MAP_OUTPUT_RECORDS) MAP_OUTPUT_RECORDS,
- sum(DWA.HDP_JOB_STAT.REDUCE_INPUT_RECORDS) REDUCE_INPUT_RECORDS,
- sum(DWA.HDP_JOB_STAT.REDUCE_OUTPUT_RECORDS) REDUCE_OUTPUT_RECORDS,
- trunc((DWA.ETL_TASK_LOG.edate-DWA.ETL_TASK_LOG.sdate)*24*60) time
- FROM
- DWA.HDP_JOB_MAP,
- DWA.ETL_TASK_PROGRAM,
- DWA.ETL_TASK_LOG,
- DWA.HDP_JOB_STAT,
- DWA.ETL_TASK_BASELINE
- WHERE
- ( DWA.HDP_JOB_STAT.JOB_ID=DWA.HDP_JOB_MAP.JOB_ID )
- AND ( DWA.HDP_JOB_MAP.ID=DWA.ETL_TASK_LOG.ID )
- AND ( DWA.ETL_TASK_LOG.ID=DWA.ETL_TASK_PROGRAM.ID )
- AND ( DWA.ETL_TASK_PROGRAM.BASELINE_ID=DWA.ETL_TASK_BASELINE.ID )
- AND
- (
- ( ( DWA.HDP_JOB_STAT.GMTDATE ) = trunc(sysdate) )
- AND
- ( ( DWA.HDP_JOB_MAP.GMTDATE ) = trunc(sysdate) )
- AND
- ( ( DWA.ETL_TASK_LOG.GMTDATE ) = trunc(sysdate) )
- AND
- DWA.ETL_TASK_PROGRAM.priority in('xx','xxx')
-
- )
- GROUP BY
- DWA.ETL_TASK_BASELINE.name,
- DWA.HDP_JOB_MAP.ID,
- DWA.ETL_TASK_PROGRAM.BIZ_SORTID,
- DWA.ETL_TASK_PROGRAM.ON_DUTY,
- DWA.ETL_TASK_LOG.PRGNAME,
- DWA.HDP_JOB_MAP.job_id,
- (DWA.ETL_TASK_LOG.edate-DWA.ETL_TASK_LOG.sdate)*24*60
- )
- order by MAP_INPUT_RECORDS desc ,job_id;
-
- select * from gv_job_mapinput
- where id exists (
- select id from
- (select id,prgname,count(job_id) cnt from gv_job_mapinput group by id,prgname)
- where cnt =1
- )
- order by MAP_INPUT_BYTES desc;
7. 多个任务只有一个共同的父任务
- drop table gvora_view_relation;
- create table gvora_view_relation as
- select a.id,a.viewname,a.on_duty,a.sourceid,a.priority,a.parentid,
- b.viewname parentviewname,b.on_duty pon_duty,b.sourceid psourceid,b.priority p_priority
- from(
- select a.id,b.viewname,b.on_duty,b.sourceid,b.priority,a.parentid from
- dwa.etl_task_relation a,
- dwa.etl_task_program b
- where a.id=b.id
- ) a,
- dwa.etl_task_program b
- where a.parentid=b.id;
-
- select a.id as 天网id,a.viewname as 显示名称,rudu,chudu
- from(
- select id,viewname,count(1) rudu from gvora_view_relation
- where priority in('xx','xxx')
-
- group by id,viewname
- ) a,
- (
- select parentid,parentviewname,count(1) chudu from gvora_view_relation
- where priority in('xx','xxx')
-
- group by parentid,parentviewname
- ) b
- where a.id=b.parentid
- order by rudu +chudu desc;