一次线上问题引发的思考

背景

某一天，突然一个导入任务的merge操作出现长时间运行。由原来1h执行完成的，延长到10h。

对比yarn 信息，如下示：

对比发现，Map数和Reduce都有所减少。

分析

查看表文件发现前一天的分区表文件数不一致。
23分区只有1个文件（大小为800多MB；22号有200+个文件，每个文件40多MB）

这样算起来，文件大小几乎翻了10倍，实际数据相等。

很奇怪，哪里的问题？

文件压缩

看了下表格式为ORC格式。列式存储，在数据存储上进行了优化。

一组表各种格式，压缩后对比
orc

ORC的压缩比最低。

针对列式存储，如果同构数据（相同的数据）比较集中，那么压缩起来就能达到很好效果。

这里就产生了一种猜想：可能导入表存在大量的同构数据，导致压缩比有这么大的差距。

下面对猜想进行数据验证。

实验

实验准备

分别制造2张表，模拟线上数据。
一张单个文件的表ods.t_single_file(2020-03-30的分区只有一个文件)。
另一个含有多个文件的表ods.t_multiple_file(2020-03-30的分区有多个文件)。

实验1

目的

两个表重新导入到不经过压缩的text表。

预期

text表存储大小一致。

准备

分别创建2张存储格式是text的测试表。userdb.t_single_file_text、userdb.t_multiple_file_text

处理代码

set hive.exec.compress.output=false; -- 不启用压缩
INSERT OVERWRITE TABLE userdb.t_single_file_text PARTITION(dt = '2020-03-30')
select
`AutoId`,
`gmt51_create`,
`gmt51_modify`,
`BundleId`,
`Platform`,
`Date`,
`Source`,
`Count`
from
ods.t_single_file
where dt = '2020-03-30'
;


set hive.exec.compress.output=false; -- 不启用压缩
INSERT OVERWRITE TABLE userdb.t_multiple_file_text PARTITION(dt = '2020-03-30')
select
`AutoId`,
`gmt51_create`,
`gmt51_modify`,
`BundleId`,
`Platform`,
`Date`,
`Source`,
`Count`
from
ods.t_multiple_file
where dt = '2020-03-30'
;

实验数据

-bash-4.1$ hadoop fs -du -h hdfs://nameservice1/user/hive/warehouse/userdb.db/t_single_file_text
368.4 G  1.1 T  hdfs://nameservice1/user/hive/warehouse/userdb.db/t_single_file_text/dt=2020-03-30
-bash-4.1$ hadoop fs -du -h hdfs://nameservice1/user/hive/warehouse/userdb.db/t_multiple_file_text
368.4 G  1.1 T  hdfs://nameservice1/user/hive/warehouse/userdb.db/t_multiple_file_text/dt=2020-03-30

实验结果

数据存储大小一致，符合预期。

实验2

目的

两个表重新导入到默认压缩的orc表；并将数据按照所有字段进行排序（保证相同的数据靠在一起）。

预期

orc表存储大小一致，并且存储相对正常写入的表更小。

准备

分别创建2张存储格式是orc的测试表。userdb.t_single_file_orc、userdb.t_multiple_file_orc

处理代码

INSERT OVERWRITE TABLE userdb.t_single_file_orc PARTITION(dt = '2020-03-30')
select
`AutoId`,
`gmt51_create`,
`gmt51_modify`,
`BundleId`,
`Platform`,
`Date`,
`Source`,
`Count`
from
ods.t_single_file
where dt = '2020-03-30'
order by BundleId,Platform,Date,Source,Count,gmt51_create,gmt51_modify
;


INSERT OVERWRITE TABLE userdb.t_multiple_file_orc PARTITION(dt = '2020-03-30')
select
`AutoId`,
`gmt51_create`,
`gmt51_modify`,
`BundleId`,
`Platform`,
`Date`,
`Source`,
`Count`
from
ods.t_multiple_file
where dt = '2020-03-30'
order by BundleId,Platform,Date,Source,Count,gmt51_create,gmt51_modify
;

实验数据

-bash-4.1$ hadoop fs -du -h hdfs://nameservice1/user/hive/warehouse/userdb.db/t_single_file_orc
7.5 G  22.5 G  hdfs://nameservice1/user/hive/warehouse/userdb.db/t_single_file_orc/dt=2020-03-30
-bash-4.1$ hadoop fs -du -h hdfs://nameservice1/user/hive/warehouse/userdb.db/t_multiple_file_orc
7.5 G  22.5 G  hdfs://nameservice1/user/hive/warehouse/userdb.db/t_multiple_file_orc/dt=2020-03-30

实验结果

数据存储大小一致。

对比原表ods.t_multiple_file

1 2	hadoop fs -du -h hdfs://nameservice1/user/hive/warehouse/ods.db/t_multiple_file 9.6 G 28.7 G hdfs://nameservice1/user/hive/warehouse/ods.db/t_multiple_file/dt=2020-03-30

由9.6G降到7.5G，所以，同构数据一起写入orc表可以获得更好的压缩比。

结论

orc格式的表在数据存储优化上有很大提升。
对同构数据较多的表，将相似的数据集中写入，同样可以进一步优化压缩比。

本文主要从线上的一个现象，对公认的结论做了数据验证对比。

题外

这个线上现象的另一个问题是Map，Reduce数量问题。这个主要就牵扯到MR任务的map task和reduce task切分问题上，感兴趣的同学可以自行搜索相关资料。