Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
960 views
in Technique[技术] by (71.8m points)

scala - merge multiple small files in to few larger files in Spark

I using hive through Spark. I have a Insert into partitioned table query in my spark code. The input data is in 200+gb. When Spark is writing to a partitioned table, it is spitting very small files(files in kb's). so now the output partitioned table folder have 5000+ small kb files. I want to merge these in to few large MB files, may be about few 200mb files. I tired using hive merge settings, but they don't seem to work.

'val result7A = hiveContext.sql("set hive.exec.dynamic.partition=true")

 val result7B = hiveContext.sql("set hive.exec.dynamic.partition.mode=nonstrict")

val result7C = hiveContext.sql("SET hive.merge.size.per.task=256000000")

val result7D = hiveContext.sql("SET hive.merge.mapfiles=true")

val result7E = hiveContext.sql("SET hive.merge.mapredfiles=true")

val result7F = hiveContext.sql("SET hive.merge.sparkfiles = true")

val result7G = hiveContext.sql("set hive.aux.jars.path=c:\Applications\json-serde-1.1.9.3-SNAPSHOT-jar-with-dependencies.jar")

val result8 = hiveContext.sql("INSERT INTO TABLE partition_table PARTITION (date) select a,b,c from partition_json_table")'

The above hive settings work in a mapreduce hive execution and spits out files of specified size. Is there any option to do this Spark or Scala?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

I had the same issue. Solution was to add DISTRIBUTE BY clause with the partition columns. This ensures that data for one partition goes to single reducer. Example in your case:

INSERT INTO TABLE partition_table PARTITION (date) select a,b,c from partition_json_table DISTRIBUTE BY date

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
...