Hi there!
It can be merged in asynchronous mode. If a small file needs to be merged or merged, merge it before the implementation and merge the file after processing. After you write repartition, you can set the number of files. When you estimate the size of a partition or file to be processed, you can set the number of files to a parameter, which is a reasonable number. It can be combined after implementation. This does not affect the implementation efficiency. However, two methods can be used. After the implementation, for example, the shuffle is 200 by default, there are 200 small files. In this case, you can combine the data offline when the data is not used. There are many methods for merging, basically, read it and write it in. Note the following points: Before the merged file is put into the original directory, delete the files that are previously read to avoid duplicate data, but cannot delete all the files. Check whether other programs write new files in the process of merging and generating large files. Check whether the files contain tmp and success. The file storage formats must be the same.
-------------------------
Are you asking about delta's small files, or spark sql's own small files? Or what? Currently, Spark SQL does not have this function. Let's look at the requirement.
-------------------------
If the Hive transaction table is used, the Hive has the function of merging small files. There is no open source for Spark. After spark sql is enabled, the delete update generates small files. The handling methods are the same and are combined periodically. Remarks: Open-source Spark does not support transactions. emr spark's transaction support is still weak. If you have any requirements, you are welcome to submit them.
-------------------------
The Spark SQL generates a large number of small files. You can obtain the data size and number of files in each partition of each table by using the metabase to check whether to merge small files or split large files. Then, Spark SQL reads the data and then writes the data into partition number.
-------------------------
You use spark streaming to process data and write it into a parquet file in real time. Then the recommendation system uses these real-time data. Is this true? (Yes, the requirement is true.) How does the recommendation system use the data? What tool is used to know the data? (Read HDFS files in real time, spark ML)
Got it. This is the technology stack of Spark. The application scenario is data pipeline. In the previous period, the open-source delta was used to solve this scenario.