Got it

Is there any optimization method for Spark to process small files?

Created: Jul 30, 2019 03:14:54Latest reply: Jul 30, 2019 03:27:15 2914 2 1 0 0
  Rewarded HiCoins: 5 (problem resolved)

Is there any optimization method for Spark to process small files?

Featured Answers
TingtingGG
Created Jul 30, 2019 03:27:15

Hi there!

It can be merged in asynchronous mode. If a small file needs to be merged or merged, merge it before the implementation and merge the file after processing. After you write repartition, you can set the number of files. When you estimate the size of a partition or file to be processed, you can set the number of files to a parameter, which is a reasonable number. It can be combined after implementation. This does not affect the implementation efficiency. However, two methods can be used. After the implementation, for example, the shuffle is 200 by default, there are 200 small files. In this case, you can combine the data offline when the data is not used. There are many methods for merging, basically, read it and write it in. Note the following points: Before the merged file is put into the original directory, delete the files that are previously read to avoid duplicate data, but cannot delete all the files. Check whether other programs write new files in the process of merging and generating large files. Check whether the files contain tmp and success. The file storage formats must be the same.

-------------------------

Are you asking about delta's small files, or spark sql's own small files? Or what? Currently, Spark SQL does not have this function. Let's look at the requirement.

-------------------------

If the Hive transaction table is used, the Hive has the function of merging small files. There is no open source for Spark. After spark sql is enabled, the delete update generates small files. The handling methods are the same and are combined periodically. Remarks: Open-source Spark does not support transactions. emr spark's transaction support is still weak. If you have any requirements, you are welcome to submit them.

-------------------------

The Spark SQL generates a large number of small files. You can obtain the data size and number of files in each partition of each table by using the metabase to check whether to merge small files or split large files. Then, Spark SQL reads the data and then writes the data into partition number.

-------------------------

You use spark streaming to process data and write it into a parquet file in real time. Then the recommendation system uses these real-time data. Is this true? (Yes, the requirement is true.) How does the recommendation system use the data? What tool is used to know the data? (Read HDFS files in real time, spark ML)
Got it. This is the technology stack of Spark. The application scenario is data pipeline. In the previous period, the open-source delta was used to solve this scenario.
View more
  • x
  • convention:

All Answers
Spark is an open source project and does not consider these fragments. It is recommended that the platform be used to do what you want to do after the Spark task is complete. For example, you can combine the fragment file and obtain data source information.


This XML file does not appear to have any style information associated with it. The document tree is shown below.
<Error>
<Code>NoSuchKey</Code>
<Message>The specified key does not exist.</Message>
<RequestId>5D3FB8C2DA7EE15CCFE2C79A</RequestId>
<HostId>yqfiles.img-cn-hangzhou.aliyuncs.com</HostId>
<Key>0702bce22ca8bf9f381a2860b1aa8f1232cf0534.png)</Key>
</Error>
View more
  • x
  • convention:

Hi there!

It can be merged in asynchronous mode. If a small file needs to be merged or merged, merge it before the implementation and merge the file after processing. After you write repartition, you can set the number of files. When you estimate the size of a partition or file to be processed, you can set the number of files to a parameter, which is a reasonable number. It can be combined after implementation. This does not affect the implementation efficiency. However, two methods can be used. After the implementation, for example, the shuffle is 200 by default, there are 200 small files. In this case, you can combine the data offline when the data is not used. There are many methods for merging, basically, read it and write it in. Note the following points: Before the merged file is put into the original directory, delete the files that are previously read to avoid duplicate data, but cannot delete all the files. Check whether other programs write new files in the process of merging and generating large files. Check whether the files contain tmp and success. The file storage formats must be the same.

-------------------------

Are you asking about delta's small files, or spark sql's own small files? Or what? Currently, Spark SQL does not have this function. Let's look at the requirement.

-------------------------

If the Hive transaction table is used, the Hive has the function of merging small files. There is no open source for Spark. After spark sql is enabled, the delete update generates small files. The handling methods are the same and are combined periodically. Remarks: Open-source Spark does not support transactions. emr spark's transaction support is still weak. If you have any requirements, you are welcome to submit them.

-------------------------

The Spark SQL generates a large number of small files. You can obtain the data size and number of files in each partition of each table by using the metabase to check whether to merge small files or split large files. Then, Spark SQL reads the data and then writes the data into partition number.

-------------------------

You use spark streaming to process data and write it into a parquet file in real time. Then the recommendation system uses these real-time data. Is this true? (Yes, the requirement is true.) How does the recommendation system use the data? What tool is used to know the data? (Read HDFS files in real time, spark ML)
Got it. This is the technology stack of Spark. The application scenario is data pipeline. In the previous period, the open-source delta was used to solve this scenario.
View more
  • x
  • convention:

Comment

You need to log in to comment to the post Login | Register
Comment

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " User Agreement."

My Followers

Login and enjoy all the member benefits

Login

Block
Are you sure to block this user?
Users on your blacklist cannot comment on your post,cannot mention you, cannot send you private messages.
Reminder
Please bind your phone number to obtain invitation bonus.