Hey there!
Fault handling
1. Manually delete the eventlog file to ensure that the service is available.
2. Increase the value of spark.eventLog.group.size to reduce the frequency of generating a file.
3. Increase the value of the JobHistory memory parameter SPARK_DAEMON_MEMORY to 10 GB or larger and decrease the value of spark.history.fs.cleaner.maxAge to reduce the number of reserved files.
Operation guide
Check whether all running Spark services can be stopped.
- If yes, go to Solution 1. (In this step, the JobHistory directory will be deleted. As a result, historical information about all Spark tasks will be lost and Spark tasks that are running may fail).
- No, go to Solution 2. (This step does not affect ongoing tasks. However, you must ensure that new Spark tasks are not submitted in the processing phase. Manual operations are complex and the workload is heavy. You need to contact R&D engineers).
Solution 1
1. On the Manager page, go to the Spark service page and stop the JobHistory instance.
http://10.88.194.32:7088/idp-edit-service/editor/image/11108756775/A-1_1_en-us_image_0118633347.jpg
Go to the Spark service configuration page and change the value of SPARK_DAEMON_MEMORY under JobHistory to 10 GB. (This step is mandatory for preventing the JobHistory thread from clearing the eventlog thread running failure due to insufficient memory).
Change the value of spark.history.fs.cleaner.maxAge to 4d. (This step is optional, indicating the maximum duration for storing logs on JobHistory. A smaller value indicates fewer files stored in the JobHistory directory. Save the configuration.
http://10.88.194.32:7088/idp-edit-service/editor/image/11108756775/A-1_1_en-us_image_0118633371.jpg
http://10.88.69.41:7088/idp-edit-service/editor/image/11108756775/A-1_1_en-us_image_0118633348.jpg
2. Log in to the client, run the kinit admin command to authenticate the user admin. Run the following command to remove the /sparkJobHistory directory and back it up:
hdfs dfs -mv /sparkJobHistory/sparkJobHistory-bak
3. Re-create the JobHistory folder and modify the permission.
hdfs dfs -mkdir /sparkJobHistory
hdfs dfs -chown spark:hadoop/sparkJobHistory
hdfs dfs -chmod 777/sparkJobHistory
4. Start the JobHistory instance.
http://10.88.69.41:7088/idp-edit-service/editor/image/11108756775/A-1_1_en-us_image_0118633350.jpg
5. After the JobHistory is started, delete the JobHistory directory that is backed up and run the following command on the client:
hdfs dfs -rm -r /sparkJobHistory-bak
Set the client parameter spark.eventLog.group.size of the Sparkstreaming service (the default value is 30 in $client_home/Spark/spark/conf/spark-defaults.conf). Change 30 to 3000. Do not change the value of spark.eventLog.group.size for non-Sparkstreaming jobs.
After the parameter value has changed, submit the service again.