-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48309][YARN]Stop am retry, in situations where some errors and retries may not be successful #46620
base: master
Are you sure you want to change the base?
Conversation
… retries may not be successful
While this PR does not include it, in order to leverage the change introduced for I am wondering if we can leverage +CC @MaxGekk in case this idea makes sense ! |
Agree, this can be used to handle existing exception, Maybe it's a good idea to include several highest frequencies error classes in your production environment |
Thank you both for helping me review this pr. In fact, I was initially thinking about whether I could use Spark's existing exception classes to achieve reuse. But if I don't use new exception information, it may not work for me to do this place. Because in the yarn-cluster mode, the application master determines whether a retry is needed based on the current exception information, such as: e.getCause match { this code is in ApplicationMaster. But if the user throws their own defined exception information,such as: throw new MyTestExecption("this is a test exception, I want to stop am retry.") In ApplicationMaster, unable to capture user-defined exception information. |
In the example code, if it throws |
finish(FinalApplicationStatus.FAILED, | ||
ApplicationMaster.EXIT_STOP_AM_RETYR, | ||
"User class threw exception: " | ||
+ StringUtils.stringifyException(stopAmRetry.getCause)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid NPE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yaooqinn Yes, I changed it.
… retries may not be successful apache#46620
@LuciferYang Lu Thank you for review. I
@LuciferYang Thank you for your review. I made the modifications as you said. |
… retries may not be successful apache#46620
Hi, @guixiaowen I need some time to think about it as it might break some existing workloads. Meantime, you can
|
@yaooqinn Ok, Thank you. |
… retries may not be successful
What changes were proposed in this pull request?
In yarn cluster mode, spark.yarn.maxAppAttempts will be configured. In our production environment, it is configured as 2 If the first execution fails, AM will retry. However, in some scenarios, even attempting a second task may fail.
For example:
org. apache. park. SQL AnalysisException: Table or view not found: test.testxxxx_xxxxx; Line 1 pos 14;
Project
+-Unresolved Relationship [bigdata_qa, testxxxxx_xxxxx], [], false
Other example:
Caused by: org. apache. hadoop. hdfs. protocol NSQuotaExceededException: The NameSpace quota (directories and files) of directory/tmp/xxx_file/xxxx is exceeded: quota=1000000 file count=1000001
Would it be more appropriate to try capturing these exceptions and stopping retry?
Why are the changes needed?
In some scenarios, even attempting a second task may fail.
Does this PR introduce any user-facing change?
The user can throw a SparkStopAMRetryException, and the Application Master will catch the exception and stop retry
set spark.yarn.maxAppAttempts=2
For examle
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?