[SPARK-48309][YARN]Stop am retry, in situations where some errors and retries may not be successful #46620

guixiaowen · 2024-05-16T14:56:47Z

… retries may not be successful

What changes were proposed in this pull request?

In yarn cluster mode, spark.yarn.maxAppAttempts will be configured. In our production environment, it is configured as 2 If the first execution fails, AM will retry. However, in some scenarios, even attempting a second task may fail.

For example:

org. apache. park. SQL AnalysisException: Table or view not found: test.testxxxx_xxxxx; Line 1 pos 14;
Project
+-Unresolved Relationship [bigdata_qa, testxxxxx_xxxxx], [], false

Other example:
Caused by: org. apache. hadoop. hdfs. protocol NSQuotaExceededException: The NameSpace quota (directories and files) of directory/tmp/xxx_file/xxxx is exceeded: quota=1000000 file count=1000001

Would it be more appropriate to try capturing these exceptions and stopping retry?

Why are the changes needed?

In some scenarios, even attempting a second task may fail.

Does this PR introduce any user-facing change?

The user can throw a SparkStopAMRetryException, and the Application Master will catch the exception and stop retry

set spark.yarn.maxAppAttempts=2

For examle

val spark = SparkSession

      .builder()

      .appName("Spark SQL basic example")

      .enableHiveSupport()

      .getOrCreate()

    try {

      spark.sql("select * from test.testxxxx_xxxxx;").show

    } catch {

      case e:AnalysisException => throw new SparkStopAMRetryException("this is a test", e)

    } finally {

      spark.stop()

    }

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

… retries may not be successful

…apache#45

…apache#46

mridulm · 2024-05-23T06:37:43Z

While this PR does not include it, in order to leverage the change introduced for SparkStopAMRetryException - existing exception handling will need to be changed (to throw SparkStopAMRetryException instead of whatever is being thrown now) - which will be a backwardly incompatible change.

I am wondering if we can leverage SparkException.errorClass instead - since SparkException is thrown by Spark ? Return EXIT_STOP_AM_RETRY for some specific subset of error classes ?

+CC @MaxGekk in case this idea makes sense !

summaryzb · 2024-05-23T12:49:33Z

I am wondering if we can leverage SparkException.errorClass instead - since SparkException is thrown by Spark ? Return EXIT_STOP_AM_RETRY for some specific subset of error classes ?

Agree, this can be used to handle existing exception, Maybe it's a good idea to include several highest frequencies error classes in your production environment
While SparkStopAMRetryException can be used to handle error scenarios afterwards this pr
+CC @LuciferYang

guixiaowen · 2024-05-23T14:23:21Z

I am wondering if we can leverage SparkException.errorClass instead - since SparkException is thrown by Spark ? Return EXIT_STOP_AM_RETRY for some specific subset of error classes ?

Agree, this can be used to handle existing exception, Maybe it's a good idea to include several highest frequencies error classes in your production environment While SparkStopAMRetryException can be used to handle error scenarios afterwards this pr +CC @LuciferYang

@mridulm @summaryzb

Thank you both for helping me review this pr.

In fact, I was initially thinking about whether I could use Spark's existing exception classes to achieve reuse.

But if I don't use new exception information, it may not work for me to do this place. Because in the yarn-cluster mode, the application master determines whether a retry is needed based on the current exception information, such as:

e.getCause match {
case _: InterruptedException =>
case e: SparkUserAppException(exitCode) =>
## e.getMessage container message ("Table or view not found:") or highest frequencies error message

this code is in ApplicationMaster.

But if the user throws their own defined exception information，such as：

throw new MyTestExecption("this is a test exception, I want to stop am retry.")

In ApplicationMaster, unable to capture user-defined exception information.

LuciferYang · 2024-05-23T16:42:39Z

In the example code, if it throws SparkUserAppException(18) instead of SparkStopAMRetryException , will it also trigger a retry with this pr?

yaooqinn · 2024-05-24T05:51:35Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala

+                finish(FinalApplicationStatus.FAILED,
+                  ApplicationMaster.EXIT_STOP_AM_RETYR,
+                  "User class threw exception: "
+                    + StringUtils.stringifyException(stopAmRetry.getCause))


Avoid NPE？

@yaooqinn Yes, I changed it.

… retries may not be successful apache#46620

guixiaowen · 2024-05-24T06:15:15Z

In the example code, if it throws SparkUserAppException(18) instead of SparkStopAMRetryException , will it also trigger a retry with this pr?

@LuciferYang Lu Thank you for review. I

In the example code, if it throws SparkUserAppException(18) instead of SparkStopAMRetryException , will it also trigger a retry with this pr?

@LuciferYang Thank you for your review. I made the modifications as you said.

… retries may not be successful apache#46620

yaooqinn · 2024-05-24T10:22:20Z

Hi, @guixiaowen I need some time to think about it as it might break some existing workloads.

Meantime, you can

Update the PR desc for better readability
Update the PR desc according to the latest change
Revise the doc of spark.yarn.maxAppAttempts

guixiaowen · 2024-05-24T11:05:06Z

Hi, @guixiaowen I need some time to think about it as it might break some existing workloads.

Meantime, you can

Update the PR desc for better readability

Update the PR desc according to the latest change

Revise the doc of spark.yarn.maxAppAttempts

@yaooqinn Ok, Thank you.

github-actions bot added the YARN label May 16, 2024

guixiaowen changed the title ~~[SPARK-48309][YARN]Stop am retry, in situations where some errors and…~~ [SPARK-48309][YARN]Stop am retry, in situations where some errors and retries may not be successful May 16, 2024

[SPARK-48309][YARN]Stop am retry, in situations where some errors and…

93163c2

… retries may not be successful

guixiaowen force-pushed the SPARK-48309 branch from 0aa486d to 93163c2 Compare May 20, 2024 17:45

guihuawen added 3 commits May 21, 2024 02:01

[SPARK-48309][YARN]Stop am retry, in situations where some errors and… …

9741954

…apache#45

[SPARK-48309][YARN]Stop am retry, in situations where some errors and… …

e50327d

…apache#46

[SPARK-48309][YARN]Stop am retry, in situations where some errors and… …

e8a363c

…apache#46

yaooqinn reviewed May 24, 2024

View reviewed changes

[SPARK-48309][YARN]Stop am retry, in situations where some errors and…

46af288

… retries may not be successful apache#46620

[SPARK-48309][YARN]Stop am retry, in situations where some errors and…

92d1efe

… retries may not be successful apache#46620

guixiaowen requested a review from yaooqinn May 24, 2024 10:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48309][YARN]Stop am retry, in situations where some errors and retries may not be successful #46620

[SPARK-48309][YARN]Stop am retry, in situations where some errors and retries may not be successful #46620

guixiaowen commented May 16, 2024 •

edited by LuciferYang

mridulm commented May 23, 2024

summaryzb commented May 23, 2024

guixiaowen commented May 23, 2024

LuciferYang commented May 23, 2024

yaooqinn May 24, 2024

guixiaowen May 24, 2024

guixiaowen commented May 24, 2024

yaooqinn commented May 24, 2024

guixiaowen commented May 24, 2024

[SPARK-48309][YARN]Stop am retry, in situations where some errors and retries may not be successful #46620

Are you sure you want to change the base?

[SPARK-48309][YARN]Stop am retry, in situations where some errors and retries may not be successful #46620

Conversation

guixiaowen commented May 16, 2024 • edited by LuciferYang

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

mridulm commented May 23, 2024

summaryzb commented May 23, 2024

guixiaowen commented May 23, 2024

LuciferYang commented May 23, 2024

yaooqinn May 24, 2024

Choose a reason for hiding this comment

guixiaowen May 24, 2024

Choose a reason for hiding this comment

guixiaowen commented May 24, 2024

yaooqinn commented May 24, 2024

guixiaowen commented May 24, 2024

guixiaowen commented May 16, 2024 •

edited by LuciferYang