Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48309][YARN]Stop am retry, in situations where some errors and retries may not be successful #46620

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

guixiaowen
Copy link
Contributor

@guixiaowen guixiaowen commented May 16, 2024

… retries may not be successful

What changes were proposed in this pull request?

In yarn cluster mode, spark.yarn.maxAppAttempts will be configured. In our production environment, it is configured as 2 If the first execution fails, AM will retry. However, in some scenarios, even attempting a second task may fail.

For example:

org. apache. park. SQL AnalysisException: Table or view not found: test.testxxxx_xxxxx; Line 1 pos 14;
Project
+-Unresolved Relationship [bigdata_qa, testxxxxx_xxxxx], [], false

Other example:
Caused by: org. apache. hadoop. hdfs. protocol NSQuotaExceededException: The NameSpace quota (directories and files) of directory/tmp/xxx_file/xxxx is exceeded: quota=1000000 file count=1000001

Would it be more appropriate to try capturing these exceptions and stopping retry?

Why are the changes needed?

In some scenarios, even attempting a second task may fail.

Does this PR introduce any user-facing change?

The user can throw a SparkStopAMRetryException, and the Application Master will catch the exception and stop retry

set spark.yarn.maxAppAttempts=2

For examle

val spark = SparkSession

      .builder()

      .appName("Spark SQL basic example")

      .enableHiveSupport()

      .getOrCreate()

    try {

      spark.sql("select * from test.testxxxx_xxxxx;").show

    } catch {

      case e:AnalysisException => throw new SparkStopAMRetryException("this is a test", e)

    } finally {

      spark.stop()

    }

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

@github-actions github-actions bot added the YARN label May 16, 2024
@guixiaowen guixiaowen changed the title [SPARK-48309][YARN]Stop am retry, in situations where some errors and… [SPARK-48309][YARN]Stop am retry, in situations where some errors and retries may not be successful May 16, 2024
@mridulm
Copy link
Contributor

mridulm commented May 23, 2024

While this PR does not include it, in order to leverage the change introduced for SparkStopAMRetryException - existing exception handling will need to be changed (to throw SparkStopAMRetryException instead of whatever is being thrown now) - which will be a backwardly incompatible change.

I am wondering if we can leverage SparkException.errorClass instead - since SparkException is thrown by Spark ? Return EXIT_STOP_AM_RETRY for some specific subset of error classes ?

+CC @MaxGekk in case this idea makes sense !

@summaryzb
Copy link
Contributor

I am wondering if we can leverage SparkException.errorClass instead - since SparkException is thrown by Spark ? Return EXIT_STOP_AM_RETRY for some specific subset of error classes ?

Agree, this can be used to handle existing exception, Maybe it's a good idea to include several highest frequencies error classes in your production environment
While SparkStopAMRetryException can be used to handle error scenarios afterwards this pr
+CC @LuciferYang

@guixiaowen
Copy link
Contributor Author

I am wondering if we can leverage SparkException.errorClass instead - since SparkException is thrown by Spark ? Return EXIT_STOP_AM_RETRY for some specific subset of error classes ?

Agree, this can be used to handle existing exception, Maybe it's a good idea to include several highest frequencies error classes in your production environment While SparkStopAMRetryException can be used to handle error scenarios afterwards this pr +CC @LuciferYang

@mridulm @summaryzb

Thank you both for helping me review this pr.

In fact, I was initially thinking about whether I could use Spark's existing exception classes to achieve reuse.

But if I don't use new exception information, it may not work for me to do this place. Because in the yarn-cluster mode, the application master determines whether a retry is needed based on the current exception information, such as:

e.getCause match {
case _: InterruptedException =>
case e: SparkUserAppException(exitCode) =>
## e.getMessage container message ("Table or view not found:") or highest frequencies error message

this code is in ApplicationMaster.

But if the user throws their own defined exception information,such as:

throw new MyTestExecption("this is a test exception, I want to stop am retry.")

In ApplicationMaster, unable to capture user-defined exception information.

@LuciferYang
Copy link
Contributor

In the example code, if it throws SparkUserAppException(18) instead of SparkStopAMRetryException , will it also trigger a retry with this pr?

finish(FinalApplicationStatus.FAILED,
ApplicationMaster.EXIT_STOP_AM_RETYR,
"User class threw exception: "
+ StringUtils.stringifyException(stopAmRetry.getCause))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid NPE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaooqinn Yes, I changed it.

@guixiaowen
Copy link
Contributor Author

In the example code, if it throws SparkUserAppException(18) instead of SparkStopAMRetryException , will it also trigger a retry with this pr?

@LuciferYang Lu Thank you for review. I

In the example code, if it throws SparkUserAppException(18) instead of SparkStopAMRetryException , will it also trigger a retry with this pr?

@LuciferYang Thank you for your review. I made the modifications as you said.

@guixiaowen guixiaowen requested a review from yaooqinn May 24, 2024 10:05
@yaooqinn
Copy link
Member

Hi, @guixiaowen I need some time to think about it as it might break some existing workloads.

Meantime, you can

  • Update the PR desc for better readability
  • Update the PR desc according to the latest change
  • Revise the doc of spark.yarn.maxAppAttempts

@guixiaowen
Copy link
Contributor Author

Hi, @guixiaowen I need some time to think about it as it might break some existing workloads.

Meantime, you can

  • Update the PR desc for better readability
  • Update the PR desc according to the latest change
  • Revise the doc of spark.yarn.maxAppAttempts

@yaooqinn Ok, Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants