Recovery for lost executors 364 #389

KiranP-d11 · 2023-11-08T10:56:01Z

Pull request for the bug described in the issue #364.

…cutor health and added recovery for lost executors

kira-lin

Thanks for your PR!
I have a concern, though, which is how will this interact with fault tolerance mode? In fault tolerance mode, if an executor(actor) failed, it will get restarted and re-register. I think this pr does pretty much the same thing, but in Spark. If users turn on fault tolerant mode, will executors be requested multiple times?

Maybe I forgot the context, but why does fault tolerant mode failed to meet your need?

kira-lin · 2023-11-09T07:06:29Z

core/raydp-main/src/main/scala/org/apache/spark/deploy/raydp/RayAppMaster.scala

+        logDebug(s"Checking if any of the executor handlers are unreachable for ${allExecutors.mkString(", ")}")
+        allExecutors.foreach { executorId =>
+          val handlerOpt = appInfo.getExecutorHandler(executorId)
+          if (handlerOpt.isEmpty) {


This is not likely to happen, right?

kira-lin · 2023-11-09T07:14:35Z

Please also fix the style lint errors, thanks

KiranP-d11 · 2023-11-10T06:33:08Z

Correct me if I am wrong, but the fault tolerance mode is only for converting a Spark DataFrame to a Ray dataset. If the use case is to just run a Spark job and write the output to a destination (like S3), then enabling fault tolerance mode has no effect.

That being said, I understand that when the executor actors die, either because of OOM issues or node failures, Ray would restart the actors, and the restarted executors should be added back to Spark. However, currently, this is not happening. When an executor dies, it gets restarted, but it immediately dies before adding itself to the Spark application. This results in the Spark application getting stuck, where it indefinitely waits for the executors to come up without doing anything.

I will try to debug why this issue is happening and will attempt to raise a PR for it (in which case this current PR with changes in spark is not needed.)

kira-lin · 2023-11-10T07:03:52Z

However, currently, this is not happening. When an executor dies, it gets restarted, but it immediately dies before adding itself to the Spark application. This results in the Spark application getting stuck, where it indefinitely waits for the executors to come up without doing anything.

Do you mean even when fault tolerant mode is on, executors are not adding back as expected? I see, this might be some bug.

We add fault tolerance mode to recover dataframes, indeed, and it introduces some behavior which might not be wanted. We are happy to have this feature, but we should make sure this can be turned off when fault tolerance mode is on.

KiranP-d11 · 2023-11-10T07:57:17Z

Do you mean even when fault tolerant mode is on, executors are not adding back as expected? I see, this might be some bug.

Yes, even when the fault tolerance mode is on, executors are not adding back.

kira-lin · 2023-11-10T08:06:17Z

I see.

We are happy to have this feature, but we should make sure this can be turned off when fault tolerance mode is on.

Is it possible to check if fault tolerance mode is on, and if so disable the periodic sent RPC?

@pang-wu , does this solve your issue？

KiranP-d11 · 2023-11-10T09:42:55Z

Is it possible to check if fault tolerance mode is on, and if so disable the periodic sent RPC?

I will check and get back on this.

pang-wu · 2023-11-10T16:33:58Z

@kira-lin I need to check, but I am very exited about this feature -- it solve a big problem in our production. However I have some concern on turning this feature off automatically, if possible, I would rather have an option for user to turn it off. The reasons are below:

Is it possible to check if fault tolerance mode is on, and if so disable the periodic sent RPC?

Let's say if fault tolerance mode is on, but I am not converting Spark dataframe to Ray dataset. Will RayDP still recovers failed executors?
Also, we have other usecases where Spark context is not initialized using Python code, but using Spark submit/raydp submimt. In this case, we don't have a way to turn fault tolerance on event it can solve the problem..?

kira-lin · 2023-11-13T02:10:27Z

Will RayDP still recovers failed executors?

With fault tolerance mode on, we basically turn on Ray's failed actor restart, so yes, it should recover the executors by design.

Also, we have other usecases where Spark context is not initialized using Python code, but using Spark submit/raydp submimt. In this case, we don't have a way to turn fault tolerance on event it can solve the problem..?

That's right.

KiranP-d11 · 2023-11-29T12:19:50Z

@kira-lin I was able to debug and fix the issue with ray's fault tolerance.
I have raised another PR for this.

Closing this PR as its is not required.

KiranP-d11 added 3 commits November 8, 2023 14:52

dev changes for executor recovery fix - initial version

93a9284

dev changes for executor recovery fix - added periodic checks for exe…

c3b3fd5

…cutor health and added recovery for lost executors

Cleanup and prepare for merging

5de9ddb

kira-lin reviewed Nov 9, 2023

View reviewed changes

KiranP-d11 closed this Nov 29, 2023

mehulbatra-d11 mentioned this pull request Sep 23, 2024

Slack Group for the commiters & maintainers #415

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recovery for lost executors 364 #389

Recovery for lost executors 364 #389

KiranP-d11 commented Nov 8, 2023

kira-lin left a comment

kira-lin Nov 9, 2023

kira-lin commented Nov 9, 2023

KiranP-d11 commented Nov 10, 2023

kira-lin commented Nov 10, 2023

KiranP-d11 commented Nov 10, 2023

kira-lin commented Nov 10, 2023

KiranP-d11 commented Nov 10, 2023

pang-wu commented Nov 10, 2023 •

edited

Loading

kira-lin commented Nov 13, 2023

KiranP-d11 commented Nov 29, 2023

Recovery for lost executors 364 #389

Recovery for lost executors 364 #389

Conversation

KiranP-d11 commented Nov 8, 2023

kira-lin left a comment

Choose a reason for hiding this comment

kira-lin Nov 9, 2023

Choose a reason for hiding this comment

kira-lin commented Nov 9, 2023

KiranP-d11 commented Nov 10, 2023

kira-lin commented Nov 10, 2023

KiranP-d11 commented Nov 10, 2023

kira-lin commented Nov 10, 2023

KiranP-d11 commented Nov 10, 2023

pang-wu commented Nov 10, 2023 • edited Loading

kira-lin commented Nov 13, 2023

KiranP-d11 commented Nov 29, 2023

pang-wu commented Nov 10, 2023 •

edited

Loading