Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes for executor actor recovery using ray's fault tolerance. #391

Merged
merged 2 commits into from
Dec 15, 2023
Merged

Changes for executor actor recovery using ray's fault tolerance. #391

merged 2 commits into from
Dec 15, 2023

Conversation

KiranP-d11
Copy link
Contributor

@KiranP-d11 KiranP-d11 commented Nov 29, 2023

Pull request for the bug described in the issue #364.

This is the fix for ray executors not recovering from OOM and other failures.
The issue is because of the race condition:

 - Executor E1 dies lets say because of OOM
 - We try to kill it by firing stop call on E1 actor
 - Since the actor is not available, the stop task fails for E1
 - In the mean while, ray brings up the lost executor E1
 - The failed task (stop task) gets retried as there are task retries configured.
 - The stop task gets fired on the new executor which got recovered
 - The Recovered executor exits with status as user intended exit.

@kira-lin
Copy link
Collaborator

kira-lin commented Dec 4, 2023

Hi @KiranP-d11
This is great. Thanks for your work!
I have already merged the pr which fixes raydp-submit, can you please merge the main branch and try CI again?
The file changes LGTM to me

@KiranP-d11
Copy link
Contributor Author

@kira-lin Merged the master branch and CI checks are passing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants