Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs stuck in PENDING because StartExecutionManager invokes worker with too large payload #1689

Closed
jtherrmann opened this issue Jun 26, 2023 · 5 comments
Labels
bug Something isn't working

Comments

@jtherrmann
Copy link
Contributor

jtherrmann commented Jun 26, 2023

The last release #1685 introduced a bug causing jobs to get stuck in PENDING. I have only diagnosed the problem in the following two deployments, so not sure if it's affecting any of the others:

  • hyp3-pdc: 10,000 jobs in PENDING
  • hyp3-edc-prod: 9,210 jobs in PENDING

From the most recent log stream for StartExecutionManager in hyp3-pdc:

[ERROR]	2023-06-26T17:45:19.186Z	3847acab-cab6-42a2-b6ac-aed21b1a63f3	Unhandled exception
Traceback (most recent call last):
  File "/var/task/lambda_logging/__init__.py", line 18, in wrapper
    lambda_handler(event, context)
  File "/var/task/start_execution_manager.py", line 35, in lambda_handler
    response = invoke_worker(worker_function_arn, jobs)
  File "/var/task/start_execution_manager.py", line 16, in invoke_worker
    return LAMBDA_CLIENT.invoke(
  File "/var/task/botocore/client.py", line 530, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/var/task/botocore/client.py", line 964, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (RequestEntityTooLargeException) when calling the Invoke operation: 279390 byte payload is too large for the Event invocation type (limit 262144 bytes)

Most recent log stream for hyp3-edc-prod shows the same error.

Per @asjohnston-asf:

I think the short term fix would be to submit two batches of 300 jobs, rather than two batches of 450 (payload too large error) or three batches of 300 (too many submitjob requests error)

For context, see #1676 and #1272

@jtherrmann jtherrmann added the bug Something isn't working label Jun 26, 2023
@jtherrmann
Copy link
Contributor Author

Submitted 3 jobs to hyp3-test and confirmed that they got started successfully.

@jtherrmann
Copy link
Contributor Author

The manager Lambda in hyp3-edc-prod is successfully submitting 2 batches of 300 jobs at a time and the number of pending jobs is down to 8,542.

@jtherrmann
Copy link
Contributor Author

Same for hyp3-pdc, pending jobs down to 9,840.

@jtherrmann
Copy link
Contributor Author

We will want to continue to monitoring to ensure that number of pending jobs is approaching 0, and that jobs are succeeding.

@jtherrmann
Copy link
Contributor Author

Looking at the Step Function console in hyp3-edc-prod, there's a long list of succeeded executions that were started after we deployed the fix. There are only two failed executions that were started after we deployed the fix, and they failed during the plugin processing.

Looking at the hyp3-pdc Jobs table, there are 0 failed jobs and 515 succeeded jobs.

The number of pending jobs will take awhile to decrease substantially, because the Batch job has to start running before the job status changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant