Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

activation is too slow for large clusters #167

Open
hnfgns opened this issue Oct 12, 2024 · 5 comments
Open

activation is too slow for large clusters #167

hnfgns opened this issue Oct 12, 2024 · 5 comments

Comments

@hnfgns
Copy link

hnfgns commented Oct 12, 2024

Hello,

Thanks community for the great work. I am experimenting with hollywood at a large scale deployment of N=1000s machines.

I noticed that activating an actor at O(N) nodes floods the network with O(N^2) messages. This causes activation requests to be dropped. Setting an extremely high request timeout works for now even then the entire activation takes quite some time -- many minutes.

I was able to identify two issues

i) agent makes a blocking call which slows down the entire activation

resp, err := a.cluster.engine.Request(activatorPID, req, a.cluster.config.requestTimeout).Result()

ii) agent broadcasts activation to entire network, leading to quadratic number of messages O(N^2).

a.bcast(&Activation{

I propose making (i) activation non-blocking and (ii) broadcasts optional so that agent does not wait for a respond from the remote actor and does not flood the O(N^2) messages. Note that (i) is potentially another method cluster#ActivateNonBlocking(...) in order not to break the cluster#Activate and (ii) is an optional flag, potentially passed as part of cluster config.

Any thoughts?

@hnfgns
Copy link
Author

hnfgns commented Oct 12, 2024

love to get your feedback @anthdm @perbu

@perbu
Copy link
Collaborator

perbu commented Oct 12, 2024

I'm not really familiar with the clustering bits. clustering is cool, but it is a slight departure from the actor model, which dosen't really give you any promises.

Can't really say much about the impact the changes your suggesting, sorry.

@hnfgns
Copy link
Author

hnfgns commented Oct 14, 2024

Thanks. IMO this comes down to the question of supporting cluster option. If hollywood will continue to support it then we might as well make it scale for production grade deployments. Unsure if anyone else is experimenting with thousands of nodes as we do. We might as well be the first one out there. This has been a real pain for us.

cc: @anthdm

@perbu
Copy link
Collaborator

perbu commented Oct 14, 2024

I'm pretty sure you're the first. We have been discussing clusters up to 10-15 nodes, but I haven't really heard anyone pushing anything beyond that.

If the changes make your install workable, that is a pretty powerful testemoni, however. I'm just not sure about the downside, if there is any.

@anthdm
Copy link
Owner

anthdm commented Dec 19, 2024

@hnfgns Thanks for sharing this.

I'm looking into this ATM. I will keep you posted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants