activation is too slow for large clusters #167

hnfgns · 2024-10-12T05:41:45Z

Hello,

Thanks community for the great work. I am experimenting with hollywood at a large scale deployment of N=1000s machines.

I noticed that activating an actor at O(N) nodes floods the network with O(N^2) messages. This causes activation requests to be dropped. Setting an extremely high request timeout works for now even then the entire activation takes quite some time -- many minutes.

I was able to identify two issues

i) agent makes a blocking call which slows down the entire activation

hollywood/cluster/agent.go

Line 148 in d199384

    
           resp, err := a.cluster.engine.Request(activatorPID, req, a.cluster.config.requestTimeout).Result()

ii) agent broadcasts activation to entire network, leading to quadratic number of messages O(N^2).

hollywood/cluster/agent.go

Line 165 in d199384

a.bcast(&Activation{

I propose making (i) activation non-blocking and (ii) broadcasts optional so that agent does not wait for a respond from the remote actor and does not flood the O(N^2) messages. Note that (i) is potentially another method cluster#ActivateNonBlocking(...) in order not to break the cluster#Activate and (ii) is an optional flag, potentially passed as part of cluster config.

Any thoughts?

The text was updated successfully, but these errors were encountered:

hnfgns · 2024-10-12T05:46:47Z

love to get your feedback @anthdm @perbu

perbu · 2024-10-12T11:39:02Z

I'm not really familiar with the clustering bits. clustering is cool, but it is a slight departure from the actor model, which dosen't really give you any promises.

Can't really say much about the impact the changes your suggesting, sorry.

hnfgns · 2024-10-14T05:08:26Z

Thanks. IMO this comes down to the question of supporting cluster option. If hollywood will continue to support it then we might as well make it scale for production grade deployments. Unsure if anyone else is experimenting with thousands of nodes as we do. We might as well be the first one out there. This has been a real pain for us.

cc: @anthdm

perbu · 2024-10-14T06:21:03Z

I'm pretty sure you're the first. We have been discussing clusters up to 10-15 nodes, but I haven't really heard anyone pushing anything beyond that.

If the changes make your install workable, that is a pretty powerful testemoni, however. I'm just not sure about the downside, if there is any.

anthdm · 2024-12-19T10:27:36Z

@hnfgns Thanks for sharing this.

I'm looking into this ATM. I will keep you posted.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

activation is too slow for large clusters #167

activation is too slow for large clusters #167

hnfgns commented Oct 12, 2024 •

edited

Loading

hnfgns commented Oct 12, 2024

perbu commented Oct 12, 2024

hnfgns commented Oct 14, 2024 •

edited

Loading

perbu commented Oct 14, 2024

anthdm commented Dec 19, 2024

activation is too slow for large clusters #167

activation is too slow for large clusters #167

Comments

hnfgns commented Oct 12, 2024 • edited Loading

hnfgns commented Oct 12, 2024

perbu commented Oct 12, 2024

hnfgns commented Oct 14, 2024 • edited Loading

perbu commented Oct 14, 2024

anthdm commented Dec 19, 2024

hnfgns commented Oct 12, 2024 •

edited

Loading

hnfgns commented Oct 14, 2024 •

edited

Loading