You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using ClusterManagers.jl on a cluster where the compute nodes (which run workers) can connect to the login nodes (which run the master process), but the other direction is blocked by firewalls. The current code which sets up an ElasticManager first has the workers connecting to master (which works), but then, via code in Base.Distributed, the master establishes a second connection to the workers (which in my case is blocked). I'm wondering:
What's the reason for the extra connection here? (which very naively to me seems unnecessary since we already have one), and
Is there any reason we can't just use the original worker->master connection for all communication? I've been hacking at the code a bit and it kind of seems to work, but was wondering if there's a showstopper I should know about before spending more energy getting it fully working.
Thanks.
The text was updated successfully, but these errors were encountered:
Hm, I think Distributed assumes that we can can have a connection for each direction. I don't think there is a hard reason for the implementation detail, except that it is easier to think about "receive" vs "send" channel. So yeah I assume you can make that work.
DilumAluthge
changed the title
Uni-directional communication in ElasticManager?
ElasticManager: Uni-directional communication in ElasticManager?
Jan 2, 2025
I'm using ClusterManagers.jl on a cluster where the compute nodes (which run workers) can connect to the login nodes (which run the master process), but the other direction is blocked by firewalls. The current code which sets up an ElasticManager first has the workers connecting to master (which works), but then, via code in
Base.Distributed
, the master establishes a second connection to the workers (which in my case is blocked). I'm wondering:What's the reason for the extra connection here? (which very naively to me seems unnecessary since we already have one), and
Is there any reason we can't just use the original worker->master connection for all communication? I've been hacking at the code a bit and it kind of seems to work, but was wondering if there's a showstopper I should know about before spending more energy getting it fully working.
Thanks.
The text was updated successfully, but these errors were encountered: