Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds distributed row gatherer #1589

Open
wants to merge 8 commits into
base: neighborhood-communicator
Choose a base branch
from

Conversation

MarcelKoch
Copy link
Member

@MarcelKoch MarcelKoch commented Apr 4, 2024

This PR adds a distributed row gatherer. This operator essentially provides the communication required in our matrix apply.

Besides the normal apply (which is blocking), it also provides two asynchronous calls. One version has an additional workspace parameter which is used as send buffer. This version can be called multiple times without restrictions, if different workspaces are used for each call. The other version doesn't have a workspace parameter, and instead uses an internal buffer. As a consequence, this function can only be called a second time, if the request of the previous call has been waited on. Otherwise, this function will throw.

This is the second part of splitting up #1546.

It also introduces some intermediate changes, which could be extracted out beforehand:

PR Stack:

@MarcelKoch MarcelKoch self-assigned this Apr 4, 2024
@ginkgo-bot ginkgo-bot added reg:build This is related to the build system. reg:testing This is related to testing. mod:core This is related to the core module. type:matrix-format This is related to the Matrix formats labels Apr 4, 2024
@MarcelKoch MarcelKoch requested a review from pratikvn April 4, 2024 10:49
@MarcelKoch MarcelKoch force-pushed the distributed-row-gatherer branch from 6b4521b to ae60198 Compare April 4, 2024 11:00
@MarcelKoch MarcelKoch force-pushed the neighborhood-communicator branch from 6acf7c4 to 8aa6ab9 Compare April 4, 2024 11:00
@MarcelKoch MarcelKoch force-pushed the distributed-row-gatherer branch 2 times, most recently from 49557f1 to 4a79442 Compare April 5, 2024 08:18
@MarcelKoch MarcelKoch modified the milestone: Ginkgo 1.8.0 Apr 5, 2024
@MarcelKoch MarcelKoch force-pushed the neighborhood-communicator branch from 8aa6ab9 to 77398bd Compare April 17, 2024 16:28
@MarcelKoch MarcelKoch force-pushed the distributed-row-gatherer branch from 4a79442 to 172eb7d Compare April 17, 2024 16:28
@MarcelKoch MarcelKoch requested a review from upsj April 19, 2024 09:20
@MarcelKoch MarcelKoch mentioned this pull request Apr 19, 2024
7 tasks
@MarcelKoch MarcelKoch force-pushed the neighborhood-communicator branch from 77398bd to d278cad Compare April 19, 2024 14:39
@MarcelKoch MarcelKoch force-pushed the distributed-row-gatherer branch 2 times, most recently from 98fa10a to 79de4c3 Compare April 19, 2024 16:19
@MarcelKoch
Copy link
Member Author

One issue that I have is the constructor. It takes a collective_communicator and an index_map. The index_map already defines the communication pattern, so the collective_communicator has to match that.
One option might be to have a virtual function like

std::unique_ptr<collective_communicator> create_with_same_type(communicator, index_map);

If I can't come up with anything better, I guess I will use that.

@pratikvn
Copy link
Member

Do we need to have the std::future setup for the release ? Can we remove that for now and just use a normal synchronous approach ? I think that is a significant change that maybe needs more thought and probably a separate PR.

@MarcelKoch MarcelKoch force-pushed the neighborhood-communicator branch from 0ad4ee8 to 1f49b91 Compare August 16, 2024 15:21
@MarcelKoch MarcelKoch requested review from upsj and removed request for upsj August 27, 2024 12:05
@MarcelKoch MarcelKoch force-pushed the distributed-row-gatherer branch from 8697971 to 341e781 Compare October 7, 2024 13:06
@MarcelKoch MarcelKoch force-pushed the neighborhood-communicator branch from 1f49b91 to 4db050c Compare October 7, 2024 13:06
send_sizes.data(), send_offsets.data(), type, recv_ptr,
recv_sizes.data(), recv_offsets.data(), type);
coll_comm
->i_all_to_all_v(use_host_buffer ? exec->get_master() : exec, send_ptr,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any difference between using all_to_all_v vs i_all_to_all_v? I assume all_to_all_v also update the interface

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all_to_all_v is a blocking call, while i_all_to_all_v is non-blocking. Right now the collective_communicator only provides the non-blocking interface, since it is more general.

include/ginkgo/core/distributed/row_gatherer.hpp Outdated Show resolved Hide resolved
* auto x = matrix::Dense<double>::create(...);
*
* auto future = rg->apply_async(b, x);
* // do some computation that doesn't modify b, or access x
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it access x but it is unclear when it will be accessed before the wait

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this just meant to say that you can't expect any meaningful data when accessing x before the wait has completed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I get it wrong.
Is the comment here to describe that user can do something safely after the call or the apply_async behavior?
My comment was based on that it is the behavior of the apply_async because apply_async definitely accesses x.
If it is for user action during async and wait, then it is correct.

core/distributed/row_gatherer.cpp Outdated Show resolved Hide resolved
Comment on lines +98 to +102
workspace.set_executor(mpi_exec);
if (send_size_in_bytes > workspace.get_size()) {
workspace.resize_and_reset(sizeof(ValueType) *
send_size[0] * send_size[1]);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combining them to assign the workspace directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combine how? Do you mean like

workspace = array<char>(mpi_exec, sizeof(ValueType) * send_size[0] * send_size[1]);

Comment on lines +118 to +119
req = coll_comm_->i_all_to_all_v(
mpi_exec, send_ptr, type.get(), recv_ptr, type.get());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

send_buffer might be on the host but the recv_ptr(x_local) might be on the device

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a check above to ensure that the memory space of the recv buffer is accessible from the mpi executor. So if GPU aware MPI is used, it should work (even if send buffer is on the host and recv buffer in the device or vice versa). Otherwise an exception will be thrown.

core/test/mpi/distributed/row_gatherer.cpp Outdated Show resolved Hide resolved
core/test/mpi/distributed/row_gatherer.cpp Outdated Show resolved Hide resolved
@MarcelKoch MarcelKoch force-pushed the neighborhood-communicator branch from 4db050c to 1ebe59f Compare October 23, 2024 13:32
@MarcelKoch MarcelKoch force-pushed the distributed-row-gatherer branch from b2025a8 to f77cb6c Compare October 23, 2024 14:17
@MarcelKoch MarcelKoch requested a review from yhmtsai October 24, 2024 10:47
@MarcelKoch MarcelKoch force-pushed the distributed-row-gatherer branch from f77cb6c to c827b23 Compare October 30, 2024 15:10
@MarcelKoch MarcelKoch force-pushed the neighborhood-communicator branch from 1ebe59f to e7d32a1 Compare October 30, 2024 15:10
@MarcelKoch MarcelKoch force-pushed the distributed-row-gatherer branch from c827b23 to 2a54c3e Compare October 30, 2024 15:30
@MarcelKoch MarcelKoch modified the milestones: Ginkgo 1.9.0, Ginkgo 1.10.0 Dec 9, 2024
@MarcelKoch MarcelKoch force-pushed the neighborhood-communicator branch from 1216932 to ceb6f2e Compare December 17, 2024 13:59
@MarcelKoch MarcelKoch force-pushed the distributed-row-gatherer branch from 2a54c3e to bd358fc Compare December 18, 2024 09:19
@MarcelKoch MarcelKoch force-pushed the neighborhood-communicator branch from ceb6f2e to 807118c Compare December 18, 2024 09:27
@MarcelKoch MarcelKoch force-pushed the distributed-row-gatherer branch 2 times, most recently from a52ba0d to 08c1f4e Compare December 19, 2024 14:15
MarcelKoch and others added 8 commits January 8, 2025 16:47
- only allocate if necessary
- synchronize correct executor

Co-authored-by: Pratik Nayak <[email protected]>
- split tests into core and backend part
- fix formatting
- fix openmpi pre 4.1.x macro

Co-authored-by: Pratik Nayak <[email protected]>
Co-authored-by: Yu-Hsiang M. Tsai <[email protected]>
Signed-off-by: Marcel Koch <[email protected]>
@MarcelKoch MarcelKoch force-pushed the neighborhood-communicator branch from 6d548e6 to cf55d8d Compare January 8, 2025 15:49
@MarcelKoch MarcelKoch force-pushed the distributed-row-gatherer branch from 08c1f4e to b3cab68 Compare January 8, 2025 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1:ST:ready-for-review This PR is ready for review mod:core This is related to the core module. reg:build This is related to the build system. reg:testing This is related to testing. type:matrix-format This is related to the Matrix formats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants