Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ch4: separate multi-vci/nic initialization #7255

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

hzhou
Copy link
Contributor

@hzhou hzhou commented Jan 3, 2025

Pull Request Description

There are two goals of this PR:

  1. Separate setup of multiple vci/nic from the regular initialization. When multiple vci/nic are not enabled by the environment CVARs, the multi-vci/nic components should be skipped completely. This means, we should not incur any cost of multiple vci/nic unless it is opted in by user/application.

  2. Allow per-comm setup of multiple vci/nic. This is necessary to support true session model where comm_world may never get created nor initialized.

[skip warnings]

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

hzhou added 5 commits January 3, 2025 15:19
Replace the rather ambiguous field is_tainted with vcis_enabled, thus
allowing vci activation on a per-comm basis.

It is inherited if the new comm is created from within an parent comm
that has vcis_enabled. If the parent comm vcis_enabled=false, then all
its descendents will have vcis_enabled until they are turned via separate
APIs (to be added in the future).

Intercomm and intercomm_merge may include processes outside originating
comm, thus vcis_enabled=false by default. MPI_Comm_create may create an
intercomm that inherits vcis_enabled=true. This is an exception because
both local processes and remote processes are from within originating
comm that has vcis_enabled.

For now, we switch on vcis_enabled in comm_wrold after post_init. With
future extension, it is possible to allow user explicitly set up
multi-vics on a smaller comm than comm world.
Rather than initialize per-vci mutexes in ch4 and register with request
pools, directly use MPIR-layer request pool mutexes.
Move multiple vci related init/finalize code into ch4_vci.c. Wrap
per-vci code into a function and only deal with vci 0 in ch4_init.c and
additional vcis in ch4_vci.c.
@hzhou hzhou force-pushed the 2412_comm_init branch 5 times, most recently from ab39110 to 111ce24 Compare January 6, 2025 17:48
@hzhou
Copy link
Contributor Author

hzhou commented Jan 6, 2025

test:mpich/ch3/most ✔️
test:mpich/ch4/most ❌

Gather all multiple vci init code in MPIDI_Comm_set_vcis. So -

1. The rest of the init code only deal with root vci.

2. Prepare for future dynamic and per-comm vci.
@hzhou
Copy link
Contributor Author

hzhou commented Jan 8, 2025

test:mpich/ch4/ucx ✔️
test;mpich/ch4/ofi ❌

hzhou added 8 commits January 8, 2025 11:46
We may relax the av insertion order which may require the full
(vci_local, nic_local, vci_remote, nic_remote) to look up an actual
destination address.

Add MPIDI_OFI_av_to_phys_root for convenience and quick survey on where
we restrict in only root vci (such as the init and spawn paths).

Remove MPIDI_OFI_comm_to_phys and prefer an explicit
MPIDIU_comm_rank_to_av and then MPIDI_OFI_av_to_phys.

Refactor MPIDI_OFI_SET_AM_HDR_COMMON in ofi_am_impl.h to directly use
dst_addr (as remote_id) rather than to recalculate it.
Let MPIDI_OFI_addr_t only contain field for root vci address, and only
allocate more space for additional addresses when multiple vci and nic is
enabled -- potentially at runtime. This avoids wasting memory for
multiple vcis unelss it is actually needed.
Move code that are related to multiple-vci setup to ofi_vci.c.
Add the flexibility of perform multiple vci address exchange within a
comm other than the comm world. For one, this is important in a session
where the comm_world may not exist. For two, this provides a mechanism
to save resource when applications don't need multiple vcis for the
entire comm world.
We exchange non-root endpoints in comm_set_vcis. Because we can't use
multiple nics before the address exchange, we only can activate multiple
nics in comm_set_vcis.
MPIDI_OFI_global.num_nics affects runtime paths such as ofi progress and
large message striping. Only set it in MPIDI_OFI_init_vcis so we won't
have complications when multi-nics is not ready.
@hzhou hzhou force-pushed the 2412_comm_init branch 2 times, most recently from b5fa39d to f80dcfc Compare January 8, 2025 20:34
hzhou added 6 commits January 8, 2025 21:33
This has been superseded by MPIDI_NM_comm_set_vcis.
Consolidate the shmem allocations in iqueue to 2 slabs. One root slab
that is initialized at world_init. The other all_slab for per-vci
transport, initialized at the time of init vcis.

The goal is to eventually allow more flexible shm creation, potentially
allow init within a non-world communicator.
Transition from world init to per-comm vci init.
MPIDU_shm_seg_t was used by mpidu_shm_alloc.c, mpidu_init_shm.c, and
mpidu_init_shm_alloc.c. However, the usages are all slightly different
and some fields are only used in one but not the other. It is simpler to
locally define it or, in the case of mpidu_init_shm.c, just use static
globals.
hzhou added 5 commits January 8, 2025 21:33
Add routine to support allocating a shared memory by a comm, which allows -

* create shared memory by a smaller comm than a comm_world
* attach the shared memory by later processes
* potentially allowing shm communication with dynamic processes - we
need a way to discover and attach to Init_shm (via intercomm) and the
initial shared memory need pre-allocate to account for new processes.

For now, we need this to support MPIDI_POSIX_comm_set_vcis.
We need ensure the extra fields, such as MPIDI_OFI_AV(av, all_dest), are
initialized to NULL.
Because we insert all remote endpoints to all local endpoints at the
same time, thus follow the exact same insertion order, they will share
the same av table index except for the local root endpoint because it
has inserted other remote root endpoints at init time. The local root to
remote non-root endpoints will have a fixed offset from that of local
non-root.
@hzhou
Copy link
Contributor Author

hzhou commented Jan 9, 2025

test:mpich/ch4/ofi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant