Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC][Discussion] Automatic Parallelization #335

Closed
soodoshll opened this issue Jul 28, 2023 · 8 comments
Closed

[RFC][Discussion] Automatic Parallelization #335

soodoshll opened this issue Jul 28, 2023 · 8 comments
Labels
enhancement New feature or request rfc Discussion of potential rfc

Comments

@soodoshll
Copy link
Collaborator

soodoshll commented Jul 28, 2023

rendered rfc

@soodoshll soodoshll added the enhancement New feature or request label Jul 28, 2023
@yaoyaoding
Copy link
Member

Hi @soodoshll, thanks for the draft!

It looks good as a first verion rfc draft!

I have several suggestions:

  1. The 0001 and 0002 rfc slot has been used. Might consider use 0003.
  2. Might be better to give a concrete example of distributed_config and optimization_config in the guide-level explanation.
  3. Reference-level explanation is a good place to show what configs are available for distributed_config and optimization_config, and what each config specify. It's okay only to put the ones that are known now and update the draft and add more configs during implementation and refactor in the future.
  4. I prefer putting the out_dir as a seperate function parameter instead of an attribute in config.
  5. Add a reference to Alpa and use something like "Alpa[1]" in the main text.
  6. This part "For example, for a 4x4 multi-machine-multi-GPU cluster, the possible sharding specifications are (4x4, 1), (4(machine), 4(gpus)), (4(gpus), 4(machines)), (1, 4x4). We do not consider (2, 8) or (8, 2). Therefore, using R or Si is sufficient since the number of shards is determined by the number of devices. " is a little vague. We can consider adding some example to illustrate what does a specific sharding specification mean (e.g., (4 gpus, 4 machine)), and explain the meaning of "R" and "Si".
  7. Consider using mesh_axes_per_dim in TensorShardSpec.
  8. The math formula in "Operator Sharding Specification" has some typesetting flaws.

@yaoyaoding
Copy link
Member

  1. Could add a section to describe the ILP formulation.

The design looks good to me. Hi @soodoshll and @xinli-git, could you also discuss how to seperate the whole feature into relative small steps to implement? We can use this issue to track the PRs related to this RFC, something like apache/tvm#15319. Thanks!

@soodoshll soodoshll added the rfc Discussion of potential rfc label Jul 29, 2023
@soodoshll
Copy link
Collaborator Author

Hi @yaoyaoding, thanks for your suggestions. I've fixed the draft.

The whole features can be decomposed into the following steps:

  1. Design and implement the data structure for tensor and op sharding specifications
  2. connect function, which relies on (1)
  3. Sharding rule generation, which relies on (1)
  4. weight sharding and comm op injection, which relies on (2)
  5. auto-parallelization algorithm, which relies on (2) and (3)
  6. Run end-to-end tests

I'm working on 1 after it is done, we can start 2 and 3. I have a prototype of 3, which I will integrate later.

Hi @xinli-git, let's work in the auto-parallel branch.

@soodoshll
Copy link
Collaborator Author

I found that resharding (tensor conversion between ops with different specifications) sometimes requires the collective communication primitive all-to-all. For example, it happens when a MxN matrix is sharded along axis M and we want to convert it to be sharded along axis N.

Though nccl does not directly supports all-to-all, it can be implemented by send and recv. Without all-to-all, a workaround is to use all-gather and then do slicing for the same purpose, though suffering from suboptimal performance.

I'd suggest treat it as a low-prioritized TODO item and see if it will really cause performance issue. We can fix it after finishing the backbone of the whole pipeline.

@xinli-git
Copy link
Collaborator

Thanks! @soodoshll. The RFC is very detailed.

For modelling computation, it seems that Alpa assumes that all tensor contraction OPs (MM, Conv) must be fully sharded so all such ops that same computation cost under different sharding strategies. They also observe that other OPs have negligible runtime cost for computation. (I verified this as well). As a result, they think there was no need to model computation.

Since this feature probably requires a month of work for multiple people (currently me and Qidong) I was thinking maybe we can leverage github Projects (https://github.com/hidet-org/hidet/projects?query=is%3Aopen)

@yaoyaoding if you think that's a good idea I will take a lead on this

@yaoyaoding
Copy link
Member

Hi @xinli-git, sounds good to me. I have not used the github project feature before, but you can have a try and let's see whether it helps the orgnization and planning.

@vadiklyutiy
Copy link
Collaborator

@yaoyaoding
Is it actual right now? What do you think?

@yaoyaoding
Copy link
Member

We can close this issue but the contents of the rfc is still valuable and can be used as reference for hidet developers on this part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request rfc Discussion of potential rfc
Projects
None yet
Development

No branches or pull requests

4 participants