Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build sdist and wheel in CI #1865

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

Conversation

calebho
Copy link

@calebho calebho commented Dec 17, 2024

summary

People often run into issues building from source, so it would help if there were pre-built wheels with sensible defaults. This PR adds GitHub Actions that builds wheels for various Python and CUDA versions and saves them to the workflow artifacts on each commit push.

To get the wheel, one needs to go to the workflow page and download it following https://docs.github.com/en/actions/managing-workflow-runs-and-deployments/managing-workflow-runs/downloading-workflow-artifacts. For example,

gh -R calebho/apex run download 12365956032 -n dist-py3.10-cu12.1.1

Still clunky, but much faster than building from source

The specific versions were chosen based on what is currently supported in PyTorch stable (2.5 at the time of writing). Specifically:

Notes

  • The build containers run Ubuntu 20 which has glibc 2.31, so the runtime environment will need at least this version of glibc
  • PyTorch 2.5 is hardcoded and I don't think there's ABI guarantees across versions, so the runtime environment will probably also need PyTorch 2.5

Good follow ups

  • Do periodic GitHub releases and attach the wheels to the release; this way people can do pip install https://github.com/NVIDIA/apex/releases/... on the appropriate wheel
  • Support PyPI #209: this PR doesn't publish sdist or wheel to PyPI; someone from NVIDIA ought to own that process
  • Add PyTorch versions to the build matrix

test plan

builds are green: https://github.com/calebho/apex/actions/runs/12365956032/job/34511835345

Screenshot 2024-12-16 at 8 52 56 PM

ran L0 tests on Python 3.10 + CUDA 12.1 on 2xA100 40GB

❯ pip list
Package                  Version
------------------------ -----------
apex                     0.1
cxxfilt                  0.3.0
exceptiongroup           1.2.2
expecttest               0.3.0
filelock                 3.13.1
fsspec                   2024.2.0
iniconfig                2.0.0
Jinja2                   3.1.3
MarkupSafe               2.1.5
mpmath                   1.3.0
networkx                 3.2.1
numpy                    2.2.0
nvidia-cublas-cu12       12.1.3.1
nvidia-cuda-cupti-cu12   12.1.105
nvidia-cuda-nvrtc-cu12   12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12        9.1.0.70
nvidia-cufft-cu12        11.0.2.54
nvidia-curand-cu12       10.3.2.106
nvidia-cusolver-cu12     11.4.5.107
nvidia-cusparse-cu12     12.1.0.106
nvidia-nccl-cu12         2.21.5
nvidia-nvjitlink-cu12    12.1.105
nvidia-nvtx-cu12         12.1.105
packaging                24.2
pip                      24.3.1
pluggy                   1.5.0
pytest                   8.3.4
PyYAML                   6.0.2
setuptools               75.6.0
sympy                    1.13.1
tomli                    2.2.1
torch                    2.5.1+cu121
tqdm                     4.67.1
triton                   3.1.0
typing_extensions        4.9.0

❯ nvidia-smi
Tue Dec 17 05:00:05 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:07:00.0 Off |                    0 |
| N/A   30C    P0              55W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  | 00000000:0F:00.0 Off |                    0 |
| N/A   30C    P0              52W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+


❯ python tests/L0/run_test.py
testGradScaler (test_adam.AdamTest) ... ok
testGradScalerCapturable (test_adam.AdamTest) ... /scratch/slurm_tmpdir/38660/env/lib/python3.10/site-packages/torch/amp/grad_scaler.py:415: FutureWarning: GradScaler is going to stop passing itself as a keyword argument to the passed optimizer. In the near future GradScaler registers `grad_scale: Tensor` and `found_inf: Tensor` to the passed optimizer and let the optimizer use them directly.
  warnings.warn(
ok
testGradScalerCapturableMaster (test_adam.AdamTest) ... ok
testLargeTensor (test_adam.AdamTest) ... skipped 'Insufficient cuda memory'
testNative (test_adam.AdamTest) ... ok
test_float (test_fused_novograd.TestFusedNovoGrad) ... /scratch/slurm_tmpdir/38660/env/lib/python3.10/site-packages/apex/optimizers/fused_novograd.py:176: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:78.)
  group['exp_avg_sq'][0] = torch.cuda.FloatTensor(v_16, device=self.param_groups[0]["params"][0].device)
ok
test_half (test_fused_novograd.TestFusedNovoGrad) ... ok
test_multi_device (test_fused_novograd.TestFusedNovoGrad) ... ok
test_multi_params (test_fused_novograd.TestFusedNovoGrad) ... ok
test_adagrad_option (test_fused_optimizer.TestFusedAdagrad) ... ok
test_float (test_fused_optimizer.TestFusedAdagrad) ... ok
test_half (test_fused_optimizer.TestFusedAdagrad) ... skipped 'PyTorch optimizer is not numerically correct for fp16'
test_multi_device (test_fused_optimizer.TestFusedAdagrad) ... ok
test_multi_params (test_fused_optimizer.TestFusedAdagrad) ... ok
test_multi_params_different_devices_throws (test_fused_optimizer.TestFusedAdagrad) ... ok
test_adam_option (test_fused_optimizer.TestFusedAdam) ... ok
test_bfloat16 (test_fused_optimizer.TestFusedAdam) ... ok
test_float (test_fused_optimizer.TestFusedAdam) ... ok
test_fp16_output (test_fused_optimizer.TestFusedAdam) ... skipped 'No longer support output fp16 param'
test_frozen_model (test_fused_optimizer.TestFusedAdam) ... ok
test_half (test_fused_optimizer.TestFusedAdam) ... ok
test_multi_device (test_fused_optimizer.TestFusedAdam) ... ok
test_multi_params (test_fused_optimizer.TestFusedAdam) ... skipped 'Disable until 8/1/2019 adam/adamw upstream picked'
test_scale (test_fused_optimizer.TestFusedAdam) ... skipped 'No longer support fuse scaling'
test_float (test_fused_optimizer.TestFusedSGD) ... ok
test_half (test_fused_optimizer.TestFusedSGD) ... ok
test_multi_device (test_fused_optimizer.TestFusedSGD) ... ok
test_float (test_lamb.TestFusedLAMB) ... ok
test_half (test_lamb.TestFusedLAMB) ... skipped 'PyTorch optimizer is not numerically correct for fp16'
test_lamb_option (test_lamb.TestFusedLAMB) ... ok
test_multi_device (test_lamb.TestFusedLAMB) ... ok
test_multi_params (test_lamb.TestFusedLAMB) ... ok
test_bfloat16 (test_lamb.TestFusedMixedPrecisionLamb) ... ok
test_float (test_lamb.TestFusedMixedPrecisionLamb) ... ok
test_half (test_lamb.TestFusedMixedPrecisionLamb) ... ok
test_lamb_option (test_lamb.TestFusedMixedPrecisionLamb) ... ok
test_multi_device (test_lamb.TestFusedMixedPrecisionLamb) ... ok
test_multi_params (test_lamb.TestFusedMixedPrecisionLamb) ... ok

----------------------------------------------------------------------
Ran 38 tests in 6.665s

OK (skipped=6)
test_autocast_fused_layer_norm_bfloat16_elementwise_affine_False_memory_efficient_False_cuda_bfloat16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... /scratch/slurm_tmpdir/38660/env/lib/python3.10/site-packages/apex/_autocast_utils.py:26: FutureWarning: `torch.cuda.amp.autocast_mode._cast(value, dtype)` is deprecated. Please use `torch.amp.autocast_mode._cast(value, 'cuda', dtype)` instead.
  return torch.cuda.amp.autocast_mode._cast(args, torch.get_autocast_gpu_dtype())
ok
test_autocast_fused_layer_norm_bfloat16_elementwise_affine_False_memory_efficient_True_cuda_bfloat16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_autocast_fused_layer_norm_bfloat16_elementwise_affine_True_memory_efficient_False_cuda_bfloat16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_autocast_fused_layer_norm_bfloat16_elementwise_affine_True_memory_efficient_True_cuda_bfloat16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_autocast_fused_layer_norm_float16_elementwise_affine_False_memory_efficient_False_cuda_float16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_autocast_fused_layer_norm_float16_elementwise_affine_False_memory_efficient_True_cuda_float16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_autocast_fused_layer_norm_float16_elementwise_affine_True_memory_efficient_False_cuda_float16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_autocast_fused_layer_norm_float16_elementwise_affine_True_memory_efficient_True_cuda_float16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_autocast_fused_rms_norm_bfloat16_elementwise_affine_False_memory_efficient_False_cuda_bfloat16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_autocast_fused_rms_norm_bfloat16_elementwise_affine_False_memory_efficient_True_cuda_bfloat16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_autocast_fused_rms_norm_bfloat16_elementwise_affine_True_memory_efficient_False_cuda_bfloat16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_autocast_fused_rms_norm_bfloat16_elementwise_affine_True_memory_efficient_True_cuda_bfloat16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_autocast_fused_rms_norm_float16_elementwise_affine_False_memory_efficient_False_cuda_float16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_autocast_fused_rms_norm_float16_elementwise_affine_False_memory_efficient_True_cuda_float16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_autocast_fused_rms_norm_float16_elementwise_affine_True_memory_efficient_False_cuda_float16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_autocast_fused_rms_norm_float16_elementwise_affine_True_memory_efficient_True_cuda_float16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_compile_fused_layer_norm_elementwise_affine_False_cuda (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_compile_fused_layer_norm_elementwise_affine_True_cuda (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_compile_fused_rms_norm_elementwise_affine_False_cuda (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_compile_fused_rms_norm_elementwise_affine_True_cuda (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_bfloat16_batch_size_16_contiguous_False_elementwise_affine_True_mixed_fused_False_bfloat16_memory_efficient_False_cuda_bfloat16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_bfloat16_batch_size_16_contiguous_False_elementwise_affine_True_mixed_fused_False_bfloat16_memory_efficient_True_cuda_bfloat16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_bfloat16_batch_size_16_contiguous_True_elementwise_affine_True_mixed_fused_False_bfloat16_memory_efficient_False_cuda_bfloat16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_bfloat16_batch_size_16_contiguous_True_elementwise_affine_True_mixed_fused_False_bfloat16_memory_efficient_True_cuda_bfloat16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_elemwise_batch_size_16_contiguous_False_elementwise_affine_True_mixed_fused_False_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_elemwise_batch_size_16_contiguous_False_elementwise_affine_True_mixed_fused_False_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_elemwise_batch_size_16_contiguous_True_elementwise_affine_True_mixed_fused_False_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_elemwise_batch_size_16_contiguous_True_elementwise_affine_True_mixed_fused_False_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_elemwise_batch_size_65536_contiguous_False_elementwise_affine_True_mixed_fused_False_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_elemwise_batch_size_65536_contiguous_False_elementwise_affine_True_mixed_fused_False_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_elemwise_batch_size_65536_contiguous_True_elementwise_affine_True_mixed_fused_False_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_elemwise_batch_size_65536_contiguous_True_elementwise_affine_True_mixed_fused_False_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_export_cuda (test_fused_layer_norm.TestFusedLayerNormCUDA) ... /scratch/slurm_tmpdir/38660/env/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:632: FutureWarning: 'torch.onnx.utils.export_to_pretty_string' is deprecated in version 2.5 and will be removed in the future. Please use onnx.printer.to_text() instead.
  return fn(*args, **kwargs)
ok
test_layer_norm_half_batch_size_16_contiguous_False_elementwise_affine_True_mixed_fused_False_float16_memory_efficient_False_cuda_float16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_half_batch_size_16_contiguous_False_elementwise_affine_True_mixed_fused_False_float16_memory_efficient_True_cuda_float16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_half_batch_size_16_contiguous_True_elementwise_affine_True_mixed_fused_False_float16_memory_efficient_False_cuda_float16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_half_batch_size_16_contiguous_True_elementwise_affine_True_mixed_fused_False_float16_memory_efficient_True_cuda_float16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_mixed_batch_size_16_contiguous_False_elementwise_affine_True_mixed_fused_True_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_mixed_batch_size_16_contiguous_False_elementwise_affine_True_mixed_fused_True_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_mixed_batch_size_16_contiguous_True_elementwise_affine_True_mixed_fused_True_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_mixed_batch_size_16_contiguous_True_elementwise_affine_True_mixed_fused_True_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_mixed_batch_size_65536_contiguous_False_elementwise_affine_True_mixed_fused_True_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_mixed_batch_size_65536_contiguous_False_elementwise_affine_True_mixed_fused_True_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_mixed_batch_size_65536_contiguous_True_elementwise_affine_True_mixed_fused_True_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_mixed_batch_size_65536_contiguous_True_elementwise_affine_True_mixed_fused_True_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_regular_batch_size_16_contiguous_False_elementwise_affine_False_mixed_fused_False_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_regular_batch_size_16_contiguous_False_elementwise_affine_False_mixed_fused_False_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_regular_batch_size_16_contiguous_True_elementwise_affine_False_mixed_fused_False_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_regular_batch_size_16_contiguous_True_elementwise_affine_False_mixed_fused_False_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_regular_batch_size_65536_contiguous_False_elementwise_affine_False_mixed_fused_False_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_regular_batch_size_65536_contiguous_False_elementwise_affine_False_mixed_fused_False_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_regular_batch_size_65536_contiguous_True_elementwise_affine_False_mixed_fused_False_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_regular_batch_size_65536_contiguous_True_elementwise_affine_False_mixed_fused_False_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_export_cuda (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_bfloat16_batch_size_16_contiguous_False_elementwise_affine_True_mixed_fused_False_bfloat16_memory_efficient_False_cuda_bfloat16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_bfloat16_batch_size_16_contiguous_False_elementwise_affine_True_mixed_fused_False_bfloat16_memory_efficient_True_cuda_bfloat16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_bfloat16_batch_size_16_contiguous_True_elementwise_affine_True_mixed_fused_False_bfloat16_memory_efficient_False_cuda_bfloat16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_bfloat16_batch_size_16_contiguous_True_elementwise_affine_True_mixed_fused_False_bfloat16_memory_efficient_True_cuda_bfloat16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_elemwise_batch_size_16_contiguous_False_elementwise_affine_True_mixed_fused_False_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_elemwise_batch_size_16_contiguous_False_elementwise_affine_True_mixed_fused_False_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_elemwise_batch_size_16_contiguous_True_elementwise_affine_True_mixed_fused_False_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_elemwise_batch_size_16_contiguous_True_elementwise_affine_True_mixed_fused_False_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_elemwise_batch_size_65536_contiguous_False_elementwise_affine_True_mixed_fused_False_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_elemwise_batch_size_65536_contiguous_False_elementwise_affine_True_mixed_fused_False_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_elemwise_batch_size_65536_contiguous_True_elementwise_affine_True_mixed_fused_False_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_elemwise_batch_size_65536_contiguous_True_elementwise_affine_True_mixed_fused_False_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_half_batch_size_16_contiguous_False_elementwise_affine_True_mixed_fused_False_float16_memory_efficient_False_cuda_float16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_half_batch_size_16_contiguous_False_elementwise_affine_True_mixed_fused_False_float16_memory_efficient_True_cuda_float16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_half_batch_size_16_contiguous_True_elementwise_affine_True_mixed_fused_False_float16_memory_efficient_False_cuda_float16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_half_batch_size_16_contiguous_True_elementwise_affine_True_mixed_fused_False_float16_memory_efficient_True_cuda_float16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_mixed_batch_size_16_contiguous_False_elementwise_affine_True_mixed_fused_True_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_mixed_batch_size_16_contiguous_False_elementwise_affine_True_mixed_fused_True_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_mixed_batch_size_16_contiguous_True_elementwise_affine_True_mixed_fused_True_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_mixed_batch_size_16_contiguous_True_elementwise_affine_True_mixed_fused_True_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_mixed_batch_size_65536_contiguous_False_elementwise_affine_True_mixed_fused_True_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_mixed_batch_size_65536_contiguous_False_elementwise_affine_True_mixed_fused_True_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_mixed_batch_size_65536_contiguous_True_elementwise_affine_True_mixed_fused_True_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_mixed_batch_size_65536_contiguous_True_elementwise_affine_True_mixed_fused_True_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_regular_batch_size_16_contiguous_False_elementwise_affine_False_mixed_fused_False_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_regular_batch_size_16_contiguous_False_elementwise_affine_False_mixed_fused_False_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_regular_batch_size_16_contiguous_True_elementwise_affine_False_mixed_fused_False_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_regular_batch_size_16_contiguous_True_elementwise_affine_False_mixed_fused_False_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_regular_batch_size_65536_contiguous_False_elementwise_affine_False_mixed_fused_False_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_regular_batch_size_65536_contiguous_False_elementwise_affine_False_mixed_fused_False_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_regular_batch_size_65536_contiguous_True_elementwise_affine_False_mixed_fused_False_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_regular_batch_size_65536_contiguous_True_elementwise_affine_False_mixed_fused_False_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok

----------------------------------------------------------------------
Ran 86 tests in 161.233s

OK
test_creation_cuda (test_mlp.TestMLPCUDA) ... ok
test_mlp_autocast_fp16_use_activation_none_bias_False_cuda (test_mlp.TestMLPCUDA) ... /home/calebh/rsc/apex/tests/L0/run_mlp/test_mlp.py:79: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast_mode.autocast(enabled=enable_autocast):
/scratch/slurm_tmpdir/38660/env/lib/python3.10/site-packages/apex/_autocast_utils.py:26: FutureWarning: `torch.cuda.amp.autocast_mode._cast(value, dtype)` is deprecated. Please use `torch.amp.autocast_mode._cast(value, 'cuda', dtype)` instead.
  return torch.cuda.amp.autocast_mode._cast(args, torch.get_autocast_gpu_dtype())
ok
test_mlp_autocast_fp16_use_activation_none_bias_True_cuda (test_mlp.TestMLPCUDA) ... ok
test_mlp_autocast_fp16_use_activation_relu_bias_False_cuda (test_mlp.TestMLPCUDA) ... ok
test_mlp_autocast_fp16_use_activation_relu_bias_True_cuda (test_mlp.TestMLPCUDA) ... ok
test_mlp_autocast_fp16_use_activation_sigmoid_bias_False_cuda (test_mlp.TestMLPCUDA) ... ok
test_mlp_autocast_fp16_use_activation_sigmoid_bias_True_cuda (test_mlp.TestMLPCUDA) ... ok
test_mlp_use_activation_none_bias_False_cuda (test_mlp.TestMLPCUDA) ... ok
test_mlp_use_activation_none_bias_True_cuda (test_mlp.TestMLPCUDA) ... ok
test_mlp_use_activation_relu_bias_False_cuda (test_mlp.TestMLPCUDA) ... ok
test_mlp_use_activation_relu_bias_True_cuda (test_mlp.TestMLPCUDA) ... ok
test_mlp_use_activation_sigmoid_bias_False_cuda (test_mlp.TestMLPCUDA) ... ok
test_mlp_use_activation_sigmoid_bias_True_cuda (test_mlp.TestMLPCUDA) ... ok
test_no_grad_cuda (test_mlp.TestMLPCUDA) ... ok
test_numeric_cuda (test_mlp.TestMLPCUDA) ... ok
test_performance_half_cuda (test_mlp.TestMLPCUDA) ... ok

----------------------------------------------------------------------
Ran 16 tests in 0.979s

OK
Fail to import hypothesis in common_utils, tests are not derandomized

Executing tests from /home/calebh/rsc/apex/tests/L0/run_optimizers

Executing tests from /home/calebh/rsc/apex/tests/L0/run_fused_layer_norm

Executing tests from /home/calebh/rsc/apex/tests/L0/run_mlp

Pytorch MLP time 0.9917 ms
C++ MLP time 0.5324 ms

@calebho calebho marked this pull request as ready for review December 17, 2024 05:03
@calebho
Copy link
Author

calebho commented Dec 17, 2024

cc @crcrpar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant