Skip to content

Commit

Permalink
Lots of deprecations with example and README adjustments
Browse files Browse the repository at this point in the history
  • Loading branch information
jaywonchung committed Apr 29, 2024
1 parent d0c415f commit 0b82a28
Show file tree
Hide file tree
Showing 83 changed files with 426 additions and 756 deletions.
20 changes: 9 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ Total energy (J):
```console
$ python -m zeus.monitor energy
[2023-08-22 22:44:45,106] [ZeusMonitor](energy.py:157) Monitoring GPU [0, 1, 2, 3].
[2023-08-22 22:44:46,210] [zeus.util.framework](framework.py:38) PyTorch with CUDA support is available.
[2023-08-22 22:44:46,210] [zeus.utils.framework](framework.py:38) PyTorch with CUDA support is available.
[2023-08-22 22:44:46,760] [ZeusMonitor](energy.py:329) Measurement window 'zeus.monitor.energy' started.
^C[2023-08-22 22:44:50,205] [ZeusMonitor](energy.py:329) Measurement window 'zeus.monitor.energy' ended.
Total energy (J):
Expand All @@ -102,16 +102,14 @@ Zeus is part of [The ML.ENERGY Initiative](https://ml.energy).
```
.
├── zeus/ # ⚡ Zeus Python package
│   ├── optimizer/ # - GPU energy and time optimizers
│   ├── run/ # - Tools for running Zeus on real training jobs
│   ├── policy/ # - Optimization policies and extension interfaces
│   ├── util/ # - Utility functions and classes
│   ├── monitor.py # - `ZeusMonitor`: Measure GPU time and energy of any code block
│   ├── controller.py # - Tools for controlling the flow of training
│   ├── callback.py # - Base class for Hugging Face-like training callbacks.
│   ├── simulate.py # - Tools for trace-driven simulation
│   ├── analyze.py # - Analysis functions for power logs
│   └── job.py # - Class for job specification
│   ├── optimizer/ # - A collection of optimizers for time and energy
│   ├── monitor/ # - Programmatic power and energy measurement tools
│   ├── utils/ # - Utility functions and classes
│   ├── _legacy/ # - Legacy code mostly to keep our papers reproducible
│   ├── device.py # - Abstraction layer over compute devices.
│   └── callback.py # - Base class for HuggingFace-like training callbacks
├── docker/ # 🐳 Dockerfiles and Docker Compose files
├── examples/ # 🛠️ Examples of integrating Zeus
Expand Down
3 changes: 1 addition & 2 deletions capriccio/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,5 +30,4 @@ data_path = dict(train="9_train.json", validation="9_val.json")
raw_datasets = datasets.load_dataset("json", data_files=data_path)
```

For a full example, you can use [`examples/ZeusDataLoader/capriccio/train.py`](../examples/ZeusDataLoader/capriccio/train.py) to fine-tune a Huggingface pre-trained language model on a slice of Capriccio.
Parts relevant to using Capriccio are marked with `# CAPRICCIO` in the script.
For a full example, please refer to [`examples/batch_size_optimizer/capriccio/train.py`](../examples/batch_size_optimizer/capriccio/train.py).
12 changes: 0 additions & 12 deletions docs/extend.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,17 +17,5 @@ You can find examples of policy implementations in [`zeus._legacy.policy.optimiz

## Plugging it into Zeus

There are two ways to run Zeus: trace-driven and end-to-end.

### Trace-driven Zeus

The Zeus simulator ([`Simulator`][zeus._legacy.simulate.Simulator]) accepts one [`BatchSizeOptimizer`][zeus._legacy.policy.BatchSizeOptimizer] and [`PowerLimitOptimizer`][zeus._legacy.policy.PowerLimitOptimizer] in its constructor.
A full-example can be found in [`examples/trace_driven`](https://github.com/ml-energy/zeus/tree/master/examples/trace_driven/).

### End-to-end Zeus

There are two central components in end-to-end Zeus: [`ZeusMaster`][zeus.run.ZeusMaster] and [`ZeusDataLoader`][zeus.run.ZeusDataLoader].
The former takes charge of driving the entire optimization over recurring jobs, and accepts an instance of [`BatchSizeOptimizer`][zeus._legacy.policy.BatchSizeOptimizer] in its constructor.
The latter takes charge of JIT-profiling power in the background, determining the optimal power limit, and setting it.
Hence, the functionality of [`JITPowerLimitOptimizer`][zeus._legacy.policy.optimizer.JITPowerLimitOptimizer] is already tightly integrated into `ZeusDataLoader`.
Users will have to implement their own [`ZeusDataLoader`][zeus.run.ZeusDataLoader] in order to test another [`PowerLimitOptimizer`][zeus._legacy.policy.PowerLimitOptimizer] policy.
13 changes: 7 additions & 6 deletions docs/getting_started/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,11 +127,12 @@ We created [Perseus](../perseus/index.md), which can optimize the energy consump

## Recurring jobs

The cost-optimal batch size is located *across* multiple job runs using a Multi-Armed Bandit algorithm.
First, go through the steps for non-recurring jobs.
[`ZeusDataLoader`][zeus.run.ZeusDataLoader] will transparently optimize the GPU power limit for any given batch size.
Then, you can use [`ZeusMaster`][zeus.run.ZeusMaster] to drive recurring jobs and batch size optimization.
In production, it's likely that a DNN is trained and re-trained repetitively to keep it up to date.
For these kinds of recurring jobs, we can take those recurrences as exploration opportunities to find the cost-optimal training batch size.
This is done with a Multi-Armed Bandit algorithm.
See [`BatchSizeOptimizer`][zeus.optimizer.batch_size.client.BatchSizeOptimizer].

This example will come in handy:
Two full examples are given for the batch size optimizer:

- [Running trace-driven simulation on single recurring jobs and the Alibaba GPU cluster trace](https://github.com/ml-energy/zeus/tree/master/examples/trace_driven){.external}
- [MNIST](https://github.com/ml-energy/zeus/tree/master/examples/batch_size_optimizer/mnist/): Single-GPU and data parallel training, with integration examples with Kubeflow
- [Sentiment Analysis](https://github.com/ml-energy/zeus/tree/master/examples/batch_size_optimizer/capriccio/): Full training example with HuggingFace transformers using the Capriccio dataset, a sentiment analysis dataset with data drift.
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ Total energy (J):
```console
$ python -m zeus.monitor energy
[2023-08-22 22:44:45,106] [ZeusMonitor](energy.py:157) Monitoring GPU [0, 1, 2, 3].
[2023-08-22 22:44:46,210] [zeus.util.framework](framework.py:38) PyTorch with CUDA support is available.
[2023-08-22 22:44:46,210] [zeus.utils.framework](framework.py:38) PyTorch with CUDA support is available.
[2023-08-22 22:44:46,760] [ZeusMonitor](energy.py:329) Measurement window 'zeus.monitor.energy' started.
^C[2023-08-22 22:44:50,205] [ZeusMonitor](energy.py:329) Measurement window 'zeus.monitor.energy' ended.
Total energy (J):
Expand Down
4 changes: 2 additions & 2 deletions docs/overview/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,8 +102,8 @@ Fortunately, DNN training jobs often **recur** in production GPU clusters,[^9] a

This results in two main components in Zeus:

- **JIT energy profiler** ([`ZeusDataLoader`][zeus.run.dataloader.ZeusDataLoader]): Finds the optimal power limit via online profiling.
- **MAB + Thompson Sampling** ([`ZeusMaster`][zeus.run.master.ZeusMaster]): Finds the optimal batch size across recurrences.
- **JIT energy profiler**: Finds the optimal power limit via online profiling.
- **MAB + Thompson Sampling**: Finds the optimal batch size across recurrences.


<!-- Abbreviation definitions -->
Expand Down
41 changes: 5 additions & 36 deletions examples/batch_size_optimizer/README.md
Original file line number Diff line number Diff line change
@@ -1,38 +1,7 @@
# Batch Size Optimizer in Zeus
# Batch Size Optimizer

Batch size optimzer is composed of two parts: server and client. Client will be running in your training script just like power limit optimizer or monitor. This client will send training result to BSO server and server will give the client the best batch size to use. Refer to the `docs/batch_size_optimizer/server.md` for how to get started.
The batch size optimizer finds the optimal training batch size for DNN jobs that recur over time.
For more details, see the [docs](https://ml.energy/zeus/optimizers/batch_size_optimizer).

## Data parallel training with Zeus

In the case of data parallel training, Batch size optimizer should be able to give the consistent batch size to all gpus. Since there is no way for batch size to tell the differences between concurrent job submissions and multiple GPU training, we ask users to send a request from a single GPU and broadcast the result(batch size, trial number) to other GPUs. In the case of reporting the result to the batch size optimizer server and receiving the corresponding result (train fail or succeeded) can be dealt by the server since it has the `trial_number`. Thus, report doesn't require any broadcast or communications with other GPUs.
Refer to the `examples/batch_size_optimizer/mnist_dp.py` for the use case.

## Kubeflow

Kubeflow is a tool to easily deploy your ML workflows to kubernetes. We provides some examples of using kubeflow with Zeus. In order to run your training in Kubeflow with Zeus, follow the `docs/batch_size_optimizer/server.md` to deploy batch size optimizer to kubernetes. After then, you can deploy your training script using kubeflow.

1. Set up Kubernetes and install kubeflow training operator.

Refer [minikube](https://minikube.sigs.k8s.io/docs/start/) for local development of Kubernetes.
Refer [Kubeflow training operator](https://github.com/kubeflow/training-operator) to how to install kubeflow.

2. Run server batch size optimizer server using Kubernetes.

Refer docs to start the server [Quick start](../../docs/batch_size_optimizer/index.md).

3. Build mnist example docker image.

```Shell
# From project root directory
docker build -f ./examples/batch_size_optimizer/mnist.Dockerfile -t mnist-example .
```

If you are using the cloud such as AWS, modify the `image` and `imagePullPolicy` in `mnist_dp.yaml` to pull it from the corresponding registry.

4. Deploy training script.

```Shell
cd examples/batch_size_optimizer
kubectl apply -f mnist_dp.yaml # For distributed training example
kubectl apply -f mnist_single_gpu.yaml # For single gpu training example
```
- The MNIST example shows single GPU or data parallel training + Kubeflow deployment.
- The Capriccio example shows a slowly drifting sentiment analysis dataset integrated with the batch size optimizer.
106 changes: 10 additions & 96 deletions examples/batch_size_optimizer/capriccio/README.md
Original file line number Diff line number Diff line change
@@ -1,111 +1,25 @@
# Integrating Zeus with Huggingface and Capriccio
# Capriccio + BSO

This example will demonstrate how to integrate Zeus with [Capriccio](../../capriccio), a drifting sentiment analysis dataset.
This example will demonstrate how to integrate Zeus with [Capriccio](../../../capriccio), a drifting sentiment analysis dataset.

You can search for `# ZEUS` in [`train.py`](train.py) for noteworthy places that require modification from conventional training scripts.
Parts relevant to using Capriccio are also marked with `# CAPRICCIO`.
## Dependencies

**Usages**

- Zeus
- [Running Zeus for a single job](#running-zeus-for-a-single-job)
- [Running Zeus over multiple recurrences](#running-zeus-over-multiple-recurrences)
- Extra
- [Fine-tuning a Huggingface language model on one slice](#fine-tuning-a-huggingface-language-model-on-one-slice)

## Running Zeus for a single job

While our paper is about optimizing the batch size and power limit over multiple recurrences of the job, it is also possible to use just [`ZeusDataLoader`](https://ml.energy/zeus/reference/run/dataloader/#zeus.run.dataloader.ZeusDataLoader) to JIT-profile and optimize the power limit.

### Dependencies

1. Generate Capriccio, following the instructions in [Capriccio's README.md](../../capriccio/).
1. If you're not using our [Docker image](https://ml.energy/zeus/getting_started/environment/), install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
1. Generate Capriccio, following the instructions in [Capriccio's README.md](../../../capriccio/).
1. If you're not using our [Docker image](https://ml.energy/zeus/getting_started/environment/), install `zeus` following our [Getting Started](https://ml.energy/zeus/getting_started/) guide.
1. Install python dependencies for this example:
```sh
pip install -r requirements.txt
```

### Example command

[`ZeusDataLoader`](https://ml.energy/zeus/reference/run/dataloader/#zeus.run.dataloader.ZeusDataLoader) interfaces with the outside world via environment variables.
Check out the [class reference](https://ml.energy/zeus/reference/run/dataloader/#zeus.run.dataloader.ZeusDataLoader) for details.

Only `ZEUS_TARGET_METRIC` is required; other environment variables below show their default values when omitted.

```bash
export ZEUS_TARGET_METRIC="0.84" # Stop training when target val metric is reached
export ZEUS_LOG_DIR="zeus_log" # Directory to store profiling logs
export ZEUS_JOB_ID="zeus" # Used to distinguish recurrences, so not important
export ZEUS_COST_THRESH="inf" # Kill training when cost (Equation 2) exceeds this
export ZEUS_ETA_KNOB="0.5" # Knob to tradeoff energy and time (Equation 2)
export ZEUS_MONITOR_PATH="/workspace/zeus/zeus_monitor/zeus_monitor" # Path to power monitor
export ZEUS_PROFILE_PARAMS="10,40" # warmup_iters,profile_iters for each power limit
export ZEUS_USE_OPTIMAL_PL="True" # Whether to acutally use the optimal PL found
python train.py \
--zeus \
--data_dir data \
--slice_number 9 \
--model_name_or_path bert-base-uncased \
--batch_size 128
```


## Running Zeus over multiple recurrences

This example shows how to integrate [`ZeusDataLoader`](https://ml.energy/zeus/reference/run/dataloader/#zeus.run.dataloader.ZeusDataLoader) and drive batch size and power optimizations with [`ZeusMaster`](https://ml.energy/zeus/reference/run/master/#zeus.run.master.ZeusMaster).

### Dependencies

1. Generate Capriccio, following the instructions in [Capriccio's README.md](../../capriccio/).
1. If you're not using our [Docker image](https://ml.energy/zeus/getting_started/environment/), install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
1. Install python dependencies for this example:
```sh
pip install -r requirements.txt
```

### Example command

```sh
# All arguments shown below are default values.
python run_zeus.py \
--seed 123 \
--b_0 128 \
--lr_0 4.00e-7 \
--b_min 8 \
--b_max 128 \
--num_recurrence 38 \
--eta_knob 0.5 \
--beta_knob 2.0 \
--target_metric 0.84 \
--max_epochs 10 \
--window_size 10
```


## Fine-tuning a Huggingface language model on one slice

`train.py` can also be used to fine-tune a pretrained language model on one slice of Capriccio, without having to do anything with Zeus.

### Dependencies

1. Generate Capriccio, following the instructions in [Capriccio's README.md](../../capriccio/).
1. Only for those not using our [Docker image](https://ml.energy/zeus/getting_started/environment/), install PyTorch separately:
```sh
conda install -c pytorch pytorch==1.10.1
```
1. Install python dependencies for this example:
```sh
pip install -r requirements.txt
```
## Running training

### Example command
As described in the [MNIST example](../mnist/), set up the Zeus batch size optimizer server, and set the `ZEUS_SERVER_URL` environment variable.
On the first recurrence of the job, the batch size optimizer will register the job to the server and print out the job ID.
From the second recurrence, set the `ZEUS_JOB_ID` environment variable to allow the recurrence to be recognized as part of the recurring job.

```sh
python train.py \
--data_dir data \
--slice_number 9 \
--model_name_or_path bert-base-uncased \
--batch_size 128
--model_name_or_path bert-base-uncased
```
Loading

0 comments on commit 0b82a28

Please sign in to comment.