-
Notifications
You must be signed in to change notification settings - Fork 28
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Lots of deprecations with example and README adjustments
- Loading branch information
1 parent
d0c415f
commit 0b82a28
Showing
83 changed files
with
426 additions
and
756 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,38 +1,7 @@ | ||
# Batch Size Optimizer in Zeus | ||
# Batch Size Optimizer | ||
|
||
Batch size optimzer is composed of two parts: server and client. Client will be running in your training script just like power limit optimizer or monitor. This client will send training result to BSO server and server will give the client the best batch size to use. Refer to the `docs/batch_size_optimizer/server.md` for how to get started. | ||
The batch size optimizer finds the optimal training batch size for DNN jobs that recur over time. | ||
For more details, see the [docs](https://ml.energy/zeus/optimizers/batch_size_optimizer). | ||
|
||
## Data parallel training with Zeus | ||
|
||
In the case of data parallel training, Batch size optimizer should be able to give the consistent batch size to all gpus. Since there is no way for batch size to tell the differences between concurrent job submissions and multiple GPU training, we ask users to send a request from a single GPU and broadcast the result(batch size, trial number) to other GPUs. In the case of reporting the result to the batch size optimizer server and receiving the corresponding result (train fail or succeeded) can be dealt by the server since it has the `trial_number`. Thus, report doesn't require any broadcast or communications with other GPUs. | ||
Refer to the `examples/batch_size_optimizer/mnist_dp.py` for the use case. | ||
|
||
## Kubeflow | ||
|
||
Kubeflow is a tool to easily deploy your ML workflows to kubernetes. We provides some examples of using kubeflow with Zeus. In order to run your training in Kubeflow with Zeus, follow the `docs/batch_size_optimizer/server.md` to deploy batch size optimizer to kubernetes. After then, you can deploy your training script using kubeflow. | ||
|
||
1. Set up Kubernetes and install kubeflow training operator. | ||
|
||
Refer [minikube](https://minikube.sigs.k8s.io/docs/start/) for local development of Kubernetes. | ||
Refer [Kubeflow training operator](https://github.com/kubeflow/training-operator) to how to install kubeflow. | ||
|
||
2. Run server batch size optimizer server using Kubernetes. | ||
|
||
Refer docs to start the server [Quick start](../../docs/batch_size_optimizer/index.md). | ||
|
||
3. Build mnist example docker image. | ||
|
||
```Shell | ||
# From project root directory | ||
docker build -f ./examples/batch_size_optimizer/mnist.Dockerfile -t mnist-example . | ||
``` | ||
|
||
If you are using the cloud such as AWS, modify the `image` and `imagePullPolicy` in `mnist_dp.yaml` to pull it from the corresponding registry. | ||
|
||
4. Deploy training script. | ||
|
||
```Shell | ||
cd examples/batch_size_optimizer | ||
kubectl apply -f mnist_dp.yaml # For distributed training example | ||
kubectl apply -f mnist_single_gpu.yaml # For single gpu training example | ||
``` | ||
- The MNIST example shows single GPU or data parallel training + Kubeflow deployment. | ||
- The Capriccio example shows a slowly drifting sentiment analysis dataset integrated with the batch size optimizer. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,111 +1,25 @@ | ||
# Integrating Zeus with Huggingface and Capriccio | ||
# Capriccio + BSO | ||
|
||
This example will demonstrate how to integrate Zeus with [Capriccio](../../capriccio), a drifting sentiment analysis dataset. | ||
This example will demonstrate how to integrate Zeus with [Capriccio](../../../capriccio), a drifting sentiment analysis dataset. | ||
|
||
You can search for `# ZEUS` in [`train.py`](train.py) for noteworthy places that require modification from conventional training scripts. | ||
Parts relevant to using Capriccio are also marked with `# CAPRICCIO`. | ||
## Dependencies | ||
|
||
**Usages** | ||
|
||
- Zeus | ||
- [Running Zeus for a single job](#running-zeus-for-a-single-job) | ||
- [Running Zeus over multiple recurrences](#running-zeus-over-multiple-recurrences) | ||
- Extra | ||
- [Fine-tuning a Huggingface language model on one slice](#fine-tuning-a-huggingface-language-model-on-one-slice) | ||
|
||
## Running Zeus for a single job | ||
|
||
While our paper is about optimizing the batch size and power limit over multiple recurrences of the job, it is also possible to use just [`ZeusDataLoader`](https://ml.energy/zeus/reference/run/dataloader/#zeus.run.dataloader.ZeusDataLoader) to JIT-profile and optimize the power limit. | ||
|
||
### Dependencies | ||
|
||
1. Generate Capriccio, following the instructions in [Capriccio's README.md](../../capriccio/). | ||
1. If you're not using our [Docker image](https://ml.energy/zeus/getting_started/environment/), install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/). | ||
1. Generate Capriccio, following the instructions in [Capriccio's README.md](../../../capriccio/). | ||
1. If you're not using our [Docker image](https://ml.energy/zeus/getting_started/environment/), install `zeus` following our [Getting Started](https://ml.energy/zeus/getting_started/) guide. | ||
1. Install python dependencies for this example: | ||
```sh | ||
pip install -r requirements.txt | ||
``` | ||
|
||
### Example command | ||
|
||
[`ZeusDataLoader`](https://ml.energy/zeus/reference/run/dataloader/#zeus.run.dataloader.ZeusDataLoader) interfaces with the outside world via environment variables. | ||
Check out the [class reference](https://ml.energy/zeus/reference/run/dataloader/#zeus.run.dataloader.ZeusDataLoader) for details. | ||
|
||
Only `ZEUS_TARGET_METRIC` is required; other environment variables below show their default values when omitted. | ||
|
||
```bash | ||
export ZEUS_TARGET_METRIC="0.84" # Stop training when target val metric is reached | ||
export ZEUS_LOG_DIR="zeus_log" # Directory to store profiling logs | ||
export ZEUS_JOB_ID="zeus" # Used to distinguish recurrences, so not important | ||
export ZEUS_COST_THRESH="inf" # Kill training when cost (Equation 2) exceeds this | ||
export ZEUS_ETA_KNOB="0.5" # Knob to tradeoff energy and time (Equation 2) | ||
export ZEUS_MONITOR_PATH="/workspace/zeus/zeus_monitor/zeus_monitor" # Path to power monitor | ||
export ZEUS_PROFILE_PARAMS="10,40" # warmup_iters,profile_iters for each power limit | ||
export ZEUS_USE_OPTIMAL_PL="True" # Whether to acutally use the optimal PL found | ||
python train.py \ | ||
--zeus \ | ||
--data_dir data \ | ||
--slice_number 9 \ | ||
--model_name_or_path bert-base-uncased \ | ||
--batch_size 128 | ||
``` | ||
|
||
|
||
## Running Zeus over multiple recurrences | ||
|
||
This example shows how to integrate [`ZeusDataLoader`](https://ml.energy/zeus/reference/run/dataloader/#zeus.run.dataloader.ZeusDataLoader) and drive batch size and power optimizations with [`ZeusMaster`](https://ml.energy/zeus/reference/run/master/#zeus.run.master.ZeusMaster). | ||
|
||
### Dependencies | ||
|
||
1. Generate Capriccio, following the instructions in [Capriccio's README.md](../../capriccio/). | ||
1. If you're not using our [Docker image](https://ml.energy/zeus/getting_started/environment/), install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/). | ||
1. Install python dependencies for this example: | ||
```sh | ||
pip install -r requirements.txt | ||
``` | ||
|
||
### Example command | ||
|
||
```sh | ||
# All arguments shown below are default values. | ||
python run_zeus.py \ | ||
--seed 123 \ | ||
--b_0 128 \ | ||
--lr_0 4.00e-7 \ | ||
--b_min 8 \ | ||
--b_max 128 \ | ||
--num_recurrence 38 \ | ||
--eta_knob 0.5 \ | ||
--beta_knob 2.0 \ | ||
--target_metric 0.84 \ | ||
--max_epochs 10 \ | ||
--window_size 10 | ||
``` | ||
|
||
|
||
## Fine-tuning a Huggingface language model on one slice | ||
|
||
`train.py` can also be used to fine-tune a pretrained language model on one slice of Capriccio, without having to do anything with Zeus. | ||
|
||
### Dependencies | ||
|
||
1. Generate Capriccio, following the instructions in [Capriccio's README.md](../../capriccio/). | ||
1. Only for those not using our [Docker image](https://ml.energy/zeus/getting_started/environment/), install PyTorch separately: | ||
```sh | ||
conda install -c pytorch pytorch==1.10.1 | ||
``` | ||
1. Install python dependencies for this example: | ||
```sh | ||
pip install -r requirements.txt | ||
``` | ||
## Running training | ||
|
||
### Example command | ||
As described in the [MNIST example](../mnist/), set up the Zeus batch size optimizer server, and set the `ZEUS_SERVER_URL` environment variable. | ||
On the first recurrence of the job, the batch size optimizer will register the job to the server and print out the job ID. | ||
From the second recurrence, set the `ZEUS_JOB_ID` environment variable to allow the recurrence to be recognized as part of the recurring job. | ||
|
||
```sh | ||
python train.py \ | ||
--data_dir data \ | ||
--slice_number 9 \ | ||
--model_name_or_path bert-base-uncased \ | ||
--batch_size 128 | ||
--model_name_or_path bert-base-uncased | ||
``` |
Oops, something went wrong.