Fixing instruction

Signed-off-by: Lazar Cvetković <[email protected]>
eth-easl · Aug 19, 2024 · 99390bf · 99390bf
1 parent 6f4c470
commit 99390bf
Show file tree

Hide file tree

Showing 5 changed files with 42 additions and 29 deletions.
diff --git a/artifact_evaluation/README.md b/artifact_evaluation/README.md
@@ -5,7 +5,7 @@ The following experiments aim to repeat results from Figures 7, 9, and 10, i.e.,
 Time burden: We expect you will need at most a day of active work to run all the experiments.
 
 Prerequisites:
-- Cloudlab cluster of 20 xl170 machines instantiated using `maestro_sosp24ae` Cloudlab profile (`https://www.cloudlab.us/p/faas-sched/maestro_sosp24ae`). 
+- Cloudlab cluster of at least 20 xl170 machines instantiated using `maestro_sosp24ae` Cloudlab profile (`https://www.cloudlab.us/p/faas-sched/maestro_sosp24ae`).
 - Chrome Cloudlab extension - install from https://github.com/eth-easl/cloudlab_extension
 
 Order of experiments to run experiments:
@@ -47,4 +47,9 @@ Instructions to set up Knative/K8s baseline cluster:
 - Clone Invitro locally and checkout to `ha_k8s` branch (`git clone --branch=ha_k8s https://github.com/vhive-serverless/invitro`)
 - Open Cloudlab experiment, open Cloudlab extension, and copy list of all addresses (RAW) using the extension. This puts the list of all nodes in your clipboard in format requested by the scripts below.
 - Set up a Knative/K8s cluster by locally running `./scripts/setup/create_multinode.sh`. Arguments should be the copied list of addresses from the previous step. For example, `./scripts/setup/create_multinode.sh user@node0 user@node1 user@node2`. This script should be executed only once.
-- After a couple of minutes, once the script has completed executing, the cluster should be running, and you can ssh into `node0`. Execute `kubectl get pods -A` and verify that installation has completed successfully by checking that all pods are in `Running` or `Completed` state.
+- After a couple of minutes, once the script has completed executing, the cluster should be running, and you can ssh into `node0`. Execute `kubectl get pods -A` and verify that installation has completed successfully by checking that all pods are in `Running` or `Completed` state.
+
+Results expectation/interpretation:
+- Since we cannot guarantee artifact evaluators access to a 100-node cluster over a 2-week artifact evaluation period, there will be some performance degradation than what we show in the paper.
+  - For cold start sweep, the throughput we show in Figure 7 will be reduced, as worker nodes become the bottleneck. What you should verify is that the cold start throughput conforms to the following inequalities -- `Knative/K8s throughtput << Maestro - containerd throughtput < Maestro - Firecracker throughtput` and `Knative/K8s latency >> Maestro latency`.
+  - For Azure 500 trace experiments, per-function slowdown of containerd and Firecracker should almost be identical. The workload on Knative/K8s should be worse, and should suffer from a long tail. Per-invocation scheduling latency for Dirigent should be better almost all the time, and the average per-function scheduling latency of Dirigent should be by a couple of orders of magnitude better than with Knative/K8s.
diff --git a/artifact_evaluation/azure_500/dirigent/INSTRUCTIONS.md b/artifact_evaluation/azure_500/dirigent/INSTRUCTIONS.md
@@ -1,13 +1,16 @@
+## Azure 500 on Dirigent
+
 Time required: 10 min to set up environment and 30 min per experiment
 
-Description:  This experiment runs the downsampled Azure trace with 500 functions. First run all the experiments with containerd, as given in the main `README.md`, and then deploy the cluster again, just that time with Firecracker. The procedure for running experiments is the same, just the trace with suffix `_firecracker` should be used.
+Description: This experiment runs the downsampled Azure trace with 500 functions. Instructions for running trace with containerd and Firecracker are the same, except that the cluster is deployed differently, as given in the `README.md` in the main folder of artifact evaluation. For Firecracker, make sure to use the trace with suffix `_firecracker`. We recommend you follow the order of experiments as given in the `README.md`. 
 
 Instructions:
-- Start Dirigent cluster as per instructions located in the root folder of artifact evaluation instructions
-- On the `node0` execute `mkdir -p ~/invitro/data/traces/azure_500` and `mkdir -p ~/invitro/data/traces/azure_500_firecracker` Copy traces `scp azure_500/* user@node0:~/invitro/data/traces/azure_500/` and `scp azure_500_firecracker/* user@node0:~/invitro/data/traces/azure_500_firecracker/`
-- Make sure `~/invitro` branch is `rps_mode`. With text editor open `cmd/config_dirigent_trace.json` and change TracePath to match `azure_500` or `azure_500_firecracker`
-- Run locally `./scripts/start_resource_monitoring.sh user@node1 user@node2 user@node3`. 
-- Run the load generator in `screen` on `node0` with `cd ~/invitro; go run cmd/loader.go --config cmd/config_dirigent_trace.json`. Wait for 30 minutes. There should be ~170K invocations, with a negligible failure rate.
-- Gather experiment results. Make sure you do not overwrite data from the other experiment.
+- Start Dirigent cluster as per instructions located in the root folder of artifact evaluation instructions.
+- On the `node0` execute `mkdir -p ~/invitro/data/traces/azure_500` or `mkdir -p ~/invitro/data/traces/azure_500_firecracker`, depending on which runtime you use. 
+- Copy traces to `node0` using `scp azure_500/* user@node0:~/invitro/data/traces/azure_500/` or `scp azure_500_firecracker/* user@node0:~/invitro/data/traces/azure_500_firecracker/`.
+- Make sure on `node0` `~/invitro` branch is `rps_mode`. With text editor open `cmd/config_dirigent_trace.json` and change TracePath to match `azure_500` or `azure_500_firecracker`.
+- On your local machine run `./scripts/start_resource_monitoring.sh user@node0 user@node1 user@node2`. 
+- Run the load generator in `screen` on `node0` with `cd ~/invitro; go run cmd/loader.go --config cmd/config_dirigent_trace.json`. Wait until the experiment completed (~30 minutes). There should be ~170K invocations, with a negligible failure rate.
+- Gather experiment results. Make sure you do not overwrite data from the other experiment, and you place results in correct folders.
   - Copy load generator output with `scp user@node0:~/invitro/data/out/experiment_duration_30.csv results_azure_500/`
   - Copy resource utilization data with `mkdir -p ./artifact_evaluation/azure_500/dirigent/results_azure_500/cpu_mem_usage && ./scripts/collect_resource_monitoring.sh ./artifact_evaluation/azure_500/dirigent/results_azure_500/cpu_mem_usage user@node0 user@node1 user@node2`.
diff --git a/artifact_evaluation/azure_500/knative/INSTRUCTIONS.md b/artifact_evaluation/azure_500/knative/INSTRUCTIONS.md
@@ -1,13 +1,17 @@
+## Azure 500 on Knative/K8s
+
 Time required: 10 min to set up environment and 30-60 min for the experiment
 
-Description:  This experiment runs the downsampled Azure trace with 500 functions. Do not reuse Knative/K8s cluster if you configured the cluster for cold start sweep experiment.
+Description: This experiment runs the downsampled Azure trace with 500 functions. Do not reuse Knative/K8s cluster if you configured the cluster for cold start sweep experiment. 
+
+Important: Do not reuse Knative/K8s cluster if you previously ran cold start sweep experiments, as the autoscaling configuration was changed and could affect the results severely.
 
 Instructions:
-- SSH into `node0`and on that node clone the load generator repo. Then checkout to `rps_mode` branch. The command is `git clone --branch=rps_mode https://github.com/vhive-serverless/invitro`.
-- On `node0` create a directory where trace will be stored `cd invitro; mkdir data/traces/azure_500`
-- Copy the trace from this folder to `node0` using the following command `scp azure_500/*.csv user@node0:~/invitro/data/traces/azure_500`
-- Run locally `./scripts/start_resource_monitoring.sh user@node1 user@node2 user@node3`.
-- On `node0` run `screen` and inside the screen run `go run cmd/loader.go --config cmd/config_knative.json`. Function deployment will take 10-20 minutes, and then experiment for additional 30 minutes.
-- Gather experiment results. Make sure you do not overwrite data from the other experiment.
+- SSH into `node0` and on that node clone the load generator repo. Then checkout to `rps_mode` branch. The command is `git clone --branch=rps_mode https://github.com/vhive-serverless/invitro`.
+- On `node0` create a directory where trace will be stored `cd invitro; mkdir data/traces/azure_500`.
+- Copy the trace from folder where this instruction file is located to the folder you previously created on `node0` using the following command `scp azure_500/*.csv user@node0:~/invitro/data/traces/azure_500`. 
+- On your local machine run `./scripts/start_resource_monitoring.sh user@node0 user@node1 user@node2`.
+- On `node0` run `screen` and inside the `screen` run `go run cmd/loader.go --config cmd/config_knative.json`. Function deployment will take 10-20 minutes, and then experiment will run for additional 30 minutes.
+- Gather experiment results. Make sure you do not overwrite data from the other experiment, and you place results in correct folders.
   - Copy load generator output with `scp user@node0:~/invitro/data/out/experiment_duration_30.csv results_azure_500/`
   - Copy resource utilization data with `mkdir -p ./artifact_evaluation/azure_500/knative/results_azure_500/cpu_mem_usage && ./scripts/collect_resource_monitoring.sh ./artifact_evaluation/azure_500/knative/results_azure_500/cpu_mem_usage user@node0 user@node1 user@node2`.
diff --git a/artifact_evaluation/cold_start_sweep/dirigent/INSTRUCTIONS.md b/artifact_evaluation/cold_start_sweep/dirigent/INSTRUCTIONS.md
@@ -1,14 +1,13 @@
+## Cold start sweep on Dirigent
+
 Time required: 10 min to set up environment and 2-3 min per data point
 
-Description:  This experiment triggers cold start in Maestro cluster. You should sweep the load until the cluster saturates, which will be visible on the latency plot. We suggest running experiments with 1, 10, 100, 500, 1000, 1250, 1500, ... RPS and observing the latency after conducting experiment for each data point. Low RPS (<10 RPS) rates should be run for 3-5 minutes, because of warmup, while all other loads can be run for just 1 minute. Always discard the results of the first experiment when starting a new cluster, as these measurements include image pull latency, which we should not include in the results.
+Description: This experiment triggers cold start in Maestro cluster. You should sweep the load until the cluster saturates, which will be visible on the latency plot. We suggest running experiments with 1, 10, 100, 250, 500, 750, 1000, ... RPS and observing the latency after conducting experiment for each data point. Low RPS (<10 RPS) rates should be run for 3-5 minutes, because of warmup, while any higher load can be run for just a minute. Always discard the results of the first experiment when starting a new cluster, as these measurements include image pull latency, which pollutes the measurements (can be seen as high p99 at low RPS). The instruction is for running experiments is the same for containerd and Firecracker, except the deployment method explained in `README.md` and `RpsImage` load generator field.
 
 Instructions:
-- Start Dirigent cluster according to instructions located in the root folder of artifact evaluation instructions. You can reuse the existing cluster running Dirigent containerd.
-- On remote machine `node0` open `~/invitro/cmd/config_dirigent_rps.json`. Set `RpsColdStartRatioPercentage` to `100`, and sweep the load with `RpsTarget` while configuring `ExperimentDuration` according to instructions above. For higher RPS, it might be necessary to increase `RpsCooldownSeconds`, which controls the number of functions that are deployed in the cluster to achieve the requested RPS. Set `GRPCFunctionTimeoutSeconds` to `15`. For containerd experiments make sure `RpsImage` is set to `docker.io/cvetkovic/dirigent_empty_function:latest`, whereas for Firecracker experiments this field should be set to `empty`.
-- Start RPS experiment by running `cd ~/invitro; go run cmd/loader.go --config cmd/config_dirigent_rps.json`
-- Create folder storing results with `mkdir -p ./artifact_evaluation/cold_start_sweep/dirigent/results_containerd`
+- Start Dirigent cluster according to instructions located in the root folder of artifact evaluation instructions (`README.md`). You can reuse the existing cluster running Dirigent containerd.
+- On remote machine `node0` open `~/invitro/cmd/config_dirigent_rps.json`. Set `RpsColdStartRatioPercentage` to `100`, and sweep the load with `RpsTarget` while configuring `ExperimentDuration` according to instructions above. For higher RPS (>1000), it might be necessary to increase `RpsCooldownSeconds`, which controls the number of functions that are deployed in the cluster to achieve the requested RPS. Set `GRPCFunctionTimeoutSeconds` to `15`. For containerd experiments make sure `RpsImage` is set to `docker.io/cvetkovic/dirigent_empty_function:latest`, whereas for Firecracker experiments this field should be set to `empty`.
+- Start RPS experiment by running `cd ~/invitro; go run cmd/loader.go --config cmd/config_dirigent_rps.json`.
+- Create folder storing results with `mkdir -p ./artifact_evaluation/cold_start_sweep/dirigent/results_containerd` or `mkdir -p ./artifact_evaluation/cold_start_sweep/dirigent/results_firecracker`.
 - Gather results located in `data/out/experiment_duration_X.csv` and copy them to your local machine in format `rps_X.csv` to the folder you created in the previous step.
-- Repeat for different RPS values until the cluster saturates, which you can see by plotting the data with the provided script
-
-Results expectation/interpretation: 
-- Since we cannot provide access to a 100-node cluster over a 2-week artifact evaluation period, the throughput we show in Figure 7 is lower on smaller cluster, as worker nodes become the bottleneck. However, it is important to note that cold start throughput of Knative/K8s << Maestro - containerd < Maestro - Firecracker.
+- Repeat for different RPS values until the cluster saturates, which you can see by plotting the data with the provided script.
diff --git a/artifact_evaluation/cold_start_sweep/knative/INSTRUCTIONS.md b/artifact_evaluation/cold_start_sweep/knative/INSTRUCTIONS.md
@@ -1,14 +1,16 @@
+## Cold start sweep on Knative/K8s
+
 Time required: 10 min to set up environment and 2-3 min per data point
 
-Description:  This experiment triggers cold start in Maestro cluster. You should sweep the load until the cluster saturates, which will be visible on the latency plot and should happen around 3 RPS. We suggest running experiments with 1, 2, 3 RPS and observing the latency after conducting experiment for each data point. Always discard the results of the first experiment when starting a new cluster, as these measurements include image pull latency, which we should not include in the results.
+Description:  This experiment triggers cold start in Maestro cluster. You should sweep the load until the cluster saturates, which will be visible on the latency plot and should happen around 3 RPS. We suggest running experiments with 1, 2, 3 RPS and observing the latency after conducting experiment for each data point. Always discard the results of the first experiment when starting a new cluster, as these measurements include image pull latency, which pollutes the results.
 
 Instructions: 
-- Start Knative/K8s cluster according to instructions located in the root folder of artifact evaluation instructions. You can reuse the existing cluster running Maestro containerd.
+- Start Knative/K8s cluster according to instructions located in the root folder of artifact evaluation instructions (`README.md`). You can reuse the existing cluster running Knative/K8s, but after executing the instructions below do not use such cluster for running Azure 500 trace again.
 - On `node0` execute the following commands:
   - Open `~/invitro/workloads/container/trace_func_go.yaml`, set `autoscaling.knative.dev/max-scale` to `1`, and then set image to `docker.io/cvetkovic/dirigent_empty_function:latest`.
   - Run `kubectl patch configmap config-autoscaler -n knative-serving -p '{"data":{"scale-to-zero-grace-period":"1s","scale-to-zero-pod-retention-period":"1s","stable-window":"6s"}}'`
-- In `cmd/config_knative_rps.json` set `ExperimentDuration` to 2.
-- The command for running experiment for each data point is `go run cmd/loader.go --config cmd/config_knative_rps.json`. Use the following data point settings in `cmd/config_knative_rps.json` for experiments.
+  - In `~/invitro/cmd/config_knative_rps.json` set `ExperimentDuration` to 2 and `RpsColdStartRatioPercentage` to `100`
+- The command for running experiment for each data point is `cd invitro; go run cmd/loader.go --config cmd/config_knative_rps.json`. Use the following data point settings in `cmd/config_knative_rps.json` for experiments.
   - `RpsTarget=1` with `RpsCooldownSeconds=10`
   - `RpsTarget=2` with `RpsCooldownSeconds=15`
   - `RpsTarget=3` with `RpsCooldownSeconds=20`