GPT-in-a-Box VM: Split Inference and Management requests (#52)

* split inference and management requests * minor changes
nutanix-cloud-native · Nov 21, 2023 · dcf66a7 · dcf66a7
1 parent 3de1ccf
commit dcf66a7
Show file tree

Hide file tree

Showing 5 changed files with 112 additions and 79 deletions.
diff --git a/docs/gpt-in-a-box/kubernetes/inference_requests.md b/docs/gpt-in-a-box/kubernetes/inference_requests.md
@@ -40,7 +40,9 @@ Curl request for Llama2-7B model
 ```
 curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" http://${INGRESS_HOST}:${INGRESS_PORT}/v2/models/llama2_7b/infer -d @$WORK_DIR/data/translate/sample_test1.json
 ```
-Input_file should be a json file in the following example format:
+
+### Input data format
+Input data should be in **JSON** format. The input should be a '.json' file containing the prompt in the format below:
 ```
 {
   "id": "42",

diff --git a/docs/gpt-in-a-box/vm/custom_model.md b/docs/gpt-in-a-box/vm/custom_model.md
@@ -12,7 +12,7 @@ Where the arguments are :
 
 - **model_name**:       Name of custom model
 - **repo_version**:     Any model version, defaults to "1.0" (optional)
-- **model_path**:       Absolute path of custom model files (should be empty non empty)
+- **model_path**:       Absolute path of custom model files (should be a non empty folder)
 - **mar_output**:       Absolute path of export of MAR file (.mar)
 - **no_download**:      Flag to skip downloading the model files, must be set for custom models
 - **handler**:          Path to custom handler, defaults to llm/handler.py (optional)

diff --git a/docs/gpt-in-a-box/vm/inference_requests.md b/docs/gpt-in-a-box/vm/inference_requests.md
@@ -1,15 +1,14 @@
-# Inference and Management Requests
-TorchServe can be inferenced and managed through it's Inference and Management APIs respectively. Find out more about TorchServe APIs in the official [Inference API](https://pytorch.org/serve/inference_api.html) and [Management API](https://pytorch.org/serve/management_api.html) documentation
+# Inference Requests
+The Inference Server can be inferenced through the TorchServe Inference API. Find out more about it in the official [TorchServe Inference API](https://pytorch.org/serve/inference_api.html) documentation.
 
 **Server Configuration**
 
 | Variable | Value |
 | --- | --- |
 | inference_server_endpoint | localhost |
 | inference_port | 8080 |
-| management_port | 8081 |
 
-The following are example cURL commands to Inference and Manage the Inference Server.
+The following are example cURL commands to send inference requests to the Inference Server.
 
 ## Inference Requests
 The following is the template command for inferencing with a text file:
@@ -50,78 +49,22 @@ curl -v -H "Content-Type: application/text" http://localhost:8080/predictions/ll
 curl -v -H "Content-Type: application/json" http://localhost:8080/predictions/llama2_7b -d @$WORK_DIR/data/translate/sample_text3.json
 ```
 
-## Describe Registered Model
-Once a model is loaded on the Inference Server, we can use the following request to describe the model and it's configuration.
+### Input data format
+Input data can be in either **text** or **JSON** format.
 
-The following is the template command for the same:
-```
-curl http://{inference_server_endpoint}:{management_port}/models/{model_name}
-```
-
-### Examples 
-For MPT-7B model
-```
-curl http://localhost:8081/models/mpt_7b
-```
-For Falcon-7B model
-```
-curl http://localhost:8081/models/falcon_7b
-```
-For Llama2-7B model
-```
-curl http://localhost:8081/models/llama2_7b
-```
-
-## Register Additional Models
-TorchServe allows the registering (loading) of multiple models simultaneously. To register multiple models, make sure that the Model Archive Files for the concerned models are stored in the same directory.
-
-The following is the template command for the same:
-```
-curl -X POST "http://{inference_server_endpoint}:{management_port}/models?url={model_archive_file_name}.mar&initial_workers=1&synchronous=true"
-```
-
-### Examples 
-For MPT-7B model
-```
-curl -X POST "http://localhost:8081/models?url=mpt_7b.mar&initial_workers=1&synchronous=true"
-```
-For Falcon-7B model
-```
-curl -X POST "http://localhost:8081/models?url=falcon_7b.mar&initial_workers=1&synchronous=true"
-```
-For Llama2-7B model
-```
-curl -X POST "http://localhost:8081/models?url=llama2_7b.mar&initial_workers=1&synchronous=true"
-```
-!!! note
-    Make sure the Model Archive file name given in the cURL request is correct and is present in the model store directory.
-
-## Edit Registered Model Configuration
-The model can be configured after registration using the Management API of TorchServe. 
-
-The following is the template command for the same:
-```
-curl -v -X PUT "http://{inference_server_endpoint}:{management_port}/models/{model_name}?min_workers={number}&max_workers={number}&batch_size={number}&max_batch_delay={delay_in_ms}"
-```
-
-### Examples 
-For MPT-7B model
-```
-curl -v -X PUT "http://localhost:8081/models/mpt_7b?min_worker=2&max_worker=2"
-```
-For Falcon-7B model
-```
-curl -v -X PUT "http://localhost:8081/models/falcon_7b?min_worker=2&max_worker=2"
-```
-For Llama2-7B model
-```
-curl -v -X PUT "http://localhost:8081/models/llama2_7b?min_worker=2&max_worker=2"
-```
-!!! note
-    Make sure to have enough GPU and System Memory before increasing number of workers, else the additional workers will fail to load.
+1. For text format, the input should be a '.txt' file containing the prompt
 
-## Unregister a Model
-The following is the template command to unregister a model from the Inference Server:
-```
-curl -X DELETE "http://{inference_server_endpoint}:{management_port}/models/{model_name}/{repo_version}"
+2. For JSON format, the input should be a '.json' file containing the prompt in the format below:
 ```
+{
+  "id": "42",
+  "inputs": [
+      {
+          "name": "input0",
+          "shape": [-1],
+          "datatype": "BYTES",
+          "data": ["Capital of India?"]
+      }
+  ]
+}
+```
diff --git a/docs/gpt-in-a-box/vm/management_requests.md b/docs/gpt-in-a-box/vm/management_requests.md
@@ -0,0 +1,87 @@
+# Management Requests
+The Inference Server can be managed through the TorchServe Management API. Find out more about it in the official [TorchServe Management API](https://pytorch.org/serve/management_api.html) documentation
+
+**Server Configuration**
+
+| Variable | Value |
+| --- | --- |
+| inference_server_endpoint | localhost |
+| management_port | 8081 |
+
+The following are example cURL commands to send management requests to the Inference Server.
+
+## Describe Registered Model
+Once a model is loaded on the Inference Server, we can use the following request to describe the model and it's configuration.
+
+The following is the template command for the same:
+```
+curl http://{inference_server_endpoint}:{management_port}/models/{model_name}
+```
+
+### Examples 
+For MPT-7B model
+```
+curl http://localhost:8081/models/mpt_7b
+```
+For Falcon-7B model
+```
+curl http://localhost:8081/models/falcon_7b
+```
+For Llama2-7B model
+```
+curl http://localhost:8081/models/llama2_7b
+```
+
+## Register Additional Models
+TorchServe allows the registering (loading) of multiple models simultaneously. To register multiple models, make sure that the Model Archive Files for the concerned models are stored in the same directory.
+
+The following is the template command for the same:
+```
+curl -X POST "http://{inference_server_endpoint}:{management_port}/models?url={model_archive_file_name}.mar&initial_workers=1&synchronous=true"
+```
+
+### Examples 
+For MPT-7B model
+```
+curl -X POST "http://localhost:8081/models?url=mpt_7b.mar&initial_workers=1&synchronous=true"
+```
+For Falcon-7B model
+```
+curl -X POST "http://localhost:8081/models?url=falcon_7b.mar&initial_workers=1&synchronous=true"
+```
+For Llama2-7B model
+```
+curl -X POST "http://localhost:8081/models?url=llama2_7b.mar&initial_workers=1&synchronous=true"
+```
+!!! note
+    Make sure the Model Archive file name given in the cURL request is correct and is present in the model store directory.
+
+## Edit Registered Model Configuration
+The model can be configured after registration using the Management API of TorchServe. 
+
+The following is the template command for the same:
+```
+curl -v -X PUT "http://{inference_server_endpoint}:{management_port}/models/{model_name}?min_workers={number}&max_workers={number}&batch_size={number}&max_batch_delay={delay_in_ms}"
+```
+
+### Examples 
+For MPT-7B model
+```
+curl -v -X PUT "http://localhost:8081/models/mpt_7b?min_worker=2&max_worker=2"
+```
+For Falcon-7B model
+```
+curl -v -X PUT "http://localhost:8081/models/falcon_7b?min_worker=2&max_worker=2"
+```
+For Llama2-7B model
+```
+curl -v -X PUT "http://localhost:8081/models/llama2_7b?min_worker=2&max_worker=2"
+```
+!!! note
+    Make sure to have enough GPU and System Memory before increasing number of workers, else the additional workers will fail to load.
+
+## Unregister a Model
+The following is the template command to unregister a model from the Inference Server:
+```
+curl -X DELETE "http://{inference_server_endpoint}:{management_port}/models/{model_name}/{repo_version}"
+```
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -124,9 +124,10 @@ nav:
             - "Getting Started": "gpt-in-a-box/vm/getting_started.md"
             - "Generating Model Archive File": "gpt-in-a-box/vm/generating_mar.md"
             - "Deploying Inference Server": "gpt-in-a-box/vm/inference_server.md"
-            - "Inference and Management Requests": "gpt-in-a-box/vm/inference_requests.md"
+            - "Inference Requests": "gpt-in-a-box/vm/inference_requests.md"
             - "Model Version Support": "gpt-in-a-box/vm/model_version.md"
             - "Custom Model Support": "gpt-in-a-box/vm/custom_model.md"
+            - "Management Requests": "gpt-in-a-box/vm/management_requests.md"
         - "Deploy on Kubernetes":
             - "Getting Started": "gpt-in-a-box/kubernetes/getting_started.md"
             - "Generating Model Archive File": "gpt-in-a-box/kubernetes/generating_mar.md"