Skip to content

Commit

Permalink
GPT-in-a-Box VM: Split Inference and Management requests (#52)
Browse files Browse the repository at this point in the history
* split inference and management requests

* minor changes
  • Loading branch information
saileshd1402 authored Nov 21, 2023
1 parent 3de1ccf commit dcf66a7
Show file tree
Hide file tree
Showing 5 changed files with 112 additions and 79 deletions.
4 changes: 3 additions & 1 deletion docs/gpt-in-a-box/kubernetes/inference_requests.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,9 @@ Curl request for Llama2-7B model
```
curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" http://${INGRESS_HOST}:${INGRESS_PORT}/v2/models/llama2_7b/infer -d @$WORK_DIR/data/translate/sample_test1.json
```
Input_file should be a json file in the following example format:

### Input data format
Input data should be in **JSON** format. The input should be a '.json' file containing the prompt in the format below:
```
{
"id": "42",
Expand Down
2 changes: 1 addition & 1 deletion docs/gpt-in-a-box/vm/custom_model.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Where the arguments are :

- **model_name**: Name of custom model
- **repo_version**: Any model version, defaults to "1.0" (optional)
- **model_path**: Absolute path of custom model files (should be empty non empty)
- **model_path**: Absolute path of custom model files (should be a non empty folder)
- **mar_output**: Absolute path of export of MAR file (.mar)
- **no_download**: Flag to skip downloading the model files, must be set for custom models
- **handler**: Path to custom handler, defaults to llm/handler.py (optional)
Expand Down
95 changes: 19 additions & 76 deletions docs/gpt-in-a-box/vm/inference_requests.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,14 @@
# Inference and Management Requests
TorchServe can be inferenced and managed through it's Inference and Management APIs respectively. Find out more about TorchServe APIs in the official [Inference API](https://pytorch.org/serve/inference_api.html) and [Management API](https://pytorch.org/serve/management_api.html) documentation
# Inference Requests
The Inference Server can be inferenced through the TorchServe Inference API. Find out more about it in the official [TorchServe Inference API](https://pytorch.org/serve/inference_api.html) documentation.

**Server Configuration**

| Variable | Value |
| --- | --- |
| inference_server_endpoint | localhost |
| inference_port | 8080 |
| management_port | 8081 |

The following are example cURL commands to Inference and Manage the Inference Server.
The following are example cURL commands to send inference requests to the Inference Server.

## Inference Requests
The following is the template command for inferencing with a text file:
Expand Down Expand Up @@ -50,78 +49,22 @@ curl -v -H "Content-Type: application/text" http://localhost:8080/predictions/ll
curl -v -H "Content-Type: application/json" http://localhost:8080/predictions/llama2_7b -d @$WORK_DIR/data/translate/sample_text3.json
```

## Describe Registered Model
Once a model is loaded on the Inference Server, we can use the following request to describe the model and it's configuration.
### Input data format
Input data can be in either **text** or **JSON** format.

The following is the template command for the same:
```
curl http://{inference_server_endpoint}:{management_port}/models/{model_name}
```

### Examples
For MPT-7B model
```
curl http://localhost:8081/models/mpt_7b
```
For Falcon-7B model
```
curl http://localhost:8081/models/falcon_7b
```
For Llama2-7B model
```
curl http://localhost:8081/models/llama2_7b
```

## Register Additional Models
TorchServe allows the registering (loading) of multiple models simultaneously. To register multiple models, make sure that the Model Archive Files for the concerned models are stored in the same directory.

The following is the template command for the same:
```
curl -X POST "http://{inference_server_endpoint}:{management_port}/models?url={model_archive_file_name}.mar&initial_workers=1&synchronous=true"
```

### Examples
For MPT-7B model
```
curl -X POST "http://localhost:8081/models?url=mpt_7b.mar&initial_workers=1&synchronous=true"
```
For Falcon-7B model
```
curl -X POST "http://localhost:8081/models?url=falcon_7b.mar&initial_workers=1&synchronous=true"
```
For Llama2-7B model
```
curl -X POST "http://localhost:8081/models?url=llama2_7b.mar&initial_workers=1&synchronous=true"
```
!!! note
Make sure the Model Archive file name given in the cURL request is correct and is present in the model store directory.

## Edit Registered Model Configuration
The model can be configured after registration using the Management API of TorchServe.

The following is the template command for the same:
```
curl -v -X PUT "http://{inference_server_endpoint}:{management_port}/models/{model_name}?min_workers={number}&max_workers={number}&batch_size={number}&max_batch_delay={delay_in_ms}"
```

### Examples
For MPT-7B model
```
curl -v -X PUT "http://localhost:8081/models/mpt_7b?min_worker=2&max_worker=2"
```
For Falcon-7B model
```
curl -v -X PUT "http://localhost:8081/models/falcon_7b?min_worker=2&max_worker=2"
```
For Llama2-7B model
```
curl -v -X PUT "http://localhost:8081/models/llama2_7b?min_worker=2&max_worker=2"
```
!!! note
Make sure to have enough GPU and System Memory before increasing number of workers, else the additional workers will fail to load.
1. For text format, the input should be a '.txt' file containing the prompt

## Unregister a Model
The following is the template command to unregister a model from the Inference Server:
```
curl -X DELETE "http://{inference_server_endpoint}:{management_port}/models/{model_name}/{repo_version}"
2. For JSON format, the input should be a '.json' file containing the prompt in the format below:
```
{
"id": "42",
"inputs": [
{
"name": "input0",
"shape": [-1],
"datatype": "BYTES",
"data": ["Capital of India?"]
}
]
}
```
87 changes: 87 additions & 0 deletions docs/gpt-in-a-box/vm/management_requests.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Management Requests
The Inference Server can be managed through the TorchServe Management API. Find out more about it in the official [TorchServe Management API](https://pytorch.org/serve/management_api.html) documentation

**Server Configuration**

| Variable | Value |
| --- | --- |
| inference_server_endpoint | localhost |
| management_port | 8081 |

The following are example cURL commands to send management requests to the Inference Server.

## Describe Registered Model
Once a model is loaded on the Inference Server, we can use the following request to describe the model and it's configuration.

The following is the template command for the same:
```
curl http://{inference_server_endpoint}:{management_port}/models/{model_name}
```

### Examples
For MPT-7B model
```
curl http://localhost:8081/models/mpt_7b
```
For Falcon-7B model
```
curl http://localhost:8081/models/falcon_7b
```
For Llama2-7B model
```
curl http://localhost:8081/models/llama2_7b
```

## Register Additional Models
TorchServe allows the registering (loading) of multiple models simultaneously. To register multiple models, make sure that the Model Archive Files for the concerned models are stored in the same directory.

The following is the template command for the same:
```
curl -X POST "http://{inference_server_endpoint}:{management_port}/models?url={model_archive_file_name}.mar&initial_workers=1&synchronous=true"
```

### Examples
For MPT-7B model
```
curl -X POST "http://localhost:8081/models?url=mpt_7b.mar&initial_workers=1&synchronous=true"
```
For Falcon-7B model
```
curl -X POST "http://localhost:8081/models?url=falcon_7b.mar&initial_workers=1&synchronous=true"
```
For Llama2-7B model
```
curl -X POST "http://localhost:8081/models?url=llama2_7b.mar&initial_workers=1&synchronous=true"
```
!!! note
Make sure the Model Archive file name given in the cURL request is correct and is present in the model store directory.

## Edit Registered Model Configuration
The model can be configured after registration using the Management API of TorchServe.

The following is the template command for the same:
```
curl -v -X PUT "http://{inference_server_endpoint}:{management_port}/models/{model_name}?min_workers={number}&max_workers={number}&batch_size={number}&max_batch_delay={delay_in_ms}"
```

### Examples
For MPT-7B model
```
curl -v -X PUT "http://localhost:8081/models/mpt_7b?min_worker=2&max_worker=2"
```
For Falcon-7B model
```
curl -v -X PUT "http://localhost:8081/models/falcon_7b?min_worker=2&max_worker=2"
```
For Llama2-7B model
```
curl -v -X PUT "http://localhost:8081/models/llama2_7b?min_worker=2&max_worker=2"
```
!!! note
Make sure to have enough GPU and System Memory before increasing number of workers, else the additional workers will fail to load.

## Unregister a Model
The following is the template command to unregister a model from the Inference Server:
```
curl -X DELETE "http://{inference_server_endpoint}:{management_port}/models/{model_name}/{repo_version}"
```
3 changes: 2 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -124,9 +124,10 @@ nav:
- "Getting Started": "gpt-in-a-box/vm/getting_started.md"
- "Generating Model Archive File": "gpt-in-a-box/vm/generating_mar.md"
- "Deploying Inference Server": "gpt-in-a-box/vm/inference_server.md"
- "Inference and Management Requests": "gpt-in-a-box/vm/inference_requests.md"
- "Inference Requests": "gpt-in-a-box/vm/inference_requests.md"
- "Model Version Support": "gpt-in-a-box/vm/model_version.md"
- "Custom Model Support": "gpt-in-a-box/vm/custom_model.md"
- "Management Requests": "gpt-in-a-box/vm/management_requests.md"
- "Deploy on Kubernetes":
- "Getting Started": "gpt-in-a-box/kubernetes/getting_started.md"
- "Generating Model Archive File": "gpt-in-a-box/kubernetes/generating_mar.md"
Expand Down

0 comments on commit dcf66a7

Please sign in to comment.