CodeInferflow is an efficient inference engine, built upon Inferflow, specifically designed for large code language models (Code LLMs). It enables efficient, local deployment of popular Code LLMs, providing APIs for code completion.
CodeInferflow is highly efficient when concurrently serving multiple requests, and supports various data types, model file formats, and network types. CodeInferflow is highly configurable and extensible, supporting a wide range of Code LLMs and enabling users to customize their models.
- Popular Code LLMs Support
Support code_llama2, codegeex2, deepseek_coder, starcoder2 and so on. Other models can be supported by editing a model specification file.
- API for Plugin in IDEs
You can use Llama Coder extension in VSCode for code completion.
- Efficient Code Inference
With dynamic batching, the inference thoughtput and request response time is optimized when concurrent requests are made.
Note: Above experiment are conducted on NVIDIA A100, with 1.3b code_llama2-7b model inferencing in FP16 mode. The average token length (Input + Output) per request is 125.
Extending features:
- Extensible and highly configurable.
- Various Datatype Supporting: F32, F16, quantization in 2-bit, 3-bit, 3.5-bit, 4-bit, 5-bit, 6-bit and 8-bit.
- Hybrid model partition for multi-GPU inference: partition-by-layer (pipeline parallelism), partition-by-tensor (tensor parallelism), and hybrid partitioning (hybrid parallelism).
- Wide model file format support: pickle, safetensors, llama.cpp gguf, etc.
- Wide network type support: decoder-only models, encoder-only models, and encoder-decoder models, MoE models.
- GPU/CPU hybrid inference: Supporting GPU-only, CPU-only, and GPU/CPU hybrid inference.
- code_llama2_instruct_7b
- codegeex2_6b
- deepseek_coder_7b_instruct_v1.5
- starcoder2-3b
- codeqwen1.5_7b (unstable)
34 predefined chat model specifications. See ChatModels
Please make sure your CUDA version >= 12.4. If you want to use the CPU version, please build the project mannually.
Download from release.
Select a model in data/models/code
or data/models/chat
, such as "starcoder2-3b". Run download.sh
or download.win.cmd
to download the model files.
Edit inferflow_service.ini
in bin
directory. Uncomment the model you want to use. By default, the model "starcoder2-3b" is already enabled.
In bin/release
directory, run inferflow_service
or inferflow_service.exe
to start the service. Or specify the configuration file path like inferflow_service <configuration_file_path>
.
After starting "inferflow_service", you can use the Llama Coder extension in VSCode for code completion.
Install the Llama Coder extension in VSCode.
- Open settings
- Find Llarma Coder settings. (Search "@ext:ex3ndr.llama-coder")
- Enter CodeInferflow service endpoint in "endpoint". (E.g. http://127.0.0.1:8080)
- Change the "Model" option to "custom".
- In "Custom Model", enter the model name you used, which should be the same name enabled in CodeInferflow config files. (E.g. deepseek_coder_1.3b_instruct)
- Enjoy code completions.
To build the CUDA version, ensure that CUDA, CMake, and Ninja are installed correctly. Compatibility is crucial: your GCC and G++ versions should match the CUDA version, and the CUDA version must align with the GPU driver version. It's advisable to utilize the NVIDIA PyTorch Docker image to streamline the setup process.
We build the project with following versions:
- docker image: nvcr.io/nvidia/pytorch:24.03-py3
- CUDA: 12.4
- GCC: 11.4
- G++: 11.4
- ninja 1.11.1
- cmake 3.28.3
cmake -B build -DUSE_CUDA=1 -DCMAKE_BUILD_TYPE=Release
cmake --build build --target install
If you want to build the cpu version, set the USE_CUDA flag to 0. However, since some activation function is not implemented yet, the cpu version may not work properly for some models.
For windows platform, the visual studio, CUDA and cmake should be installed properly. The build process is similar to the linux build process.
We build the project with following versions:
- Visual Studio: 2022
- CUDA: 12.4
- cmake 3.28.3
cmake -B build -DUSE_CUDA=1 -DCMAKE_BUILD_TYPE=Release
cmake --build build --target install --config release
CodeInferflow offers high configurability, enabling users to customize settings such as the model, device, data type, and completion template within a single configuration file. To edit the configuration file, see Model Serving Configuration.
Additionally, users have the flexibility to configure new or custom models by modifying the model specification file. For detailed instructions, refer to Model Setup Guide.
Support openai format, ollama format. See API
CodeInferflow is inspired by following awesome projects: