CodeInferflow

CodeInferflow is an efficient inference engine, built upon Inferflow, specifically designed for large code language models (Code LLMs). It enables efficient, local deployment of popular Code LLMs, providing APIs for code completion.

CodeInferflow is highly efficient when concurrently serving multiple requests, and supports various data types, model file formats, and network types. CodeInferflow is highly configurable and extensible, supporting a wide range of Code LLMs and enabling users to customize their models.

Features

Popular Code LLMs Support

Support code_llama2, codegeex2, deepseek_coder, starcoder2 and so on. Other models can be supported by editing a model specification file.

API for Plugin in IDEs

You can use Llama Coder extension in VSCode for code completion.

Efficient Code Inference

With dynamic batching, the inference thoughtput and request response time is optimized when concurrent requests are made.

Note: Above experiment are conducted on NVIDIA A100, with 1.3b code_llama2-7b model inferencing in FP16 mode. The average token length (Input + Output) per request is 125.

Extending features:

Extensible and highly configurable.
Various Datatype Supporting: F32, F16, quantization in 2-bit, 3-bit, 3.5-bit, 4-bit, 5-bit, 6-bit and 8-bit.
Hybrid model partition for multi-GPU inference: partition-by-layer (pipeline parallelism), partition-by-tensor (tensor parallelism), and hybrid partitioning (hybrid parallelism).
Wide model file format support: pickle, safetensors, llama.cpp gguf, etc.
Wide network type support: decoder-only models, encoder-only models, and encoder-decoder models, MoE models.
GPU/CPU hybrid inference: Supporting GPU-only, CPU-only, and GPU/CPU hybrid inference.

Models with Predefined Specification Files

Code Models

Chat Models

34 predefined chat model specifications. See ChatModels

Getting Started

Starting inferflow_service via Pre-build Binaries

Please make sure your CUDA version >= 12.4. If you want to use the CPU version, please build the project mannually.

Step1: Download pre-build binaries

Download from release.

Step2: Download models

Select a model in data/models/code or data/models/chat, such as "starcoder2-3b". Run download.sh or download.win.cmd to download the model files.

Step3: Edit the configuration file

Edit inferflow_service.ini in bin directory. Uncomment the model you want to use. By default, the model "starcoder2-3b" is already enabled.

Step4: Start inferflow_service

In bin/release directory, run inferflow_service or inferflow_service.exe to start the service. Or specify the configuration file path like inferflow_service <configuration_file_path>.

Code Completion in IDE

After starting "inferflow_service", you can use the Llama Coder extension in VSCode for code completion.

Step1: Install Llama Coder Extension

Install the Llama Coder extension in VSCode.

Step2: Configue the extension

Open settings
Find Llarma Coder settings. (Search "@ext:ex3ndr.llama-coder")
Enter CodeInferflow service endpoint in "endpoint". (E.g. http://127.0.0.1:8080)
Change the "Model" option to "custom".
In "Custom Model", enter the model name you used, which should be the same name enabled in CodeInferflow config files. (E.g. deepseek_coder_1.3b_instruct)
Enjoy code completions.

Build Mannually

Linux

To build the CUDA version, ensure that CUDA, CMake, and Ninja are installed correctly. Compatibility is crucial: your GCC and G++ versions should match the CUDA version, and the CUDA version must align with the GPU driver version. It's advisable to utilize the NVIDIA PyTorch Docker image to streamline the setup process.

We build the project with following versions:

docker image: nvcr.io/nvidia/pytorch:24.03-py3
CUDA: 12.4
GCC: 11.4
G++: 11.4
ninja 1.11.1
cmake 3.28.3

cmake -B build -DUSE_CUDA=1 -DCMAKE_BUILD_TYPE=Release
cmake --build build --target install

If you want to build the cpu version, set the USE_CUDA flag to 0. However, since some activation function is not implemented yet, the cpu version may not work properly for some models.

Windows

For windows platform, the visual studio, CUDA and cmake should be installed properly. The build process is similar to the linux build process.

We build the project with following versions:

Visual Studio: 2022
CUDA: 12.4
cmake 3.28.3

cmake -B build -DUSE_CUDA=1 -DCMAKE_BUILD_TYPE=Release
cmake --build build --target install --config release

Model Serving Configuration

CodeInferflow offers high configurability, enabling users to customize settings such as the model, device, data type, and completion template within a single configuration file. To edit the configuration file, see Model Serving Configuration.

Additionally, users have the flexibility to configure new or custom models by modifying the model specification file. For detailed instructions, refer to Model Setup Guide.

REST API Usage

Support openai format, ollama format. See API

Acknowledgements

CodeInferflow is inspired by following awesome projects:

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
.vscode		.vscode
3rd_party		3rd_party
bin		bin
data		data
docs		docs
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
README.zh.md		README.zh.md
build.cmd		build.cmd
build.sh		build.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeInferflow

Features

Models with Predefined Specification Files

Code Models

Chat Models

Getting Started

Starting inferflow_service via Pre-build Binaries

Step1: Download pre-build binaries

Step2: Download models

Step3: Edit the configuration file

Step4: Start inferflow_service

Code Completion in IDE

Step1: Install Llama Coder Extension

Step2: Configue the extension

Build Mannually

Linux

Windows

Model Serving Configuration

REST API Usage

Acknowledgements

About

Releases 1

Packages

Languages

License

immocreat/CodeInferflow

Folders and files

Latest commit

History

Repository files navigation

CodeInferflow

Features

Models with Predefined Specification Files

Code Models

Chat Models

Getting Started

Starting inferflow_service via Pre-build Binaries

Step1: Download pre-build binaries

Step2: Download models

Step3: Edit the configuration file

Step4: Start inferflow_service

Code Completion in IDE

Step1: Install Llama Coder Extension

Step2: Configue the extension

Build Mannually

Linux

Windows

Model Serving Configuration

REST API Usage

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages