You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add functionality in LLMBlock within the pipeline to override the global OpenAI client variable. This enhancement will allow us to support running multiple OpenAI clients for different LLMBlock instances if desired. The primary intention is to run LLMBlock inference calls against a model deployment tailored to serve specific inference requests.
Currently, in vLLM, certain LoRA inference calls do not support specific performance optimization flags. By separating these inference calls from the non-LoRA inference calls, we can deploy multiple instances of vLLM, each optimized for different types of inference calls. This would ensure better performance.
The text was updated successfully, but these errors were encountered:
This issue has been automatically marked as stale because it has not had activity within 90 days. It will be automatically closed if no further activity occurs within 30 days.
Is this still an issue? It would not be trivial to wire up multiple functioning OpenAI clients with the ability to select which client per block of a Pipeline. We're redoing the internals of the pipeline config, so if this is something we really need then we should revisit it after the larger pipeline work lands.
Add functionality in
LLMBlock
within the pipeline to override the global OpenAI client variable. This enhancement will allow us to support running multiple OpenAI clients for differentLLMBlock
instances if desired. The primary intention is to run LLMBlock inference calls against a model deployment tailored to serve specific inference requests.Currently, in vLLM, certain LoRA inference calls do not support specific performance optimization flags. By separating these inference calls from the non-LoRA inference calls, we can deploy multiple instances of vLLM, each optimized for different types of inference calls. This would ensure better performance.
The text was updated successfully, but these errors were encountered: