Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Output from tool calls are not very observable #9770

Closed
tslmy opened this issue Dec 31, 2023 · 13 comments
Closed

[Bug]: Output from tool calls are not very observable #9770

tslmy opened this issue Dec 31, 2023 · 13 comments
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@tslmy
Copy link
Contributor

tslmy commented Dec 31, 2023

Bug Description

What seems wrong

I can't see responses from function calls in Arize Phoenix...

image

... or in ChainLit:

image

What I was expecting

I usually can see results from other types of Spans, such as embedding in Arize Phoenix...

image

... and Query Engines tools in ChainLit:

image

so I was expecting the same level of observability for FunctionTools.

My guess

I think this is because "function_call_response", the string LlamaIndex uses to identify responses from generic function calls with callback handlers, was non-standard to OpenInference Tracing conventions.

The "function_call_response" is defined here:

FUNCTION_OUTPUT = "function_call_response" # function call output
.

It is sent to callback managers here:

event.on_end(payload={EventPayload.FUNCTION_OUTPUT: str(tool_output)})
.

And the conventions are here. Indeed, Arize Phoenix doesn't know "function_call_response", as shown here:

https://github.com/Arize-ai/phoenix/blob/363e8913a3a0f7dafca5dc6bba6bf0e9776c1158/app/src/openInference/tracing/semanticConventions.ts#L51-L58

Version

0.9.23

Steps to Reproduce

from llama_hub.tools.wikipedia import WikipediaToolSpec
from llama_index.agent import ReActAgent
from llama_index.callbacks.base import CallbackManager
from llama_index.llms import OpenAILike
import chainlit as cl
import phoenix as px


# "Phoenix can display in real time the traces automatically collected from your LlamaIndex application."
# https://docs.llamaindex.ai/en/stable/module_guides/observability/observability.html
px.launch_app()


@cl.on_chat_start
async def factory():
    callback_manager = CallbackManager([cl.LlamaIndexCallbackHandler()])

    local_llm = OpenAILike(
        api_base="http://localhost:1234/v1",
        timeout=600,  # secs
        api_key="loremIpsum",
        # I'm using this: https://huggingface.co/HuggingFaceH4/zephyr-7b-beta .
        model="zephyr beta 7B q5_k_m gguf",
        is_chat_model=True,
        is_function_calling_model=True,
        context_window=32768,
        callback_manager=callback_manager,
    )

    tool_spec = WikipediaToolSpec()
    agent = ReActAgent.from_tools(
        tools=tool_spec.to_tool_list(),
        llm=local_llm,
        callback_manager=callback_manager,
    )
    cl.user_session.set("agent", agent)


@cl.on_message
async def main(message: cl.Message):
    """ChainLit provides a web GUI for this application."""
    agent: ReActAgent = cl.user_session.get("agent")
    response = await cl.make_async(agent.chat)(message.content)
    response_message = cl.Message(content="")
    response_message.content = response.response
    await response_message.send()

Save the script above as main.py. Run in terminal:

chainlit run main.py -w

Ask the Agent "Search Wikipedia and summarize what is a cheesecake", and observe.

Relevant Logs/Tracbacks

No response

@tslmy tslmy added bug Something isn't working triage Issue needs to be triaged/prioritized labels Dec 31, 2023
Copy link

dosubot bot commented Dec 31, 2023

🤖

It's great that you've identified the issue with the function_call_response not adhering to OpenInference Tracing conventions. Your detailed analysis and references to the specific code locations and conventions are very helpful.

If you have a proposed solution or fix in mind, I would encourage you to go ahead and create a pull request with your changes. Your contribution will be valuable in improving LlamaIndex and ensuring it aligns with the OpenInference Tracing conventions.

Thank you for your dedication to improving LlamaIndex!


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@logan-markewich
Copy link
Collaborator

@tslmy do you have any idea on what the fix here would be? Looking at arize's spec, I actually have no idea 😆 We should be populating a tool_calls attribute?

@tslmy
Copy link
Contributor Author

tslmy commented Dec 31, 2023

@logan-markewich , tool_calls is closer, but it also doesn't seem to have a conventional way to specify tool outputs.

I opened an Issue to the Open Inference Specifications repo above, where I asked maintainers there to come up with a conventional way to specify tool outputs. Let's see what they will say.

@tslmy tslmy changed the title [Bug]: function_call_response does not adhere to OpenInference Tracing conventions [Bug]: Output from tool calls are not very observable Dec 31, 2023
@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Mar 31, 2024
Copy link

dosubot bot commented Mar 31, 2024

Hi, @tslmy,

I'm helping the LlamaIndex team manage their backlog and am marking this issue as stale. From what I understand, the issue highlighted a bug where the output from function calls is not observable in Arize Phoenix and ChainLit due to a non-standard string used to identify responses from generic function calls with callback handlers. The resolution involved encouraging the author to create a pull request with their proposed changes, and there is ongoing discussion about the need for a conventional way to specify tool outputs, with an issue opened in the Open Inference Specifications repo for further discussion.

Could you please confirm if this issue is still relevant to the latest version of the LlamaIndex repository? If it is, please let the LlamaIndex team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you!

@tslmy
Copy link
Contributor Author

tslmy commented Apr 1, 2024

@dosubot shall we wait till Arize-ai/open-inference-spec#37 is closed before closing this?

@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Apr 1, 2024
@teis-e
Copy link

teis-e commented Apr 8, 2024

Hello, i am also trying to get a non OpenAI model working. Is OpenAILike the only way to do this, like you did. Or can i just use HuggingFaceLLM?

Copy link

dosubot bot commented Apr 8, 2024

Based on the context provided, OpenAILike is not the only option for integrating non-OpenAI models into your project. You can also use HuggingFaceLLM for this purpose. Both OpenAILike and HuggingFaceLLM are designed to provide interfaces for generating text completions and handling chat interactions, but they are tailored to work with different ecosystems.

OpenAILike provides an interface similar to OpenAI's API. It supports customization through various parameters such as model, api_key, context_window, max_tokens, and an optional tokenizer. It allows for generating text completions with the complete method and handling chat interactions with the chat method. You can find more details in the test_openai_like.py file.

On the other hand, HuggingFaceLLM is designed to integrate with Hugging Face's language models. It allows specifying a model and tokenizer from Hugging Face, with default options set to "StabilityAI/stablelm-tuned-alpha-3b". It supports custom configurations for the context window size, maximum number of new tokens to generate, system prompts, and query wrapper prompts. It also offers methods for both one-off text completions (complete) and streaming completions (stream_complete), as well as methods for handling chat interactions (chat and stream_chat). More details can be found in the huggingface.py file.

So, depending on the specific requirements of your project and the ecosystem you are working with, you can choose either OpenAILike or HuggingFaceLLM to integrate non-OpenAI models into your project.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@logan-markewich
Copy link
Collaborator

@teis-e openailike is for models hosted on a server that has the same API as openai

HuggingfaceLLM is specifically for running a model locally (no server, just in the same process/program as llamaindex)

@teis-e
Copy link

teis-e commented Apr 8, 2024

@teis-e openailike is for models hosted on a server that has the same API as openai

HuggingfaceLLM is specifically for running a model locally (no server, just in the same process/program as llamaindex)

Right! Thnx, but the openailike gives the opurtunity to run agents and function calling etc without OpenAI. So just a local model, like the Zypher from above?

@logan-markewich
Copy link
Collaborator

Using openailike does not magically give access to function calling (I wish haha)

The server running the api has to actually have this implemented to handle the tools kwarg when running the model

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jul 10, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 17, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jul 17, 2024
@ryann-sportsbet
Copy link

Hi @teis-e,

Do you have any update on this issue?

I'm experiencing the same problem while attempting to track token usage from LLM calls. I'm using Phoenix, and I'm finding it difficult to determine the precise number of input, output, and total tokens used for each LLM call.
For context, I'm working with Bedrock Converse.
Any insights or progress on this matter would be greatly appreciated mate

@teis-e
Copy link

teis-e commented Oct 14, 2024 via email

@teis-e
Copy link

teis-e commented Oct 14, 2024

I can think you can tokenize the messaga and then the len() is the amount of tokens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
None yet
Development

No branches or pull requests

4 participants