Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding issue with non-ASCII characters in Ollama.chat responses (Spanish language issue) #168

Open
jasp402 opened this issue Nov 30, 2024 · 2 comments

Comments

@jasp402
Copy link

jasp402 commented Nov 30, 2024

Description

When using the Ollama.chat method to interact with the llama3 model, responses containing special characters (e.g., accented characters like á, é, í, ó, ú, ü and punctuation like ¿, ¡) are improperly encoded. While standard ASCII characters work fine, non-ASCII characters are returned with encoding artifacts, making them unreadable.

This issue persists across attempts to decode or process the responses within the client code, suggesting the issue might be related to how the library or server processes UTF-8 encoding.


Observed Behavior

The following responses were received when interacting with the llama3 model via Ollama.chat:

assistant: Hola! ¿Cómo estás?

This was expected to be:

assistant: Hola! ¿Cómo estás?
Raw response from Ollama: {
  model: 'llama3',
  created_at: '2024-11-30T04:57:41.4175287Z',
  message: { role: 'assistant', content: 'Hola! ¿Cómo estás?' },
  done_reason: 'stop',
  done: true,
  total_duration: 2413650900,
  load_duration: 24432200,
  prompt_eval_count: 12,
  prompt_eval_duration: 288000000,
  eval_count: 8,
  eval_duration: 2100000000
}

Additional Context

  • English works fine: Messages containing only English characters are processed correctly.
  • Special characters fail: Any character outside the ASCII range (e.g., accented vowels, ¿, ¡) results in encoding artifacts.

Direct API Output
Testing with curl shows that responses from the server are returned in fragments:

curl -X POST http://localhost:11434/api/generate \
     -H "Content-Type: application/json" \
     -d '{
         "model": "llama3",
         "prompt": "¡Hola! ¿Cómo estás?"
     }'

Response

{
    "model": "llama3",
    "created_at": "2024-11-30T05:00:16.7882944Z",
    "response": "¡",
    "done": false
}
{
    "model": "llama3",
    "created_at": "2024-11-30T05:00:17.1132195Z",
    "response": "h",
    "done": false
}
{
    "model": "llama3",
    "created_at": "2024-11-30T05:00:17.5597785Z",
    "response": "ola",
    "done": false
}
{
    "model": "llama3",
    "created_at": "2024-11-30T05:00:19.5772488Z",
    "response": "?",
    "done": true
}

This suggests the fragments are being returned correctly in terms of structure but not properly encoded.

Attempts to Resolve

UTF-8 Decoding Using Buffer: Tried decoding the response as UTF-8:

const content = Buffer.from(response.message.content, 'latin1').toString('utf8');
console.log('Decoded content:', content);

Result:

Hola! ´┐¢C´┐¢mo est´┐¢s?

Environment Details

OS: Windows 11
Node.js Version: 20.x
Library Version: Latest (installed via npm)
Model Used: llama3
API Host: http://127.0.0.1:11434

Request

  • Confirm UTF-8 Handling: Verify that the server and library are properly handling UTF-8 characters in both streaming and assembled responses.
  • Document Encoding Expectations: Clarify if clients need to perform additional decoding steps or if the library should natively handle this.
  • Provide Guidance: If this issue is expected behavior, please provide steps or examples for properly decoding responses with non-ASCII characters.

Thank you for addressing this issue. If more information or debugging steps are needed, feel free to reach out!

@jessegross
Copy link

What version of Ollama are you running? There were some Unicode issues on earlier versions but we have tests to verify this now and it seemed to work fine when I just tried it:

ollama % ./ollama run llama3                               
>>> ¡Hola! ¿Cómo estás?
¡Hola! Como soy un modelo de lenguaje artificial, no tengo sentimientos ni emociones como los seres humanos, por 
lo que no estoy realmente "bien" o "mal". Estoy aquí para ayudarte en cualquier cosa que necesites, responder a 
tus preguntas y tener una conversación con vosotros. ¿En qué puedo ayudarte hoy?

@BruceMacD
Copy link
Collaborator

Hi @jasp402 as Jesse demonstrated the server seems to correctly encode the response, I have also tested on my system (using MacOS and Bun) and the content is encoded correctly.

Here is my example code:

import { Ollama } from 'ollama';

async function main() {
   const ollama = new Ollama({host: 'http://127.0.0.1:11434'});
   const message = {role: 'user', content: '¡Hola! ¿Cómo estás?'};

   try {
       // Regular chat
       const response = await ollama.chat({
           model: 'llama3.2:1b', 
           messages: [message]
       });
       console.log(response.message.content);

       // Streaming chat 
       const stream = await ollama.chat({
           model: 'llama3.2:1b',
           messages: [message],
           stream: true
       });

       for await (const chunk of stream) {
           process.stdout.write(chunk.message.content);
       }
   } catch (error) {
       console.error('Error:', error);
   }
}

main().catch(console.error);

And running the example:

❯ bun run chat.ts
¡Hola! Estoy bien, gracias. ¿Y tú cómo estáis? Es un placer tenerte aquí. ¿En qué puedo ayudarte hoy?
Estoy bien, gracias. ¿En qué puedo ayudarte hoy? ¿Necesitas ayuda con algo en particular o simplemente quieres charlar un rato? Estoy aquí para escucharte y responder a tus preguntas.%

Would you be able to try the sample code I have here? I'm wondering if its a Windows specific issue or a middleware/proxy causing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants