Encoding issue with non-ASCII characters in Ollama.chat responses (Spanish language issue) #168

jasp402 · 2024-11-30T05:17:29Z

Description

When using the Ollama.chat method to interact with the llama3 model, responses containing special characters (e.g., accented characters like á, é, í, ó, ú, ü and punctuation like ¿, ¡) are improperly encoded. While standard ASCII characters work fine, non-ASCII characters are returned with encoding artifacts, making them unreadable.

This issue persists across attempts to decode or process the responses within the client code, suggesting the issue might be related to how the library or server processes UTF-8 encoding.

Observed Behavior

The following responses were received when interacting with the llama3 model via Ollama.chat:

assistant: Hola! ┬┐C├│mo est├ís?

This was expected to be:

assistant: Hola! ¿Cómo estás?

Raw response from Ollama: {
  model: 'llama3',
  created_at: '2024-11-30T04:57:41.4175287Z',
  message: { role: 'assistant', content: 'Hola! ┬┐C├│mo est├ís?' },
  done_reason: 'stop',
  done: true,
  total_duration: 2413650900,
  load_duration: 24432200,
  prompt_eval_count: 12,
  prompt_eval_duration: 288000000,
  eval_count: 8,
  eval_duration: 2100000000
}

Additional Context

English works fine: Messages containing only English characters are processed correctly.
Special characters fail: Any character outside the ASCII range (e.g., accented vowels, ¿, ¡) results in encoding artifacts.

Direct API Output
Testing with curl shows that responses from the server are returned in fragments:

curl -X POST http://localhost:11434/api/generate \
     -H "Content-Type: application/json" \
     -d '{
         "model": "llama3",
         "prompt": "¡Hola! ¿Cómo estás?"
     }'

Response

{
    "model": "llama3",
    "created_at": "2024-11-30T05:00:16.7882944Z",
    "response": "¡",
    "done": false
}
{
    "model": "llama3",
    "created_at": "2024-11-30T05:00:17.1132195Z",
    "response": "h",
    "done": false
}
{
    "model": "llama3",
    "created_at": "2024-11-30T05:00:17.5597785Z",
    "response": "ola",
    "done": false
}
{
    "model": "llama3",
    "created_at": "2024-11-30T05:00:19.5772488Z",
    "response": "?",
    "done": true
}

This suggests the fragments are being returned correctly in terms of structure but not properly encoded.

Attempts to Resolve

UTF-8 Decoding Using Buffer: Tried decoding the response as UTF-8:

const content = Buffer.from(response.message.content, 'latin1').toString('utf8');
console.log('Decoded content:', content);

Result:

Hola! ´┐¢C´┐¢mo est´┐¢s?

Environment Details

OS: Windows 11
Node.js Version: 20.x
Library Version: Latest (installed via npm)
Model Used: llama3
API Host: http://127.0.0.1:11434

Request

Confirm UTF-8 Handling: Verify that the server and library are properly handling UTF-8 characters in both streaming and assembled responses.
Document Encoding Expectations: Clarify if clients need to perform additional decoding steps or if the library should natively handle this.
Provide Guidance: If this issue is expected behavior, please provide steps or examples for properly decoding responses with non-ASCII characters.

Thank you for addressing this issue. If more information or debugging steps are needed, feel free to reach out!

The text was updated successfully, but these errors were encountered:

jessegross · 2024-12-03T01:32:27Z

What version of Ollama are you running? There were some Unicode issues on earlier versions but we have tests to verify this now and it seemed to work fine when I just tried it:

ollama % ./ollama run llama3                               
>>> ¡Hola! ¿Cómo estás?
¡Hola! Como soy un modelo de lenguaje artificial, no tengo sentimientos ni emociones como los seres humanos, por 
lo que no estoy realmente "bien" o "mal". Estoy aquí para ayudarte en cualquier cosa que necesites, responder a 
tus preguntas y tener una conversación con vosotros. ¿En qué puedo ayudarte hoy?

BruceMacD · 2024-12-03T17:41:20Z

Hi @jasp402 as Jesse demonstrated the server seems to correctly encode the response, I have also tested on my system (using MacOS and Bun) and the content is encoded correctly.

Here is my example code:

import { Ollama } from 'ollama';

async function main() {
   const ollama = new Ollama({host: 'http://127.0.0.1:11434'});
   const message = {role: 'user', content: '¡Hola! ¿Cómo estás?'};

   try {
       // Regular chat
       const response = await ollama.chat({
           model: 'llama3.2:1b', 
           messages: [message]
       });
       console.log(response.message.content);

       // Streaming chat 
       const stream = await ollama.chat({
           model: 'llama3.2:1b',
           messages: [message],
           stream: true
       });

       for await (const chunk of stream) {
           process.stdout.write(chunk.message.content);
       }
   } catch (error) {
       console.error('Error:', error);
   }
}

main().catch(console.error);

And running the example:

❯ bun run chat.ts
¡Hola! Estoy bien, gracias. ¿Y tú cómo estáis? Es un placer tenerte aquí. ¿En qué puedo ayudarte hoy?
Estoy bien, gracias. ¿En qué puedo ayudarte hoy? ¿Necesitas ayuda con algo en particular o simplemente quieres charlar un rato? Estoy aquí para escucharte y responder a tus preguntas.%

Would you be able to try the sample code I have here? I'm wondering if its a Windows specific issue or a middleware/proxy causing the issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding issue with non-ASCII characters in Ollama.chat responses (Spanish language issue) #168

Encoding issue with non-ASCII characters in Ollama.chat responses (Spanish language issue) #168

jasp402 commented Nov 30, 2024

jessegross commented Dec 3, 2024

BruceMacD commented Dec 3, 2024

Encoding issue with non-ASCII characters in Ollama.chat responses (Spanish language issue) #168

Encoding issue with non-ASCII characters in Ollama.chat responses (Spanish language issue) #168

Comments

jasp402 commented Nov 30, 2024

Description

Observed Behavior

Additional Context

Attempts to Resolve

Environment Details

Request

jessegross commented Dec 3, 2024

BruceMacD commented Dec 3, 2024