Still unable to produce a full output, even with context changes. 6000 seconds spent on processing context recursively, 1 second spend on generation of each 8 token chunk. #65

bucketcat · 2023-04-14T19:35:36Z

bucketcat
Apr 14, 2023

Currently, it's managing a whopping 64 tokens at most, often 32, before re-processing the entire context, which takes ages, even though the output itself takes one second. 1.1.6, 1.17, 1.7.1 all has the same issue. 1.1.6 reread the entire context after each 8 token output however, now it manages to pump out 8x4 tokens the first time, before returning to one 8 token output, and then rereading the entire context in those god damn 512 token chunks.

Why is the output even segmented into 8 token pieces before "re-reading" the context and continuing? It should be able to produce a 128 token output based on 2048 tokens of context, higher output count will obviously be less coherent but that's a fine trade off when you can manually edit and remove parts quickly. The worst part is that it segments reading the context into 512 token chunks, meaning that it spends 60 seconds per 512 tokens, to produce a fraction of an output, then repeats. That's 240 seconds, for 8 tokens while my hardware is idle (--threads 14, somehow just as broken as ffmpeg multithreading)

There's some real funky stuff going on, context is still not handled nor stored properly, since it should be cached and only the oldest tokens should be pushed out, and replaced with the previous output and any other user modifications. (assuming no keyword triggered world-info is used)

The performance is great, for the first output. But at 2048 context, writing a 128 token output takes 4-6000 seconds while my hardware is setting at a 40% cpu, 50% memory, 35% gpu (with CFBlast) essentially idling. There's also no way to specify a higher memory usage, it's evaluated on loading the model and is static, meaning that 50-40% of my memory is not utilized, which could be used for caching the context.

Is this a recent issue? I have the issue on both OLblast and CFBlast.

If there's a way to use more of my memory, and to ensure that multithreading actually works (well, it does for generation which takes less than a second, but 99% of the time is spent on reading the context and all my hardware is idle during that period).

Any clue what's going on? Why is the context simply not stored in memory during the entire output?

EDIT: Is there any way to just drop Blas/blast and use RAM caching since I have enough memory to spare to store 2048 tokens? Alternatively using pagefiles and disk caching? It's absurd at the moment.

LostRuins · 2023-04-15T02:26:13Z

LostRuins
Apr 15, 2023
Maintainer

That's the problem, there is no way to push "oldest" tokens out of the context without having to recalculate everything after that point. Everything we have is just a workaround for it. You did try running with --smartcontext? What's the token allowance displayed after it triggers? It could be that your story's memory is too long.

0 replies

bucketcat · 2023-04-15T12:49:29Z

bucketcat
Apr 15, 2023
Author

That's the problem, there is no way to push "oldest" tokens out of the context without having to recalculate everything after that point. Everything we have is just a workaround for it. You did try running with --smartcontext? What's the token allowance displayed after it triggers? It could be that your story's memory is too long.

I did try smart context. It made no difference at all. Also tried reducing context in increments of 512 without smart context, but even with a context limit of 512 tokens it's still unusable due to the fact that it can't output an entire output at once.

The main issue is that the output is processed in 8 token chunks, so the 4x512(or 1x512 with 512 tokens of context set) context processing is done sixteen times for an output of 128 tokens (eight times for an output of 64 tokens), once per 8 tokens. Every now and then it manages to output several of these 8 token chunks in a row after processing the context, but it's usually only on the first 32 tokens and only once after loading the model, then it goes back to re-processing the context once every 8 tokens instead of doing it once per generation.

If it only processed the context once per generation, it would be fine, but currently it has to process it up to sixteen times for a 128 token output, meaning that each 8 tokens takes 240(60 seconds per 512 tokens of context) seconds, and the entire output ends up being 16x240 seconds besides the first generation when a model is loaded.

Edit: Also, I've noticed that the System-Info doesn't actually mention threads used at all, which it should. Dunno if this is just a difference in verbosity in the sys_info compared to llama.cpp or if --threads isn't being parsed properly.

Edit 2: Smart context is removing 8 tokens from the allowance per 8 token chunk, which is to be expected given the circumstances. The problem is with the parameter "Max_length", which is n_batch I assume. I'll try to set it to half of the output tokens to see if it will change it so that context is only processed twice per output.

Input: {"n": 1, "max_context_length": 2048, "max_length": 8, "rep_pen": 1.08, "temperature": 0.62, "top_p": 0.9, "top_k": 0, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 1024, "rep_pen_slope": 0.7, "sampler_order": [0, 1, 2, 3, 4, 5, 6],

Edit 3: Smart context

Truncated, but the this is the runtime during a generation, with the smart context being triggered at random, but still only resulting in a marginal improvement.

Processing Prompt [BLAS] (512 / 2024 tokens)
Processing Prompt [BLAS] (2024 / 2024 tokens)
Generating (8 / 8 tokens)

[New Smart Context Triggered! Buffered Token Allowance: 1020]
Processing Prompt [BLAS] (1020 / 1020 tokens)
Generating (8 / 8 tokens)

[Reusing Smart Context: 1008 allowance remaining]
Processing Prompt (1 / 1 tokens)
Generating (8 / 8 tokens)

[Reusing Smart Context: 996 allowance remaining]
Processing Prompt (1 / 1 tokens)
Generating (8 / 8 tokens)

[Reusing Smart Context: 988 allowance remaining]
Processing Prompt (1 / 1 tokens)
Generating (8 / 8 tokens)

Processing Prompt [BLAS] (512 / 2024 tokens)
Processing Prompt [BLAS] (2024 / 2024 tokens)
Generating (8 / 8 tokens)

It's a bit hit or miss, but does increase the chance of having more than 8 tokens output, but it keeps reprocessing the context again a lot. The smart context seems to manage to output around 4x8 tokens before context has to be reprocessed. I didn't manage to change the n_batch.

5 replies

LostRuins Apr 15, 2023
Maintainer

In that case, you probably have something (world info? author's note?) that is constantly modifying parts of your context. It may be better to disable streaming and request larger token amounts.

bucketcat Apr 15, 2023
Author

I have a total of 63 tokens in static memory, 0 world info 0 author's note (tested with the web-app llama tokenizer at huggingface). Btw, Kobold Lite is missing the variable in the ui to change the requested tokens and I didn't manage to figure out a launch arg for it. Is there any way to bind the Llama compatibility parts to united/regular kobold client instead of lite? I could access the api that way as well and use a bunch of other self hosted stuff like a second parallel 13b model alongside(to host I suppose, unless that's limited to lite) the 33b model and imagen for personal use. Lots of the features and settings I depend on for my work flow are missing from lite, such as prompt insertion, context ordering, advanced sampler settings, custom biasing of words and phrases, custom stop sequences etc. Makes finding out what's happening much harder when you can't test things out.

LostRuins Apr 15, 2023
Maintainer

The option to change requested length is "Amount to generate". Also make sure the streaming is off (url does not have a ?streaming=1 param)
Yes, you can use it from regular kobold, just select it from the "Online services -> API" option when loading a model.

bucketcat Apr 15, 2023
Author

So it's not possible to set a constant output token and have text streaming like novelAi? That's odd. Any reason streaming affects the output tokens?

There's no "Online services > API" field in the UI for me.

Edit: Disabling streaming solved the segmentation of output into 8 token chunks! A bit sad that I have to disable streaming though, even delayed streaming would be okay. Would never have guessed that to be the cause.

There's still no Online services field in the setting however.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Still unable to produce a full output, even with context changes. 6000 seconds spent on processing context recursively, 1 second spend on generation of each 8 token chunk. #65

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Still unable to produce a full output, even with context changes. 6000 seconds spent on processing context recursively, 1 second spend on generation of each 8 token chunk. #65

bucketcat Apr 14, 2023

Replies: 2 comments · 5 replies

LostRuins Apr 15, 2023 Maintainer

bucketcat Apr 15, 2023 Author

LostRuins Apr 15, 2023 Maintainer

bucketcat Apr 15, 2023 Author

LostRuins Apr 15, 2023 Maintainer

bucketcat Apr 15, 2023 Author

LostRuins Apr 16, 2023 Maintainer

bucketcat
Apr 14, 2023

Replies: 2 comments 5 replies

LostRuins
Apr 15, 2023
Maintainer

bucketcat
Apr 15, 2023
Author

LostRuins Apr 15, 2023
Maintainer

bucketcat Apr 15, 2023
Author

LostRuins Apr 15, 2023
Maintainer

bucketcat Apr 15, 2023
Author

LostRuins Apr 16, 2023
Maintainer