Replies: 2 comments 5 replies
-
That's the problem, there is no way to push "oldest" tokens out of the context without having to recalculate everything after that point. Everything we have is just a workaround for it. You did try running with |
Beta Was this translation helpful? Give feedback.
-
I did try smart context. It made no difference at all. Also tried reducing context in increments of 512 without smart context, but even with a context limit of 512 tokens it's still unusable due to the fact that it can't output an entire output at once. The main issue is that the output is processed in 8 token chunks, so the 4x512(or 1x512 with 512 tokens of context set) context processing is done sixteen times for an output of 128 tokens (eight times for an output of 64 tokens), once per 8 tokens. Every now and then it manages to output several of these 8 token chunks in a row after processing the context, but it's usually only on the first 32 tokens and only once after loading the model, then it goes back to re-processing the context once every 8 tokens instead of doing it once per generation. If it only processed the context once per generation, it would be fine, but currently it has to process it up to sixteen times for a 128 token output, meaning that each 8 tokens takes 240(60 seconds per 512 tokens of context) seconds, and the entire output ends up being 16x240 seconds besides the first generation when a model is loaded. Edit: Also, I've noticed that the System-Info doesn't actually mention threads used at all, which it should. Dunno if this is just a difference in verbosity in the sys_info compared to llama.cpp or if --threads isn't being parsed properly. Edit 2: Smart context is removing 8 tokens from the allowance per 8 token chunk, which is to be expected given the circumstances. The problem is with the parameter "Max_length", which is n_batch I assume. I'll try to set it to half of the output tokens to see if it will change it so that context is only processed twice per output. Input: {"n": 1, "max_context_length": 2048, "max_length": 8, "rep_pen": 1.08, "temperature": 0.62, "top_p": 0.9, "top_k": 0, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 1024, "rep_pen_slope": 0.7, "sampler_order": [0, 1, 2, 3, 4, 5, 6], Edit 3: Smart context Truncated, but the this is the runtime during a generation, with the smart context being triggered at random, but still only resulting in a marginal improvement. Processing Prompt [BLAS] (512 / 2024 tokens) [New Smart Context Triggered! Buffered Token Allowance: 1020] [Reusing Smart Context: 1008 allowance remaining] [Reusing Smart Context: 996 allowance remaining] [Reusing Smart Context: 988 allowance remaining] Processing Prompt [BLAS] (512 / 2024 tokens) Processing Prompt [BLAS] (512 / 2024 tokens) Processing Prompt [BLAS] (512 / 2024 tokens) It's a bit hit or miss, but does increase the chance of having more than 8 tokens output, but it keeps reprocessing the context again a lot. The smart context seems to manage to output around 4x8 tokens before context has to be reprocessed. I didn't manage to change the n_batch. |
Beta Was this translation helpful? Give feedback.
-
Currently, it's managing a whopping 64 tokens at most, often 32, before re-processing the entire context, which takes ages, even though the output itself takes one second. 1.1.6, 1.17, 1.7.1 all has the same issue. 1.1.6 reread the entire context after each 8 token output however, now it manages to pump out 8x4 tokens the first time, before returning to one 8 token output, and then rereading the entire context in those god damn 512 token chunks.
Why is the output even segmented into 8 token pieces before "re-reading" the context and continuing? It should be able to produce a 128 token output based on 2048 tokens of context, higher output count will obviously be less coherent but that's a fine trade off when you can manually edit and remove parts quickly. The worst part is that it segments reading the context into 512 token chunks, meaning that it spends 60 seconds per 512 tokens, to produce a fraction of an output, then repeats. That's 240 seconds, for 8 tokens while my hardware is idle (--threads 14, somehow just as broken as ffmpeg multithreading)
There's some real funky stuff going on, context is still not handled nor stored properly, since it should be cached and only the oldest tokens should be pushed out, and replaced with the previous output and any other user modifications. (assuming no keyword triggered world-info is used)
The performance is great, for the first output. But at 2048 context, writing a 128 token output takes 4-6000 seconds while my hardware is setting at a 40% cpu, 50% memory, 35% gpu (with CFBlast) essentially idling. There's also no way to specify a higher memory usage, it's evaluated on loading the model and is static, meaning that 50-40% of my memory is not utilized, which could be used for caching the context.
Is this a recent issue? I have the issue on both OLblast and CFBlast.
If there's a way to use more of my memory, and to ensure that multithreading actually works (well, it does for generation which takes less than a second, but 99% of the time is spent on reading the context and all my hardware is idle during that period).
Any clue what's going on? Why is the context simply not stored in memory during the entire output?
EDIT: Is there any way to just drop Blas/blast and use RAM caching since I have enough memory to spare to store 2048 tokens? Alternatively using pagefiles and disk caching? It's absurd at the moment.
Beta Was this translation helpful? Give feedback.
All reactions