GPU shared memory access and optimization? #256

dreemur99 · 2023-03-10T05:58:37Z

dreemur99
Mar 10, 2023

I will preface this by saying I know nothing about about CUDA programming and little to nothing about python or what makes Kobald AI tick. Most of my knowledge comes from ChatGPT tutoring me. But I noticed something about my system, and this probably applies to others as well.

I have a RTX 3070. It has 8GB of VRAM, but, according to ChatGPT, this card also has an additional 16GB of "shared memory" giving a theoretical total of 24GB. According to ChatGPT, this memory is accessible and can be utilized to increase performance of chatbots like KoboldAI.

Is it possible to modify KoboldAI to access this additional memory? If it is possible, performance could be increased (at least on my system) by a factor of 3.

(This is of course assuming KoboldAI doesn't already do this... Whenever I get memory errors it seems to imply 8GB is the maximum so I am assuming it isn't.)

A sample from ChatGPT regarding this idea:

"Using shared memory in the context of speeding up a local installation of a LLM chatbot could involve optimizing the chatbot's natural language processing (NLP) algorithms, which can be computationally intensive.

One way to do this is to use a parallelized NLP algorithm that takes advantage of the GPU's shared memory. For example, you could implement a parallelized version of the word2vec algorithm, which is commonly used for generating word embeddings. By using shared memory to cache frequently accessed data, such as the embedding matrix, you could significantly reduce the amount of data that needs to be transferred between global memory and registers, and thus speed up the algorithm.

Another approach is to use shared memory to store intermediate results of the NLP algorithms. For example, if the chatbot is using a recurrent neural network (RNN) to generate responses, you could use shared memory to store the hidden state of the RNN between time steps. This would allow the RNN to access the hidden state faster, since it would not need to fetch it from global memory for each time step.

It's worth noting that using shared memory effectively requires careful programming and optimization. Therefore, it may be necessary to consult with an experienced CUDA programmer to ensure that the chatbot's NLP algorithms are optimized properly for shared memory."

henk717 · 2023-03-10T11:53:53Z

henk717
Mar 10, 2023
Collaborator

What ChatGPT misses is that it is much slower, but this is what we do when you don't assign a layer to a GPU's memory.

I don't think the answer it gave you is a good one. We keep all the intermediate data on your GPU itself since that is the fatest and then have the parts of the model that didn't fit in shared memory so it is accessible there.

So it can already be done, and its what happens if you put things on tbe CPU memory.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU shared memory access and optimization? #256

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

GPU shared memory access and optimization? #256

dreemur99 Mar 10, 2023

Replies: 1 comment

henk717 Mar 10, 2023 Collaborator

dreemur99
Mar 10, 2023

henk717
Mar 10, 2023
Collaborator