Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory allocation #4665

Open
XiWeiGu opened this issue Apr 28, 2024 · 2 comments
Open

Memory allocation #4665

XiWeiGu opened this issue Apr 28, 2024 · 2 comments

Comments

@XiWeiGu
Copy link
Contributor

XiWeiGu commented Apr 28, 2024

PR #4577

We have introduced adjust_thread_buffers() function, similar to OpenMP, for initializing global thread buffers instead of the existing local buffers initialized in blas_thread_server.

In blas_thread_init, memory is allocated for blas_cpu_number threads using the adjust_thread_buffers interface. However, when calling interfaces like gemm, memory allocation is still performed in the main thread:

buffer = (XFLOAT *)blas_memory_alloc(0);

This would lead to an additional buffer being allocated, deviating from the logic of the code before the modification.

@shivammonaka
Copy link
Contributor

It appears this buffer is allocated as a precaution in case the execution follows the single-threaded path. However, if OpenBLAS determines that multi-threading is necessary, the exec_threads function (see source) utilizes the buffers already allocated during the adjust_thread_buffers phase.

In scenarios where the OpenBLAS call requires 16 buffers for execution, an additional buffer (making it 17) is unnecessarily allocated, resulting in the wastage of one buffer.

@martin-frbg Another concern is that OpenBLAS allocates the number of buffers equal to the maximum possible threads per BLAS call, which is generally equivalent to the number of CPUs on the system. This approach is quite static and often leads to significant memory wastage, as many buffers remain unused during smaller BLAS calls.

Moreover, this fixed allocation strategy imposes a limitation on scalability, making it challenging to support a higher NUM_PARALLEL configuration efficiently. Could a more dynamic and adaptive buffer allocation method be explored to address these issues?

@XiWeiGu
Copy link
Contributor Author

XiWeiGu commented Jan 17, 2025

In scenarios where the OpenBLAS call requires 16 buffers for execution, an additional buffer (making it 17) is unnecessarily allocated, resulting in the wastage of one buffer.

Sometimes it's not just a matter of waste. #4662 fixed the HUGETLB_ALLOCATION option. When this option is enabled, even though OpenBLAS only requires 16 buffers, 17 buffers are allocated. As a result, the system's huge page count (/proc/sys/vm/nr_hugepages) must be set to 17 for it to work correctly, which becomes quite difficult to understand.

When I tested Linpack with OpenBLAS on multiple NUMA nodes, enabling HUGETLB_ALLOCATION improved performance but caused the configuration of the system's huge page count to become a disaster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants