Memory allocation #4665

XiWeiGu · 2024-04-28T06:48:43Z

We have introduced adjust_thread_buffers() function, similar to OpenMP, for initializing global thread buffers instead of the existing local buffers initialized in blas_thread_server.

In blas_thread_init, memory is allocated for blas_cpu_number threads using the adjust_thread_buffers interface. However, when calling interfaces like gemm, memory allocation is still performed in the main thread:

buffer = (XFLOAT *)blas_memory_alloc(0);

This would lead to an additional buffer being allocated, deviating from the logic of the code before the modification.

The text was updated successfully, but these errors were encountered:

shivammonaka · 2025-01-16T09:56:31Z

It appears this buffer is allocated as a precaution in case the execution follows the single-threaded path. However, if OpenBLAS determines that multi-threading is necessary, the exec_threads function (see source) utilizes the buffers already allocated during the adjust_thread_buffers phase.

In scenarios where the OpenBLAS call requires 16 buffers for execution, an additional buffer (making it 17) is unnecessarily allocated, resulting in the wastage of one buffer.

@martin-frbg Another concern is that OpenBLAS allocates the number of buffers equal to the maximum possible threads per BLAS call, which is generally equivalent to the number of CPUs on the system. This approach is quite static and often leads to significant memory wastage, as many buffers remain unused during smaller BLAS calls.

Moreover, this fixed allocation strategy imposes a limitation on scalability, making it challenging to support a higher NUM_PARALLEL configuration efficiently. Could a more dynamic and adaptive buffer allocation method be explored to address these issues?

XiWeiGu · 2025-01-17T01:30:57Z

In scenarios where the OpenBLAS call requires 16 buffers for execution, an additional buffer (making it 17) is unnecessarily allocated, resulting in the wastage of one buffer.

Sometimes it's not just a matter of waste. #4662 fixed the HUGETLB_ALLOCATION option. When this option is enabled, even though OpenBLAS only requires 16 buffers, 17 buffers are allocated. As a result, the system's huge page count (/proc/sys/vm/nr_hugepages) must be set to 17 for it to work correctly, which becomes quite difficult to understand.

When I tested Linpack with OpenBLAS on multiple NUMA nodes, enabling HUGETLB_ALLOCATION improved performance but caused the configuration of the system's huge page count to become a disaster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory allocation #4665

Memory allocation #4665

XiWeiGu commented Apr 28, 2024

shivammonaka commented Jan 16, 2025

XiWeiGu commented Jan 17, 2025

Memory allocation #4665

Memory allocation #4665

Comments

XiWeiGu commented Apr 28, 2024

shivammonaka commented Jan 16, 2025

XiWeiGu commented Jan 17, 2025