Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BOLT tasks and ABT_cond #51

Open
devreal opened this issue Apr 2, 2020 · 4 comments
Open

BOLT tasks and ABT_cond #51

devreal opened this issue Apr 2, 2020 · 4 comments

Comments

@devreal
Copy link

devreal commented Apr 2, 2020

I am trying to leverage low-level Argobots features inside BOLT tasks (BOLT 1.0rc3, built with internal Argobots). In particular, I would like to block a set of tasks on a conditional variable and unblock them eventually from a different task, like in this example:

#include <abt.h>
#include <stdio.h>

int main(int argc, char **argv)
{
  int n = 10;
#pragma omp parallel
{
#pragma omp master
{
  int blocked = 0;
  ABT_mutex mtx;
  ABT_cond cond;
  ABT_mutex_create(&mtx);
  ABT_cond_create(&cond);

  for (int i = 0; i < n; ++i) {
    printf("Discovering task %d\n", i);
  #pragma omp task shared(mtx, cond, blocked)
  {
    printf("Task %d blocking\n", i);
    ABT_mutex_lock(mtx);
    blocked++;
    ABT_cond_wait(cond, mtx);
    ABT_mutex_unlock(mtx);
  }
  }

  #pragma omp task shared(cond, mtx, blocked)
  {
    printf("Broadcast task starting\n");
    while (n != blocked) {
      ABT_thread_yield();
    }
    // mutex required to ensure all tasks entered cond
    ABT_mutex_lock(mtx);
    printf("Broadcast task broadcasting\n");
    ABT_cond_broadcast(cond);
    ABT_mutex_unlock(mtx);
  }

  #pragma omp taskwait
}
}
  return 0;
}

What I see is that all tasks are created and only the first task starts executing. Output:

$ ./test_bolt_abt_cond
Discovering task 0
Discovering task 1
Discovering task 2
Discovering task 3
Discovering task 4
Discovering task 5
Discovering task 6
Discovering task 7
Discovering task 8
Discovering task 9
Task 0 blocking

Any idea why only the first task is executing? Are the other runnable tasks not passed to Argobots? Do I need to set some environment variables to make this work?

@shintaro-iwasaki
Copy link
Collaborator

Thanks for reporting an issue! I tested in my environment and it seems that tasking logic a bug in BOLT (tasks are not parallelized). I will fix it this weekend (as well as #49).

@devreal
Copy link
Author

devreal commented Apr 2, 2020

Thanks for looking into this. I'll be happy to give it a try as soon as you have a fix ready :)

@devreal
Copy link
Author

devreal commented Apr 23, 2020

I tested with current master (4e6a8a4) but the problem persists. #47 did not fix it.

shintaro-iwasaki added a commit to shintaro-iwasaki/bolt that referenced this issue May 1, 2020
This bug was reported in pmodels#51, which has
been already fixed.  This test checks if this bug has been fixed.

Acknowledgment: the origianl code of this test has been created by Joseph
Schuchart ([email protected]).  Thank you.
shintaro-iwasaki added a commit to shintaro-iwasaki/bolt that referenced this issue May 1, 2020
This bug was reported in pmodels#51, which has
been already fixed.  This test checks if this bug has been fixed.

Acknowledgment: the original code of this test has been created by Joseph
Schuchart ([email protected]).  Thank you.
shintaro-iwasaki added a commit to shintaro-iwasaki/bolt that referenced this issue May 2, 2020
This bug was reported in pmodels#51, which has
been already fixed.  This test checks if this bug has been really fixed.

Acknowledgment: the original code of this test has been created by Joseph
Schuchart ([email protected]).  Thank you.
@shintaro-iwasaki
Copy link
Collaborator

Originally BOLT had a few tasking bugs, which I hope have been fixed in several PRs. I also added tests to make sure OpenMP tasks and OpenMP threads are scheduled in parallel. Thank you very much for reporting this issue!

"Correct" but nonintuitive behavior

Now it works "correctly" (in my understanding); I finally found that the current BOLT design does not run your program correctly because Argobots blocking calls block OpenMP tasks. I used Clang 10.0 in the following experiments, but any recent compiler should be okay. I am not sure if an old GCC (e.g., GCC 4.x) works.

First, it works as follows on my four-core laptop.

$ # By default, the following is equivalent to KMP_ABT_NUM_ESS=4 OMP_NUM_THREADS=4 ./a.out 
$ ./a.out 
Discovering task 0
Discovering task 1
Discovering task 2
Task 0 blocking
Discovering task 3
Task 1 blocking
Discovering task 4
Discovering task 5
Discovering task 6
Task 2 blocking
Discovering task 7
Discovering task 8
Discovering task 9
Task 3 blocking
(hang)

Because OMP_NUM_THREADS=4, four OpenMP tasks are executed. Since all OpenMP threads are blocking in the discovering tasks, the other tasks are not scheduled.

There is a design issue in BOLT. At present, on Argobots blocking calls (e.g., ABT_cond_wait()), BOLT blocks "underlying OpenMP threads" as well as "currently running OpenMP tasks". This is because, unlike #pragma omp taskyield, the Argobots yield call (ABT_thread_yield(), which is executed in ABT_cond_wait()) does not release mapping between OpenMP tasks and OpenMP threads in BOLT. Such management is needed, for example, to schedule only four tasks in the above case (since there are only four OpenMP threads).

The fundamental reason is that BOLT maps both OpenMP threads and OpenMP tasks to Argobots threads (let's say ULTs). If an OpenMP task (=ULT) runs a ABT_thread_yield(), a natural expectation is that the task yields its control to the parent OpenMP thread, but it actually goes back to the parent Argobots scheduler (!= an OpenMP thread) since the parent OpenMP thread is also a ULT and scheduled by on an Argobots scheduler. To manage OpenMP thread-task mapping, #pragma omp taskyield and internal __kmp functions explicitly explicitly handle this mapping "in BOLT".

Runtime-level solutions

(I list a few options, but none of them are available now.)

  1. Make BOLT-aware Argobots synchronization calls
    Just create BOLT_cond_wait instead of ABT_cond_wait. This lowers interoperability and maintainability, so I don't like it.

  2. Map "OpenMP threads" to "Argobots schedulers"
    This is the fundamental solution, but it requires a significant change in Argobots. There are a few slightly different ways to implement it (1. make ABT_thread_yield() return to a parent "ULT", not a parent scheduler by 1.1 hooking ABT_thread_yield() or 1.2 changing the ABT_thread_yield() definition, 2. make "scheduler's scheduler" and let it schedule lightweight schedulers, ...). In any case, it cannot be implemented soon. Since this weird thread-task mapping management degrades the OpenMP tasking performance of BOLT, however, it will and should be fixed in the future (although it might not be the very near future).

  3. Ignoring thread-task mapping
    If we allow independent OpenMP threads that do not belong to a team but can execute certain OpenMP tasks, this issue would be solved. Unfortunately, the current OpenMP specification and implementation do not allow such.

User-level solutions

Regardless of the number of Argobots schedulers (which is, in the current implementation, equal to the number of Pthreads), giving enough executors (i.e., OpenMP threads) is the easiest solution.

$ # On my laptop, equivalent to KMP_ABT_NUM_ESS=4 OMP_NUM_THREADS=11 ./a.out 
$ OMP_NUM_THREADS=11 ./a.out 
Discovering task 0
Discovering task 1
Discovering task 2
Task 0 blocking
Discovering task 3
Discovering task 4
Discovering task 5
Discovering task 6
Discovering task 7
Task 3 blocking
Task 4 blocking
Task 5 blocking
Task 6 blocking
Task 7 blocking
Task 1 blocking
Task 2 blocking
Discovering task 8
Discovering task 9
Task 8 blocking
Task 9 blocking
Broadcast task starting
Broadcast task broadcasting
$ 

In reality, the threading performance of BOLT is not bad, so using OpenMP threads instead of OpenMP tasks is another way.

Anyway, thank you very much for giving us a very insightful question! We could find a few bugs, make scheduling tests, and realize the design issue in BOLT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants