-
Notifications
You must be signed in to change notification settings - Fork 13
3.3 atomic_ring: A template for a quasi lock free ring implementation
A ring is a particular container in which elements can be added at the head or at the tail end in a comparable time. The structure is circular, which means that when you reach the end of the ring, the next element starts again at the beginning of the structure.
The version I propose here is designed to run in threads with minimal use of locks.
I was inspired by "lock-free" programming to build this container. The term is rather misleading because when the need to extend memory to accommodate new elements arises, it is sometimes necessary to lock memory accesses. Nevertheless, if we want to implement a container capable of surviving in a multithreaded environment, we must implement the following two constraints a minimum:
-
First constraint: the increase in memory must be done per page, without the pages already allocated being destroyed. In this way, we protect ourselves against the risk of access to a destroyed memory area, as it is sometimes the case with standard containers in C++.
-
Second constraint: this ring structure can be used to store pointers, which introduces particularly tricky concurrency problems. For instance, if two threads replace the same pointer with another one, the operations must be correctly sequenced to avoid that an object is destroyed twice or worse never destroyed, at the risk of untimely crashes or memory leaks.
This ring container has been implemented as a C++ template with some constraints that can be modified if necessary. In particular, it can only include 65535 elements. In addition, some choices to ensure "lock-free" programming give a rather poor memory size/occupancy ratio in the worst case scenario. It should be noted that a ring is used to place objects at the head or at the tail end, rarely to position them in a specific location. From this point of view, this structure ensures an equivalent amount of time to add new elements at the front or at the back. In addition, when operations are reduced to "push_front" or "push_back", memory usage is generally optimal.
We have mentioned above that one of the solutions to implement a "lock-free" container is to manage the memory in the form of pages. In this case, this pagination is based on a fixed table size of 256 elements. Each element is itself a table of 256 elements. At the beginning, this principal table is empty except for the first and last element.
template <class Z> class atomic_ring {
public:
//head is our main table
atomic_ring_element<Z>* head[256];
//the iterator to add at the tail or at the end of the ring
std::atomic<unsigned short> first;
std::atomic<unsigned short> last;
//a lock only used when the main table must be enlarged
std::recursive_mutex _lock;
atomic_ring() {
for (long i = 1; i < 255; i++)
head[i] = NULL;
//We create two initial elements at the first and last position
head[0] = new atomic_ring_element<Z>(NULL);
head[255] = new atomic_ring_element<Z>(NULL);
last = 0;
first = 0;
}
As can be seen from the example above, this template is actually very simple. The initial table of 256 elements contains elements of the type atomic_ring_element to which we will come back later.
At the first initialization, the table is empty except for the first and last element for which an atomic_ring_element is created.
First of all, first and last are unsigned short whose maximum value is 65535. This choice has the effect that if last is 65535, then last+1
will be 0, hence the ring effect thus obtained.
It should also be noted that they are implemented as atomic variables. This ensures that their values can be changed without a lock.
The only case where locks are necessary is when you try to place an element in a head location that is NULL.
This case is managed by the following method:
inline void addnewelement(unsigned char nb) {
if (head[nb] == NULL) {
_lock.lock();
if (head[nb] == NULL)
head[nb] = new atomic_ring_element<Z>(zero);
_lock.unlock();
}
}
Note the double test on head[nb]. Indeed, only one thread at a time can access the critical section. Once the element is created, the second test will allow waiting threads to skip this section since an atomic_ring_element is now available at this location.
The position of the last element added at the front is recorded by first while the position of the last element added at the back is recorded by last. When we add at the front, we decrement first, when we add at the back we increment last. Thus, the addition of a new element at the back or the front is totally parallel.
Here is for example the version: push_front that adds an element:
Z push_front(Z val) {
unsigned short pos = --first;
uchar nb = pos >> 8;
uchar i = pos - (nb << 8);
addnewelement(nb);
return head[nb]->replace(val,i);
}
The calculation to know where to store the element is very simple. We divide by 256 to know in which head location is located our subtable. Then we calculate the remainder to know the position in this subtable. addnewelement will check if the box marked nb requires the creation of a new element atomic_ring_element.
The fact that first is atomic ensures that each thread will have access to its own value.
The reverse version: push_back applies the same thing to last, except that last is incremented.
...
unsigned short pos = last++;
...
Note also the call to replace at the end of the method, which exchanges between the value stored in the subtable and val. We will come back to the why of this method.
You can remove an element at the back or at the front: remove_back and remove_front. These methods also return the value that is being removed. Indeed, if it is a pointer to an object, it may be necessary to destroy it.
Z remove_back() {
unsigned short pos = --last;
uchar nb = pos >> 8;
uchar i = pos - (nb << 8);
return head[nb]->replace(NULL, i);
}
Note the call of the replace method. replace returns the value pointed to by i and replaces it with NULL. This method is fundamental because it is the way to solve concurrent access problems.
Let's now look at atomic_ring_element to understand why.
As mentioned above, atomic_ring_element is also a table of 256 elements: 256x256 = 65536, but with a touch of surprise:
template <class Z> class atomic_ring_element {
public:
std::atomic<Z> vecteur[256];
...
Indeed, the table vecteur contains atomic elements. This type allows you to manage concurrent accesses on the condition that you use a particular std::atomic method: exchange. This method allows you to save a new value while returning the old value as the result. Being atomic, this operation guarantees that calls from different threads will be made one after the other.
This is what the famous replace method looks like:
Z replace(Z v, unsigned char i) {
return vecteur[i].exchange(v);
}
This replace method that we saw earlier in remove_back allows us to guarantee that no elements will be lost.
Access to the head or tail end is also very simple:
Z back() {
unsigned short pos = last-1;
uchar nb = pos >> 8;
uchar i = pos - (nb << 8);
return head[nb]->vecteur[i];
}
The back element is situated in last-1 while the front element is in first+1.
Here is an example of how to use this template:
//The parameter NULL is the default value stored in the subtables
//You can choose what you want
atomic_ring<char*> test(NULL);
char* c = new char[10];
strcpy (c, "test");
char* r = test.push_back(c);
if (r != NULL) //If a value has been exchanged with an already existing one
delete[] r; //it is destroyed. NULL is the default value provided above
The description of this template shows how to implement a structure as rich as a ring in a very simple way by limiting the use of locks as much as possible.
In our ring, we do not need locks to access or remove an element. In the case of adding elements, the lock is finally used only every 256 operations. Actually, when the main table is full, the locks become useless. Obviously, compared to a vector, this structure is slightly less efficient and a little more resource-intensive. On the other hand, its very low use of locks makes it remarkably effective in a multithreaded context.
The implementation is available here: atomic_ring