Concurrent atomic operations in C++ and Raku

Here’s a problem to solve: you have two threads, each incrementing the same single counter N times. What is the state of the counter at the end of the program?

A straightforward solution

A naïve C++ program can be written using the standard library threads like this:

#include <iostream>
#include <thread>

int counter;

void f() {
    for (int c = 0; c != 100000; c++) counter++;
}

int main() {
    std::thread thread_a {f};
    std::thread thread_b {f};

    thread_a.join();
    thread_b.join();

    std::cout << counter << std::endl;
}

The two threads, thread_a and thread_b, are running the same function f which increments the counter in a loop. Threads are running concurrently, and we expect that they will update the counter so that it double the number of loop cycles.

At this point, you face a few surprises. The first one is quite expectable: the program is a random generator number, and it prints some number, which is always not less than 100,000 but never reaches 200,000, staying very close to the lower border:

$ ./a.out 
101640

$ ./a.out 
118064

$ ./a.out 
106088

$ ./a.out 
103100

$ ./a.out 
108400

The second surprise is that this was the result I got on Mac OS using the g++ compiler. The same compiler on Ubuntu gives an executable file that prints exact 200,000:

$ ./a.out 
200000

$ ./a.out 
200000

$ ./a.out 
200000

If you got similar output, try increasing the overall load on your computer, and most likely you get different numbers again:

$ ./a.out 
200000

$ ./a.out 
127976

$ ./a.out 
135940

Notice that you have to compile the code with the -pthread flag on Linux (p stands for POSIX):

g++ -pthread -std=c++11 test.cpp 

What’s going on under Windows?

Using mutexes

But OK, we made a small sidewalk tour, but the main thing is that the original program is not thread-safe yet. We have to make sure that only a single thread updates the counter at any moment.

First, let us try mutexes:

. . .
#include <mutex>

int counter;
std::mutex mu;

void f() {
    for (int c = 0; c != 100000; c++) {
        mu.lock();
        counter++;
        mu.unlock();
    }
}

. . .

On each iteration, the mutex is locked to ensure the counter is not read and/or updated by another thread. Did it help? Well, depends on which answer you accept.

The program prints the correct number 200,000 but it does it unbelievably slow:

$ time ./a.out 
200000
 
real 0m0.556s

Obviously, working with a mutex on each iteration is an overhead. You can try moving it out of the loop:

void f() {
    mu.lock();
    for (int c = 0; c != 100000; c++) {
        counter++;
    }
    mu.unlock();
}

It is fast and correct now, but the logic is broken. We did not intend to make all 100,000 iterations without interruption. Now, the second thread starts incrementing the counter only after the first thread did its job completely.

The same slow result gives another approach with mutexes—using templates to let them unlock mutexes at the end of the scope:

int counter;
std::mutex mu;

void f() {
    for (int c = 0; c != 100000; c++) {
        std::unique_lock<std::mutex> lock {mu};
        counter++;
    }
}

Correct, but slow answer again. We need something more radical.

Atomic operations in Raku

The Raku programming language, which is a recent rename of Perl 6, provides you with atomic operations that you can apply to special atomic integers. I am referring you to the above two links for the details, and here is the corresponding program in Raku:

my atomicint $counter = 0;

sub f() {
    $counter⚛++ for ^100_000;
}

await do {
    start {f}
    start {f}
}

say $counter;

The atomic behaviour is forced by the two elements: the atomicint modifier in the variable declaration and the Unicode modifier for the increment operator: ⚛++. Without those, the result of the program is not predictable: as earlier, it can be any number between 100,000 and 200,000.

As soon as we start working with the atomic variable, the result is correct, and the program is quite quick:

$ time raku test.raku 
200000
 
real 0m0.303s

But wait, how interpreted Raku can be faster than compiled C++? Just to make it more awkward, let us exclude the compilation time and measure the proper part of runtime only:

. . .
my $t0 = now;

await do {
    start {f}
    start {f}
}

say now - $t0;
. . .

It is now about five times faster than a C++ program with mutexes:

$ raku 168.raku 
0.13648206

What is wrong here? The next section gives you the answer.

The true C++ atomic

Atomic operations are the answer in C++ too. Modern C++ gives you a very handy way to make a counter atomic.

All you need is to instantiate the atomic template:

std::atomic<int> counter;

You can also do it using a predefined type synonym.

std::atomic_int counter;

Just for the reference, here is the whole program again. It is almost the same as the first naïve solution, just with an addition of the <atomic> header and the atomic counter:

#include <iostream>
#include <thread>
#include <atomic>

std::atomic<int> counter;

void f() {
    for (int c = 0; c != 100000; c++) {
        counter++;
    }
}

int main() {
    std::thread thread_a {f};
    std::thread thread_b {f};

    thread_a.join();
    thread_b.join();

    std::cout << counter << std::endl;
}

Whatever platform you use to compile and run this program, you get an instant correct result:

$ time ./a.out 
200000
 
real 0m0.008s

To confirm if the generated code is lock-free on your hardware, you can use the following method to check it (it is 1 in my case):

std::cout << counter.is_lock_free() << std::endl;

Hurray! Three goals have been achieved:

  • The code is thread-safe
  • It is very fast
  • It is correct

Congratulations!

Leave a Reply