Concurrency: Testing isn’t good enough

Exploitation of multi-core hardware is essential for achieving performance. But developing software that exploits such hardware is difficult and risky. The fact that concurrent software is inherently non-deterministic means that reliance on testing for software quality assurance is inappropriate and dangerous. Use of static analysis to discover and diagnose concurrency defects is a cost-effective solution.

Key points

  • The results produced by concurrent software depend on uncontrollable aspects of thread execution
  • Lack of repeatability makes testing a fundamentally unreliable QA approach for concurrent software
  • Static analysis does not suffer from the flaws of testing since it can take all thread execution pathways and deployment scenarios into account
  • Contemplate’s ThreadSafe static analysis tool reduces risk by discovering and diagnosing concurrency defects in Java

Introduction

Modern multicore processors, in which several possibly unrelated computations can be performed at the same time, provide a rich and powerful foundation for building high performance systems. The features of present and future hardware that enable highly concurrent software must be fully exploited to attain the high throughput and low latency required of high-performance software systems.

Exploiting the benefits of multicore hardware has proven to be difficult and risky. Writing software that correctly and safely makes use of concurrency requires careful thought to account for the effects of running in a concurrent environment.

The increasing use of concurrency poses a fundamentally new challenge to software quality assurance processes. The way that threads are scheduled and the order in which events in different threads occur is unpredictable and can differ each time a concurrent program runs. Software that incorrectly accounts for concurrency can therefore contain intermittent defects that elude even the most rigorous testing processes, because the outcome is non-deterministic: errors are not repeatable. Since repeatability is the corner-stone of QA approaches based on testing, a new approach is needed.

Non-deterministic results

A program is deterministic if it always produces the same output, via the same computation steps, for any given input. Most non-concurrent programs are deterministic. Programs whose output depends on the time (or the contents of a web page, entries in a database, etc.) can be regarded as deterministic, where time (or web page contents, database contents, etc.) is an additional input.

Concurrent programs are inherently non-deterministic, with the order of and interaction between events depending on — among other things — the exact scheduling of threads as well as whether and at what points thread execution is pre-empted by some other activity. This fundamental non-determinism is the main reason why concurrent programming is so hard.

A very simple example that illustrates the problem is given by a program having two threads with a shared variable x, where one thread writes 1 to x and the other thread writes 2 to x:

Image: race between two threadsWhether the final value of x is 1 or 2 depends on which write takes place last. (In fact it is more complicated than that, because changes in per-core hardware caches are not immediately propagated to main memory, so the speed of propagation is another uncontrollable factor that influences the result.)

This is an example of a race condition: there is a “race” between the two threads, with the final value of x depending on which thread “wins” the race.

It is easy to understand very simple non-deterministic programs like the one above. The difficulty comes with more complicated programs. To work out the result of running two concurrent threads, we need to consider the results of all of the possible interleavings of events from the two threads. For two threads with one event each, there are two possible interleavings; for two threads with 10 events each, there are 184756 interleavings, and each interleaving may produce a different result.

Under normal operation, there is no way to control which interleaving is taken. So even very extensive testing will at best give a false sense of security: there is no way of knowing which interleavings have been tested and so no way of being sure which untested interleavings will lead to disastrous failures.

Place your bets

As everybody knows, rolling a pair of dice produces non-deterministic results: the outcome depends on the exact way that the dice hit the table and each other before they come to rest, the force with which they were thrown, and their initial orientation. Throw the dice twice and the outcome is unlikely to be the same. The same goes for the result of spinning a roulette wheel. Gamblers would obviously prefer the outcome to be deterministic: just wait to see the outcome once, and then bet on the outcome being the same next time. Casino owners rely on non-determinism: their profit depends on the fact that outcomes are unpredictable.

Now, drawing an analogy with concurrent programs, you — the one who cares about the result of the computation — are the gambler!

Image: a pair of 6-sided diceImagine testing that a pair of standard 6-sided dice never produce a result of 12. Of course, that’s not true! But since there is a 1:36 chance of a given roll producing a result of 12, 18 trials are required on average before 12 will appear. One test is not enough. Ten tests are probably not enough. 40 tests might not be enough, if you’re unlucky.

Image: 100-sided dieNow imagine testing that a pair of Zocchihedrons (100-sided dice) never produce a result of 200. Again, that’s not true. But on average, 5000 trials are required before 200 will appear. One test is obviously not enough. 1000 tests are not enough. Even 20000 tests might not be enough, if you’re unlucky.

Back to our analogy with concurrent code. Under the assumption that the developers of a concurrent system are competent and careful, we might expect that most thread interleavings produce correct results. A wrong result — leading perhaps to deadlock or crash — might be produced only rarely, when some unusual thread interleaving is encountered. This is like dealing with 100-sided or 1000-sided dice, not 6-sided dice. Place your bets!

Testing isn’t good enough anymore

Making careful use of synchronisation and other concurrency mechanisms so as to produce deterministic results is vital to keeping risks under control. Unfortunately, over-use of synchronisation can have a very serious impact on performance, and in the worse case can lead to deadlock. It’s easy to get it wrong — there are many pitfalls — and you can’t test whether you’ve got it right! Getting the same result 100 times in a row, or 1000 times, might make a program look deterministic, but there’s no guarantee that the result will be the same on the next attempt.

With non-deterministic programs, correct results under test don’t guarantee correct results in deployment. Correct results during the first week or month of deployment don’t guarantee correct results in the future: the particular unlucky interleaving of threads that leads to catastrophic data corruption might be a one-in-a-million chance or even less likely. The fact that the problematic case only happens once in a blue moon means that it has probably completely escaped testing until it happens one day in production. If you’re lucky, the result will be relatively benign. But if you’re not . . .

Correct results during years of single-core deployment don’t guarantee correct results when the hardware is upgraded to multi-core. Although single-core execution of a concurrent program is in theory non-deterministic to the same extent as multi-core execution, with the same number of possible interleavings, far fewer interleavings are actually encountered in practice because the events of one thread are only interleaved with the events of a different thread when thread execution is pre-empted by the need to perform some other activity. As a consequence, upgrading to multi-core hardware significantly increases the risk that untested interleavings will be encountered.

In order to debug a faulty program, it is typically run in an instrumented environment so that intermediate results can be logged, or so that its execution can be stopped and intermediate results probed. The fact that the debugging environment is so different from the production environment makes intermittent faults even more difficult to track down. A “Heisenbug” is a tongue-in-cheek name for a software bug that disappears when one tries to examine it, by analogy with the “observer effect” in quantum mechanics, where the act of observing affects the system that is observed.

Static analysis to the rescue

A static analysis tool scans the source code (or bytecode) of a codebase, performing a kind of symbolic execution to work out the potential consequences of each line of code. Inconsistencies and defects can be brought to the developers’ attention before the code has even been completely written.

Static analysis is cost-effective for development of non-concurrent programs because it leads to early detection of bugs, meaning that they are much less expensive to fix than if they are caught during testing or deployment. In that case, static analysis can be thought of as testing with all data values for certain kinds of errors.

Static analysis is essential for concurrent programs because testing, even with all data values, isn’t good enough. Static analysis amounts to exhaustive testing for all data values and for all interleavings, for all deployment scenarios.

Static analysis readily accommodates the possibility of non-deterministic outcomes. Going back to our example of rolling dice, imagine a version of static analysis that simply keeps track of bounds on values:

6-sided dice
One 6-sided die: result is [1. .6]
Sum of two 6-sided dice: result is [1. .6] + [1. .6] = [2. .12]
So 12 is possible
100-sided dice
One 100-sided die: result is [1. .100]
Sum of two 100-sided dice: result is [1. .100] + [1. .100] = [2. .200]
So 200 is possible

The same essential idea can be adapted to static analysis of non-deterministic concurrent programs, by taking account of all possible outcomes. This can be done without the huge computational overhead of checking all of the possible interleavings individually.

Related reading

Contemplate Ltd. Maintaining safe concurrent code with ThreadSafe. April 2013.

Victor Grazi. Exterminating Heisenbugs. InfoQ 2012.

Edward A. Lee. The problem with threads. IEEE Computer 39(5):33-42, 2006. [UC Berkeley technical report version]

Ben Ylvisaker. Multi-core processors are a headache for multithreaded code. GrammaTech 2013.