Maintaining safe concurrent code with ThreadSafe

A Contemplate technical briefing

Modern multicore hardware offers an unprecedented opportunity for developing highly concurrent software capable of delivering high throughput and low latency. However, developing reliable concurrent software is risky, and concurrency-related defects like data races and deadlocks can be subtle. Concurrency defects may only manifest themselves as one in a million discrepancies in behaviour, evading usual testing regimes. Moreover, even if one version of some software is correct with respect to concurrent behaviour, defects are often introduced as changes and improvements are made. Contemplate ThreadSafe can discover and diagnose concurrency defects as they creep into codebases, using cutting edge static analysis techniques and tight integration into the Eclipse IDE.

Key points

  • Multicore systems and concurrency offer opportunities and risk
  • Concurrent software is hard to maintain, because design decisions are hidden and implicit in code
  • Contemplate’s ThreadSafe static analysis tool reduces risk by discovering and diagnosing concurrency defects in Java
  • ThreadSafe uses cutting-edge techniques to accurately discover defects that arise from the interaction of many methods and classes
  • ThreadSafe’s tight Eclipse integration allows rapid and precise diagnosis of concurrency defects
Image: A screenshot of ThreadSafe discovering and helping to diagnose a concurrency defect

ThreadSafe discovering and helping to diagnose a concurrency defect in the Eclipse IDE.

Introduction

Modern multicore processors provide a rich and powerful foundation for building high performance concurrent systems. The features of present and future hardware that enable highly concurrent software must be fully exploited to attain the high throughput and low latency required of modern software systems.

Exploiting the benefits of multicore hardware has proven to be difficult and risky. Writing software that correctly and safely makes use of concurrency requires careful thought to take into account the effects of running in a concurrent environment. Software that incorrectly accounts for concurrency can contain intermittent defects that elude even the most rigorous testing processes. Software systems that exploit concurrent behaviour are typically designed by expert developers, and careful thought is put into avoiding common concurrency defects like race conditions, atomicity violations and deadlocks. A well-structured design for a software system that specifies matters such as synchronization strategies and lock acquisition patterns greatly reduces the possibility of concurrency defects.

However, it is rare for even the most perfectly designed software system to remain unchanged. Changing business requirements or environmental changes mean that new features must be added to existing systems. Changes are not necessarily carried out by the original authors of the system, and new developers may not be aware of the often undocumented design intentions of the original developers. Difficult to find concurrency defects can slip in, manifesting as costly errors during production.

Contemplate’s ThreadSafe is an advanced static analysis tool that analyses Java code for concurrency defects, reducing the inherent risk in exploiting concurrency. With ThreadSafe, developers can discover the underlying design intentions of existing concurrent code, and be warned when new code deviates from this design. In this technical briefing, we demonstrate a simple example of concurrent software with a solid original design that is violated by the addition of new features that do not heed this design. We show how ThreadSafe can provide early warning of the introduced concurrency defects, and how the cutting edge technologies present in ThreadSafe contribute to the identification and explanation of concurrency defects.

Concurrency defects creep in during software evolution

Even when a piece of software has been constructed to use concurrency correctly, defects can be introduced as modifications and additions are made. The following simple example in Java demonstrates the incremental introduction of a concurrency defect over the lifetime of a software product. ThreadSafe identifies and helps with diagnosing the problem, catching defects long before they enter production.

Stage One: A concurrency-correct class is written

Our example starts with a Java class BankAccount for representing bank accounts that may be accessed concurrently. Despite its simplicity, careful thought has to be put in to the concurrency aspects of the design of this class.

The original developer has designed this class with two Application Programming Interfaces (APIs). The first is a public API through which the current balance can either be read or be altered by credits and debits. The public API is intended to be safely used concurrently: any call to a public API method will act atomically on the current balance.

The second API is an internal API intended for use by subclasses that add additional features to the basic BankAccount class. This API offers direct access to read and modify the current balance. In order to allow for complex sequences of operations on the balance to happen atomically, the internal API does not perform any synchronization itself. Instead, a lock object is provided to subclasses via a protected field lock. The concurrent design of this internal API can be summed up in the following rule:

All clients must synchronize on the object in the field lock while accessing the internal API.

The actual implementation of the BankAccount class goes as follows. For brevity, we do not include any comments or Javadoc in this code, though they would be part of any real implementation.

package com.contemplateltd.example;

public class BankAccount {

    protected final Object lock = new Object();

    private int balance;

    public BankAccount(int initialBalance) {
        if (initialBalance < 0)
            throw new IllegalArgumentException("initial balance must be >= 0");

        balance = initialBalance;
    }

This code declares a private field balance that contains the internal state of a BankAccount object, and a constructor to initialise that field. The protected final field lock declared at the start of the method contains the object to be used for synchronizing access to the internal API. Note that the constructor does not need to synchronize its write to the balance field. Since a reference to this is not leaked, and if we assume that the object will be correctly published, then the Java Memory Model guarantees that the constructor has exclusive access to the object being constructed.

    protected int readBalance() {
        return balance;
    }

    protected void adjustBalance(int adjustment) {
        balance = balance + adjustment;
    }

This pair of method declarations implements the internal API. Since the field balance has been declared private, we use two protected methods to allow subclasses of BankAccount to manipulate it. The use of protected methods for the internal API may seem like overkill for controlling access to a single field. However, one can imagine more complex implementations that maintain additional information like transaction logs, the internal details of which should not be exposed to subclasses.

public void credit(int amount) {
        if (amount < 0)
            throw new IllegalArgumentException("credit amount must be >= 0");

        synchronized (lock) {
            adjustBalance(amount);
        }
    }

    public void debit(int amount) throws Exception {
        if (amount < 0)
            throw new IllegalArgumentException("debit amount must be >= 0");

        synchronized (lock) {
            if (readBalance () - amount < 0)
                throw new Exception("insufficient funds");
            adjustBalance(-amount); 
        }
    }

    public void tax(double taxRate) throws Exception {
        if (taxRate < 0.0 || taxRate > 1.0)
            throw new IllegalArgumentException("tax amount must be >= 0.0 and <= 1.0");

        synchronized (lock) {
            int tax = (int)(taxRate * (double)readBalance());
            adjustBalance(-tax);
        }
    }

    public int getCurrentBalance() {
        synchronized (lock) {
            return readBalance();
        }
    }
}

Finally, we have declarations of the four methods of the public API. These methods adhere to the concurrent design that we stated above: a lock on the object in the field lock is acquired before calling the internal API methods. The implementations of the debit() and tax() methods demonstrate why the internal API cannot handle synchronization by itself: the balance check and the decrement must happen atomically to avoid a race condition. Atomicity is achieved by wrapping the sequence of operations in a single synchronized block.

Note that acquiring the lock is essential in all four cases. For example, failing to acquire the lock in the getCurrentBalance() method may result in out of date information being returned.

Stage two: Software evolves, and a concurrency defect is introduced

Some time after the BankAccount class has been designed and implemented, another developer is given the task of implementing a class for “bonus” bank accounts. The class BonusBankAccount includes an additional method that applies a fixed bonus amount to the current balance.

The developer given the task of implementing BonusBankAccount, who is under pressure to finish quickly, looks at the method signatures for the internal API of BankAccount and produces the following implementation of BonusBankAccount:

package com.contemplateltd.example;

public class BonusBankAccount extends BankAccount {

    private final int bonus;

    public BonusBankAccount(int initialBalance, int bonus) {
        super(initialBalance);

        if (bonus < 0)
            throw new IllegalArgumentException("bonus must be >= 0");

        this.bonus = bonus;
    }

    public void applyBonus() {
        adjustBalance(bonus);
    }
}

The developer of BonusBankAccount has introduced a serious concurrency defect.

A consumer of instances of BonusBankAccount may reasonably assume that, as for instances of its super class, it is safe to call public API methods from multiple threads concurrently. However, the applyBonus() method in BonusBankAccount does not adhere to the concurrency policy intended by the developer of BankAccount. No synchronization occurs on the object in the field lock, even though the internal API method adjustBalance() was invoked.

When executed in a concurrent environment, the effects of the lack of synchronization include, but are not limited to, the following:

  • Updates even from the correctly synchronized credit() and debit() methods in BankAccount may be lost if bonuses are applied concurrently. Updating the balance field in the adjustBalance() method involves a read of the balance and then a write. If the balance field is updated between the read and the write, then the intervening update will be lost. Without correct synchronization, atomicity of updates to the balance field cannot be guaranteed.
  • Lack of synchronization can cause updates to the shared balance field to be delayed for an arbitrarily long time. Therefore, multiple cores invoking applyBonus() may see starkly different views of the balance field. This problem is exacerbated on machines with many cores distributed over multiple physical chips, as is common in server-class hardware.

On normal desktop hardware with four to eight cores, the effects of the BonusBankAccount developer’s mistake can be observed by setting up several threads to concurrently apply thousands of bonuses and credits to the same account over a short period of time. In a simple test case, the difference between the expected balance and the actual balance can be observed easily because we know exactly what the expected balance must be. In production, the expected balance is not necessarily known, and there is no choice but to accept the erroneous computed balance. Moreover, it is possible in this simple example to set up a test harness. Setting up a test harness for more complex concurrent designs can be prohibitively expensive.

As serious as this mistake is, it is an understandable lapse from the point of view of the developer of BonusBankAccount. Unlike the compiler-checked documentation afforded by the Java type system, there is no way to enforce a concurrency policy on users of an API. Developers operating under time pressure cannot always be expected to be able to accurately infer intended concurrency policies that are often left implicit in existing code.

Discovering and diagnosing the defect with ThreadSafe

With BonusBankAccount as it is, the concurrency defect in the applyBonus() method could go undetected for many months, running in production, silently corrupting the value of the balance field.

Static Analysis can help to discover and diagnose this kind of concurrency defect, long before software enters production. Contemplate’s ThreadSafe is a static analysis tool for detecting and diagnosing concurrency defects in Java code. ThreadSafe’s results can be presented in the Eclipse IDE‘s Java perspective during development as illustrated here, or via the Sonar software quality platform for a higher level view.

Applying ThreadSafe’s static analysis to this example produces the following report in Eclipse, directing our attention to the balance field:

Image: A screenshot of the ThreadSafe view in the Eclipse IDE

To investigate further, we right-click on the finding to get ThreadSafe to show us where the accesses to the balance field occur, and the locks that are being used to guard each access.

Image: A screenshot of ThreadSafe's Guards view in the Eclipse IDE

This view shows us the list of locations in the source code where the balance field is accessed from. Accesses that are likely to be problematic are marked with a warning symbol. Across the top of the view are descriptions of the objects that have been synchronized on for these accesses. In this relatively simple example, only the object referred to by the lock field is relevant, so this is the only object displayed by ThreadSafe.

The first row refers to the access to balance in the readBalance() method. Since this is only ever called from the correctly synchronized credit(), debit() and tax() methods, ThreadSafe has shown that the lock BankAccount.this.lock is Always Held. The second and third rows refer to the read and write accesses in the adjustBalance() method. ThreadSafe has indicated that the lock is only Maybe Held for these accesses, indicating that there are contexts in which these accesses may be unsynchronized.

By double clicking on each of the access locations listed in this view, we are taken to the lines in the implementation of BankAccount where the accesses to balance occur. From this information, we learn two useful facts.

  1. Since all of the accesses appear within the implementation of the internal API, we can deduce that the original developer of the BankAccount class probably intended that all uses of the internal API should synchronize on the object in the BankAccount.lock field.
  2. ThreadSafe has highlighted that some accesses do not synchronize on the object in the BankAccount.lock field: this is indicated by the Maybe Held label. By using Eclipse’s Call Hierarchy feature to discover calls to the internal API methods, the failure to synchronize can be traced back to the applyBonus() method in the BonusBankAccount class.

The developer of BonusBankAccount can now amend their code to take into account the concurrency policy of the BankAccount class. The new implementation of the applyBonus() method is:

    public void applyBonus() {
        synchronized (lock) {
            adjustBalance(bonus);
        }
    }

Re-running ThreadSafe on the amended code produces no reports. The concurrency defect introduced by the developer of BonusBankAccount has been remediated.

How ThreadSafe works

ThreadSafe is able to find the concurrency defect hidden in the interaction between the BankAccount class and the BonusBankAccount class by applying the following cutting-edge static analysis techniques:

Key technology: Context-sensitive inter-procedural analysis
In the BankAccount class, the synchronizations on the object in the lock field happen in the implementation of the public API, while the actual accesses to the balance happen in the implementation of the internal API. The root cause of the concurrency defect described above is the invocation of the internal API in two different kinds of context: ones where synchronization has occurred and ones where no synchronization has occurred. ThreadSafe can accurately track synchronization contexts, both on Java intrinsic locks and on java.util.concurrent locks, even through complex chains of method calls.ThreadSafe not only tracks synchronizations across method calls but can also track references to shared data passed between methods. Tracking of references increases accuracy, exposing yet more hidden concurrency defects, and also reducing false positives (false positives are reports generated by a static analysis tool that do not indicate real defects).
Key technology: Inter-class analysis
To discover the discrepancy between the synchronization policy of BankAccount and the actions of BonusBankAccount, ThreadSafe has tracked method calls across the class hierarchy, as well as within a single class. Subtle defects can arise due to mismatches between the assumptions made by developers of different classes, so accurate tracking of the interactions between classes is essential for accurate reporting of concurrency defects. ThreadSafe can handle even the most complex class hierarchies.
Key technology: Aggregation and filtering of results
ThreadSafe uses its context-sensitive inter-procedural and inter-class analysis to gather rich information on the accesses to the shared balancefield, and the circumstances under which they occur.This information is then filtered and aggregated to discover likely concurrency defects. In the example we presented above, filtering is essential to remove the irrelevant access to the field balance in the constructor of BankAccount. (This access is irrelevant because it can only ever happen before the object has been shared between threads.) Several other filtering techniques are also applied to reduce the false positive rate. ThreadSafe then aggregates the remaining information to discern hidden irregularities in the code. For example, the inconsistent use of synchronization between the BankAccount and BonusBankAccount classes is discovered by analysing all the conditions under which accesses to the balance field can occur.
Key technology: Straightforward presentation of results
After a likely concurrency bug is discovered by ThreadSafe, the developer is presented with the necessary information to help diagnose the underlying cause. In the example we presented above, ThreadSafe’s Guards View presented us with the precise locations of the unsynchronized accesses to the balance field, and the synchronisation behaviour of the code for the other accesses. By comparing these pieces of information, a developer is able to determine the true defect and resolve it.

ThreadSafe helps with all aspects of concurrent software

The example we presented above demonstrated a relatively simple example of an inconsistent synchronization defect that was discovered and diagnosed with the help of ThreadSafe. ThreadSafe helps with discovering all kinds of defects common in concurrent code: more complex race conditions, including inconsistent synchronization and atomicity violations, as well as deadlocks.

Inconsistent Synchronization

The BankAccount example we presented above demonstrated an example of a concurrency defect due to inconsistent synchronization of accesses to a shared field, which resulted in potential race conditions. ThreadSafe can also discover more complex examples of inconsistent synchronization. The following code snippet demonstrates a more complex instance of inconsistent synchronization, involving potentially concurrent accesses to a shared collection.

    private final Map<Long, Cache> caches = new HashMap<Long, Cache>();

    public synchronized Cache getCache(Long cacheId) {
        Cache cache = caches.get(cacheId);
        if (cache == null) {
            cache = new Cache();
            caches.put(cacheId, cache);
        }
        return cache;
    }

    public boolean isCacheIdAvailable(Long cacheId) {
        return caches.containsKey(cacheId);
    }

The developer has been inconsistent in their use of synchronization between the getCache() and isCacheIdAvailable() methods. This can lead to incorrect information being returned by isCacheIdAvailable(). ThreadSafe will highlight this discrepancy as it is likely a sign of a potentially serious defect when this code is run in a concurrent environment.

Atomicity violations

Atomicity violations occur when sequences of interdependent operations are performed by a thread on some shared data, without ensuring that interfering operations performed by concurrently running threads are not occurring at the same time.

A common pattern of atomicity violations consists of sequences of get, check and put operations on shared collection objects. For an example of this, imagine an evolution of the example above where the developer has decided to gain some performance increase by dispensing with synchronization on the whole HashMap, and is using a ConcurrentHashMap instead:

    private final Map<Long, Cache> caches = new ConcurrentHashMap<Long, Cache>();

    public Cache getCache(Long cacheId) {
        Cache cache = caches.get(cacheId);
        if (cache == null) {
            cache = new Cache();
            caches.put(cacheId, cache);
        }
        return cache;
    }

Imagine there are two threads, A and B, that call getCache() concurrently with the same cacheId. First, A calls caches.get(cacheId), and gets null, as there is no Cache object yet associated with cacheId. Thread B then does the same, also receiving null. Now, both thread A and thread B will create separate Cache objects, insert them into the caches map, and return them. However, threads A and B have now obtained different Cache objects for the same cacheId, which is not what was intended. At best, this could lead to performance degradation as threads A and B do not shared as much cached data as possible. At worst, threads A and B may end up with completely inconsistent states, leading to subtle and hard to find anomalous behaviour.

Note that this code is unsafe in a concurrent environment, despite the use of ConcurrentHashMap from the java.util.concurrent library.

ThreadSafe can detect atomicity violations like the one shown in this example, drawing developers’ attention to difficult to find defects in concurrent code.

Deadlocks

Deadlocks occur when two (or more) threads must wait for each other to release some resource before they can continue. Since neither thread will release its resource until another thread does, both threads will wait indefinitely.

A subtle source of deadlocks arises from the interaction of blocking data structures and synchronization. Holding locks while waiting for other threads to perform tasks can prevent all participating threads from making progress. The essence of this kind of problem is demonstrated in the following code snippet:

    private final BlockingQueue<Integer> queue = new LinkedBlockingDeque<Integer>();

    public synchronized Integer take() throws InterruptedException {
        return queue.take();
    }

    public synchronized void add(int x) {
        queue.add(x);
    }

If a thread calls the method take() while queue is empty, then it will simply block. Since take() is a synchronized method, it will prevent add() from ever being run by another thread, so nothing will ever be inserted into the queue, and the pair of threads will wait forever. Since this defect only occurs when the queue is empty, this kind of defect may only show up in production under difficult to replicate circumstances.

ThreadSafe detects and warns developers about this and other kinds of potential deadlock defects hidden within concurrent code, helping to lower the risk that harmful behaviour occurs during production.