udoprog.github.io

Patching ThreadPoolExecutor to handle Errors

In this post I’ll describe an important patch that you always want to use when using a ThreadPoolExecutor (or any ExecutorService) in Java.

Edit (2017-11-05): Since JDK 8u92, there is a new option called -XX:ExitOnOutOfMemoryError that can effectively be used instead.

The patch intends to mitigate the unexpected death of threads, and mitigate the impact that they have on your application.

To help illustrate illustrate this, here is an example project with a very nasty thread eating up all memory:

public class Example {
    private static final int MESSAGE_SIZE = 1024 * 1000;

    public static void main(String[] argv) throws Exception {
        final ExecutorService executor =
            new ThreadPoolExecutor(2, 2, 0L, TimeUnit.MILLISECONDS, new LinkedBlockingQueue<>());

        final TransferQueue<long[]> queue = new LinkedTransferQueue<>();

        executor.submit(new BadThread());

        executor.submit(() -> {
            while (true) {
                queue.transfer(new long[MESSAGE_SIZE]);
            }
        });

        while (true) {
            System.out.println("main: waiting for message...");
            queue.take();
            System.out.println("main: OK");
            Thread.sleep(500);
        }
    }

    /**
     * A bad thread eating up all available memory and holding on to it.
     */
    static class BadThread implements Callable<Void> {
        @Override
        public Void call() throws Exception {
            Thread.sleep(1000);

            System.out.println("BadThread: Start 'borrowing' memory...");

            final List<Long> list = new ArrayList<>();

            while (true) {
                try {
                    list.add(0L);
                } catch (final OutOfMemoryError error) {
                    System.out.println("BadThread: Hold on to OOM: " + error);
                    Thread.sleep(10000);
                }
            }
        }
    }
}

Compile and run this application with -Xmx16m. You should see something like the following:

main: waiting for message...
main: OK
main: waiting for message...
main: OK
BadThread: Start 'borrowing' memory...
main: waiting for message...
main: OK
BadThread: Hold on to OOM: java.lang.OutOfMemoryError: Java heap space
main: waiting for message...
...

The application is stuck, we are no longer seeing any main: OK messages. No stack traces, nothing.

The reason is that out coordinator thread allocates memory for its message, this means that it could be the target of an OutOfMemoryError when the allocation fails because BadThread has locked up all available memory and is refusing to die.

This state is when it gets interesting. ThreadPoolExecutor will, as per documentation, happily catch and swallow any exception being thrown in one of its tasks. It is explicitly left to the developer to handle this.

This leaves us with a dead coordinator thread at the other end of the Queue, and main is left to its own devices forever. :(.

The afterExecute patch

This patch is derived from this StackOverflow answer and can be applied to ThreadPoolExecutor.

final ExecutorService executor = new ThreadPoolExecutor(2, 2, 0L, TimeUnit.MILLISECONDS, new LinkedBlockingQueue<>()) {
    protected void afterExecute(Runnable r, Throwable t) {
        super.afterExecute(r, t);

        if (t == null && r instanceof Future<?>) {
            try {
                Future<?> future = (Future<?>) r;

                if (future.isDone()) {
                    future.get();
                }
            } catch (CancellationException ce) {
                t = ce;
            } catch (ExecutionException ee) {
                t = ee.getCause();
            } catch (InterruptedException ie) {
                Thread.currentThread().interrupt(); // ignore/reset
            }
        }

        if (t != null) {
            if (t instanceof Error) {
                try {
                    System.err.println("Error in runnable: " + r);
                    t.printStackTrace(System.err);
                    System.err.println(
                        "This is an unrecoverable error, shutting down...");
                } finally {
                    System.exit(1);
                }
            }

            System.out.println(t);
        }
    }
};

This patch overrides the afterExecute method. A hook designed to allow for custom behavior after the completion of tasks.

Run the project again, and you should see the following:

main: waiting for message...
main: OK
main: waiting for message...
main: OK
BadThread: Start 'borrowing' memory...
main: waiting for message...
main: OK
BadThread: Hold on to OOM: java.lang.OutOfMemoryError: Java heap space
Error in runnable: java.util.concurrent.FutureTask@5cf149bb
java.lang.OutOfMemoryError: Java heap space
    at com.spotify.heroic.ExecutorServicePatch.lambda$main$0(ExecutorServicePatch.java:63)
    at com.spotify.heroic.ExecutorServicePatch$$Lambda$1/495053715.call(Unknown Source)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
This is an unrecoverable error, shutting down...

Process finished with exit code 1

Errors

I want to emphasise that OutOfMemoryError is generally not an error that you can safely recover from. There are no guarantees that the thread responsible for eating up your memory is the target for this error. Even if that is the case, this thread might become important at a later stage in its life. In my opinion, the most reasonable thing to do is to give up.

An Error is a subclass of Throwable that indicates serious problems that a reasonable application should not try to catch. Most such errors are abnormal conditions.

At this stage you might be tempted to attempt a clean shutdown of your application on errors. This might work. But we might also be in a state where a thread critical towards the clean shutdown of your application is no longer alive. There might not be any memory left to support a complex shutdown. Attempting it could lead to your cleanup attempt crashing leading us back to where we started.

If you want to cover manually created threads, you can make use of Thread#setDefaultUncaughtExceptionHandler. Just remember, this still does not cover thread pools.

On a final note, if you are a library developer: Please don’t hide your thread pools from us.