Java on bramp.net

Running Java in Production: A SRE’s Perspective

Sat, 13 Jan 2018 12:50:31 -0800

Originally published as part of the Java Advent 2017 series

As a Site Reliability Engineer (SRE) I make sure our production services are efficient, scalable, and reliable. A typical SRE is a master of production, and has to have a good understanding of the wider architecture, and be well versed in many of the finer details.

It is common that SREs are polyglot programmer, expected to understand multiple different languages. For example, C++ may be hard to write, test and get right, but has high performance, perfect for backend systems such as databases. Whereas Python is easy to write, and great for quick scripting, useful for automation. Java is somewhere in the middle, and even though it is a compiled language, it provides type safety, performance, and many other advantages that make it a good choice for writing web infrastructure.

Even though many of the best practices that SREs adopt can be generalised to any language, there are some unique challenges with Java Web applications. This article highlight some of these challenges and talks about what we can do to address them.

Deployment

A typical java application consists of 100s of class files, either written by your team, or from common libraries that the application depends on. To keep the number of class files under control, and to provide better versioning, and compartmentalisation, they are typically bundled up into JAR or WAR files.

There are many ways to host a java application, one popular method is using a Java Servlet Container such as Tomcat, or JBoss. These provide some common web infrastructure, and libraries to make it, in theory, easier to deploy and manage the java application. Take Tomcat, a java program that provides the actual webserver and loads the application (bundled as a WAR file) on your behalf. This may work well in some situations, but actually adds additional complexity. For example, you now need to keep track of the version of the JRE, the version of Tomcat, and the version of your application. Testing for incompatibility, and ensuring everyone is using the same versions of the full stack can be problematic, and lead to subtle problems. Tomcat also brings along its own bespoke configuration, which is yet another thing to learn.

A good tenant to follow is to “keep it simple”, but in the Servlet Container approach, you have to keep track of a few dozen Tomcat files, plus one or more WAR files that make up the application, plus all the Tomcat configuration that goes along with it.

Thus there are some frameworks that attempt to reduce this overhead by instead of being hosted within a full application server, they embed their own web server. There is still a JVM but it invokes a single JAR file that contains everything needed to run the application. Popular frameworks that enable these standalone apps are Dropwizard and Spring Boot. To deploy a new version of the application, only a single file needs to be changed, and the JVM restarted. This is also useful when developing and testing the application, because everyone is using the same version of the stack. It is also especially useful for rollbacks (one of SRE’s core tools), as only a single file has to be changed (which can be as quick as a symlink change).

One thing to note with a Tomcat style WAR file, the file would contain the application class files, as well as all the libraries the application depends on as JAR files. In the standalone approach, all the dependencies are merged into a single, Fat JAR. A single JAR file that contains the class files for the entire application. These Fat or Uber JARs, not only are easier to version and copy around (because it is a single immutable file), but can actually be smaller than an equivalent WAR file due to pruning of unused classes in the dependencies.

This can even be taken further, by not requiring separate JVM and JAR files. Tools like capsule.io, can actually bundle up the JAR file, JVM, and all configuration into a single executable file. Now we can really ensure the full stack is using the same versions, and the deployment is agnostic to what may already be installed on the server.

Keep it simple, and make the application as quick and easy to version, using a single Fat JAR, or executable where possible.

Startup

Even though Java is a compiled language, it is not compiled to machine code, it is instead compiled to bytecode. At runtime the Java Virtual Machine (JVM) interprets the bytecode, and executes it in the most efficient way. For example, just-in-time (JIT) compilation allows the JVM to watch how the application is used, and on the fly compile the bytecode into optimal machine code. Over the long run this may be advantageous for the application, but during startup can make the application perform suboptimally for tens of minutes, or longer. This is something to be aware of, as it has implications on load balancing, monitoring, capacity planning, etc.

In a multi-server deployment, it is best practice to slowly ramp up traffic to a newly started task, giving it time to warm up, and to not harm the overall performance of the service. You may be tempted to warm up new tasks by sending it artificial traffic, before it is placed into the user-serving path. Artificial traffic can be problematic if it does not approximate normal user traffic. In fact, this fake traffic may trigger the JIT to optimise for cases that don’t normally occur, thus leaving the application in a sub-optimal or worse state than not being JIT’d.

Slow starts should also be considered when capacity planning. Don’t expect cold tasks to handle the same load as warm tasks. This is important when rolling out a new version of the application, as the capacity of the system will drop until the tasks warms up. If this is not taken into account, too many tasks may be reloaded concurrently, causing a capacity based cascading outage.

Expect cold starts, and try to warm the application up with real traffic.

Monitoring

This advice is generic monitoring advice, but it is worth repeating for Java. Make sure the most important and useful metrics are exported from the Java application, are collected and easily graphed. There are many tools and frameworks for exporting metrics, and even more for collecting, aggregating, and displaying.

When something breaks, troubleshooting the issue should be possible from only the metrics being collected. You should not be to depending on log files, or looking at code, to deal with an outage.

Most outages are caused by change. That is, a new version of the application, a config change, new source of traffic, a hardware failure, or a backend dependencies behaving differently. The metrics exported by the application, should include ways to identify the version of Java, application, and configuration in use. It should break down sources of traffic, mix, error counts, etc. It should also track the health, latency, error rates, etc of backend dependencies. Most of the time, this is enough to diagnose a outage quickly.

Specific to Java, there are metrics that can be helpful to understand the health, and performance of the application. Guiding future decisions on how to scale and optimise the application. Garbage collection time, heap size, thread count, JIT time are all important and Java specific.

Finally, a note about measuring response times, or latency. That is, the time it takes the application to handle a request. Many make the mistake of looking at average latency, in part because it can be easily calculated. Averages can be misleading, because it doesn’t show the shape of the distribution. The majority of requests may be handled quickly, but there may be a long tail of requests that are rare but take a while. This is especially troubling for JVM application, because during garbage collection there is a stop the world (STW) phase, where the application must pause, to allow the garbage collection to finish. In this pause, no requests will be responded to, and users may wait multiple seconds.

It is better to collect either the max, or 99 (or higher) percentile latency. For percentile, that say for every every 100 requests, 99 are served quicker than this number. Looking at the worst case latency is more meaningful, and more reflective of the user perceived performance.

Measure metrics that matter, and you can later depend on.

Memory Management

A good investment of your time is to learn about the various JVM garbage collection algorithms. The current state of the art are the concurrent collectors, either G1, or CMS. You can decide on what may be best for your application, but for now G1 is the likely winner. There are many great articles that explain how they work, but I’ll cover some key topics.

When starting up, the Java Virtual Machine (JVM) reserves a large chunk of OS memory and splits it into heap and non-heap. The non-heap contains areas such as Metaspace (formally called Permgen), and stack space. Metaspace is for class definitions, and stack space is for each thread’s stacks. The heap is used for the objects that are created, which normally takes up the majority of the memory usage. Unlike a typical executable, the JVM has the -Xms and -Xmx flags that control the minimum and maximum size of the heap. These limits constrain the maximum amount of RAM the JVM will use, which can make the memory demands on your servers predictable. It is common to set both these flags to the same value, provisioning them to fill up the available RAM on your server. There are also best practices around this when sizing Docker containers.

Garbage collection (GC) is the process of managing this heap, by finding java objects that are no longer in use (i.e no longer referred to), and can be reclaimed. In most cases the JVM scans the full graph of objects, marking which it finds. At the end, any that weren’t visited, are deleted. To ensure there aren’t race conditions, the GC typically has to stop the world (STW), which pauses the application for a short while, while it finishes up.

The GC is a source of (perhaps unwarranted) resentment because it is blamed for many performance problems. Typically this boils down to not understanding how the GC works. For example, if the heap is sized too small, the JVM can aggressive garbage collect, trying to futilely free up space. The application can then get stuck in this “GC thrashing” cycle, that makes very little progress freeing up space, and spending a larger and larger proportion of time in GC, instead of running the application code.

Two common cases where this can happen, are memory leaks, or resource exhaustion. Garbage collected languages shouldn’t allow what is conventionally called memory leaks, however, they can occur. Take for example, maintaining a cache of objects that never expire. This cache will grow forever, and even though the objects in the cache may never be used again, they are still referenced, thus ineligible to be garbage collected.

Another common cases is unbounded queues. If your application places incoming requests on a unbounded queue, this queue could grow forever. If there is a spike of request, objects retained on the queue could increase the heap usage, causing the application to spend more and more time in GC. Thus the application will have less time to process requests from the queue, causing the backlog to grow. This spirals out of control as the GC struggles to find any objects to free, until the application can make no forward progress.

The garbage collector algorithms has many optimisations to try and reduce total GC time. One important observation, the weak generational hypothesis, is that objects either exist for a short time (for example, related to the handling a request), or last a long time (such as global objects that manage long lived resources).

Because of this, the heap is further divided into young and old space. The GC algorithm that runs across the young space assume the object will be freed, and if not, the GC promotes the object into old space. The algorithm for old space, makes the opposite assumption, the the object won’t be freed. The size of the young/old may thus also be tuned, and depending on G1 or CMS the approach will be different. But, if the young space is too small, objects that should only exist for short time end up getting promoted to old space. Breaking some of the assumptions the old GC algorithms make, causing GC to run less efficiently, and causing secondary issues such as memory fragmentation.

As mentioned earlier, GC is a source of long tail latency, so should be monitored closely. The time taken for each phase of the GC should be recorded, as well as the fullness of heap space (broken down by young/old/etc) before and after GC runs. This provides all the hints needed to either tune, or improve the application to get GC under control.

Make GC your friend. Careful attention should be paid to the heap, and garbage collector, and it should be tuned (even coarsely) to ensure there is enough heap space even in the fully loaded/worst case.

Other tips

Debugging

Java has many rich tools for debugging during development and in production. For example, it is possible to capture live stack traces, and heap dumps from the running application. This can be useful to understand memory leaks, or deadlocks. However, you must ensure the application is started to allow these features, and that the typical tools, jmap, jcmd, etc are actually available on the server. Running the application inside a Docker container, or non-standard environment, may make this more difficult, so test and write a playbook on how to do this now.

Many frameworks, also expose much of this information via webservices, for easier debugging, for example the Dropwizard /threads resource, or the Spring Boot production endpoints.

Don’t wait until you have a production issue, test now how to grab heap dumps and stack traces.

Fewer but larger tasks

There are many features of the JVM that have a fixed cost per running JVM, such as JIT and garbage collection. Your application may also have fixed overheads, such as resource polling (backend database connections), etc. If you run fewer, but larger (in terms of CPU and RAM) instances, you can reduce this fixed cost, getting an economy of scale. I’ve seen doubling the amount of CPU and RAM a Java application had, allowed it to handle 4x the requests per second (with no impact to latency). This however makes some assumption about the application’s ability to scale in a multi-threaded way, but generally scaling vertically is easier than horizontally.

Make your JVM as large as possible.

32-bit vs. 64-bit Java

It used to be common practice to run a 32-bit JVM if your application didn’t use more than 4GiB of RAM. This was because 32-bit pointers are half the size of 64-bit, which reduced the overhead of each java object. However, as modern CPUs are 64-bit, typically with 64-bit specific performance improvements, and that the cost of RAM being cheap this make 64-bit JVMs the clear winner.

Use 64-bit JVMs.

Load Shedding

Again general advice, but important for java. To avoid overload caused by GC thrashing, or cold tasks, the application should aggressively load shed. That is, beyond some threshold, the application should reject new requests. It may seem bad to reject some requests early, but it is better than allowing the application to become unrecoverably unhealthy and fail all requests. There are many ways to avoid overload, but common approaches are to ensure queues are bounded, and that thread pools are sized correctly. Additionally, outbound request should have appropriate deadlines, to ensure a slow backend doesn’t cause problems for your application.

Handle as many requests as you can, and no more.

Conclusion

Hopefully this article has made you think about your java production environment. While not be prescriptive, we highlight some areas to focus. The links throughout should guide you in the right direction.

Maven Plugins on Java 8

Sat, 01 Apr 2017 15:21:27 -0700

As part of my standard Maven configuration, I like to use two plugins backed by Google technologies, the first to help keep my code formatted correctly, and the second to check for compile time errors. However, Google recently moved to require JDK 1.8, which broke anyone trying to compile my projects with an older JDK. In this article I’ll quickly explain how to configure Maven to work around this problem.

Specifically I use the following two plugins:

coveo/fmt-maven-plugin (which uses google-java-format). This follows the Google’s Java Style guide, and reformats the code to ensure it stays consistent. This is great when accepting external contributions, as it keeps the code base uniform, and avoids style discussion on pull requests.
plexus-compiler-javac-errorprone (which uses Google’s errorprone). This is a static code analysis tool, that checks for simple errors at compile time, and fails the build if they are found. Again, this helps improve the quality of the code.

Even though my projects typically target 1.7, these plugins require to run under 1.8. Really I’d prefer I could bump all my projects to target 1.8+, but since a few of my projects are libraries (which other people include into their projects), that is easier said than done. To deal with this, I changed my Maven configuration to only run these two plugins when run under the sufficient JDK. This means those using a older JDK don’t get the benefits, but since locally I use JDK 8, and all my open source projects use Travis CI, eventually these issues will be identified.

So if you get an error like

java.lang.UnsupportedClassVersionError: com/google/googlejavaformat/java/FormatterException : Unsupported major.minor version 52.0

An API incompatibility was encountered while executing org.apache.maven.plugins:maven-compiler-plugin:3.5.1:compile: java.lang.UnsupportedClassVersionError: javax/tools/DiagnosticListener : Unsupported major.minor version 52.0

Please update to JDK 1.8, or update your Maven configuration to restrict these plugins to when run on a modern JDK:


...
    
        
            java18
            
                1.8
            
            
                
                    
                        com.coveo
                        fmt-maven-plugin
                        
                            
                                
                                    format
                                
                            
                        
                    
                    
                        org.apache.maven.plugins
                        maven-compiler-plugin
                        
                            javac-with-errorprone
                            true
                            true
                            
                                -Xlint:all
                            
                        
                        
                            
                                org.codehaus.plexus
                                plexus-compiler-javac-errorprone
                                2.8.1
                            
                            
                            
                                com.google.errorprone
                                error_prone_core
                                2.0.19
                            
                        
                    
                
            
        
    
...

This defines a new profile, that is only “activated” under Java 1.8. When activated the section has the two additional plugins added. Ensure that these plugins are no longer mentioned in the regular section, and only in the section.

An example of this change can be found in recent commit.

The importance of tuning your thread pools

Thu, 17 Dec 2015 01:00:00 +0000

Originally published as part of the Java Advent 2015 series

Whether you know it or not, your Java web application is most likely using a thread pool to handle incoming requests. This is an implementation detail that many overlook, but sooner or later you will need to understand how the pool is used, and how to correctly tune it for your application. This article aims to explain the threaded model, what a thread pool is, and what you need to do to correctly configure them.

Single Threaded

Let us start with some basics, and progress with the evolution of the threaded model. No matter which application server or framework you use, Tomcat, Dropwizard, Jetty, they all use the same fundamental approach. Buried deep inside the web server is a socket. This socket is listening for incoming TCP connections, and accepting them. Once accepted, data can be read from the newly established TCP connection, parsed, and turned into a HTTP request. This request is then handed off to the web application, to do with what it wants.

To provide an understanding of the role of threads, we won’t use an application server, instead we will build a simple server from scratch. This server mirrors what most application servers do under the hood. To start with, a single threaded web server may look like this:

ServerSocket listener = new ServerSocket(8080);
try {
	while (true) {
		Socket socket = listener.accept();
		try {
			handleRequest(socket);
		} catch (IOException e) {
			e.printStackTrace();
		}
	}
} finally {
	listener.close();
}

This code creates a ServerSocket on port 8080, then in a tight loop the ServerSocket checks for new connections to accept. Once accepted the socket is passed to a handleRequest method. That method would typically read the HTTP request, do whatever process is needed, and write a response. In this simple example, handleRequest reads a single line, and returns a short HTTP response. It would be normal for handleRequest to do something more complex, such as reading from a database, or conducting some other kind of IO.

final static String response =
	"HTTP/1.0 200 OK\r\n" +
	"Content-type: text/plain\r\n" +
	"\r\n" +
	"Hello World\r\n";

public static void handleRequest(Socket socket) throws IOException {
	// Read the input stream, and return "200 OK"
	try {
		BufferedReader in = new BufferedReader(
			new InputStreamReader(socket.getInputStream()));
		
		log.info(in.readLine());

		OutputStream out = socket.getOutputStream();
		out.write(response.getBytes(StandardCharsets.UTF_8));
	} finally {
		socket.close();
	}
}

As there is only a single thread handling all accepted sockets, each request must be fully handled, before accepting the next. In a real application it could be normal for the equivalent handleRequest method to take on the order of 100 milliseconds to return. If this was the case, the server would be limited to handling only 10 requests per second, one after the other.

Multi-threaded

Even though handleRequest may be blocked on IO, the CPU is free to handle more requests. With a single threaded approach this is not possible. Thus this server can be improved to allow concurrent operations, via creating multiple threads:

public static class HandleRequestRunnable implements Runnable {
	final Socket socket;

	public HandleRequestRunnable(Socket socket) {
		this.socket = socket;
	}

	public void run() {
		try {
			handleRequest(socket);
		} catch (IOException e) {
			e.printStackTrace();
		}
	}
}

// Main loop here
ServerSocket listener = new ServerSocket(8080);
try {
	while (true) {
		Socket socket = listener.accept();
		new Thread( new HandleRequestRunnable(socket) ).start();
	}
} finally {
	listener.close();
}

Here, accept() is still called in a tight loop within a single thread, but once a TCP connection is accepted, and a socket available, a new thread is spawned. This spawned thread executes a HandleRequestRunnable, which simply calls the same handleRequest method from above.

Creating the new thread, now frees up the original accept() thread to handle more TCP connections, and allows the application to handle requests concurrently. This technique is referred to as a “thread per request”, and is the most popular approach taken. It is worth noting there are other approaches, such as the event driven asynchronous model NGINX and Node.js deploy, but they don’t use thread pools, and thus are out of scope for this article.

In the thread per request approach, creating a new thread (and later destroying it) can be expensive, as both the JVM and the OS needs to allocate resources. Additionally in the above implementation, the number of threads being created is unbounded. Being unbounded is very problematic, as it can quickly led to resource exhaustion.

Resource exhaustion

Each thread requires a certain amount of memory for the stack. On recent 64bit JVMs, the default stack size is 1024KB. If the server receives a flood of requests, or the handleRequest method becomes slow, the server may end up with huge number of concurrent threads. Thus to manage 1000 concurrent requests, the 1000 threads would consume 1GB of the JVM’s RAM just for thread’s stacks. In addition the code executing in each thread will be creating objects on the heap needed to process the request. This very quickly adds up, and can exceed the heap space assigned to the JVM, putting pressure on the garbage collector, causing thrashing and eventually leading to OutOfMemoryErrors.

Not only consuming RAM, the threads may use other finite resources, such as file handles, or database connections. Exceeding these may led to other types of errors or crashes. Thus to avoid exhausting resources it is important to avoid unbounded data structures.

Not a panacea, but the stack size issue can be somewhat mitigated by tuning the stack size with the -Xss flag. A smaller stack will reduce the per thread overhead, but potentially leads to StackOverflowErrors. Your mileage will vary, but for many applications the default 1024KB is excessive, and smaller 256KB or 512KB values might be more appropriate. The smallest value Java will allow is 160KB.

Thread pool

To avoid continuously creating new threads, and to bound the maximum number, a simple thread pool can be used. Simply put, the pool keeps track of all threads, creating new ones when needed up to an upper bound, and where possible reusing idle threads.

ServerSocket listener = new ServerSocket(8080);
ExecutorService executor = Executors.newFixedThreadPool(4);
try {
	while (true) {
		Socket socket = listener.accept();
		executor.submit( new HandleRequestRunnable(socket) );
	}
} finally {
	listener.close();
}

Now, instead of directly creating threads, this code uses an ExecutorService, which submits work (in the term of Runnables) to be executed across a pool of threads. In this example a fixed thread pool of four threads is used to handle all incoming requests. This bounds the number of “in-flight” requests, and thus places bounds on the resource usage.

In addition to newFixedThreadPool, the Executors utility class also provides a newCachedThreadPool method. This suffers from the earlier unbounded number of threads, but whenever possible makes use of previously created but now idle threads. Typically this type of pool is useful for short-lived requests that do not block on external resources.

ThreadPoolExecutors can be constructed directly, allowing for its behaviour to be customised. For example, the min and max number of threads within the pool can be defined, as well as policies for when threads are created and destroyed. An example of this is shown shortly.

Work queue

In the fixed thread pool case, the observant reader may wonder what happens if all threads are busy, and a new request comes in. Well the ThreadPoolExecutor may use a queue to hold pending requests before a thread becomes available. The Executors.newFixedThreadPool by default use an unbounded LinkedList. Again this leads to the resource exhaustion problem, albeit much slower since each queued request is smaller than a full thread, and will typically not be using as many resources. However, in our examples, each queued request is holding a socket which (depending on OS) would be consuming a file handle. This is the kind of resource that the operating system will limit, so it may not be best to hold on to it unless needed. Therefore it also makes sense to bound the size of the work queue.

public static ExecutorService newBoundedFixedThreadPool(int nThreads, int capacity) {
	return new ThreadPoolExecutor(nThreads, nThreads,
		0L, TimeUnit.MILLISECONDS,
		new LinkedBlockingQueue<Runnable>(capacity),
		new ThreadPoolExecutor.DiscardPolicy());
}

public static void boundedThreadPoolServerSocket() throws IOException {
	ServerSocket listener = new ServerSocket(8080);
	ExecutorService executor = newBoundedFixedThreadPool(4, 16);
	try {
		while (true) {
			Socket socket = listener.accept();
			executor.submit( new HandleRequestRunnable(socket) );
		}
	} finally {
		listener.close();
	}
}

Again, we create a thread pool, but instead of using the Executors.newFixedThreadPool helper method, we create the ThreadPoolExecutor ourselves, passing a bounded LinkedBlockingQueue capped to 16 elements. Alternatively an ArrayBlockingQueue could have be used, which is an implementation of a bounded buffer.

If all threads are busy, and the queue fills up, what happens next is defined by the last argument to the ThreadPoolExecutor. In this example, a DiscardPolicy is used, which simply discards any work that would overflow the queue. There are other policies, such as the AbortPolicy which throws an exception, or the CallerRunsPolicy which executes the job on the caller’s thread. This CallerRunsPolicy provides a simple way to self limit the rate jobs can be added, however, it could be harmful, blocking a thread that should stay unblocked.

A good default policy is to Discard or Abort, which both drop the work. In these cases it would be easy to return a simple error to the client, such as a HTTP 503 “Service unavailable”. Some would argue that the queue size could just be increased, and then all work would eventually be run. However, users are unwilling to wait forever, and if fundamentally the rate at which work comes in, exceeds the rate it can be executed, then the queue will grow indefinitely. Instead the queue should only be used to smooth out bursts of requests, or handle short stalls in processing. In normal operation the queue should be empty.

How many threads?

Now we understand how to create a thread pool, the hard question is how many threads should be available? We have determined that the maximum number should be bounded to not cause resource exhaustion. This includes all types of resources, memory (stack and heap), open file handles, open TCP connections, the number of connections a remote database can handle, and any other finite resource. Conversely, if the threads are CPU bound instead of IO bound, then the number of physical cores should be considered finite, and perhaps no more than one thread per core should be created.

This all depends on the work the application is doing. A user should run load tests using various pool sizes, and a realistic mix of requests. Each time increasing their thread pool size until breaking point. This makes it possible to find the upper bound, for when resources are exhausted. In some cases it may be prudent to increase the number of available resources, for example making more RAM available to the JVM, or tuning the OS to allow for more file handles. However, at some point the theoretical upper bound will be reached, and should be noted, but this is not the end of the story.

Little’s Law

Little's Law equation

Queuing theory, in particular, Little’s Law, can be used to help understand the properties of the thread pool. In simple terms, Little’s Law describes the relationship between three variables; L the number of requests in-flight, λ the rate at which new requests arrive, and W the average time to handle the request. For example, if there are 10 requests arriving per second, and each request takes one second to process, there is an average of 10 request in-flight at any time. In our example, this maps to using 10 threads. If the time to process a single request is doubled, then the average in-flight requests also doubles to 20, and thus requires 20 threads.

Understanding the impact that execution time has on in-flight request is very important. It is common for some backend resource (such as a database) to stall, causing requests to take longer to process, quickly exhausting a thread pool. Therefore the theoretical upper bound may not be an appropriate limit for the pool size. Instead, a limit should be placed on execution time, and used in combination with the theoretical upper bound.

For example, let’s say the maximum in-flight requests that can be handled is 1000 before the JVM exceeds its memory allocation. If we budget for each request to take no longer than 30 seconds, we should expect in the worst case to handle no more than 33 ⅓ requests per second. However, if everything is working correctly, and requests take only 500ms to handle, the application can handle 2000 requests per second, on only 1000 threads. It may also be reasonable to specify that a queue can be used to smooth out short bursts of delay.

Why the hassle?

If the thread pool has too few threads, you run the risk of under utilising the resources, and turning users away unnecessarily. However, if too many threads are allowed, resource exhaustion occurs, which can be more damaging.

Not only can local resources be exhausted but it is possible to adversely impact others. Take for example, multiple applications querying the same backend database. Databases typically have a hard limit on the number of concurrent connections. If one misbehaving unbounded application consumes all these connections, it would block the others from accessing the database. Causing a widespread outage.

Even worse, a cascading failure could occur. Imagine an environment with multiple instances of a single application, behind a common load balancer. If one of the instances begins to run out of memory due to excessive in-flight requests, the JVM will spend more time garbage collecting, and less time handling the requests. That slow down, will reduce the capacity of that one instance, and force the other instances to handle a higher fraction of incoming requests. As they now handle more requests, with their unbounded thread pools, the same problem occurs. They run out of memory, and again begin aggressively garbage collecting. This vicious cycle cascades across all instances, until there is a systemic failure.

Far too often I’ve observed that load testing is not conducted, and an arbitrarily high number of threads is allowed. In the common case the application can happily process requests at the incoming rate using a small number of threads. If however, processing the requests depends on a remote service, and that service temporarily slows down, the impact of increasing W (the average processing time) can very quickly exhaust the pool. Because the application was never load tested at the maximum number, all the resource exhaustion issues outlined before are exhibited.

How many thread pools?

In microservice, or service oriented architectures (SOA), it is normal to access multiple remote backend services. This setup is particularly susceptible to failures, and thought should be given to gracefully dealing with them. If a remote service’s performance degrades, it can cause the thread pool to quickly hit its limit, and subsequent requests are dropped. However, not all requests may require this unhealthy backend, but since the thread pool is full these requests are needlessly dropped.

The failure of each backend can be isolated by providing backend specific thread pools. In this pattern, there is still a single request worker pool, but if the request needs to call a remote service, the work is transferred to that backend’s thread pool. This leaves the main request pool unburden by a single slow backend. Then only requests needing that particular backend pool are impacted when it malfunctions.

A final benefit of multiple thread pools, is it helps avoid a form of deadlock. If every available thread becomes blocked on a result of a yet to be processed request, then a deadlock occurs, and no thread is able to move forward. When using multiple pools, and having a good understanding of the work they execute, this issue can be somewhat mitigated.

Deadlines and other best practices

A common best practice is to ensure there is a deadline on all remote calls. That is, if the remote service does not respond within a reasonable time, the request is abandoned. The same technique can be used for work within the thread pool. Specifically, if the thread is processing one request for longer than a defined deadline, it should be terminated. Making room for a new request, and placing an upper bound on W. This may seem like a waste, but if the user (which might typically be a web browser) is waiting for a response, then after 30 seconds the browser might just give up anyway, or more likely the user becomes impatient and navigates away.

Failing fast, is another approach that can be taken when creating pools for backends. If the backend has failed, the thread pool will quickly fill up with request waiting to connect to the unresponsive backend. Instead, the backend can be flagged as unhealthy, all subsequent requests could fail instantly instead of needlessly waiting. Note however, that a mechanism is needed to determine when the backend has become healthy again.

Finally, if a request will need to call multiple backends independently, it should be possible to call them in parallel, instead of sequentially. This would reduce the wait time, at the cost of increased threads.

Luckily, there is a great library, hystrix, which packages many of these best practices and exposes them in a simple and safe way.

Conclusion

Hopefully this article has improved your understanding of thread pools. By understanding the application’s needs, and using a combination of the maximum thread count, and the average response time, an appropriate thread pool can be determined. Not only will this avoid cascading failures, but help plan and provision your service.

Even though your application may not explicitly use a thread pool, they are implicitly used by your application server or higher level abstraction. Tomcat, JBoss, Undertow, Dropwizard all provides multiple tunables to their thread pools (the pool which your servlet is executed).

Unrolling loops at runtime with Byte Buddy

Wed, 09 Sep 2015 20:29:04 -0700

While creating the UnsafeArrayList, I encountered a problem that I felt I could optimise. The UnsafeArrayList copies objects into off-heap memory, instead of what a normal ArrayList would do, which is to store references to the object in an array on the heap. For example an UnsafeArrayList holds instances of FourLongs, whose fields consume a total of 32 bytes (4×8 bytes) of memory. By design, when set() or get() are called, the UnsafeArrayList copies these 32 bytes into or out of a contiguous segment of memory.

To achieve the copying, sun.misc.Unsafe’s putLong() is repeatedly called, moving 8 bytes at a time. For example, this simple loop will copy a long’s worth of memory each iteration, from src, into dest:

final long COPY_STRIDE = 8;
final Unsafe unsafe = UnsafeHelper.getUnsafe();

public void copy(Object dest, long src) {
	long destOffset = 0;
	long destEnd = UnsafeHelper.sizeOf(dest);

	while (destOffset < dstEnd) {
		unsafe.putLong(dest, dstOffset, unsafe.getLong(src));
		destOffset += COPY_STRIDE;
		src += COPY_STRIDE;
	}
}

Note, we use putLong, not because the UnsafeArrayList is storing objects made up of longs, but because this is the Unsafe method that can copy the most in one go. This putLong method is thus being used as the building block to build a more complex looping copy method. Note, this works great for memory which is aligned on a 8 byte boundary, and the total copy is a multiple of 8 bytes. For the sake of this article, we make the assumption that this is always true.

In the FourLong’s case, the copy method would iterates four times. This is predictable, and occurs every time we get() on a UnsafeArrayList instance. Since this copy loop will be executed every time get() is called, it is worth seeing if we can make it execute faster. A common optimisation is for the developer to manually unroll the loop, avoiding the loop counter, and producing potentially quicker code¹. In this case, manually unrolling the code is not possible because the parameterised type could be any size. For example, a UnsafeArrayList would only need to copy 8 bytes (two 4 byte ints). You would hope that the JIT would notice the loop always iterates the same number of times (for a particular list), and be able to remove the loop. Sadly, it does not seem to do this, perhaps because the JVM does not know what side effects unsafe.{get,put}Long have. To measure the cost of the looping we compare the previous code to this:

final int COPY_STRIDE = 8;
final Unsafe unsafe = UnsafeHelper.getUnsafe();

public void copy(Object dest, long src) {
	assert(UnsafeHelper.sizeOf(dest) == 4 * COPY_STRIDE)

	long destOffset = 0;

	unsafe.putLong(dest, destOffset, unsafe.getLong(src));
	destOffset += COPY_STRIDE;
	src += COPY_STRIDE;

	unsafe.putLong(dest, destOffset, unsafe.getLong(src));
	destOffset += COPY_STRIDE;
	src += COPY_STRIDE;

	unsafe.putLong(dest, destOffset, unsafe.getLong(src));
	destOffset += COPY_STRIDE;
	src += COPY_STRIDE;

	unsafe.putLong(dest, destOffset, unsafe.getLong(src));
}

When benchmarked, this manually unrolled code runs 2 times faster! This got me thinking, since a particular UnsafeArrayList instance is always going to copy the same sized object, again and again and again, it could perhaps generate bytecode during creation, that unrolled the loop.

Enter Byte Buddy

Thus investigation into Byte Buddy began, a library designed for generating bytecode at runtime. The rest of this article explains how to use Byte Buddy for this goal.

To start, I used Intellij IDEA’s “Show Bytecode” option, to inspect the code generated by my hand unrolled code.

; Initialisation
  ; long destOffset = 0;
  LCONST_0  ; Load the long zero
  LSTORE 4  ; Store it in “destOffset”

; Copy
  ; unsafe.putLong(dest, destOffset, unsafe.getLong(src));
  ALOAD 0  ; Load “this”
  ; The the “unsafe” member from this.
  GETFIELD net/bramp/unsafe/Test.unsafe : Lsun/misc/Unsafe;

  ALOAD 1  ; Load dest
  LLOAD 4  ; Load dstOffset
  ALOAD 0  ; Load this
  ; The the “unsafe” member from this.
  GETFIELD net/bramp/unsafe/Test.unsafe : Lsun/misc/Unsafe;

  LLOAD 2  ; Load src
  ; unsafe.getLong(src), storing result on stack.
  INVOKEVIRTUAL sun/misc/Unsafe.getLong (J)J
  ; unsafe.putLong(dest, dstOffset, {stack result})
  INVOKEVIRTUAL sun/misc/Unsafe.putLong (Ljava/lang/Object;JJ)V

;; Increment
  ; dstOffset += 8;
  LLOAD 4   ; Load dstOffset
  LDC 8     ; Load 8
  LADD      ; Add dstOffset and 8
  LSTORE 4  ; Store result to dstOffset

  ; src += 8;
  LLOAD 2   ; Load src
  LDC 8     ; Load 8
  LADD      ; Add src and 8
  LSTORE 2  ; Store result to src

After reading a primer to bytecode, this generated bytecode looked quite simple. It can be broken up into three steps, initialisation, copy, and increment. At runtime, Byte Buddy can be used to generate bytecode that is an unrolled equivalent, such that there is 1 initialisation step, N copy steps, and N-1 increment steps, where N is based on the size of the object the UnsafeArrayList plans to copy.

Reading through the Byte Buddy API it seems the best way to achieve this is to create an abstract class, which will form the base of a generated class. Then at runtime create an instantiation of this abstract class, specialised with the unrolled copy bytecode.

For example, the base class would look like this:

public abstract class UnsafeCopier {
	protected final Unsafe unsafe;

	public UnsafeCopier(Unsafe unsafe) {
		this.unsafe = checkNotNull(unsafe);
	}

	abstract void copy(Object dest, long src);
}

Leaving us to implement the copy(…) method optimally for the size of object being copied.

Using the Builder pattern I created the UnrolledUnsafeCopierBuilder class. The build() method will calculate the size of the class being copied, then using Byte Buddy generate the copy implementation, and returns a specialised instance UnsafeCopier.

public UnsafeCopier build(Unsafe unsafe) {
	final long length = UnsafeHelper.sizeOf(clazz);

	Class dynamicType = new ByteBuddy()
		.subclass(UnsafeCopier.class)
		.method(named("copy"))
		.intercept(new CopierImplementation(length)).make()
		.load(getClass().getClassLoader(), ClassLoadingStrategy.Default.WRAPPER)
		.getLoaded();

	return (UnsafeCopier) dynamicType
		.getDeclaredConstructor(Unsafe.class)
		.newInstance(unsafe);
}

This begins by calculating the size of the class. Then using a ByteBuddy instance, creates a new dynamicType, which extends UnsafeCopier. This subclass then obtains its copy method with code generated by CopierImplementation(length). Finally, this new dynamicType is used to create an instance of the copier, which is now specialised for copying instances of clazz.

The real meat of the code is in CopierImplementation, which can be explained in pieces:

class CopierImplementation implements ByteCodeAppender, Implementation {

	public static final long COPY_STRIDE = 8;

	final long length;

	public CopierImplementation(long length) {
		this.length = length;
	}

	private StackManipulation buildStack() throws ... {
		...
		final StackManipulation setupStack = ...
		final StackManipulation copyStack = ...
		final StackManipulation incrementStack = ...

		final int iterations = (int) (length / COPY_STRIDE);
		final StackManipulation[] stack = new StackManipulation[1 + 2 * iterations];
		
		stack[0] = setupStack;
		for (int i = 0; i < iterations; i++) {
			stack[i * 2 + 1] = copyStack;
			stack[i * 2 + 2] = incrementStack;
		}

		// Override the last incrementStack with a "return"
		stack[stack.length - 1] = MethodReturn.VOID;

		return new StackManipulation.Compound(stack);
	}

	...
}

Byte Buddy uses StackManipulation objects to define what bytecode to generate. These StackManipulation objects can be built up hierarchically and contain all the bytecode instructions to execute. We define a separate StackManipulation object for each step, and in the buildStack() method combine the steps multiple times into one array. In particular, this stack array contains one initialise step, N copy steps, and N-1 increment steps, with a return instruction on the end.

Recall from the early bytecode listing, that the initialisation was two bytecode operations, a LCONST, and LSTORE. In Byte Buddy, we can thus do the following:

final StackManipulation setupStack = new StackManipulation.Compound(
	LongConstant.ZERO,                       // LCONST_0
	MethodVariableStore.LONG.storeOffset[4]  // LSTORE 4
);

Byte Buddy provides the primitives for most bytecode instructions, and can be built up in these StackManipulation arrays. However, some instructions are missing, for example LADD (needed by the increment step). But it is simple enough to create one from scratch, as shown outside of this article.

Next the copy step is defined which is a few more instructions than the increment, but relatively simple:

final Field unsafeField = UnsafeCopier.class.getDeclaredField("unsafe");
final Method getLongMethod = Unsafe.class.getMethod("getLong", long.class);
final Method putLongMethod = Unsafe.class.getMethod("putLong",Object.class, long.class, long.class);

final StackManipulation copyStack = new StackManipulation.Compound(
	// unsafe.putLong(dest, destOffset, unsafe.getLong(src));
	MethodVariableAccess.REFERENCE.loadOffset[0], // ALOAD 0 this

	FieldAccess.forField(new FieldDescription.ForLoadedField(unsafeField))
	                                   .getter(), // GETFIELD

	MethodVariableAccess.REFERENCE.loadOffset[1], // ALOAD 1 dest
	MethodVariableAccess.LONG.loadOffset[4],      // LLOAD 4 destOffset

	MethodVariableAccess.REFERENCE.loadOffset[0], // ALOAD 0 this
	FieldAccess.forField(new FieldDescription.ForLoadedField(unsafeField))
	                                   .getter(), // GETFIELD

	MethodVariableAccess.LONG.loadOffset[2],      // LLOAD 2 src

	MethodInvocation.invoke(new MethodDescription.ForLoadedMethod(getLongMethod)),
	MethodInvocation.invoke(new MethodDescription.ForLoadedMethod(putLongMethod))
);

Again, the bytecode instructions are created as a sequence of StackManipulation, replicating the bytecode the java compiler code had generated earlier. This example contains a couple of new StackManipulation classes, in particular the Field and Method Descriptions classes.

The final step is the increment step, which won’t be explained, but for the interested reader the source can be found here.

One last piece of information Byte Buddy needs, is the size of the stack needed for the copy() method, including any space local variables may need. The StackManipulation comes in handy here, as it is able to infer some of these details from the byte code it represents. In particular, the following code calculates the stack size:

public Size apply(MethodVisitor methodVisitor, Implementation.Context implementationContext,
   MethodDescription instrumentedMethod) {

	...

	// Call buildStack() (from above) to generate the bytecode
	StackManipulation stack = buildStack();

	// Calculate the size of this bytecode
	StackManipulation.Size finalStackSize = stack.apply(methodVisitor, implementationContext);

	// Now return the size of this bytecode, plus two, which is the size of the local
	// destOffset variable.
	return new Size(finalStackSize.getMaximalSize(), instrumentedMethod.getStackSize() + 2);
}

An important part here, is the +2, which makes room for the long destOffset variable. If that was missing, the generated bytecode would incorrectly write over instructions on the stack, and most likely crash the JVM.

Now at runtime the UnsafeArrayList’s constructor can use the UnrolledUnsafeCopierBuilder to generate a specialised UnsafeCopier designed for the exact class the UnsafeArrayList is storing.

Results

Now we have most of what we need, it is worth benchmarking this code. Using JMH, we can write three microbenchmarks. One using the original looping code, one using the hand unrolled code, and one using the Byte Buddy unrolled code. The code for the benchmarks is on GitHub, and follows a similar methodology to that in a previous article.

The results are as you may expect:

Benchmark	Mode	Cnt	Score	Error	Units
Loop	thrpt	25	218.056	± 11.123	ops/us
Hand Unrolled	thrpt	25	430.376	± 27.448	ops/us
Byte Buddy Unrolled	thrpt	25	437.139	± 22.811	ops/us

The loop code can execute ~218 times per microseconds, whereas both the Byte Buddy, and hand unrolled code had near identical performance, of ~430-437 iterations per microsecond, nearly twice as fast. Of course, not measured here is the startup cost of generating the unrolled code. It is assumed this technique would only be used when the generated code would exist for a long time. Otherwise the setup cost undoes any per execution savings.

Conclusion

In summary, we managed to unroll a loop at runtime by generating on demand bytecode for that specific purpose. This was possible by inspecting machine generated bytecode, and using Byte Buddy to generate equivalent bytecode at runtime, customised specifically with the correct number of unrolled iterations.

This technique may seem completely crazy, and I don’t suggest its used unless you know what you are doing. That includes, actually measuring you have a performance problem which could be fixed with this, and not being able to depend on the JVM’s own JIT to do this optimisation for you.

Helpful Links: GitHub Home | Gitub Code | JavaDoc

Unrolled code is not always faster, as larger code may not fit into CPU instruction cache. ↩︎

Unsafe Part 3: Benchmarking a java UnsafeArrayList

Thu, 27 Aug 2015 20:39:04 -0700

Previously we introduced a UnsafeArrayList, an ArrayList style collection that instead of storing references to the objects, it would use sun.misc.Unsafe and UnsafeHelper to copy the objects into heap allocated memory. This has the unique property of keeping all objects contiguous in memory, and avoids a pointer indirection, at the cost of needing to copy values in and out. This article aims to benchmark this list, and understand its unique characteristics.

Methodology

To test the performance of this new style of list, a series of benchmarks were devised. The new JMH benchmark framework was used, and final benchmark code is available here.

Multiple iterations were run, and unless stated results were calculated with a 99% confidence interval. A couple of warmup iteration were always run and discarded. All tests were run on a Ubuntu Linux 3.19.0-22 desktop, with a 64bit Intel® Core™ i3-2125 CPU @ 3.30GHz, and 16 GiB of 1333 MHz DDR3 RAM. The JVM was OpenJDK (version 1.8.0_45-internal).

For each benchmark new ArrayLists and UnsafeArrayLists were constructed, and populated with newly created objects. The size of the lists were varied, up to a maximum that could be held in memory without disk swapping. Two artificial workloads were created,

Reading items from the lists start to finish, and
Processing the elements in a random order.

The first was reproduced by simply reading the first field of every element of the list in order, and the second by sorting the list based on the object’s fields (with a simple quicksort).

Three test classes of different sizes were created to be stored within the ArrayLists, one class had two long fields, one had four long fields, and finally one with eight long fields . Named TwoLongs, FourLongs and EightLongs requiring 16, 32, and 64 bytes for the fields respectively. Each iteration these classes were created with random values in the fields.

The Results

Benchmark	List	Type	Size	Mean Time (s)
Iterate	ArrayList	TwoLongs	80,000,000	2.266 ± 0.229
Iterate	UnsafeArrayList	TwoLongs	80,000,000	1.79 ± 0.03
IterateInPlace	UnsafeArrayList	TwoLongs	80,000,000	0.442 ± 0.023

Iterate	ArrayList	FourLongs	80,000,000	2.277 ± 0.211
Iterate	UnsafeArrayList	FourLongs	80,000,000	2.126 ± 0.019
IterateInPlace	UnsafeArrayList	FourLongs	80,000,000	0.648 ± 0.019

Iterate	ArrayList	EightLongs	80,000,000	2.792 ± 0.072
Iterate	UnsafeArrayList	EightLongs	80,000,000	2.672 ± 0.322
IterateInPlace	UnsafeArrayList	EightLongs	80,000,000	0.941 ± 0.032

Sort	ArrayList	TwoLongs	80,000,000	70.31 ± 3.939
Sort	ArrayList	FourLongs	80,000,000	79.673 ± 6.119
Sort	ArrayList	EightLongs	80,000,000	97.687 ± 4.86

Sort	UnsafeArrayList	TwoLongs	80,000,000	18.69 ± 3.158
Sort	UnsafeArrayList	FourLongs	80,000,000	24.822 ± 0.79
Sort	UnsafeArrayList	EightLongs	80,000,000	40.697 ± 0.743

Iterate

Starting with the smallest test object, TwoLongs, to read the first field of all 80 million elements within an ArrayList took on average 2.266 ± 0.229 seconds. To do the same with the UnsafeArrayList (which doesn’t store objects, and instead copies elements in/out) took on average 1.79 ±0.03 seconds (an 24% improvement).

Remember in the previous article, UnsafeArrayList has two methods for retrieving an element T get(int index) and a T get(T dest, int index). The former creates a new object and copies the fields. The latter copies the fields in place of a given destination object, allowing the reuse of a single temp object, and avoiding creations of new objects, thus is labelled “InPlace” in the above results.

It is therefore surprising that the UnsafeArrayList can iterate 24% faster than an ArrayList, when it has the additional overhead of creating an object, and copying fields into it. Compared to an ArrayList which is just reading existing objects.

Some theory is needed to understand what might be happening here. A modern CISC CPU can execute an instruction in a few clock cycles, let’s say ~0.5 nanoseconds, however, reading from RAM takes ~10 nanoseconds. While the CPU is waiting for the response from RAM it is effectively blocked. To compensate the CPU deploys a few tricks, two of which could be helping here. Firstly, the CPU tries to predicting and prefetch the next memory request. Secondly, the CPU will execute instructions out of order, thus not waiting for the memory if a later instruction does not depend on the read.

In the ArrayList case, the array of reference is stored in contiguous memory. However, the actual objects (that the references point to) could be anywhere in RAM. As the program loops through it is making reads from effectively random locations in memory, that can’t be predicted, and thus stalls the CPU.

There is no doubt in the UnsafeArrayList the CPU is prefetching the next elements before it is needed. Additionally the cost of creating these short lived objects is most likely very small because they live and die in eden space and are thus simple to create and garbage collect. I also would not be surprised if the CPU or the JIT compiler was able to do some kind of vectorising on the input. That is, concurrently operating on multiple entries at the same time.

If we then test the T get(T dest, int index) method (labelled IterateInPlace), it can iterate through the array in an impressive 0.442 ±0.023 seconds. That’s 5 times faster than the ArrayList, and 4 times faster than the T get(int index). This is certainly because the objects are not created for each get.

It was not measured here, but it is possible to confirm what the CPU is doing, by using hardware based performance counters. These are special registers within the CPU that can be configured to measure cache hit/miss rates, prefetches, instructions per cycle, and many other metrics. These can be invaluable to understand what’s truly going on, as in most cases humans are bad at understanding performance bottlenecks through intuition alone. Tools such as oprofile, perf, dtrace and systemtap can be used for this.

To do a quick sanity check, in the ArrayList case it takes an average of 28.325 nanoseconds per element. According to wikipedia it takes between 9.00-18.75 nanoseconds to read from DDR3 memory at 1333 Mhz. Thus this number doesn’t seem unexpected, as the ArrayList has to issue two memory reads, firstly reading sequentially from an array of references, and then reading from the object (which is at an unpredictable address).

With the UnsafeArrayList in-place test, it takes an average of 5.53 nanoseconds per element. As the fields are stored contiguously in memory, the CPU can efficiency pipeline the requests, amortizing the 9-18 ns memory read cost. Here the speed is most likely limited by either the memory’s bandwidth, or the CPU’s clock cycles. To read 80 million memory addresses in 0.442 seconds, requires 180 Megatransfers per second, and assuming each object is two longs, or 16 bytes requires ~2.68 GiB/s of throughput. Neither of those values approach the upper limit of what DDR3 is capable of, thus I suspect the time is a combination of this and CPU instructions.

Sorting

The second benchmark measured the speed at which the lists could be read and written to somewhat randomly, and in particular sorted. This should cause a less predictable reads from memory. To sort 80 million elements in the ArrayList took 70.31 ±3.939 seconds, and only 18.69 ±3.158 seconds for the UnsafeArrayList using the in-place get. The relative times is not as impressive as the previous test, but still the UnsafeArrayList is ~3.7 times as quick. I’m unsure exactly why the UnsafeArrayList would be faster, but I suspect it is related to the fewer memory indirections, and prefetching effect the copying of fields has.

It’s also worth noting, the increase performance becomes less profound as the size of the stored class increases. For the FourLong the difference between ArrayList and UnsafeArrayList is 3.2x, and for EightLong the difference is 2.4x. This can easily be explained by the increasing cost of copying the fields in and out of the list. Even so, I would argue that the copy cost is in part hidden, as it is effectively prefetching the object’s fields into the CPU cache. Saving a memory load when the field is actually used (most likely shortly after it is pulled from the list).

Other observations

Overlooked is the smaller memory requirements for the UnsafeArrayList. A TwoLong instance is 16 bytes of data, plus 16 bytes of JVM object header. Thus an ArrayList of 40 million instances take 2.4 GiB of RAM (32 bytes x 80M), plus an additional 305MiB for an array of 80 million references (assuming compressed object pointers takes 4 bytes each). Totalling 2.68 GiB, whereas the UnsafeArray takes 16 bytes per entry, totaling only 1.2GiB (roughly half the size!).

Of course if the array is holding larger classes (such as the EightLong), the per object overhead is smaller, in these cases 6.25GiB vs 4.76GiB, roughly 75% the size.

One last observation of interest is the confidence intervals for the results. A larger error implies more variability in the test runs. For example, if the garbage collector ran during some of the runs, and slowed down the test, it would increase this error. In all the tests using the UnsafeArrayList in-place methods, the confidence interval is smaller, implying more constancy and predictability. This can be important in certain situations, such as real-time systems.

Conclusion

We benchmarked the UnsafeArrayList, against a normal ArrayList in two artificial workloads. We found that in both the start-to-finish iteration, and in the sorting case, that the UnsafeArrayList was 4-5x faster than its counterpart. This result itself is interesting when designing high performance data structures, however, the use of sun.misc.Unsafe is considered dangerous, and thus the performance comes with many caveats and risks. In fact, it was recently announced that the Unsafe class is being deprecated and hidden in java 9. So instead, this was just an insightful journey into how the CPU can optomise particular workloads, and how Java can be pushed to extreme speeds.

Your results may vary, and as always you should benchmark your exact workload instead of a hypothetical one, but this was still an interesting experiment.

Unsafe Part 2: Using sun.misc.Unsafe to create a contiguous array of objects

Wed, 26 Aug 2015 17:51:02 -0700

I recently came across an article from the Mechanical Sympathy blog, that used the flyweight pattern to build a “compact off-heap” array of objects. They basically allocated an area of memory large enough to store N copies of their object. Then using a single instance of a proxy object, would pack/unpack fields into this memory. For example, let’s say we needed to store an array of Point objects. We could construct a simple array like so:

Point[] points = new Point[N];

The inefficiency here is that each instance of a Point requires 12-16 bytes of overhead to store metadata about the object (such as class, GC state, etc), and each additional instance adds to the cost of garbage collection. Additionally, the array actually contains references to Point objects stored elsewhere in RAM. These references require a memory indirection when accessing the actual instances.

In the Mechanical Sympathy article, they instead packed all the fields of the instances into a contiguous array. For simplification I changed their example, but it was something like this:

int[] memory = new int[N*2];

class ProxyPoint {
    private int index = 0;

    public void setIndex(int index) {
        this.index = index;
    }

    public int getX() {
    	return memory[index*2 + 0]
    }

    public int getY() {
    	return memory[index*2 + 1]
    }
}

With this approach there is no overhead for each Point object (as there is only ever one PointProxy, and one array). This also has the interesting property that the fields for all the Points are stored in the same contiguous region of memory. Which leads to some great cache/CPU benefits. For example, if you read all the points sequentially, adjacent objects share the same CPU cache line, and the CPU can predictably prefetch the next point. This would not be possible with an array of references to Points, as each Point could potentially be stored anywhere in RAM.

Now with this primer, it would be interesting to have a normal Java List that stored fields packed together like this. The above solution only works if you create a proxy object ahead of time knowing what class you would be storing. Using the recently released UnsafeHelper class (discussed previously), I went about to build something that looked like a standard generic ArrayList, that could store any type. But with the benefit of storing all elements in contiguous region of memory.

The final solution is UnsafeArrayList.java. This implements the Java List interface, but instead of storing references to objects, it copies the object into a contiguous region of memory. If you are a C++ programmer, you can think of this as a std::vector instead of a std::vector. This minor change comes with it’s own pros and cons, outlined later.

To begin with the list is constructed like so new UnsafeArrayList(Point.class). The Point.class is passed in so that the list knows what kind of objects it will be storing. This is required due to a limitation in Java’s implementation of generics, that makes it impossible for a class to know its own generic type.

The constructor begins by calculating the size of an instance, and uses the UnsafeHelper to calculates the offset to the first field within an instance.

public UnsafeArrayList(Class<T> type, int capacity) {
    this.firstFieldOffset = UnsafeHelper.firstFieldOffset(type);
    this.elementSize      = UnsafeHelper.sizeOf(type) - firstFieldOffset;
    this.unsafe           = UnsafeHelper.getUnsafe();

An area of memory is then allocated, like so:

    base = unsafe.allocateMemory(elementSize * capacity);

This base variable holds the address to the beginning of the memory, and can only be used via the Unsafe class. The memory is large enough to hold capacity objects of elementSize bytes.

Unlike a Java reference, this base address allows pointer arithmetic, and thus to access a particular element we have a simple method to calculate the memory offset:

    private long offset(int index) {
        return base + (index * elementSize);
    }

Then to set an element within this List, we copy its fields into the allocated memory:

    @Override
    public T set(int index, T element) {
        unsafe.copyMemory(element, firstFieldOffset, // src, src_offset
                          null, offset(index),       // dst, dst_offset
                          elementSize);              // size

This copies from object element, starting at offset firstFieldOffset, into the raw memory address determined by offset(index).

The get method is a little more problematic, as the List interface expects get to return an instance of the object. Since we aren’t actually storing references to the objects (but copies of their fields), we need to construct an instance and populate it. This is quite costly, and defeats the point of this UnsafeArrayList. Instead an additional get method is provided, that allows an object to be passed in, which will have its fields replaced.

    public T get(T dest, int index) {
        unsafe.copyMemory(null, offset(index),
                          dest, firstFieldOffset,
                          elementSize);
        return dest;
    }

For completeness a standard get(int index) method is provided, which creates a new instance of the object (using unsafe.allocateInstance() instead of new Type).

    public T get(int index) {
        return get((T) unsafe.allocateInstance(type), index);
    }

You can inspect the rest of the code via GitHub, but these are the main parts.

In conclusion, this approach has some pros and cons, but was mostly created for fun.

Pros
List<> interfaces that stores objects in contiguous memory
Better cache locality and CPU performance
Minimal memory overhead
Cons
Uses sun.misc.Unsafe
Additional CPU cycles needed to copies objects in and out of array
Copies the class out of the garbage collector’s view, thus if a stored object contains the only references to other objects, the garbage collector will not know it is still used.

In the next article, we’ll benchmark this UnsafeArrayList, and investigate the performance impact of the cache locality, and other overheads.

Unsafe Part 1: sun.misc.Unsafe Helper Classes

Mon, 24 Aug 2015 20:13:58 -0700

I recently came across the sun.misc.Unsafe class, a poorly documented, internal API that gives your java program direct access to the JVM’s memory. Of course accessing the JVM’s memory can be considered unsafe, but allows for some exciting opportunities.

You can use Unsafe to inspect and manipulate the layout of your objects in RAM, allocate memory off the heap, do interesting things with threads, or even hack in multiple inheritance. Multiple people have written about Unsafe before, and there are some really good articles, so we won’t cover it here.

Using unsafe is not too difficult, but I found the need for a few helper methods, thus I created a collection of classes wrapping the Unsafe code, starting with UnsafeHelper. The main methods of interest are getUnsafe(), sizeOf(), firstFieldOffset(), toByteArray() and hexDump(). The javadoc is the best place to look for documentation, however I’ll quickly explain their use.

To get an sun.misc.Unsafe instance, you have to extract it from a private static field within sun.misc.Unsafe class. For ease, the UnsafeHelper.getUnsafe() method does that.

When accessing an object, you typically need to know the size of the object (in bytes), and be able to find the offset to individual fields. If you understand the memory layout the JVM uses, you’ll know there is a header in front of the Object’s fields. Typically it looks like this, but varies based on CPU architecture, platform, etc:

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
mark word(8)								klass pointer(4)				padding

More information [here][6] and [here][7].

To hide some of the details, headerSize() returns the size of the header, and sizeOf() return the total size an object including the header in bytes. firstFieldOffset() is then useful as it provides the the offset to the first field. Note that headerSize() and firstFieldOffset() do not always return identical results, as padding (not part of the header) may be used to correctly align the first field.

Next toByteArray() will take an object, and copy it (and its header) into a byte array. Useful for easily inspecting, and serialising the object. Finally, hexDump() uses the toByteArray() to grab an object, and print out a hex representation of the memory, for example:

/**
 * hexDump(new Class4()) prints:
 * 0x00000000: 01 00 00 00 00 00 00 00  8A BF 62 DF 67 45 23 01
 */
static class Class4 {
    int i = 0x01234567;
}

/**
 * Longs are always 8 byte aligned, so 4 bytes of padding
 * hexDump(new Class8()) prints:
 * 0x00000000: 01 00 00 00 00 00 00 00  9B 81 61 DF 00 00 00 00
 * 0x00000010: EF CD AB 89 67 45 23 01
 */
static class Class8 {
    long l = 0x0123456789ABCDEFL;
}

In the first example, Class4, a simple class with a single int field, takes up 16 bytes of memory, with the first 8 used by the JVM, the 2nd 4 bytes being a class pointer (basically how the object knows what kind of class it is), and the last four actually being the value of the field. The second example shows a similar header, but with bytes 12-16 being used as padding, so that the long field value is 8 byte aligned.

These helper methods are available in new project on Github, and downloadable via Maven. Just download the jar file, or include a maven dependency, and import net.bramp.unsafe.UnsafeHelper.


    net.bramp.unsafe
    unsafe-helper
    1.0

Next article, we’ll make use of this new UnsafeHelper to build a special List which copies objects, instead of storing references.

Decompile and Recompile Android APK

Sat, 01 Aug 2015 12:24:59 -0700

I had the need to take an existing Android APK, tweak it, and rebuild. This is not too difficult, but I did have to download the tools from a few different sites, and find a full list of instructions. Thus to make this easier, here is a quick recap of what’s needed.

Download the following:

apktool - tool for reverse engineering Android apk files. In this case can extract and rebuild.
keytool - Java tool for creating keys/certs. Comes with the JDK.
jarsigner Java tool for signing JAR/APK files. Comes with the JDK.
zipalign - archive alignment tool, that comes with the Android SDK.

Some extras:

JD-GUI - Java Decompiler
dex2jar - Converts Android dex files to class/jar files.

Instructions:

We assume you are on a Linux or Mac, but this will work (with some tweaking) on Windows. Install a recent Java JDK, then the Stand-alone Android SDK, and finally apktool.

Optionally setup some alias:

alias apktool='java -jar ~/bin/apktool_2.0.1.jar'
alias dex2jar='~/bin/dex2jar-2.0/d2j-dex2jar.sh'
alias jd-gui='java -jar ~/bin/jd-gui-1.3.0.jar'

First, unpack the application.apk file. This will create a “application” directory with assets, resources, compiled code, etc.

apktool d -r -s application.apk

Now poke around, and edit any of the files in the application directory. If you wish to decompile any java you can do the following:

# Convert the Dex files into standard class files
dex2jar application/classes.dex

# Now use the JD (Java Decompiler) to inspect the source
jd-gui classes-dex2jar.jar

Once you have made your changes, you need to repack the APK. This will create a my_application.apk file:

apktool b -f -d application
mv application/dist/application.apk my_application.apk

The APK must be signed before it will run on a device. Create a key if you don’t have an existing one. If prompted for a password, enter anything (but remember it).

keytool -genkey -v -keystore my-release-key.keystore -alias alias_name \
                   -keyalg RSA -keysize 2048 -validity 10000

Now sign the APK with the key:

# Sign the apk
jarsigner -verbose -sigalg SHA1withRSA -digestalg SHA1 -keystore my-release-key.keystore my_application.apk alias_name

# Verify apk
jarsigner -verify -verbose -certs my_application.apk

Finally, the apk must be aligned for optimal loading:

zipalign -v 4 my_application.apk my_application-aligned.apk

Voila, now you have a my_application-aligned.apk file, which you can side load onto your device.

Grabbing a Certificate with OpenSSL and importing it into Java

Sat, 16 Aug 2014 00:00:00 +0000

Occasionally I have to grab a SSL cert from a server, and turn it into something that Java can use. Here are the quick instructions

# Store the cert issued by a web server
openssl s_client -showcerts -connect www.google.com:443 &gt; www.google.com.pem

# Convert it from PEM format to DER format
openssl x509 -in www.google.com.pem -inform PEM -out www.google.com.der -outform DER

# Import it into your keystore
sudo /usr/java6/bin/keytool -import -alias www.google.com -file www.google.com.der -keystore /usr/java6/jre/lib/security/cacerts

# The keystore password is by default "changeit"

Groovy / Grails

Wed, 04 Jul 2012 00:00:00 +0000

Over the past couple of weeks I’ve been playing with Groovy and Grails, and after a somewhat frustrainting week I thought I’d share my thoughts. Groovy is a dynamic language that runs in a standard JVM, and effectively extends the Java langugage. This makes it easy for existing Java programmer to pick it up and ease into it. Grails is the Groovy equilivant of Ruby on Rails, a rapid web development framework. I had high hopes for both as Groovy adds lots of interesting features to Java, such as Closures, Dynamic typing, Mixins, and lots of clever syntax to reduce code and to speed up the average developer. On top of this Grails can quickly scaffold a MVC framework, allowing you to literally build a CRUD based application in minutes.

This all sounds great but I think both of these technologies are still young and there are a lot of things to work out. I was consistently hitting bugs in Grails, and I found the support for Groovy to be lacking in my IDE of choice Ecplise, forcing me to move to IntelliJ which did a lot better job.

Groovy

Dynamic typing

The dynamic variable typing allows you to create a variable and not declare what type it is. Then as you use the variable you can very easily convert it between types. To be honest, and maybe I miss the point, but I’ve never been fond of dynamic typing in other languages. I tend to create a variable and ensure I keep it a particular type. I do this because dynamic typing can introduce all sorts of errors, and you have to truely understand the rules. For example, if I try and convert a String to a boolean (as I might do in a condition), what type of Strings evaluate to true and false? In Groovy a empty string is false, but a string with a single whitespace char would be true.

def someString = ""
if (someString) {
...
}
// a useful example of String->boolean conversion

Groovy also adds duck typing. If a variable walks like a duck, quacks like a duck then it must be a duck. This is effectively a way to avoid having to implement a interface by checking at runtime if the class has a particular method. This is only useful because at runtime Groovy allows methods to be add (and removed) from classes. This thus allow from some interesting programming, however I find it very error prone. As a method could be added to a class at runtime there is no compile-time checking.

class SomeObject {

}
SomeObject o = new SomeObject();
o.someMethod();
// This code is valid at compile time, but only at runtime with an MissingMethod exception be thrown.

Because of the dynamic nature a lot of the silly typo errors that should be caught at compile time, will only now be found at run time. Mistyping a method name wasn’t caught until that line of code was reached. Also, due to dynamic typing, errors such as calling a method with the wrong argument types were not caught. I found this very frustrating as it slowed down my development. This also makes me dread what will happen if this code is pushed into production without a very rigorous 100% line test coverage.

It looks like Groovy 2.0 is trying to resolve this concern with GEP 8, a new type of annotation that will force Groovy to statically check your class/method at compile time.

Grails

GORM

The GORM is Grails’s ORM, which sits on top of Hibernate. It takes advantage of Groovy’s collection syntax to make configuring a model easy. However, I think due to the young nature of Grails I found multiple problems with GORM. I started by using the super convenient H2 data source for testing. Then as I progressed I moved to MySQL. However, the code that worked perfectly with with H2 stopped working in MySQL. There were little things, like reserved keywords being different, which tripped up MySQL. Looking at the generated SQL the MySQL queries weren’t being escaped, which would have solved this issue. Secondly, and a bigger issue, but I was using hierarchical data models. That is, I had a generic abstract Base model, and multiple specific models that extended from the base. This worked well in H2 and avoided a lot of duplication of code, but with the MySQL data source it was handled incorrectly, causing me to spend hours investigating and modifying the code.

I also tried the MongoDB plugin, as the document store concept works great for my heirachy concept. However it wasn’t a direct drop in replacement for H2/MySQL, and I even found some bugs, which I reported.

Scaffolding

This was one of the coolest features, but also one of the biggest let downs. Scaffolding generates all the code you quickly need for a simple CRUD application. There are two modes, dynamic and static. A dynaimic one literally allows you to create a controller in just a few lines, with all the code for create/read/update/delete hidden behind the scenes. Static scaffolding is very similar in features, but placed all the code in the groovy file ready for you to edit.

class SomeController {
    static scaffold = Author
}
// This is all you need for a CRUD controller that maps to the Author model

The problem I found here is that it dynamic scaffolding served little purpose than showing off how little you could write. To actually customise it you would have to use static scaffolding. Even then, the static scaffolding didn’t seem particular neat and simple (as compared to other rapid dev frameworks I’ve used), and you eventually had to throw 90% of that generated code away and write it all yourself.

Closures

The concept of closures and anonymous functions is a very cool one, which in fact I have quite liked using in Python and JavaScript. The implementation here also seemed quite good, except for some minor pet pevs I had. The real issue I had with closures is how it polluted the call stack. Some of my call stacks were now chains of methods like:

at _GrailsCompile_groovy$_run_closure2.doCall(_GrailsCompile_groovy:46)
at com.springsource.loaded.ri.ReflectiveInterceptor.jlrMethodInvoke(ReflectiveInterceptor.java:1231)
at org.codehaus.gant.GantMetaClass.invokeMethod(GantMetaClass.java:133)
at com.springsource.loaded.ri.ReflectiveInterceptor.jlrMethodInvoke(ReflectiveInterceptor.java:1231)
at org.codehaus.gant.GantMetaClass.invokeMethod(GantMetaClass.java:133)

This is no doubt a limitation of being built onto of the JVM that couldn’t provide more helpful output.

Run-app

Grails comes with a CLI tool that does a lot of the code generation for you. One of the useful commands is grails run-app, this will start up an embedded webserver which runs your application, and better yet, allows you to make code changes without recompiling/redeploying. This truly makes it quicker to develop and test your Java/Groovy, and allows those minor tweaks to your Controllers, etc without a wait. However, yet again I was let down by this feature. Lots of simple changes would cause the run-app to stop serving my pages with odd exception. The solution was to stop the webserver and start it again, which defeats the purpose. Even worse, I sometimes had to grails clean as it did not always pick up my code changes.

Conclusion

I liked everything that Groovy and Grails was trying to do, but I think their implementation isn’t good enough yet, and there are too many gotchas for me to considering using this in a production environment. I no doubt will follow it’s progress and play with it every so often.