Unrolling loops at runtime with Byte Buddy

| unsafe | java | byte-buddy

While creating the UnsafeArrayList, I encountered a problem that I felt I could optimise. The UnsafeArrayList copies objects into off-heap memory, instead of what a normal ArrayList would do, which is to store references to the object in an array on the heap. For example an UnsafeArrayList<FourLong> holds instances of FourLongs, whose fields consume a total of 32 bytes (4×8 bytes) of memory. By design, when set() or get() are called, the UnsafeArrayList copies these 32 bytes into or out of a contiguous segment of memory.

To achieve the copying, sun.misc.Unsafe’s putLong() is repeatedly called, moving 8 bytes at a time. For example, this simple loop will copy a long’s worth of memory each iteration, from src, into dest:

final long COPY_STRIDE = 8;
final Unsafe unsafe = UnsafeHelper.getUnsafe();

public void copy(Object dest, long src) {
	long destOffset = 0;
	long destEnd = UnsafeHelper.sizeOf(dest);

	while (destOffset < dstEnd) {
		unsafe.putLong(dest, dstOffset, unsafe.getLong(src));
		destOffset += COPY_STRIDE;
		src += COPY_STRIDE;
	}
}

Note, we use putLong, not because the UnsafeArrayList is storing objects made up of longs, but because this is the Unsafe method that can copy the most in one go. This putLong method is thus being used as the building block to build a more complex looping copy method. Note, this works great for memory which is aligned on a 8 byte boundary, and the total copy is a multiple of 8 bytes. For the sake of this article, we make the assumption that this is always true.

In the FourLong’s case, the copy method would iterates four times. This is predictable, and occurs every time we get() on a UnsafeArrayList<FourLong> instance. Since this copy loop will be executed every time get() is called, it is worth seeing if we can make it execute faster. A common optimisation is for the developer to manually unroll the loop, avoiding the loop counter, and producing potentially quicker code1. In this case, manually unrolling the code is not possible because the parameterised type could be any size. For example, a UnsafeArrayList<Point> would only need to copy 8 bytes (two 4 byte ints). You would hope that the JIT would notice the loop always iterates the same number of times (for a particular list), and be able to remove the loop. Sadly, it does not seem to do this, perhaps because the JVM does not know what side effects unsafe.{get,put}Long have. To measure the cost of the looping we compare the previous code to this:

final int COPY_STRIDE = 8;
final Unsafe unsafe = UnsafeHelper.getUnsafe();

public void copy(Object dest, long src) {
	assert(UnsafeHelper.sizeOf(dest) == 4 * COPY_STRIDE)

	long destOffset = 0;

	unsafe.putLong(dest, destOffset, unsafe.getLong(src));
	destOffset += COPY_STRIDE;
	src += COPY_STRIDE;

	unsafe.putLong(dest, destOffset, unsafe.getLong(src));
	destOffset += COPY_STRIDE;
	src += COPY_STRIDE;

	unsafe.putLong(dest, destOffset, unsafe.getLong(src));
	destOffset += COPY_STRIDE;
	src += COPY_STRIDE;

	unsafe.putLong(dest, destOffset, unsafe.getLong(src));
}

When benchmarked, this manually unrolled code runs 2 times faster! This got me thinking, since a particular UnsafeArrayList instance is always going to copy the same sized object, again and again and again, it could perhaps generate bytecode during creation, that unrolled the loop.

Enter Byte Buddy

Thus investigation into Byte Buddy began, a library designed for generating bytecode at runtime. The rest of this article explains how to use Byte Buddy for this goal.

To start, I used Intellij IDEA’s “Show Bytecode” option, to inspect the code generated by my hand unrolled code.

; Initialisation
  ; long destOffset = 0;
  LCONST_0  ; Load the long zero
  LSTORE 4  ; Store it in “destOffset”

; Copy
  ; unsafe.putLong(dest, destOffset, unsafe.getLong(src));
  ALOAD 0  ; Load “this”
  ; The the “unsafe” member from this.
  GETFIELD net/bramp/unsafe/Test.unsafe : Lsun/misc/Unsafe;

  ALOAD 1  ; Load dest
  LLOAD 4  ; Load dstOffset
  ALOAD 0  ; Load this
  ; The the “unsafe” member from this.
  GETFIELD net/bramp/unsafe/Test.unsafe : Lsun/misc/Unsafe;

  LLOAD 2  ; Load src
  ; unsafe.getLong(src), storing result on stack.
  INVOKEVIRTUAL sun/misc/Unsafe.getLong (J)J
  ; unsafe.putLong(dest, dstOffset, {stack result})
  INVOKEVIRTUAL sun/misc/Unsafe.putLong (Ljava/lang/Object;JJ)V

;; Increment
  ; dstOffset += 8;
  LLOAD 4   ; Load dstOffset
  LDC 8     ; Load 8
  LADD      ; Add dstOffset and 8
  LSTORE 4  ; Store result to dstOffset

  ; src += 8;
  LLOAD 2   ; Load src
  LDC 8     ; Load 8
  LADD      ; Add src and 8
  LSTORE 2  ; Store result to src

After reading a primer to bytecode, this generated bytecode looked quite simple. It can be broken up into three steps, initialisation, copy, and increment. At runtime, Byte Buddy can be used to generate bytecode that is an unrolled equivalent, such that there is 1 initialisation step, N copy steps, and N-1 increment steps, where N is based on the size of the object the UnsafeArrayList plans to copy.

Reading through the Byte Buddy API it seems the best way to achieve this is to create an abstract class, which will form the base of a generated class. Then at runtime create an instantiation of this abstract class, specialised with the unrolled copy bytecode.

For example, the base class would look like this:

public abstract class UnsafeCopier {
	protected final Unsafe unsafe;

	public UnsafeCopier(Unsafe unsafe) {
		this.unsafe = checkNotNull(unsafe);
	}

	abstract void copy(Object dest, long src);
}

Leaving us to implement the copy(…) method optimally for the size of object being copied.

Using the Builder pattern I created the UnrolledUnsafeCopierBuilder class. The build() method will calculate the size of the class being copied, then using Byte Buddy generate the copy implementation, and returns a specialised instance UnsafeCopier.

public UnsafeCopier build(Unsafe unsafe) {
	final long length = UnsafeHelper.sizeOf(clazz);

	Class<?> dynamicType = new ByteBuddy()
		.subclass(UnsafeCopier.class)
		.method(named("copy"))
		.intercept(new CopierImplementation(length)).make()
		.load(getClass().getClassLoader(), ClassLoadingStrategy.Default.WRAPPER)
		.getLoaded();

	return (UnsafeCopier) dynamicType
		.getDeclaredConstructor(Unsafe.class)
		.newInstance(unsafe);
}

This begins by calculating the size of the class. Then using a ByteBuddy instance, creates a new dynamicType, which extends UnsafeCopier. This subclass then obtains its copy method with code generated by CopierImplementation(length). Finally, this new dynamicType is used to create an instance of the copier, which is now specialised for copying instances of clazz.

The real meat of the code is in CopierImplementation, which can be explained in pieces:

class CopierImplementation implements ByteCodeAppender, Implementation {

	public static final long COPY_STRIDE = 8;

	final long length;

	public CopierImplementation(long length) {
		this.length = length;
	}

	private StackManipulation buildStack() throws ... {
		...
		final StackManipulation setupStack = ...
		final StackManipulation copyStack = ...
		final StackManipulation incrementStack = ...

		final int iterations = (int) (length / COPY_STRIDE);
		final StackManipulation[] stack = new StackManipulation[1 + 2 * iterations];
		
		stack[0] = setupStack;
		for (int i = 0; i < iterations; i++) {
			stack[i * 2 + 1] = copyStack;
			stack[i * 2 + 2] = incrementStack;
		}

		// Override the last incrementStack with a "return"
		stack[stack.length - 1] = MethodReturn.VOID;

		return new StackManipulation.Compound(stack);
	}

	...
}

Byte Buddy uses StackManipulation objects to define what bytecode to generate. These StackManipulation objects can be built up hierarchically and contain all the bytecode instructions to execute. We define a separate StackManipulation object for each step, and in the buildStack() method combine the steps multiple times into one array. In particular, this stack array contains one initialise step, N copy steps, and N-1 increment steps, with a return instruction on the end.

Recall from the early bytecode listing, that the initialisation was two bytecode operations, a LCONST, and LSTORE. In Byte Buddy, we can thus do the following:

final StackManipulation setupStack = new StackManipulation.Compound(
	LongConstant.ZERO,                       // LCONST_0
	MethodVariableStore.LONG.storeOffset[4]  // LSTORE 4
);

Byte Buddy provides the primitives for most bytecode instructions, and can be built up in these StackManipulation arrays. However, some instructions are missing, for example LADD (needed by the increment step). But it is simple enough to create one from scratch, as shown outside of this article.

Next the copy step is defined which is a few more instructions than the increment, but relatively simple:

final Field unsafeField = UnsafeCopier.class.getDeclaredField("unsafe");
final Method getLongMethod = Unsafe.class.getMethod("getLong", long.class);
final Method putLongMethod = Unsafe.class.getMethod("putLong",Object.class, long.class, long.class);

final StackManipulation copyStack = new StackManipulation.Compound(
	// unsafe.putLong(dest, destOffset, unsafe.getLong(src));
	MethodVariableAccess.REFERENCE.loadOffset[0], // ALOAD 0 this

	FieldAccess.forField(new FieldDescription.ForLoadedField(unsafeField))
	                                   .getter(), // GETFIELD

	MethodVariableAccess.REFERENCE.loadOffset[1], // ALOAD 1 dest
	MethodVariableAccess.LONG.loadOffset[4],      // LLOAD 4 destOffset

	MethodVariableAccess.REFERENCE.loadOffset[0], // ALOAD 0 this
	FieldAccess.forField(new FieldDescription.ForLoadedField(unsafeField))
	                                   .getter(), // GETFIELD

	MethodVariableAccess.LONG.loadOffset[2],      // LLOAD 2 src

	MethodInvocation.invoke(new MethodDescription.ForLoadedMethod(getLongMethod)),
	MethodInvocation.invoke(new MethodDescription.ForLoadedMethod(putLongMethod))
);

Again, the bytecode instructions are created as a sequence of StackManipulation, replicating the bytecode the java compiler code had generated earlier. This example contains a couple of new StackManipulation classes, in particular the Field and Method Descriptions classes.

The final step is the increment step, which won’t be explained, but for the interested reader the source can be found here.

One last piece of information Byte Buddy needs, is the size of the stack needed for the copy() method, including any space local variables may need. The StackManipulation comes in handy here, as it is able to infer some of these details from the byte code it represents. In particular, the following code calculates the stack size:

public Size apply(MethodVisitor methodVisitor, Implementation.Context implementationContext,
   MethodDescription instrumentedMethod) {

	...

	// Call buildStack() (from above) to generate the bytecode
	StackManipulation stack = buildStack();

	// Calculate the size of this bytecode
	StackManipulation.Size finalStackSize = stack.apply(methodVisitor, implementationContext);

	// Now return the size of this bytecode, plus two, which is the size of the local
	// destOffset variable.
	return new Size(finalStackSize.getMaximalSize(), instrumentedMethod.getStackSize() + 2);
}

An important part here, is the +2, which makes room for the long destOffset variable. If that was missing, the generated bytecode would incorrectly write over instructions on the stack, and most likely crash the JVM.

Now at runtime the UnsafeArrayList’s constructor can use the UnrolledUnsafeCopierBuilder to generate a specialised UnsafeCopier designed for the exact class the UnsafeArrayList is storing.

Results

Now we have most of what we need, it is worth benchmarking this code. Using JMH, we can write three microbenchmarks. One using the original looping code, one using the hand unrolled code, and one using the Byte Buddy unrolled code. The code for the benchmarks is on GitHub, and follows a similar methodology to that in a previous article.

The results are as you may expect:

Benchmark Mode Cnt Score Error Units
Loop thrpt 25 218.056 ± 11.123 ops/us
Hand Unrolled thrpt 25 430.376 ± 27.448 ops/us
Byte Buddy Unrolled thrpt 25 437.139 ± 22.811 ops/us

The loop code can execute ~218 times per microseconds, whereas both the Byte Buddy, and hand unrolled code had near identical performance, of ~430-437 iterations per microsecond, nearly twice as fast. Of course, not measured here is the startup cost of generating the unrolled code. It is assumed this technique would only be used when the generated code would exist for a long time. Otherwise the setup cost undoes any per execution savings.

Conclusion

In summary, we managed to unroll a loop at runtime by generating on demand bytecode for that specific purpose. This was possible by inspecting machine generated bytecode, and using Byte Buddy to generate equivalent bytecode at runtime, customised specifically with the correct number of unrolled iterations.

This technique may seem completely crazy, and I don’t suggest its used unless you know what you are doing. That includes, actually measuring you have a performance problem which could be fixed with this, and not being able to depend on the JVM’s own JIT to do this optimisation for you.

Helpful Links: GitHub Home | Gitub Code | JavaDoc


  1. Unrolled code is not always faster, as larger code may not fit into CPU instruction cache. ↩︎