Malloc Envy

Patrick Doyle on April 4, 2016

Folks using programming languages with automatic garbage collection (GC) often feel that it’s a price worth paying. I disagree: outside of some fairly limited circumstances, there’s no price to be paid at all.

If your application is short-running, or frees little memory, or allocates and frees a moderate number objects that are all the same size, then malloc+free might win. Otherwise, the comparison is not so clear, and there are large classes of applications in which automatic garbage collection is more efficient than malloc+free.

Malloc+free is a very challenging interface to implement efficiently. Consider: you need to hand out chunks of memory of any size, one at a time, with no advance warning of usage patterns; and then you need to be able to free arbitrary chunks at arbitrary times. And you can never change your mind and move a chunk of memory afterward, because anyone is allowed to remember that chunk’s address without telling you.

These constraints inevitably lead to memory waste known as fragmentation, which comes in various forms. For one thing, freed spaces lie unused between allocated chunks. Until you have a request for a chunk that fits in that space, the space goes unused. In addition, the free call receives no indication of how large the chunk is, so the allocator must store that somewhere, consuming even more memory, even though it’s often the case that the application code can tell the size of a chunk of memory already. The allocator, being completely divorced from the application, can’t take advantage of this information. All of these effects lead to memory overhead.

A modern garbage collector, like the one in Java, addresses many of these problems. I can talk about IBM’s collector, since I have some experience with that one, but most of what I’ll say applies to Oracle’s and OpenJDK’s as well.

Here’s how Java allocates an object of size N: round N up to a multiple of 8 bytes, and bump the allocation pointer by that much. That’s it. By using a scheme called a thread-local heap (TLH), there is no contention between threads. There’s some additional handling for large objects, and for the case when the TLH is exhausted and another one must be allocated, but generally speaking, the GC is self-tuning so that none of these activities appear on the hot path of an application. The overhead is generally negligible, and all you’re left with is the pointer bump.

Ok, that’s fine, but what about freeing memory? Surely that must be expensive?

True, a GC pass is more expensive than a single call to free, but then that’s not a fair comparison. One GC pass can free an unlimited number of objects, while free frees only one. The fair comparison would be the cost per object freed. From this point of view, the winner is far from obvious, and depends on the application.

At one end of the spectrum, you have the pathological case in which GC scans the entire heap and all thread stacks, and finds no objects to free. Technically, this works out to an infinite cost per object, but it only occurs when the system is out of memory, at which point the cost of a GC pass is no longer your primary concern.

Now consider the other end of the spectrum: a million objects are allocated and then become unused before the next GC cycle. (This could happen, for example, if your application allocates an object inside a loop. For example, string concatenation inside a loop will do this.) The cost of calling free on each of those objects would clearly be proportional to the number of objects freed, but the cost of the GC pass depends only on the surviving objects and thread stacks. Java’s collector can free an unlimited number of objects in time independent of the number of objects freed.

A more serious effect in real application is that the GC disrupts caches when it runs. True, the GC only touches objects that are still in use, but there are generally more of those than will fit in the processor cache, so what’s left in the cache is not very likely to be the data the application thread will access right after the GC pass finishes. Furthermore, the GC often touches objects using threads that are not the application thread using the object, and if you’re unlucky enough that the application thread is on another processor core, then the application thread will no longer find anything useful in its cache when it resumes.

However, in between GC operations, the opposite is true. While collecting these objects, the GC will naturally rearrange them in memory so objects are near other objects that reference them. IBM’s garbage collector uses a so-called “hybrid” algorithm that’s part-way between breadth-first and depth-first, giving decent locality for a wide variety of object graphs. For example, this algorithm can pack HashMap entries together with the corresponding key and value objects, and if these are small enough, they can all fit in one cache line. Even if they can’t fit in a cache line, they can often fit in one memory page, reducing the number of TLB^† entries required to access them. In contrast, poor old malloc must decide once and for all where to put each chunk, knowing nothing but its size.

It’s also important not to overstate the duration of GC pauses. Applications don’t need to stop for the entire duration of a garbage collection cycle: they can continue to run and mutate objects (and even allocate objects), so long as they coordinate with the GC.

One final objection to automatic garbage collection is that each object must be self-describing, which entails more memory consumption. For example, an array in IBM’s Java has a 12-byte header, comprising a class pointer, a lock word, and a size field.^‡ An array in C++ has none of these things. However, consider that if the array is ever to be freed by free, then its size must be recorded anyway. That leaves 8 bytes of overhead, which is likely comparable to the overhead caused by malloc‘s fragmentation.

So it’s far from obvious that malloc+free is the paragon of memory allocation performance. It’s only too bad that head-to-head comparisons are so hard to do, since few systems support both accurate GC and also malloc+free. I’m sure such comparisons would show that automatic garbage collection has nowhere near the overhead that people imagine it has.

† The TLB (Translation Lookaside Buffer) is a page table cache. They’re generally very small; usually on the order of 64 entries for 4 KB pages. Using larger pages requires fewer TLB entries, but then processors usually offer far fewer TLB entries for large pages; perhaps only four or eight entries. The TLB is an important resource to manage.

‡ Object headers (and references) grow larger if your heap is too large to support compressed references. Generally this limit is around 30GB, though recent JVMs can increase this limit by aligning objects to 16- or 32-byte boundaries, at the expense of wasting more memory in alignment padding.

Category: Vena Engineering
Tag: performance

Posted by Patrick Doyle

Director of Engineering at Vena

All Posts