Locate humongous allocations with BPF

January 23, 2020 10 minutes read

Performances • G1GC • BPF

2020-02-16 Update, a new article has been published to include feedback from many people. It answers the unexplained points of this article.

Brendan Gregg's book "BPF Performance Tools" contains lots of tools to diagnose even the hardest problems. Today, we will have a look at funccount, trace, argdist and stackcount to pinpoint where humongous objects are allocated in a Java application.

The test application

Like in the previous article, we will analyse a very standard java application: Spring PetClinic. Except that, this time, I voluntarily modified it so that large objects (humongous) are allocated. With the commands funccount, trace, argdist and stackcount, we will figure out where these allocations happen.

"Humongous" objects?

First, let’s do a quick recap about humongous objects. When our JVM uses G1GC, the heap is divided into multiple regions. These regions are then assigned to either the Eden, the Survivors or the Old Generation. But what happens when the application allocates an object that is bigger than a region?

The contract of every GC is that the application should not know how memory is allocated. It should not even know that memory is finite, for what it’s worth.

But in this particular case, we have to look at that "how". G1GC will merge adjacent regions until it gets a big enough space to hold the allocated object. These regions will not be part of the Young Generation, nor of the Old Generation. They will be processed separately, during mixed GC cycles.

To be completely precise, all objects bigger than half of a region are considered "humongous".

The important thing to keep in mind here is that G1GC was not optimised at all for humongous objects. They are a known anti-pattern. It is recommended to change the application so that they are split into multiple, smaller objects.

When such modifications are not a possibility, there is another alternative, which is to increase the region size with -XX:G1HeapRegionSize=…. But this flag comes with its set of issues. Especially, its use disables G1GC adaptive logic entirely. When you use it, you tell the GC that you precisely know how your application behaves. And therefore the GC gives you the entire control of its algorithm, and won’t get in the way, trying to adapt to changing allocation patterns.

The test

For this article, I will use:

JDK 13.0.1
The BPF tools installed with the bcc and bcc-tools packets on Archlinux with a 5.4.6 Linux kernel
The Java agent perf-map-agent to have readable stacktraces

The following environment variables will be defined:

LIBJVM will point to the libjvm.so file from the JDK
PATH will include the BPF tools
JAVA_OPTS will contain JVM parameters to have 1 MB regions and to preserve the frame pointer so that BPF can walk stacktraces

export LIBJVM=/home/alarm/.sdkman/candidates/java/13.0.1.hs-adpt/lib/server/libjvm.so
export PATH=/usr/share/bcc/tools/:$PATH
export JAVA_OPTS="-XX:+PreserveFramePointer -Xms3G -Xmx3G -XX:+UseG1GC -XX:G1HeapRegionSize=1048576 -verbose:gc"

A light load test will be executed in the background to stress the application and trigger the problematic code. Finally, note that the application will only be analysed after it has been optimised by the JVM (when it is in steady-state).

Preliminary checks

Let’s execute the application, put it under load, and look at the GC logs. We can read that, multiple times, a GC cycles is initiated because of humongous allocations. That, in itself, is a strong indication that the application has an issue, and that we should investigate further.

[442.770s][info][gc] GC(227) Pause Young (Normal) (G1 Humongous Allocation) 3059M->3034M(3072M) 3.816ms
[442.779s][info][gc] GC(228) Pause Young (Normal) (G1 Humongous Allocation) 3055M->3054M(3072M) 3.749ms
[442.866s][info][gc] GC(229) Pause Full (G1 Humongous Allocation) 3054M->3048M(3072M) 86.581ms
[442.870s][info][gc] GC(230) Pause Young (Normal) (G1 Humongous Allocation) 3055M->3054M(3072M) 2.324ms
[442.935s][info][gc] GC(231) Pause Full (G1 Humongous Allocation) 3054M->3054M(3072M) 65.282ms

Funccount

Let’s start with the funccount command. This command counts, as its name suggests, how many times a given function was invoked. It is going to help us establish a first diagnosis.

We want to identify where humongous objects are allocated. Before anything, we need to determine which JVM function handles these allocation requests. So we can either ping our favorite JVM developers on Twitter, or just read the source code to find the class G1CollectedHeap. More specifically, its G1CollectedHeap::humongous_obj_allocate method.

The great thing about open-source is not that it is free. It is that we can read the source.

Alright, so now we have to figure out what the actual name of the function is, after it has been compiled by GCC (the mangled name). Since we know that the name contains humongous, we can simply run funccount "$LIBJVM:*humongous*" and see what we get.

[alarm@archlinux ~]$ sudo funccount "$LIBJVM:*humongous*"
Tracing 24 functions for "b'/home/alarm/.sdkman/candidates/java/13.0.1.hs-adpt/lib/server/libjvm.so:*humongous*'"... Hit Ctrl-C to end.
^C
FUNC                                                                             COUNT
b'_ZN16G1ConcurrentMark34humongous_object_eagerly_reclaimedEP10HeapRegion'       218
b'_ZN15G1CollectedHeap33eagerly_reclaim_humongous_regionsEv'                     228
b'_ZN9metaspace35get_size_for_nonhumongous_chunktypeENS_10ChunkIndexEb'          746
b'_ZN22G1BlockOffsetTablePart24set_for_starts_humongousEPP12HeapWordImplm'       3357
b'_ZN10HeapRegion20set_starts_humongousEPP12HeapWordImplm'                       3357
b'_ZN15G1CollectedHeap41humongous_obj_allocate_initialize_regionsEjjm'           3357
b'_ZN10HeapRegion23set_continues_humongousEPS_'                                  3357
b'_ZN15G1CollectedHeap28attempt_allocation_humongousEm'                          3393
b'_ZN15G1CollectedHeap22humongous_obj_allocateEm'                                3849
b'_ZN15G1CollectedHeap29humongous_obj_size_in_regionsEm'                         4562
b'_ZN15G1CollectedHeap21free_humongous_regionEP10HeapRegionP14FreeRegionList'    5858
b'_ZN10HeapRegion15clear_humongousEv'                                            6361
b'_ZN22G1RemSetTrackingPolicy31update_humongous_before_rebuildEP10HeapRegionb'   8649
Detaching...

We find our function with a slightly less readable decorated name: _ZN15G1CollectedHeap22humongous_obj_allocateEm.

Let’s ignore the rest and focus on that function only. Our funccount command now becomes: funccount -p `pgrep java` "$LIBJVM:_ZN15G1CollectedHeap22humongous_obj_allocateEm".

[alarm@archlinux ~]$ sudo funccount -p `pgrep java` "$LIBJVM:_ZN15G1CollectedHeap22humongous_obj_allocateEm"
Tracing 1 functions for "b'/home/alarm/.sdkman/candidates/java/13.0.1.hs-adpt/lib/server/libjvm.so:_ZN15G1CollectedHeap22humongous_obj_allocateEm'"... Hit Ctrl-C to end.
^C
FUNC                                                  COUNT
b'_ZN15G1CollectedHeap22humongous_obj_allocateEm'     2571
Detaching...

2571 occurrences, which means as many humongous objects. This is a clear confirmation that the application has an issue. Besides, I have not even let funccount run for a long time.

Now, to make some more progress, let’s use another tool: trace.

Trace

The trace command allows us to detect which thread called a given function. It can also record stacktraces that let to a function. So we are going to use it with _ZN15G1CollectedHeap22humongous_obj_allocateEm. But this time, we will execute the perf-map-agent agent just before trace to have readable stacktraces.

./perf-map-agent/bin/create-java-perf-map.sh `pgrep java`

First, let’s just execute trace -p `pgrep java` "$LIBJVM:_ZN15G1CollectedHeap22humongous_obj_allocateEm".

[alarm@archlinux ~]$ sudo trace -p `pgrep java` "$LIBJVM:_ZN15G1CollectedHeap22humongous_obj_allocateEm"
PID     TID     COMM            FUNC
631     804     http-nio-8080-e _ZN15G1CollectedHeap22humongous_obj_allocateEm
631     948     http-nio-8080-e _ZN15G1CollectedHeap22humongous_obj_allocateEm
631     946     http-nio-8080-e _ZN15G1CollectedHeap22humongous_obj_allocateEm
631     896     http-nio-8080-e _ZN15G1CollectedHeap22humongous_obj_allocateEm
631     887     http-nio-8080-e _ZN15G1CollectedHeap22humongous_obj_allocateEm
631     803     http-nio-8080-e _ZN15G1CollectedHeap22humongous_obj_allocateEm
631     944     http-nio-8080-e _ZN15G1CollectedHeap22humongous_obj_allocateEm
[...]

We can see that the allocations were performed by different HTTP threads, according to their ID (TID). This information can be useful in the diagnosis, especially if the application under scrutiny uses different thread pools for different tasks.

Let’s now generate stacktraces. The trace command will be configured to generate only 10 of them (-M 10). Without this parameter, we would get thousands of stacktraces being printed continuously in the terminal.

In the output below, I will only list a single stacktrace, to keep things readable.

[alarm@archlinux ~]$ sudo trace -Up `pgrep java` -M 10 "$LIBJVM:_ZN15G1CollectedHeap22humongous_obj_allocateEm"
[...]
PID     TID     COMM            FUNC
631     951     http-nio-8080-e _ZN15G1CollectedHeap22humongous_obj_allocateEm
        b'G1CollectedHeap::humongous_obj_allocate(unsigned long)+0x0 [libjvm.so]'
        b'G1CollectedHeap::mem_allocate(unsigned long, bool*)+0x105 [libjvm.so]'
        b'MemAllocator::allocate() const+0x9a [libjvm.so]'
        b'TypeArrayKlass::allocate_common(int, bool, Thread*)+0xb8 [libjvm.so]'
        b'OptoRuntime::new_array_C(Klass*, int, JavaThread*)+0x33 [libjvm.so]'
        b'_new_array_Java+0x2a [perf-631.map]'
        b'Lorg/springframework/samples/petclinic/owner/OwnerController;::initFindForm+0x60 [perf-631.map]' (3)
        b'Interpreter+0x1500 [perf-631.map]'
        b'Lorg/springframework/web/method/support/InvocableHandlerMethod;::invokeForRequest+0x298 [perf-631.map]'
        b'Lorg/springframework/web/servlet/mvc/method/annotation/ServletInvocableHandlerMethod;::invokeAndHandle+0x58 [perf-631.map]'
        b'Lorg/springframework/web/servlet/DispatcherServlet;::doDispatch+0x32c0 [perf-631.map]'
        b'Lorg/springframework/web/servlet/FrameworkServlet;::service+0x1130 [perf-631.map]'
        b'Lorg/apache/catalina/core/ApplicationFilterChain;::internalDoFilter+0x1d4 [perf-631.map]'
        b'Lorg/apache/tomcat/websocket/server/WsFilter;::doFilter+0x68 [perf-631.map]'
        b'Lorg/apache/catalina/core/ApplicationFilterChain;::internalDoFilter+0x138 [perf-631.map]'
        b'Lorg/springframework/web/servlet/resource/ResourceUrlEncodingFilter;::doFilter+0x174 [perf-631.map]'
        b'Lorg/apache/catalina/core/ApplicationFilterChain;::internalDoFilter+0x138 [perf-631.map]'
        b'Lorg/springframework/web/filter/RequestContextFilter;::doFilterInternal+0x1538 [perf-631.map]'
        b'Lorg/springframework/web/filter/OncePerRequestFilter;::doFilter+0x464 [perf-631.map]'
        b'Lorg/apache/catalina/core/ApplicationFilterChain;::internalDoFilter+0x138 [perf-631.map]'
        b'Lorg/springframework/web/filter/FormContentFilter;::doFilterInternal+0x2f0 [perf-631.map]'
        b'Lorg/springframework/web/filter/OncePerRequestFilter;::doFilter+0x464 [perf-631.map]'
        b'Lorg/apache/catalina/core/ApplicationFilterChain;::internalDoFilter+0x138 [perf-631.map]'
        b'Lorg/springframework/boot/actuate/metrics/web/servlet/WebMvcMetricsFilter;::doFilterInternal+0x5dc [perf-631.map]'
        b'Lorg/springframework/web/filter/OncePerRequestFilter;::doFilter+0x464 [perf-631.map]'
        b'Lorg/apache/catalina/core/ApplicationFilterChain;::internalDoFilter+0x138 [perf-631.map]'
        b'Lorg/springframework/web/filter/CharacterEncodingFilter;::doFilterInternal+0x20c [perf-631.map]'
        b'Lorg/springframework/web/filter/OncePerRequestFilter;::doFilter+0x464 [perf-631.map]' (2)
        b'Lorg/apache/catalina/core/ApplicationFilterChain;::internalDoFilter+0x138 [perf-631.map]'
        b'Lorg/apache/catalina/core/StandardWrapperValve;::invoke+0x384 [perf-631.map]'
        b'Lorg/apache/catalina/authenticator/AuthenticatorBase;::invoke+0x538 [perf-631.map]'
        b'Lorg/apache/catalina/core/StandardHostValve;::invoke+0x4c0 [perf-631.map]'
        b'Lorg/apache/catalina/connector/CoyoteAdapter;::service+0x680 [perf-631.map]'
        b'Lorg/apache/coyote/http11/Http11Processor;::service+0x53c [perf-631.map]'
        b'Lorg/apache/coyote/AbstractProcessorLight;::process+0xac [perf-631.map]'
        b'Lorg/apache/coyote/AbstractProtocol$ConnectionHandler;::process+0x550 [perf-631.map]'
        b'Lorg/apache/tomcat/util/net/NioEndpoint$SocketProcessor;::doRun+0x1f8 [perf-631.map]'
        b'Lorg/apache/tomcat/util/net/SocketProcessorBase;::run+0xdc [perf-631.map]' (1)
        b'Ljava/util/concurrent/ThreadPoolExecutor;::runWorker+0x150 [perf-631.map]'
        b'Interpreter+0x17b3 [perf-631.map]'
        b'Interpreter+0x17f8 [perf-631.map]'
        b'Interpreter+0x17f8 [perf-631.map]'
        b'call_stub+0x8a [perf-631.map]'
        b'JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*)+0x3c9 [libjvm.so]'
        b'JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, Thread*)+0x1e5 [libjvm.so]'
        b'thread_entry(JavaThread*, Thread*)+0x6c [libjvm.so]'
        b'JavaThread::thread_main_inner()+0x1be [libjvm.so]'
        b'Thread::call_run()+0x10d [libjvm.so]'
        b'thread_native_entry(Thread*)+0xe7 [libjvm.so]'
        b'start_thread+0xdf [libpthread-2.30.so]'
[...]

From the bottom to the top, we can see Tomcat (1), Spring (2), and finally a PetClinic class: org.springframework.samples.petclinic.owner.OwnerController (3). More precisely, we can see that it is the method initFindForm from OwnerController that caused the humongous allocation. We do not know which line caused it. But we know that the memory is used for an array. And that it is done directly in the body of initFindForm.

We can now go back to development and update the application code to split the allocation.

To conclude for this command, make sure to only use trace with code that has been optimized by the JVM. BPF tools run outside of the JVM. They can inspect a running JVM, but they are not as well integrated as standard Java tools like jstack. Especially, if trace is used on interpreted Java code, it won’t produce satisfying results. The output below is an example of what trace can "see" when the code is interpreted. Admittedly, Interpreter+0x17b3 and similar frames are not very helpful…

545     748     http-nio-8080-e _ZN15G1CollectedHeap22humongous_obj_allocateEm
        b'G1CollectedHeap::humongous_obj_allocate(unsigned long)+0x0 [libjvm.so]'
        b'G1CollectedHeap::mem_allocate(unsigned long, bool*)+0x105 [libjvm.so]'
        b'MemAllocator::allocate() const+0x9a [libjvm.so]'
        b'TypeArrayKlass::allocate_common(int, bool, Thread*)+0xb8 [libjvm.so]'
        b'InterpreterRuntime::newarray(JavaThread*, BasicType, int)+0x20 [libjvm.so]'
        b'Interpreter+0x17399 [perf-545.map]'
        b'Interpreter+0x1500 [perf-545.map]'
        b'[unknown] [perf-545.map]'
        b'Interpreter+0x1500 [perf-545.map]'
        b'Interpreter+0x17b3 [perf-545.map]'
        b'Interpreter+0x1500 [perf-545.map]'
[...]

Argdist

To go even further, I looked at the argdist command. But I could not get to a good result. This section contains the results of my research, and the remaining questions.

The question is this: how big were the humongous objects?

Given its signature, the method G1CollectedHeap::humongous_obj_allocate(size_t size) contains the answer. We "just" need to inspect the value of its first and unique parameter. And this is precisely what argdist is for.

I know that the application under test allocated arrays of 250,000 long values. This should represent approximately 2 MB of heap (2,000,000 bytes, plus the array object header). So I am expecting to find something close to this value.

Given that I use a 64 bit JVM, the size_t type should resolved to an unsigned integer stored on 8 bytes. But when I tell argdist that the parameter type is a uint64, I get incorrect results that are way too high (about 140 TB)

[alarm@archlinux ~]$ sudo argdist -i 5 -p `pgrep java` -C "p:$LIBJVM:_ZN15G1CollectedHeap22humongous_obj_allocateEm(size_t size):u64:size"
[19:30:56]
p:/home/alarm/.sdkman/candidates/java/13.0.1.hs-adpt/lib/server/libjvm.so:_ZN15G1CollectedHeap22humongous_obj_allocateEm(size_t size):u64:size
	COUNT      EVENT
	31         size = 140652730090208
[19:31:01]
p:/home/alarm/.sdkman/candidates/java/13.0.1.hs-adpt/lib/server/libjvm.so:_ZN15G1CollectedHeap22humongous_obj_allocateEm(size_t size):u64:size
	COUNT      EVENT
	38         size = 140652730090208
[19:31:06]
p:/home/alarm/.sdkman/candidates/java/13.0.1.hs-adpt/lib/server/libjvm.so:_ZN15G1CollectedHeap22humongous_obj_allocateEm(size_t size):u64:size
	COUNT      EVENT
	38         size = 140652730090208

So my next try was to supposed that size_t was really a 32 bits unsigned integer. I got slightly less erratic results (about 1.1 GB), but they were still very far from the expected value.

[alarm@archlinux ~]$ sudo argdist -i 5 -p `pgrep java` -C "p:$LIBJVM:_ZN15G1CollectedHeap22humongous_obj_allocateEm(size_t size):u32:size"
[19:34:24]
p:/home/alarm/.sdkman/candidates/java/13.0.1.hs-adpt/lib/server/libjvm.so:_ZN15G1CollectedHeap22humongous_obj_allocateEm(size_t size):u32:size
	COUNT      EVENT
	39         size = 1141080800
[19:34:29]
p:/home/alarm/.sdkman/candidates/java/13.0.1.hs-adpt/lib/server/libjvm.so:_ZN15G1CollectedHeap22humongous_obj_allocateEm(size_t size):u32:size
	COUNT      EVENT
	39         size = 1141080800
[19:34:34]
p:/home/alarm/.sdkman/candidates/java/13.0.1.hs-adpt/lib/server/libjvm.so:_ZN15G1CollectedHeap22humongous_obj_allocateEm(size_t size):u32:size
	COUNT      EVENT
	38         size = 1141080800
[..]

The "BPF Performance Tools" book also mentions Big Endian to Little Endian conversions multiple times. So I tried to convert the values that I got, without success. This remains a very puzzling point for me at this time. If you know what happens, please do send me a message on Twitter!

Stackcount

To close this article, let’s talk about another failed attempt. The trace command lists all the humongous allocations, in no specific order and without any grouping. It is fortunate that the test application only had 3 humongous allocation sites. They were easy to list based on the trace output.

In a bigger application, there could be a lot more allocation sites. It would be very useful to be able to regroup them and determine which allocation site is the most frequent to refactor it first. And this is precisely why stackcount has been created.

Unfortunately, I could not get a single usable stacktrace out of this tool.

[alarm@archlinux ~]$ sudo stackcount -p `pgrep java` "$LIBJVM:_ZN15G1CollectedHeap22humongous_obj_allocateEm"
Tracing 1 functions for "/home/alarm/.sdkman/candidates/java/13.0.1.hs-adpt/lib/server/libjvm.so:_ZN15G1CollectedHeap22humongous_obj_allocateEm"... Hit Ctrl-C to end.
^C
  b'[unknown]'
  b'[unknown]'
  b'[unknown]'
  b'[unknown]'
  b'[unknown]'
  b'[unknown]'
[...]

If you manage to get it to work, please let me know on Twitter. I think it would be very handy for this use case.

Conclusion

The BPF tools are truly very good and multi-purpose. By understanding more precisely how the JVM works, we can use them to diagnost hard problems like humongous allocations. And even if they were initially created for compiled applications (like C/C++ apps), their support of the JVM is already very solid. No doubt that they will soon become even more useful.

Do you know other interesting JVM methods to monitor? Feel free to share them with me, and I will write new articles about them.

If you have any question/comment, feel free to send me a tweet at @pingtimeout. And if you enjoyed this article and want to support my work, you can always buy me a coffee ☕️.