Follow-up - Locate humongous allocations with BPF

February 16, 2020 9 minutes read

Performances • G1GC • BPF

In the previous article, I explored how BPF tools could be used to locate humongous allocations. I had a lot of positive feedback, along with some precisions and advice. In this article, let’s review them.

Group allocation per stacktraces

In the previous article, I managed to print a stacktrace every time a humongous object was allocated. I wanted to group them to have a much more concise result. I tried the stackcount command. But I could not get it to work.

Alex Blewitt suggested to use bpftrace instead of stackcount. An advantage of bpftrace is that we can pass it a custom script. In this case, u:$LIBJVM:_ZN15G1CollectedHeap22humongous_obj_allocateEm { @[ustack] = count() }. Let’s analyse it:

u: tells bpftrace that we are going to instrument a user function
$LIBJVM points to the code to be instrumented
_ZN15G1CollectedHeap22humongous_obj_allocateEm is the name of the function to instrument
ustack is a variable which BPF associates with the userspace stacktrace automatically
@[ustack] = count() creates an associative array in which this stacktrace is mapped to its number of occurrences

In other words, this script will increment a counter per stacktrace every time a humongus object is allocated at a given site. When the script is stopped, each stacktrace will be printed with the number of times a humongous object was allocated there.

The example below shows that 65 objects were allocated in PetClinic’s OwnerController::showOwner() method.

[alarm@archlinux ~]$ ./perf-map-agent/bin/create-java-perf-map.sh `pgrep java`
[alarm@archlinux ~]$ sudo bpftrace -e "u:$LIBJVM:_ZN15G1CollectedHeap22humongous_obj_allocateEm { @[ustack] = count() }"
Attaching 1 probe...
[...]
@[
    G1CollectedHeap::humongous_obj_allocate(unsigned long)+0
    G1CollectedHeap::mem_allocate(unsigned long, bool*)+261
    MemAllocator::allocate() const+154
    TypeArrayKlass::allocate_common(int, bool, Thread*)+184
    OptoRuntime::new_array_C(Klass*, int, JavaThread*)+51
    _new_array_Java+42
    Lorg/springframework/samples/petclinic/owner/OwnerController;::showOwner+96
    Ljdk/internal/reflect/GeneratedMethodAccessor134;::invoke+172
    Lorg/springframework/web/method/support/InvocableHandlerMethod;::invokeForRequest+584
    Lorg/springframework/web/servlet/mvc/method/annotation/ServletInvocableHandlerMethod;::invokeAndHandle+88
    Lorg/springframework/web/servlet/mvc/method/AbstractHandlerMethodAdapter;::handle+2848
    Lorg/springframework/web/servlet/DispatcherServlet;::doDispatch+2628
    Ljavax/servlet/http/HttpServlet;::service+6900
    Lorg/apache/catalina/core/ApplicationFilterChain;::internalDoFilter+492
    Lorg/apache/tomcat/websocket/server/WsFilter;::doFilter+104
    Lorg/apache/catalina/core/ApplicationFilterChain;::internalDoFilter+284
    Lorg/springframework/web/servlet/resource/ResourceUrlEncodingFilter;::doFilter+372
    Lorg/apache/catalina/core/ApplicationFilterChain;::internalDoFilter+284
    Lorg/springframework/web/filter/RequestContextFilter;::doFilterInternal+1160
    Lorg/springframework/web/filter/OncePerRequestFilter;::doFilter+292
    Lorg/apache/catalina/core/ApplicationFilterChain;::internalDoFilter+284
    Lorg/springframework/web/filter/FormContentFilter;::doFilterInternal+392
    Lorg/springframework/web/filter/OncePerRequestFilter;::doFilter+292
    Lorg/apache/catalina/core/ApplicationFilterChain;::internalDoFilter+284
    Lorg/springframework/boot/actuate/metrics/web/servlet/WebMvcMetricsFilter;::doFilterInternal+1148
    Lorg/springframework/web/filter/OncePerRequestFilter;::doFilter+292
    Lorg/apache/catalina/core/ApplicationFilterChain;::internalDoFilter+284
    Lorg/springframework/web/filter/CharacterEncodingFilter;::doFilterInternal+200
    Lorg/springframework/web/filter/OncePerRequestFilter;::doFilter+292
    Lorg/apache/catalina/core/ApplicationFilterChain;::internalDoFilter+284
    Lorg/apache/catalina/core/StandardWrapperValve;::invoke+908
    Lorg/apache/catalina/authenticator/AuthenticatorBase;::invoke+2144
    Lorg/apache/catalina/core/StandardHostValve;::invoke+988
    Lorg/apache/catalina/core/StandardEngineValve;::invoke+512
    Lorg/apache/catalina/connector/CoyoteAdapter;::service+1292
    Lorg/apache/coyote/http11/Http11Processor;::service+800
    Lorg/apache/coyote/AbstractProcessorLight;::process+172
    Lorg/apache/coyote/AbstractProtocol$ConnectionHandler;::process+988
    Lorg/apache/tomcat/util/net/NioEndpoint$SocketProcessor;::doRun+492
    Lorg/apache/tomcat/util/net/SocketProcessorBase;::run+220
    Interpreter+6136
    Interpreter+6067
    Interpreter+6136
    Interpreter+6136
    call_stub+138
    JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*)+969
    JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, Thread*)+485
    thread_entry(JavaThread*, Thread*)+108
    JavaThread::thread_main_inner()+446
    Thread::call_run()+269
    thread_native_entry(Thread*)+231
    start_thread+223
]: 65

Measure the size of humongous objects

In the previous article, I also tried to measure the size of humongous objects with argdist. However, I could not get a readable (and correct) result.

Aleksey Shipilëv noted a flaw in my reasoning. I started with the idea that the size_t word_size parameter of the G1CollectedHeap::humongous_obj_allocate method could tell me the size of objects in bytes. I based that assumption on the type of the parameter itself, which usually represents this kind of values. But it is actually a value expressed in heap words, rather than a value expressed in bytes. How many bytes does a word takes? The answer is given by the value of the -XX:HeapWordSize parameter, which usually is 8 bytes. We are going to use the information later in this article.

As Alex Blewitt suggested, bpftrace is a good candidate for this use case. Especially given that the identification of function parameter types is automatic. No need to manually give types like uint32/uint64, which means less errors.

Let’s start very simply by printing a message every time a humongous object is allocated. We are going to use the script { printf("Humongous allocation of %d bytes\n", arg1 * 8 ); }. The syntax is pretty self-explanatory here. But let’s still note two things. We are going to use the arg1 variable, which is automatically associated with the first parameter of the instrumented function. And we are multiplying its value by 8 to have a result in bytes.

[alarm@archlinux ~]$ sudo bpftrace -e "u:$LIBJVM:_ZN15G1CollectedHeap22humongous_obj_allocateEm { printf(\"Humongous allocation of %d bytes\n\", arg1 * 8 ); }"
Attaching 1 probe...
Humongous allocation of 2000016 bytes
Humongous allocation of 2000016 bytes
Humongous allocation of 2000016 bytes
Humongous allocation of 2000016 bytes
Humongous allocation of 2000016 bytes
Humongous allocation of 2000016 bytes
[...]

Good news, we finally have correct values. The PetClinic application was modified to allocated arrays of 250000 long. This should amount for approximately 2 MB of heap (2 000 000 bytes, plus the array header). And this is precisely the value that we get with bpftrace.

Let’s now move to Alex’s script: u:$LIBJVM:_ZN15G1CollectedHeap22humongous_obj_allocateEm { @[ustack] = hist(arg1 * 8) }. Let’s analyse it first:

u: tells bpftrace, just like previously, that we are going to instrument a user function
$LIBJVM points to the code to be instrumented
_ZN15G1CollectedHeap22humongous_obj_allocateEm is the name of the function to instrument
ustack is a variable which BPF associates with the userspace stacktrace automatically
@[ustack] = hist(arg1 * 8) creates an associative array in which each user stacktrace is mapped to a histogram of all allocated object sizes

The use of a histogram is not strictly necessary here, given that all humongous objects have the same size. But in a real situation, it would allow to regroup them in buckets and determine how far appart these sizes are. Typically in the case of a dynamic allocation. The hist function groupes the values by power of 2. For instance:

@size:
    [4, 8)                25 |                                                    |
    [8, 16)               74 |                                                    |
    [16, 32)               5 |                                                    |
    [32, 64)           13603 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
    [64, 128)           2527 |@@@@@@@@@                                           |
...

Let’s execute this script. We can see that 16 objects which size is between 1MB and 2MB have been allocated in the PetClinic’s OwnerController::showOwner() method.

[alarm@archlinux ~]$ sudo bpftrace -e "u:$LIBJVM:_ZN15G1CollectedHeap22humongous_obj_allocateEm { @[ustack] = hist(arg1 * 8) }"
Attaching 1 probe...
[...]
@[
    G1CollectedHeap::humongous_obj_allocate(unsigned long)+0
    G1CollectedHeap::mem_allocate(unsigned long, bool*)+261
    MemAllocator::allocate() const+154
    TypeArrayKlass::allocate_common(int, bool, Thread*)+184
    OptoRuntime::new_array_C(Klass*, int, JavaThread*)+51
    _new_array_Java+42
    Lorg/springframework/samples/petclinic/owner/OwnerController;::showOwner+96
    Ljdk/internal/reflect/GeneratedMethodAccessor134;::invoke+172
    Lorg/springframework/web/method/support/InvocableHandlerMethod;::invokeForRequest+584
    Lorg/springframework/web/servlet/mvc/method/annotation/ServletInvocableHandlerMethod;::invokeAndHandle+88
    Lorg/springframework/web/servlet/mvc/method/AbstractHandlerMethodAdapter;::handle+2848
    Lorg/springframework/web/servlet/DispatcherServlet;::doDispatch+2628
    Ljavax/servlet/http/HttpServlet;::service+6900
    Lorg/apache/catalina/core/ApplicationFilterChain;::internalDoFilter+492
    Lorg/apache/tomcat/websocket/server/WsFilter;::doFilter+104
    Lorg/apache/catalina/core/ApplicationFilterChain;::internalDoFilter+284
    Lorg/springframework/web/servlet/resource/ResourceUrlEncodingFilter;::doFilter+372
    Lorg/apache/catalina/core/ApplicationFilterChain;::internalDoFilter+284
    Lorg/springframework/web/filter/RequestContextFilter;::doFilterInternal+1160
    Lorg/springframework/web/filter/OncePerRequestFilter;::doFilter+292
    Lorg/apache/catalina/core/ApplicationFilterChain;::internalDoFilter+284
    Lorg/springframework/web/filter/FormContentFilter;::doFilterInternal+392
    Lorg/springframework/web/filter/OncePerRequestFilter;::doFilter+292
    Lorg/apache/catalina/core/ApplicationFilterChain;::internalDoFilter+284
    Lorg/springframework/boot/actuate/metrics/web/servlet/WebMvcMetricsFilter;::doFilterInternal+1148
    Lorg/springframework/web/filter/OncePerRequestFilter;::doFilter+292
    Lorg/apache/catalina/core/ApplicationFilterChain;::internalDoFilter+284
    Lorg/springframework/web/filter/CharacterEncodingFilter;::doFilterInternal+200
    Lorg/springframework/web/filter/OncePerRequestFilter;::doFilter+292
    Lorg/apache/catalina/core/ApplicationFilterChain;::internalDoFilter+284
    Lorg/apache/catalina/core/StandardWrapperValve;::invoke+908
    Lorg/apache/catalina/authenticator/AuthenticatorBase;::invoke+2144
    Lorg/apache/catalina/core/StandardHostValve;::invoke+988
    Lorg/apache/catalina/core/StandardEngineValve;::invoke+512
    Lorg/apache/catalina/connector/CoyoteAdapter;::service+1292
    Lorg/apache/coyote/http11/Http11Processor;::service+800
    Lorg/apache/coyote/AbstractProcessorLight;::process+172
    Lorg/apache/coyote/AbstractProtocol$ConnectionHandler;::process+984
    Lorg/apache/tomcat/util/net/NioEndpoint$SocketProcessor;::doRun+492
    Lorg/apache/tomcat/util/net/SocketProcessorBase;::run+220
    Interpreter+6136
    Interpreter+6067
    Interpreter+6136
    Interpreter+6136
    call_stub+138
    JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*)+969
    JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, Thread*)+485
    thread_entry(JavaThread*, Thread*)+108
    JavaThread::thread_main_inner()+446
    Thread::call_run()+269
    thread_native_entry(Thread*)+231
    start_thread+223
]:
[1M, 2M)              16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

Bonus: more BPF !

Alex did not stop there and suggested several other BPF scripts. In a real situation, they could be very useful. Although they were not in my experiment, as all threads run the same code and allocate objects of the same size.

These scripts are listed below, as they can surely be be useful for other cases.

Trace the allocations per thread and per stacktrace

The script u:$LIBJVM:_ZN15G1CollectedHeap22humongous_obj_allocateEm { @[tid, ustack] = hist(arg1 * 8) } creates an two-dimensions associative array. At each humongous allocation, for each thread id and for each stacktrace, a histogram records the size of allocated objects.

This can be useful when an application has several thread pools, and we want to identify which is responsible for which allocation.

Trace the allocations per JVM, per thread and per stacktrace

Why stop at two dimensions? The script u:$LIBJVM:_ZN15G1CollectedHeap22humongous_obj_allocateEm { @[pid, tid, ustack] = hist(arg1 * 8) } creates a three-dimensions associative array. At each humongous allocation, for each JVM pid, for each thread id and for each stacktrace, a histogram records the size of allocated objects.

This can be useful when multiple JVM runs simultaneously on the same machine.

Trace precisely the size of allocated objects

It is also possible to use the allocated object size as key in the associative array. With the script u:$LIBJVM:_ZN15G1CollectedHeap22humongous_obj_allocateEm { @[arg1, ustack] = count() }, for each object size and each allocation size, we measure the number of allocations.

This script can be used when the number of different object sizes is small.

Periodically print results

All the scripts presented in this article print their results when bpftrace is stopped. It is possible to add interval:s:10 { print(@); clear(@); } after each one of them so that the results are printed every 10 seconds. For example:

[alarm@archlinux ~]$ sudo bpftrace -e "u:$LIBJVM:_ZN15G1CollectedHeap22humongous_obj_allocateEm { @[ustack] = hist(arg1 * 8) } interval:s:10 { print(@); clear(@); }"

Bonus: trace humongous allocations with async-profiler

Andrei Pangin mentioned that it was possible to use async-profiler to trace humongous allocations. The main idea is identical to the one presented in the last article, as async-profile can also instrument user functions.

Using the command profiler.sh -e _ZN15G1CollectedHeap22humongous_obj_allocateEm -f allocations.svg `pgrep java`, it is possible to get an SVG file with all the allocation stacktraces, as a flamegraph.

The mangled name is different between Java 8 and Java 11.

In Andrei’s screenshot, you can see that the mangled name ends with allocateEmh whereas in my command it ends with allocateEm. This is because of a signature change of the method humongous_obj_allocate. In Java 8, it had two parameters. But in Java 11, this changed and the second parameter was removed. Therefore, the mangled name is different depending on which Java version you use.

if you use Java 8, you want to run your BPF scripts/async-profiler against _ZN15G1CollectedHeap22humongous_obj_allocateEmh
if you use Java 11 or higher, you want to run your BPF scripts/async-profiler against _ZN15G1CollectedHeap22humongous_obj_allocateEm

2020-02-17 Update: It is possible to use _ZN15G1CollectedHeap22humongous_obj_allocateEm* (note the trailing * wildcard) in BPF scripts and with async-profiler. This allows for a solution that is compatible with both Java 8 and Java >11. Additionally, it should be noted that async-profiler now supports demangled names, using the usual -e flag, as in -e G1CollectedHeap::humongous_obj_allocate.

Bonus: trace humongous allocations with Java Flight Recorder

Jean-Philippe Bempel suggested the possibility to use Java Flight Recorder to trace these allocations.

This time, the main idea is different. It is not about instrumenting a JVM function, but instead to rely on the JVM monitoring itself. We know that, because of their size, humongous objects are allocated outside of each thread TLAB. Therefore, we can inspect the allocations that happened outside the TLAB, sort them by size, and locate the objects, their size and their allocation site.

It should however be noted that not all objects allocated outside TLAB are humongous. But this solution has the advantage of being integrated in the tooling that most enterprises are already using (JFR/JMC).

Conclusion

If I had to keep only one thing from all of this, it would be that the Java community is awesome.

I did not think I would get so many positive, useful and actionable feedback. A simple BPF command exploration exercise became something a lot more elaborate. This would not have been possible without all these people. So thank you very much for you help! Thanks to you, I have learned a lot more than I anticipated!

If you have any question/comment, feel free to send me a tweet at @pingtimeout. And if you enjoyed this article and want to support my work, you can always buy me a coffee ☕️.