mardi 27 août 2013

How to not measure memory impact ?

Today, I was asked on Twitter to have a look at this article by Brice LEPORINI and provide feedback. This could not fit in a "comments" section so I wrote an article. Here it goes !

The experiment

Brice wrote an experiment to calculate the cost of going immutable and comparing it versus being mutable.

Although very risky, he emphasizes that his results cannot be used to draw bad/flame/godwin conclusions like "multable is soooo better that immutable, you functional suckers !". There are a lot of flawed benchmark on the internet that claim to be the only truth, so let's also rejoice when experiments reports try to be objective, complete and come with a disclaimer about their limits.

My specs

I ran the code on my laptop, without any application running in the background (not even a X server) to make sute the system was quiet. I used OpenJDK 1.7.0_40. My CPU is a quad-core Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz.

Also, my memory bandwidth is about 2.5GB/s. This is an important information.

The attempt

I ran the test with the following parameters :
  • Parallel GC, PrintGCDetails, PrintTenuringDistribution, Xms1g, Xmx1g
  • Parallel GC, PrintGCDetails, PrintTenuringDistribution, Xms6g, Xmx6g, Xmn5g
  • CMS, PrintGCDetails, PrintTenuringDistribution, Xms1g, Xmx1g
  • CMS, PrintGCDetails, PrintTenuringDistribution, Xms6g, Xmx6g, Xmn5g
  • G1, PrintGCDetails, PrintTenuringDistribution, Xms1g, Xmx1g
  • G1, PrintGCDetails, PrintTenuringDistribution, Xms6g, Xmx6g, Xmn5g

CMS and G1 runs were only for science ;-)

The results

Here is the table of results that I built from this runs, sorted by full average :

ImmutabilityTest6mx 6ms 5mnG11000038612116418031361
ImmutabilityTest6mx 6ms 5mnParallel1000011813109316051217
MutabilityTest6mx 6ms 5mnG11000013058102713101180
ImmutabilityTest6mx 6ms 5mnCMS1000012182108116031171
ImmutabilityTest1mx 1msCMS1000013522103115861101
ImmutabilityTest1mx 1msParallel1000012756102316131084
MutabilityTest6mx 6ms 5mnParallel10000315779411336997
MutabilityTest1mx 1msParallel1000062898691072895
MutabilityTest1mx 1msCMS1000096048601068893
MutabilityTest6mx 6ms 5mnCMS100006518770973819

For each test, two files were generated :
  • One file that has the console output (run-xxxxx)
  • One file for the GC log

The file names follow this convention : <TYPE>-<GC>-<MX>-<MS>-<MN?>.log where
  • TYPE is "run" or "gc", depending on the log
  • GC is "parallel", "cms" or "g1", depending on the GC that was used
  • MX is the max heap size in GB
  • MS is the min heap size in GB
  • MN is the young generation size in GB
They are available on this Github repository.

So, what can we conclude from these results ? Well, Not much, I'm afraid.

There are lots of problems with this experiment on my machine. One of those is that we are measuring things that take about 1 millisecond. This resolution is very very very low, and we cannot see significant differences at this timescale.

Also, having a look at the GC logs show that the GC durations, for PSYoungGen and ParNew are spread between 1ms and 10ms. Say it differently : a GC cycle take between 1 and 10 times the duration of an iteration. Therefore a GC cycle would have a huuuge impact of our microbenchmark.

Now, the real stuff. If we look at the "Parallel GC, 1mx, 1ms" GC log, we can see two different phases in memory utilization. Those two phases are the immutable test first and then the mutable test.

Now, if we look at the allocation rate, things get really interresting.

See that almost flat cloud of dots at ~2.5GB/s ? This means that at that time, the application was creating that much garbage. At the beggining of this article, I mentionned that my memory bandwidth was about 2.5GB/s, now you can see it clearly.

The ImmutabilityTest is actually limited by the size of the wires between my CPU and my memory chips. To say this differently, the ImmutabilityTest does not measure how fast we can create Immutable dates holder, it measure how fast the data can go from my CPU to my RAM.

Therefore, all my ImmutabilityTest results in the previous table are invalid.

Final Thoughts

It is quite often for microbenchmark not to measure different things than what we expect. This is another example of such misconceptions : the results were quite stable, they let us be confident of the experiment and yet were completely wrong.

I think Brice had noticed the limits of this benchmark, but did not push the analyze to its end. I hope this article added some clarifications and will give more materials for future experiments.

Btw, I did not mention CMS nor G1 because CMS never ran and because G1 needed to be tuned a bit. This could also be an interresting topic :-)

2 commentaires:

  1. Ce commentaire a été supprimé par l'auteur.

  2. Hello,

    First of all, I'd like to stress that my paper was *NOT* about measuring memory impact, but about overall impact on execution time.

    Well, the first interesting fact is that is that the run on a different environment, with heap tweaks and with switching GC implementations mostly shows the same raw results: if you let me order the rows by Memory/GC/Test and compare SampledAvg (full average has a too high standard deviation to me):

    1mx 1ms CMS --> 32,70 %
    6mx 6ms 5mn CMS --> 39,30 %
    6mx 6ms 5mn G1 --> 27,30 %
    1mx 1ms Parallel --> 33,50 %
    6mx 6ms 5mn Parallel --> 16,80 %

    Most values shows the 27~39% ratio, cool. The second interesting thing is the tuning you applied to heap settings seem to allow sensible gains.
    Cooking with these parameters was not part of my goals as I thought modifying all entities in lists did not seem to be a real application (the second reason is sometimes it sounds like sorcery to me!).

    About GC duration, that's a good point as well and reopening my logs confirmed the numbers you gave. Anyway if a collection can be 10 times the measured sequence (but the average duration is twice and the minimum is a 2Oth), it occurs only once every 80/150 times (Immut/Mut). So maybe I'm wrong, but I still believe that brewing the GC duration in a 100 elements average calculation makes sense.

    As we discussed on twitter (
    , the 2.5GB/s allocation rate ceiling needs more explanation because according to the supplier, the memory bandwidth for your CPU is not 2.5GB/s but 25.6GB/s (cf.

    Your work confirms my final thoughts: whatever the root causes are, fact is Immutability implies more work but the amount is low. In most cases, if the best you can gain when looking for optimizing an application is 15~30% (and these amounts are valid only for a very specific part!), well it seems the app is already very fast! So, in most cases, using immutability in your design would not imply performance impact on a sensible way, and that's more true if the allocation rate bottleneck issue can be fixed.

    After your G1 tuning paper teasing, we now wait for your new article ;-)