mardi 27 août 2013

How to not measure memory impact ?

Today, I was asked on Twitter to have a look at this article by Brice LEPORINI and provide feedback. This could not fit in a "comments" section so I wrote an article. Here it goes !

The experiment

Brice wrote an experiment to calculate the cost of going immutable and comparing it versus being mutable.

Although very risky, he emphasizes that his results cannot be used to draw bad/flame/godwin conclusions like "multable is soooo better that immutable, you functional suckers !". There are a lot of flawed benchmark on the internet that claim to be the only truth, so let's also rejoice when experiments reports try to be objective, complete and come with a disclaimer about their limits.

My specs

I ran the code on my laptop, without any application running in the background (not even a X server) to make sute the system was quiet. I used OpenJDK 1.7.0_40. My CPU is a quad-core Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz.

Also, my memory bandwidth is about 2.5GB/s. This is an important information.

The attempt

I ran the test with the following parameters :
  • Parallel GC, PrintGCDetails, PrintTenuringDistribution, Xms1g, Xmx1g
  • Parallel GC, PrintGCDetails, PrintTenuringDistribution, Xms6g, Xmx6g, Xmn5g
  • CMS, PrintGCDetails, PrintTenuringDistribution, Xms1g, Xmx1g
  • CMS, PrintGCDetails, PrintTenuringDistribution, Xms6g, Xmx6g, Xmn5g
  • G1, PrintGCDetails, PrintTenuringDistribution, Xms1g, Xmx1g
  • G1, PrintGCDetails, PrintTenuringDistribution, Xms6g, Xmx6g, Xmn5g

CMS and G1 runs were only for science ;-)

The results

Here is the table of results that I built from this runs, sorted by full average :

TestMemoryGCIterationsMaxMinSampledAvgFullAvg
ImmutabilityTest6mx 6ms 5mnG11000038612116418031361
ImmutabilityTest6mx 6ms 5mnParallel1000011813109316051217
MutabilityTest6mx 6ms 5mnG11000013058102713101180
ImmutabilityTest6mx 6ms 5mnCMS1000012182108116031171
ImmutabilityTest1mx 1msCMS1000013522103115861101
ImmutabilityTest1mx 1msParallel1000012756102316131084
MutabilityTest6mx 6ms 5mnParallel10000315779411336997
MutabilityTest1mx 1msParallel1000062898691072895
MutabilityTest1mx 1msCMS1000096048601068893
MutabilityTest6mx 6ms 5mnCMS100006518770973819

For each test, two files were generated :
  • One file that has the console output (run-xxxxx)
  • One file for the GC log

The file names follow this convention : <TYPE>-<GC>-<MX>-<MS>-<MN?>.log where
  • TYPE is "run" or "gc", depending on the log
  • GC is "parallel", "cms" or "g1", depending on the GC that was used
  • MX is the max heap size in GB
  • MS is the min heap size in GB
  • MN is the young generation size in GB
They are available on this Github repository.

So, what can we conclude from these results ? Well, Not much, I'm afraid.

There are lots of problems with this experiment on my machine. One of those is that we are measuring things that take about 1 millisecond. This resolution is very very very low, and we cannot see significant differences at this timescale.

Also, having a look at the GC logs show that the GC durations, for PSYoungGen and ParNew are spread between 1ms and 10ms. Say it differently : a GC cycle take between 1 and 10 times the duration of an iteration. Therefore a GC cycle would have a huuuge impact of our microbenchmark.

Now, the real stuff. If we look at the "Parallel GC, 1mx, 1ms" GC log, we can see two different phases in memory utilization. Those two phases are the immutable test first and then the mutable test.


Now, if we look at the allocation rate, things get really interresting.





See that almost flat cloud of dots at ~2.5GB/s ? This means that at that time, the application was creating that much garbage. At the beggining of this article, I mentionned that my memory bandwidth was about 2.5GB/s, now you can see it clearly.

The ImmutabilityTest is actually limited by the size of the wires between my CPU and my memory chips. To say this differently, the ImmutabilityTest does not measure how fast we can create Immutable dates holder, it measure how fast the data can go from my CPU to my RAM.

Therefore, all my ImmutabilityTest results in the previous table are invalid.

Final Thoughts

It is quite often for microbenchmark not to measure different things than what we expect. This is another example of such misconceptions : the results were quite stable, they let us be confident of the experiment and yet were completely wrong.

I think Brice had noticed the limits of this benchmark, but did not push the analyze to its end. I hope this article added some clarifications and will give more materials for future experiments.

Btw, I did not mention CMS nor G1 because CMS never ran and because G1 needed to be tuned a bit. This could also be an interresting topic :-)

mardi 13 août 2013

GC log format change

Recently, I noticed that System.gc() calls are reported differently in GC logs. This articles reports the changes that happened and the evolution of GC logs format for System.gc() calls.

Disclaimer

System.gc() calls are bad (like in "baaaaaaad"). However, they are almost easy to detect so it did not took long to fix them.

This article is mostly about 1.7.0_40, even if other versions are referenced.

Hotspot GC logs format is always subject to change without notice. The information given here may not be correct for previous or future versions of Hotspot JVM.

System.gc() detection

To find if you have System.gc() calls, it is easy. You just have to grep your GC logs on "Full GC (System)". If it matches, then you have some of those nasty calls. If it does not, you are clean.

But using a recent JVM, things gets complicated. Consider the following program :

public class Test {
    public static void main(String... args) {
        System.gc();
    }
}

Run it with an old JVM and you will get a classic "Full GC (System)" line in the logs.

$ java -version
java version "1.6.0_45"
Java(TM) SE Runtime Environment (build 1.6.0_45-b06)
Java HotSpot(TM) 64-Bit Server VM (build 20.45-b01, mixed mode)

$ java -verbose:gc -XX:+PrintGCDetails Test
[GC [PSYoungGen: 629K->256K(36736K)] 629K->256K(120768K), 0.0007280 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 
[Full GC (System) [PSYoungGen: 256K->0K(36736K)] [PSOldGen: 0K->120K(84032K)] 256K->120K(120768K) [PSPermGen: 2566K->2566K(21248K)], 0.0028790 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 

But with the current latest stable version, System.gc() calls are reported as regular full GC.

$ java -version
java version "1.7.0_40"
OpenJDK Runtime Environment (IcedTea 2.4.1) (ArchLinux build 7.u40_2.4.1-2-x86_64)
OpenJDK 64-Bit Server VM (build 24.0-b50, mixed mode)

$ java -verbose:gc -XX:+PrintGCDetails Test 
[GC [PSYoungGen: 634K->352K(36864K)] 634K->352K(120832K), 0.0018710 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
[Full GC [PSYoungGen: 352K->0K(36864K)] [ParOldGen: 0K->226K(83968K)] 352K->226K(120832K) [PSPermGen: 2726K->2725K(21504K)], 0.0076090 secs] [Times: user=0.03 sys=0.00, real=0.01 secs] 

Things get even more tricky : in that particular case, you can deduce that this was a System.gc() calls :
  • It cannot be triggered by heap over-utilization since the old gen occupancy was empty (0 KB)
  • It cannot be triggered by PermGen overflow since the perm gen was not full (2.7 MB)
  • It cannot be triggered by heap fragmentation since the old gen was empty, the parallel GC were used and the heap did not grow
  • It could be triggered by an oversized object allocation since the heap occupancy did not increased

However in the general case, the old gen is rarely completely empty so we should generally not be able to find System.gc() calls that way.

What about CMS ?

Using CMS, we have a bit more information.

$ java -verbose:gc -XX:+PrintGCDetails -XX:+UseConcMarkSweepGC Test 
[Full GC[CMS: 0K->227K(84032K), 0.0111990 secs] 673K->227K(121856K), [CMS Perm : 2727K->2726K(21248K)], 0.0112790 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 

A "Full GC" line with CMS means promotion failure, concurrent mark failure or System.gc(). On recent JVM, Full GC logs based on promotion failure contain the string "promotion failure". As well, Full GC logs caused by CMF contains the string "concurrent mode failure".

In this particular case and particular JVM version, we can infer that this line is a call to System.gc().

What about G1 ?

Strangely enough, G1 is the most easy GC to deal with, in this situation. The output is the following :

$ java -verbose:gc -XX:+PrintGCDetails -XX:+UseG1GC Test 
[Full GC 323K->226K(7168K), 0.0222030 secs]
   [Eden: 1024.0K(24.0M)->0.0B(3072.0K) Survivors: 0.0B->0.0B Heap: 323.0K(124.0M)->226.6K(7168.0K)]
 [Times: user=0.01 sys=0.01, real=0.02 secs] 

Usually, G1 logs with details comes with a LOT of information. Here, so little data must mean that this is not a regular young or mixed collection. This is definitely a System.gc() call.

What about Java 8 ?

This pattern is not the same for Java 8 VM. Indeed, such calls will be reported with the string "System.gc()".

$ java -version
java version "1.8.0-ea"
Java(TM) SE Runtime Environment (build 1.8.0-ea-b99)
Java HotSpot(TM) 64-Bit Server VM (build 25.0-b41, mixed mode)

With the parallel GCs :

$ java -verbose:gc -XX:+PrintGCDetails Test
[GC (System.gc()) [PSYoungGen: 634K->352K(36864K)] 634K->352K(120832K), 0.0185000 secs] [Times: user=0.01 sys=0.10, real=0.02 secs] 
[Full GC (System.gc()) [PSYoungGen: 352K->0K(36864K)] [ParOldGen: 0K->252K(83968K)] 352K->252K(120832K), [Metaspace: 2585K->2585K(110592K)], 0.0037170 secs] [Times: user=0.02 sys=0.00, real=0.01 secs] 

With CMS :

$ java -verbose:gc -XX:+PrintGCDetails -XX:+UseConcMarkSweepGC Test
[Full GC (System.gc()) [CMS: 0K->258K(84032K), 0.0108710 secs] 673K->258K(121856K), [Metaspace: 2585K->2585K(110592K)], 0.0114610 secs] [Times: user=0.00 sys=0.00, real=0.01 secs] 

With G1 :

$ java -verbose:gc -XX:+PrintGCDetails -XX:+UseG1GC Test
[Full GC (System.gc())  372K->252K(7168K), 0.0146070 secs]
   [Eden: 1024.0K(24.0M)->0.0B(3072.0K) Survivors: 0.0B->0.0B Heap: 372.4K(124.0M)->253.0K(7168.0K)], [Metaspace: 2585K->2585K(110592K)]
 [Times: user=0.01 sys=0.01, real=0.01 secs] 


Final Thoughts

Now, you know that you have to look even more closely at GC logs to find suspicious activity like Full GC calls that look like System.gc() calls.

Btw, life must be really hard for the folks that write GC logs parsers. It surely takes time to keep up with all that changes !

samedi 10 août 2013

Resampling the audio track of a video

MOOC platforms like Coursera and MIT OpenCourseWare are fantastic. But sometimes the video provided are not really audible, especially when you are in noisy public transportation. This article shows how to solve this issue.

The issue

Take any course from MIT OCW. It is likely coming from a classroom, with some echo, little noise. The problem is that when you are in a bus and want to watch your video lessons, you are very easily disturbed by the environment (horns, brakes noises, people speaking loudly).

How can we transform a video so that the volume is higher ? ffmpeg+sox is a very efficient combination for that.

The process

First, we have to extract the audio track from the video, then amplify the volume, and finally rebuild the whole video using the amplified audio track.

Here, I assume that the file you want to modify is stored in ${video}.

To extract the audio track in a file named "${video}.wav", use this command :
ffmpeg -i $video ${video}.wav

Then, find the maximum amplification you can use without introducing noise :
sox ${video}.wav -n stat 2>&1 | grep "Volume adjustment"

This will return, for instance, "Volume adjustment : 2.041".

Use this value to amplify the audio track :
sox -v  ${video}.wav ${video}.louder.wav

Finally, rebuild the video using the amplified audio track :
ffmpeg -i $video -i ${video}.louder.wav \
  -map 0:0 -map 1:0 \
  -strict -2 \
  -acodec aac -ab 96k -vcodec copy resampled.${video}
Note that you may need to use the audio codec "libfaac", depending on your version of ffmpeg and how it was built. In that case, drop the parameter "-scrict -2" with is only required to enable the experimental audio codec "aac"

That's it. Now you can clean the temporary files that were generated :
rm ${video}.wav ${video}.louder.wav
mv ${video} ${video}.old
mv resampled.${video} ${video}

Final thoughts

That was simple, now you can enjoy/learn without being affected with a too-low volume problem.

Any other issue with MOOC or video encoding/decoding/resampling ? Let's discuss about that in the comments !

mardi 4 juin 2013

Project Tyrion

Recently, thanks to Kirk Pepperdine for the idea and a brainstorming session, I started working on a new open-source lock-profiler (Tyrion). Now that I have a first version working, it's time to write an article to explain the concepts for those who would want to join.

Why Tyrion ?

Because it is interresting !

Consider the following questions :
  • In my application, which locks are never used ?
  • In my application, which locks are used only by one thread ?
  • In my application, which locks are contended ?
To answer them, you have to dig inside the JVM and understand the inner-workings, you have to be curious, you eventually have to manipulate code that you did not write. These are challenges that sound fun !

How (high level view) ?

I wanted to instrument code to have a clear view on the critical sections of the application. Basically, I wanted to transform this kind of code (assuming bar is an instance variable) :
public void foo() {
  synchronized(bar) {
    //...
  }
}

My goal is to have this result :
public void foo() {
  synchronized(bar) {
    LocksInterceptor.enteredSynchronizedBlock(bar);
    //...
    LocksInterceptor.leavingSynchronizedBlock(bar);
  }
}

How (tricky technical details) ?

The bytecode of method foo looks like this.
GETFIELD #3
MONITORENTER
...
GETFIELD #3
MONITOREXIT

I want to get a reference to the target of the lock. Fortunately, the bytecode DUP is just what I need, since it duplicated the last element on the stack. I just have to transform the code so that it looks like this.
GETFIELD #3
DUP
MONITORENTER
INVOKESTATIC   #4    // fr/pingtimeout/tyrion/LockInterceptor.enteredSynchronizedBlock:(Ljava/lang/Object;)V
...
GETFIELD #3
DUP
INVOKESTATIC   #5    // fr/pingtimeout/tyrion/LockInterceptor.leavingSynchronizedBlock:(Ljava/lang/Object;)V
MONITOREXIT

And guess what ? Using ASM, obtaining this result is really simple, as you only have to find MONITORENTER and MONITOREXIT, and add intructions before/after them. See the class LockTransformer.java for more details

Pitfall 1 - Synchronized methods

This approach is nice, but does not cover all possible cases. For instance, there is not MONITORENTER nor MONITOREXIT for synchronized methods. The JVM handles synchronization internally. The solution is to visit every synchronzied methods and insert calls to LockInterceptor at method entry and method exit.

Now the question is : what is the lock ? Simple : it is "this", and a reference to "this" in instance methods can be pushed on the stack using ALOAD_0.

So the following bytecodes should be added as the first statements of every synchronized method.
ALOAD_0
INVOKESTATIC   #6    // fr/pingtimeout/tyrion/LockInterceptor.enteredSynchronizedMethod:(Ljava/lang/Object;)V

And the following bytecodes should be added as the last statements of every synchronized method.
ALOAD_0
INVOKESTATIC   #7    // fr/pingtimeout/tyrion/LockInterceptor.leavingSynchronizedMethod:(Ljava/lang/Object;)V

Pitfall 2 - Static methods

So far, things are cool. However, static methods are not. In a static synchronized method, the lock is the class itself, and it cannot be retrieved by ALOAD_0.

To get a reference to the class, I had to get the class FQN and insert a call to java/lang/Class.forName() instead of the ALOAD_0 instruction. So the following bytecodes should be added as the first statements of every static-synchronized method.
LDC            "class FQN"
INVOKESTATIC   #2    // java/lang/Class.forName:(Ljava/lang/String;)Ljava/lang/Class;
INVOKESTATIC   #6    // fr/pingtimeout/tyrion/LockInterceptor.enteredSynchronizedMethod:(Ljava/lang/Object;)V

And the following bytecodes should be added as the last statements of every static-synchronized method.
LDC            "class FQN"
INVOKESTATIC   #2    // java/lang/Class.forName:(Ljava/lang/String;)Ljava/lang/Class;
INVOKESTATIC   #7    // fr/pingtimeout/tyrion/LockInterceptor.leavingSynchronizedMethod:(Ljava/lang/Object;)V

Conclusion & Special Thanks

That's it, synchronized blocks and methods are intercepted, and it did not required lots of code (3 bytecodes inserted at most, 5 classes to do that with ASM). The source is available on GitHub, if you want to have a look.

There are lots of subtle things going on, such as reentrant locks, but this is another topic.

Special thanks to Julien Ponge and Frédéric Le Mouël who did a great talk at Devoxx France 2012 about bytecode manipulation. This really helped me a lot. Also, thanks to ASM guys, this framework rox.

Any experience with ASM or bytecode manipulation ? Already wrote an open-source lock profiler ? Let's discuss it in the comments !

vendredi 26 avril 2013

WANTED : Stacktraces

Lately, I ran into a situation on my production environment : I was getting NullPointerExceptions that did not have any stacktrace. In this article, I will describe what happened and analyse the cause of such a strange behaviour. This article is inspired by Andrey Vityuk's blog, where I found the first pieces of information I needed to fully understand the problem, so, thanks Andrey !

The problem

Until then, you were having a day just like every other one, getting rid of bugs in a production system, one by one. Everything else being done, you open that old ticket that mentions NullPointerException. Fine, usual business, seems not critical, and you think "let's just grab the production logs of yesterday !".

But then, you realise that all you have are lines like these :
2012-12-12 15:42:01 [main] ERROR Exception during feature xyz
java.lang.NullPointerException
2012-12-12 15:47:22 [main] ERROR Exception during feature xyz
java.lang.NullPointerException
2012-12-12 15:48:00 [main] ERROR Exception during feature xyz
java.lang.NullPointerException


The first thing you think is "Did the developper really write log.error("Error during feature xyz\n" + e.getClass()) ?". And also "How am I supposed to solve this bug without knowing where the NPE is thrown ?".

So you open the java file that contains the feature xyz and there, you see something that you did not expect :
log.error("Exception during feature xyz", e);


Your logging framework is correctly used, nothing to report here, but then why do you have no stacktrace ?

Guess what ? It's Hotspot's fault !

Fortunately, you find Andrey Vityuk's blog and discover that Hotspot can simply throw Exceptions that do not have stacktraces for performance sake.

Indeed, if Hotspot detects that you have a hot method that throws exceptions, it can decide to remove the stacktrace so that the exception throwing mechanism is faster. Some people might call this "a bug", I won't.

We should always remember that Hotspot job is to make our code run faster. Period.

When Hotspot optimises code, it rewrites that code, it merges methods, swaps instructions and compiles it to native code. Once this is done, you don't have anything that look like your Java classes. Unfortunately, this comes at a very high price when Exceptions starts to mess around since Hotspot has to do the exact opposite work to rebuild the consistent stacktrace that you are expecting.

By definition, Exceptions are exceptional (or at least, they should be). In some code bases, we can find exceptions that are used to implement business cases. This can for sure be considered as a code smell.

The code

Consider the following code :

public class StackTraceGenerator {
    public static void main(String[] args) {
        for (int i = 0; i < 16_000; i++) {
            System.err.print("iteration " + i + ", ");
            throwAndCatchNullPointerException();
        }
    }

    private static void throwAndCatchNullPointerException() {
        try {
            throwNPE();
        } catch (NullPointerException e) {
            System.err.print(e.hashCode() + ", ");
            e.printStackTrace();
        }
    }

    private static int throwNPE() {
        return ((Object) null).hashCode();
    }
}


If you run this code with C2 (-server), you may eventually see that it generates exceptions without stacktraces. On my machine, it happens around iteration 10,050.

$ java -server fr.pingtimeout.StackTraceGenerator
iteration 0, 1442002549, java.lang.NullPointerException
        at fr.pingtimeout.StackTraceGenerator.throwNPE(StackTraceGenerator.java:27)
        at fr.pingtimeout.StackTraceGenerator.throwAndCatchNullPointerException(StackTraceGenerator.java:19)
        at fr.pingtimeout.StackTraceGenerator.main(StackTraceGenerator.java:13)
[...]
iteration 10015, 533361599, java.lang.NullPointerException
        at fr.pingtimeout.StackTraceGenerator.throwNPE(StackTraceGenerator.java:27)
        at fr.pingtimeout.StackTraceGenerator.throwAndCatchNullPointerException(StackTraceGenerator.java:19)
        at fr.pingtimeout.StackTraceGenerator.main(StackTraceGenerator.java:13)
iteration 10016, 27836516, java.lang.NullPointerException
iteration 10017, 27836516, java.lang.NullPointerException
[...]


Exceptions without Stacktraces, behind the scenes

In the print statement, I included the hashCode of the exception that was thrown. This is completely intended since it shows that after the optimisation is applied, the same exception is thrown for each call to throwNPE.

Indeed, the way Hotspot optimises this code is the following :
  • Create an instance of the exception that should be thrown
  • Clear its stacktrace
  • Cache this exception
  • Throw this exception instead of creating a new one for future calls


So hotspot does not only truncate stacktrace but also optimises the instance creation. At first, I thought that it was an optimisation in the way try/catch blocks work under heavy pressure, but it is not the case. Simply put, this optimisation consists in a singleton for a given exception, the rest of the code stays identical. Main gain : Hotspot does not have to rebuild a new stacktrace from optimised code each time the exception is thrown.

This behaviour is triggered by the JVM flag OmitStackTraceInFastThrow which is activated by default. We could get rid of it using -XX:-OmitStackTraceInFastThrow but I would not do that.

Final thoughts

This exception was the result of a coding mistake and happened enough time so that hotspot optimised it. It should be easily reproducible on any machine and be fixed right away.

We should not reduce the performance of a whole system in order for a bug to be easily fixed.

Ever faced this situation ? Another war story to share ? Let's discuss it in the comments !

vendredi 12 avril 2013

Extracting iCMS data from garbage collector logs

Recently, I had to inspect some GC logs that had been created on a recent Hotspot JVM, using ParNew, CMS and iCMS. I did not find any tool that could report any information on iCMS duty cycles, so I wrote one that allows me to spot iCMS problem in a matter of seconds.

The problem

The Concurrent Mark and Sweep GC is almost always used on websites that have a lot of traffic and that cannot tolerate long pauses. Its incremental mode is often used on servers with limited hardware, so that small CMS collections on small chuncks of memory are executed between minor collections.

However, I could not find any GC log parser that took advantage of iCMS duty cycles information. In particular, how can we instantly see that something is wrong with our GC tuning ?

Incremental CMS reports

iCMS reports can be seen in ParNew GC log, in "icms_dc=..." parts of the message. For instance, take the following message :

18.862: [GC 18.862: [ParNew: 136320K->15040K(153344K), 0.0167820 secs] 158326K->37046K(2080128K), 0.0168440 secs] [Times: user=0.10 sys=0.00, real=0.02 secs]
21.028: [GC 21.028: [ParNew: 151360K->15263K(153344K), 0.0458710 secs] 173366K->48678K(2080128K) icms_dc=5 , 0.0459650 secs] [Times: user=0.22 sys=0.00, real=0.05 secs]


This log shows us that two minor collections that occurred at 18.862ms and 21.028ms and an information about a iCMS collection. This collection was ran between the two ParNew collections and took 5% of the time between them. So, let's do some math.

21.028 - 18.862 = 2.166 secs
5% of 2.166 secs = 0.1083 secs


=> The iCMS collection took 108.3ms to complete, in the time frame of 2.166 seconds between the two ParNew Collections. This collection took place concurrently, so our application only suffered of a small overhead.

Some bad values

These log message were describing a normal situation, however, what would you think about these?

928404.003: [GC 928404.003: [ParNew: 140485K->17024K(153344K), 0.0724360 secs] 1854871K->1745355K(2080128K) icms_dc=100 , 0.0726830 secs] [Times: user=0.33 sys=0.00, real=0.07 secs]
928405.145: [GC 928405.145: [ParNew: 153344K->13475K(153344K), 0.0365750 secs] 1786382K->1660879K(2080128K) icms_dc=100 , 0.0368000 secs] [Times: user=0.18 sys=0.01, real=0.04 secs]


Yes, you read well, iCMS duty cycles ran 100% of the time between ParNew collections.

With ParNew being a stop-the-world collector, we can conclude that at 928404.003, a ParNew collection stopped the application for 0.07 seconds, afterwhat a incremental CMS collection was run concurrently during the whole time the application was working, and until it was stopped again at 928405.145 by a new ParNew collection.

So basically, during that time frame, our application was either paused or slowed by an iCMS collection.

Now, the big question is, if we have a 3-days GC log, how can we instantly know that we have a problem linked to iCMS duty cycles ?

Cumulative frequency of iCMS

My idea was to create a graph that sums the ratio of icms_dc time during the execution of the application.

The graph has the following characteristics :
  • X axis is the time spend in iCMS duty cycle in percent
  • Y axis is the cumulative frequency of such iCMS duty cycles


My goal is to have the lowest possible iCMS duty cycles, so let's say that I want 90% of my iCMS collections taking less than 10% of the time between two minor collections.

With such requirements, I can say that this GC log has no problem regarding iCMS :



However, this application does. It suffers from long iCMS duty cycles and some measures should be taken to reduce the pressure of Garbage Collectors. We can see that only 80% of iCMS cycles take less than 20% of the time available, which is not good. Even worse, look at the 90th, 95th and 99th percentiles. iCMS is definitely working too much and causes too much overhead on the application.



Finally, I can merge my graphs to make sure that my tuning is really improving the application Garbage Collections.



Conclusion

The scripts I created to generate those charts are available on GitHub under the GPL licence. You can find them here : https://github.com/pingtimeout/icms-cumulative-frequency. I hope it will be useful :-)

Do you have other diagnostic techniques ? Some ideas to share ? Let's talk about it in the comments !

lundi 1 avril 2013

Retour sur Devoxx France - BOF NoSSII

La semaine dernière, je me suis rendu au BOF NoSSII de Devoxx France 2013. Une no-conférence très intéressante où l'on ne parlait pas de langage, de framework ou d'outils. Réactions.

Les sociétés présentes, la notion de valeurs

Dans le panel d'animation du BOF, trois sociétés étaient représentées : Lateral Thoughts, Ninja Squad et Scopyleft.

Leurs dirigeants-employés-salariés-geeks ont pu nous présenter les raisons qui les ont amenés à se regrouper, ainsi que celles qui les ont poussé vers tel ou tel statut juridique.

Le point qui m'a le plus marqué est celui des valeurs. Chaque société s'est créée sur une définition des valeurs qui vont l'animer durant toute son existence. L'objectif principal de ces entreprises est de fédérer des développeurs vers un but commun, or ce but ne peut pas être la rentabilité financière puisque, soyons francs, à un moment où à un autre, une meilleure situation se présentera toujours (par exemple, Google vous débauche pour aller bosser à SF et vous propose un salaire à 6 chiffres).

Ce qui unit ces entreprises, ce sont leurs valeurs. Parce que c'est la seule chose qui maintiendra la cohésion d'une équipe quand les temps seront difficiles.

NoSSII = Not Only SSII

Une bonne partie des échanges (la première demi-heure) a beaucoup tourné autour de la définition d'une SSII. Plusieurs définitions s'opposent.

La SSII est une entité qui travaille par engagement de résultat (projets au forfaits) ou par engagement de moyens (projets en régie, prestations client). L'industrie du logiciel est similaire à toutes les autres, dans la mesure où l'on se demande toujours s'il faut faire (réaliser un soft en interne avec ses propres employés) ou faire-faire (externaliser le développement, faire appel à des prestataires). De ces deux points, on peut conclure que les indépendants sont tous des SSII mono-personnelles. Les trois sociétés présentes sont également des SSII, au même titre que les habituels Steria, Capgemini & co.

Mais ce n'est pas la seule définition valable.

Une SSII, c'est une société qui place des ressources sur des projets (forfait ou régie) et dont la ligne directrice est définie par un patron pour être ensuite relayée par toutes les hiérarchies du groupe. C'est ce que je retiens de mon expérience chez Steria, c'est également ce qui a été partagé par le public. Une SSII ne cherche pas à aller plus loin que le placement de collaborateurs. La prestation est la finalité.

Chez Lateral Thoughts, Ninja Squad et Scopyleft, en revanche, la prestation est un moyen. Un moyen pour arriver, par exemple, à créer un logiciel et à en vivre. Cette nouvelle finalité est décidée par l'ensemble des co-gérants, de façon collégiale. En ce sens, ces entreprises ne sont pas des SSII.

Alors en sont-elles ? N'en sont-elles pas ? En sont-elle un peu mais pas totalement ? C'est justement là que se trouve l'essence du terme NoSSII : oui, elles sont des SSII, mais pas complètement (Not Only).

Ok, mais est-ce que ça marche ? Et est-ce que ça scale ?

LT, NS et Scopyleft sont encore relativement jeunes (respectivement 1 an et demi, 1 an, 3 mois) mais le modèle a l'air pérenne. Elles sont de petite taille (entre 3 et 8 personnes) et donc on peut se demander si elles pourront croître sans perdre leurs valeurs.

EDIT : Comme le dit Scopyleft
sur twitter, la croissance n'est pas nécessairement l'objectif de toute société.

Il y a relativement peu d'exemples de sociétés participatives qui fonctionnent sur ce modèle et qui ont un grand nombre d'associés. Je n'ai pas noté le nom de celle qui était cité en exemple, mais il s'agissait d'une Scop de plusieurs centaines de salariés. D'après Wikipedia, la taille moyenne d'une Scop est de 21 salariés, ce qui laisse une belle marge de progression :-)

Une question de risque

Ce que je retiens surtout de cette conférence est lié au risque que tu prends.

Lorsque tu es salarié d'une SSII traditionnelle, le risque que tu prends est très faible car tu as un CDI, un salaire garanti, une mutuelle, tout ça sans te soucier de quoi que ce soit. Bref tu es au chaud. Le salaire en revanche peut être un peu décevant quand tu vois les prix pratiqués par la profession.

Si en revanche tu es indépendant, le risque est plus grand, car d'un mois à l'autre tu peux voir ta prestation ne pas être renouvelée. Ce risque est quand même limité à ta seule personne : si tu te plantes, tu te plantes seul et tu peux toujours revenir dans des SSII traditionnelles. Côté salaire, il est plus élevé.

Maintenant, si tu fais partie d'une société participative, les décisions que tu prends, en tant que co-gérant, vont impacter la vie de tes associés. Ce risque est donc aussi grand qu'en étant indépendant, mais avec un impact sur d'autres personnes. De qui donner des sueurs froides.

Une grande responsabilité

Lorsque tu es en SSII classique, la vie de ta société ne t'intéresse pas puisque tu n'as aucun moyen pour peser dans les décisions. Tu te laisses porter, on verra bien où ça te mène.

En tant qu'indépendant ou en société participative, tu *décides*, et ça c'est un grand changement. Tu es acteur dans la vie de ton entreprise, tu ne peux pas simplement te laisser porter, tu portes la responsabilité te tes décisions, tu es maître de ton outil de travail et tu choisis ce que tu veux en faire.

Ça peut faire peur, car cela demande du courage et de la force. S'engager dans une société participative en demande autant que lorsque tu veux passer indépendant. Être en NoSSII, c'est être indépendant, mais en équipe. Une belle leçon de courage qui donne à réfléchir.

mercredi 27 mars 2013

Code Kata : FizzNabillaBuzz

Fizz Nabilla Buzz

Cet exercice est une variation du code kata FizzBuzz, version Nabilla.

Le sujet original du code kata FizzBuzz est disponible ici : http://codingdojo.org/cgi-bin/wiki.pl?KataFizzBuzz

Règles

Lorsqu'un nombre est divisible par 3, afficher "Nan mais allô quoi !".
Lorsqu'un nombre est divisible par 4, afficher "Vous me recevez ?".
Lorsqu'un nombre est divisible par 5, afficher "T'es une fille et t'as pas de shampoing !".
Si un nombre n'est ni divisible par 3, ni par 4, ni par 5, afficher "Allô..."

Exemple

Voici l'exécution de FizzNabillaBuzz pour les nombres entre 1 et 15.

1 -> Allô...
2 -> Allô...
3 -> Nan mais allô quoi !
4 -> Vous me recevez ?
5 -> T'es une fille et t'as pas de shampoing !
6 -> Nan mais allô quoi !
7 -> Allô...
8 -> Vous me recevez ?
9 -> Nan mais allô quoi !
10 -> T'es une fille et t'as pas de shampoing !
11 -> Allô...
12 -> Nan mais allô quoi ! Vous me recevez ?
13 -> Allô...
14 -> Allô...
15 -> Nan mais allô quoi ! T'es une fille et t'as pas de shampoing !


À vous de jouer ! Vous pouvez même inventer vos propres variations !

Des idées ? Discutons-en dans les commentaires !

samedi 16 mars 2013

PetClinic Performance Tuning - About Ippon audit

Last week, Ippon Technologies published several articles about Java Performance Tuning. Those articles were interresting, but not compeltely accurate. In this post, I will give my own conclusions and remarks.

But first, a disclaimer : dear Ippon readers, what you did represent a huge amount of work, and I don't want to criticise that. I like what you did, it is a very good job. However, I would have appreciated some more information, and I will try to describe it here.

1 - The "Prove it !" mindset

I often repeat to anyone willing to hear it that they must prove what they think before taking any action. I call this the "Prove It !" mindset, but it is also well known as the "Diagnose before Cure !" or "Measure, Don't Guess !" principles.

I tend to be unhappy when solutions are applied without ever any measurement that this would do any good. If someone says that he/she magically has the solution to a problem that is not measured, it is the same as taking medicine without even knowing the kind of desease you have : it may work, or not, or even do nothing.

The more one prove that his/her decision is the right one, the better knowledge is spread. If no-one understand why you did something, discussions like the following will eventually happen : "Oh, that thing was made by an expert, DO NOT TOUCH IT ! If it work, don't fix it !"

2 - Goals

Performance audits are like recursive functions : if you don't have any stop condition, you could go on forever.

When we write benchmarks, we want to answer a very precise question, like "Can my application handle 1200 requests per second ?". When you have your first answer (usually "No", otherwise no-one would have called you in the first place), you can work to reach that goal.

Note : one goal can include "... and I cannot have more that 32 GB of RAM on my server, which I cannot duplicate".

However, there was no precise goal Ippon's articles. They were more a tooling showcase than a real-life situation. So, to be completely exhaustive : you *need* to have goals, otherwise you won't ever make your client smile.

3 - Testing conditions

In the first article, we can see that the default configuration (-Xms128m) did produce an OutOfMemoryError. The conclusion was the following :

"The first results are as follow: the application quickly goes up to 285 requests per second (req/sec), then slows down and throws “java.lang.OutOfMemoryError: Java heap space” errors after nearly 700 users."

The problem with this sentence is that what we did was not a benchmark per se. A benchmark has to give stable results over time. Here, we just cannot say that the application handles 285 requests per second.

What we saw here was that the application, put under a heavy load, was able to cope with 285 requests per second at best. But since the application crashed, we cannot draw any conclusion from this. The test environment is just not ready. In the last article, the 1GB Heap results is taken as a starting point, which shows that this 285req/s was not a benchmark result.

I think what Julien Dubois meant, in the first article, was that without a proper environment, tests could not be seriously made. As a general rule, we want to have a stable testing environment that does not crash even if you run a test during several hours.

So this first article was related to performance tuning, but was not about optimisation. It was only about setting up the environment, which is a prerequisite for any rigourous benchmark.

Finally, the conclusion mentions "548 req/sec and 0,2% of HTTP errors". Does this mean that at some point, the application still crash in a OOME ? If that is the case, we can prove that the test environment is not good enough by increasing the loop count of the thread group in JMeter. And if it is the case, then the environment is still not ready.

4 - The right fix

I did not understand why the second article started with a focus on the memory consumed by Dandelion library.

In the first screenshot, we can see that Dandelion's objects represent about ~130MB of the total heap, which has been set to 1GB, so we focus on a consumer of 10% of the heap. The questions are : do we really have a memory problem ? How did we prove it ? Are those -.20% http errors caused by a bad memory consumption pattern ?

Long story short : upgrading a library and refactoring your code to go stateless makes you handle less requests per second (532req/s instad of 548req/s) with more http errors (0.70% instead of 0.20%).

In that case, I tend to think that these actions were not the right fix. The causes of the HTTP errors have never been cleary defined, so we might end up with new errors, who knows ?

I would have liked more justifications for the actions that were taken. Did we have a memory consumption problem ? How did we prove it ? How did we prove that the taken actions were actually solutions, and not problems we were adding in the code ?

5 - Audit conclusion

The conclusion of the last article mentions that going back to -Xmx128m produces about the same result, which is ~1200req/s.

"At the beginning of the tests, we had to increase our heap memory size to 1 Gb, and could only serve 548 req/sec, with some HTTP errors. After completing our audit, we are now back to 128 M, and can serve 1225 req/sec with no error at all. We expect those results to be even better on a real server, with many cores and threads, and where removing JVM locks will have a more significant impact."

I could not reproduce that.

On my machine, decreasing the heap to 128 MB makes the result go down to ~180req/s, which is just not acceptable. The following screenshot shows the memory consumption of the application : the old generation is completely filled, resulting in a huge number of full GC (~2500) that represent 15 minutes waiting for the GC to finish.



Is this memory reduction thing a typo ? How could an application behave identically with less memory if its liveset does not even fit in the heap ? I would like to know more about it.

6 - More testing !

I checkout'ed the source (revision fa1e1a8f86c574b765d65b26fd858e0d28ae81fc, 2013-03-15 09:36:26) and made my own tests. Here are some more conclusions.

Note : I have not the same testing environment. I chose to reintroduce HSQLDB and slightly modified the JMeter test so that 10 visits per pet per owner are added before the test takes place. This is because the HSQLDB is flushed are each server restart. The req/s results will be different, since I do not run the tests on the same hardware, but it is not the point here.

With 1 GB, we can see that we are now down to 3 full GC (total pause time ~2s), which is pretty good. My test finished with an average of 600req/s.



But can we achieve the same performances if we reduce the size of the heap ? The answer is yes. By tuning the different memory pools, I managed to get the same number of requests per second with the following parameters : -Xmx512m -XX:NewRatio=1 -XX:SurvivorRatio=14

Even more tests, for science. What about switching to ParNew and CMS ? By switching to CMS, we can expect less significant pauses, since only ParNew pauses (~20ms) would stop the application. However, this comes at a price : a decrease in the overall performance since CMS implies much more overhead than other GC. Well, this is easily verified : by adding CMS (-Xmx512m -XX:NewRatio=1 -XX:SurvivorRatio=14 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC), I got an average of ~537req/s, with 49 CMS cycles.

Finally, considering the memory consumption profile of the application, we can set the CMS starting threshold to a higher value. This way, we will get less concurrent execution of CMS, meaning more CPU cycles for our application. By adding -XX:CMSInitiatingPermOccupancyFraction=70, we end up with ~599req/s and only 1 CMS execution.

Wrap Up

Event if I would have preferred much more details about the choices that were made, the performance audit that was made is pretty cool :)

I hope you enjoyed having more information and more test cases on petclinic. Do not hesitate to try any configuration that comes through your mind, as long as you measure if it improves the situation, and as long as you are not afraid to revert your changes if you cannot prove any improvement. This part is one of the most entertaining things on the JVM.

The default GC causes long STW pauses, and that could have become a problem if the heap was set to more than 8 GB. The switch to CMS comes at a price, but with enough testing and tuning, we can manage to have identical performances.

Any idea how to improve these articles ? Some war stories that you want to share ? Let's talk about this in the comments !

samedi 23 février 2013

Random thoughts - Space vs. Time

Last week, Olivier Croisier resurrected an old coding challenge in his blog, the Coder's Breakfast. As geeky as every challenge can be, I immediately started to think about potential solutions. This led me to an amusing discussion with a Scala guy. This post in a follow-up of that discussion.

Problem

The discussion was like :
Him : "Pff, it's easy, do a Stream"
Me : "Err, yeah but you would have to generate all numbers from 0 to max"
Him : "Of course not ! You filter ! I can do it in one line right now"

I was quite surprised by his arguments, and wanted to explain that I was not looking for brievty but for optimization. So I told him that if our discussion were for Java, it would have been like :
"Pff, easy, do an ArrayList
- [same answer]
- Not all, just use Guava !"

My point was that a data structure cannot be an answer to a complex problem, we have to visualize the whole problem, not just by the first implementation that comes to our minds.

This is commonly known as a Space/Time tradeoff (or a CPU/Memory tradeoff) : if you write a very small program, such as a one-llner, then chances are that this space gaiin will have a time cost (wasted CPU time). On the other hand, if you craft a program that do not waste any CPU cycle, then chances are that it will take more space (either in memory or by the number of LoC).

Note that I perfectly know that in Scala and Clojure, Streams and seqs are lazy, where ArrayList is not. It was just a troll.


Some algorithm

For me, this challenge is about "What you are going to do" instead of "How you are going to do it". Brute-force solutions are different in their implementation details, but follow the same algorithm :
- For each number n taken between [0; max]
--- If n fulfills a test, keep it

In those one-liners wars, we do not think about algorithm, and we should feel bad about it, especially in a coding challenge. Such challenges are the only place where we can seek premature optimization, for the beauty of the exercise.

So let's go back to the roots of programming : algorithms.
The code we have to create can be splitted in the following components :
Generator -> Filter -> Aggregator
We know that the upper bound will be a power of 10, and won't be above 10^10.


First implementation

The most basic implementations does this :
- Generate all numbers in [1; max] that fulfill a given test
- Iterate over every element starting at index 1 an compute the difference between the element and the previous one
- Keep the highest difference
- Count the number of results

Considering we have numbers and solutions, our algorithm would generate numbers in O(max).


Second implementation

But we can achieve a better solution : let's generate every number by picking digits in a list of "allowed digits".

To calculate every non-repeating-digits number of length x,
- For each digit d taken in [1;9]
--- Find all numbers of length x that starts with d and that are composed by every digits from [0; d[ and ]d; 9]
To find all numbers of length x that starts with d and that are composed by someDigits
- For each digit i taken in someDigits
--- Add d at the position x of all numbers of length x-1 that starts with i and that are composed by someDigits without i

The main goal of this algorithm is to have no "Filter" component. We perfectly know the range of acceptable digits, so we can just generate every combination by restricting this range to a smaller subset. By using recursion, the program may be harder to reason about but also simpler to test and faster to run.

A Java implementation of this algorithm is available here : https://github.com/pingtimeout/challenges/tree/master/non-repeating-numbers-java

A Clojure one should follow in a few days.


Wrap-Up

My Java implementation is way longer than a Scala one-liner, however it is also faster from an algorithmic standpoint.

However I do not mean that it is is better implementation.

In a real project, I would probably prefer writing a simple (ineffective) solution in 10 minutes than a very optimised one that would take hours or even days. But keep in mind this Space vs. Time tradeoff. There is no such thing as a perfect solution, there are only tradeoffs and we must know them to make better decisions.

Note that I won't write here about the wall-clock standpoint of those solutions since it would make me create microbenchmarks, which is something one should not do.

Final thoughts : Thanks Olivier for this challenge ! It was a fun exercise =]