vendredi 12 avril 2013

Extracting iCMS data from garbage collector logs

Recently, I had to inspect some GC logs that had been created on a recent Hotspot JVM, using ParNew, CMS and iCMS. I did not find any tool that could report any information on iCMS duty cycles, so I wrote one that allows me to spot iCMS problem in a matter of seconds.

The problem

The Concurrent Mark and Sweep GC is almost always used on websites that have a lot of traffic and that cannot tolerate long pauses. Its incremental mode is often used on servers with limited hardware, so that small CMS collections on small chuncks of memory are executed between minor collections.

However, I could not find any GC log parser that took advantage of iCMS duty cycles information. In particular, how can we instantly see that something is wrong with our GC tuning ?

Incremental CMS reports

iCMS reports can be seen in ParNew GC log, in "icms_dc=..." parts of the message. For instance, take the following message :

18.862: [GC 18.862: [ParNew: 136320K->15040K(153344K), 0.0167820 secs] 158326K->37046K(2080128K), 0.0168440 secs] [Times: user=0.10 sys=0.00, real=0.02 secs]
21.028: [GC 21.028: [ParNew: 151360K->15263K(153344K), 0.0458710 secs] 173366K->48678K(2080128K) icms_dc=5 , 0.0459650 secs] [Times: user=0.22 sys=0.00, real=0.05 secs]


This log shows us that two minor collections that occurred at 18.862ms and 21.028ms and an information about a iCMS collection. This collection was ran between the two ParNew collections and took 5% of the time between them. So, let's do some math.

21.028 - 18.862 = 2.166 secs
5% of 2.166 secs = 0.1083 secs


=> The iCMS collection took 108.3ms to complete, in the time frame of 2.166 seconds between the two ParNew Collections. This collection took place concurrently, so our application only suffered of a small overhead.

Some bad values

These log message were describing a normal situation, however, what would you think about these?

928404.003: [GC 928404.003: [ParNew: 140485K->17024K(153344K), 0.0724360 secs] 1854871K->1745355K(2080128K) icms_dc=100 , 0.0726830 secs] [Times: user=0.33 sys=0.00, real=0.07 secs]
928405.145: [GC 928405.145: [ParNew: 153344K->13475K(153344K), 0.0365750 secs] 1786382K->1660879K(2080128K) icms_dc=100 , 0.0368000 secs] [Times: user=0.18 sys=0.01, real=0.04 secs]


Yes, you read well, iCMS duty cycles ran 100% of the time between ParNew collections.

With ParNew being a stop-the-world collector, we can conclude that at 928404.003, a ParNew collection stopped the application for 0.07 seconds, afterwhat a incremental CMS collection was run concurrently during the whole time the application was working, and until it was stopped again at 928405.145 by a new ParNew collection.

So basically, during that time frame, our application was either paused or slowed by an iCMS collection.

Now, the big question is, if we have a 3-days GC log, how can we instantly know that we have a problem linked to iCMS duty cycles ?

Cumulative frequency of iCMS

My idea was to create a graph that sums the ratio of icms_dc time during the execution of the application.

The graph has the following characteristics :
  • X axis is the time spend in iCMS duty cycle in percent
  • Y axis is the cumulative frequency of such iCMS duty cycles


My goal is to have the lowest possible iCMS duty cycles, so let's say that I want 90% of my iCMS collections taking less than 10% of the time between two minor collections.

With such requirements, I can say that this GC log has no problem regarding iCMS :



However, this application does. It suffers from long iCMS duty cycles and some measures should be taken to reduce the pressure of Garbage Collectors. We can see that only 80% of iCMS cycles take less than 20% of the time available, which is not good. Even worse, look at the 90th, 95th and 99th percentiles. iCMS is definitely working too much and causes too much overhead on the application.



Finally, I can merge my graphs to make sure that my tuning is really improving the application Garbage Collections.



Conclusion

The scripts I created to generate those charts are available on GitHub under the GPL licence. You can find them here : https://github.com/pingtimeout/icms-cumulative-frequency. I hope it will be useful :-)

Do you have other diagnostic techniques ? Some ideas to share ? Let's talk about it in the comments !

Aucun commentaire:

Enregistrer un commentaire