samedi 16 mars 2013

PetClinic Performance Tuning - About Ippon audit

Last week, Ippon Technologies published several articles about Java Performance Tuning. Those articles were interresting, but not compeltely accurate. In this post, I will give my own conclusions and remarks.

But first, a disclaimer : dear Ippon readers, what you did represent a huge amount of work, and I don't want to criticise that. I like what you did, it is a very good job. However, I would have appreciated some more information, and I will try to describe it here.

1 - The "Prove it !" mindset

I often repeat to anyone willing to hear it that they must prove what they think before taking any action. I call this the "Prove It !" mindset, but it is also well known as the "Diagnose before Cure !" or "Measure, Don't Guess !" principles.

I tend to be unhappy when solutions are applied without ever any measurement that this would do any good. If someone says that he/she magically has the solution to a problem that is not measured, it is the same as taking medicine without even knowing the kind of desease you have : it may work, or not, or even do nothing.

The more one prove that his/her decision is the right one, the better knowledge is spread. If no-one understand why you did something, discussions like the following will eventually happen : "Oh, that thing was made by an expert, DO NOT TOUCH IT ! If it work, don't fix it !"

2 - Goals

Performance audits are like recursive functions : if you don't have any stop condition, you could go on forever.

When we write benchmarks, we want to answer a very precise question, like "Can my application handle 1200 requests per second ?". When you have your first answer (usually "No", otherwise no-one would have called you in the first place), you can work to reach that goal.

Note : one goal can include "... and I cannot have more that 32 GB of RAM on my server, which I cannot duplicate".

However, there was no precise goal Ippon's articles. They were more a tooling showcase than a real-life situation. So, to be completely exhaustive : you *need* to have goals, otherwise you won't ever make your client smile.

3 - Testing conditions

In the first article, we can see that the default configuration (-Xms128m) did produce an OutOfMemoryError. The conclusion was the following :

"The first results are as follow: the application quickly goes up to 285 requests per second (req/sec), then slows down and throws “java.lang.OutOfMemoryError: Java heap space” errors after nearly 700 users."

The problem with this sentence is that what we did was not a benchmark per se. A benchmark has to give stable results over time. Here, we just cannot say that the application handles 285 requests per second.

What we saw here was that the application, put under a heavy load, was able to cope with 285 requests per second at best. But since the application crashed, we cannot draw any conclusion from this. The test environment is just not ready. In the last article, the 1GB Heap results is taken as a starting point, which shows that this 285req/s was not a benchmark result.

I think what Julien Dubois meant, in the first article, was that without a proper environment, tests could not be seriously made. As a general rule, we want to have a stable testing environment that does not crash even if you run a test during several hours.

So this first article was related to performance tuning, but was not about optimisation. It was only about setting up the environment, which is a prerequisite for any rigourous benchmark.

Finally, the conclusion mentions "548 req/sec and 0,2% of HTTP errors". Does this mean that at some point, the application still crash in a OOME ? If that is the case, we can prove that the test environment is not good enough by increasing the loop count of the thread group in JMeter. And if it is the case, then the environment is still not ready.

4 - The right fix

I did not understand why the second article started with a focus on the memory consumed by Dandelion library.

In the first screenshot, we can see that Dandelion's objects represent about ~130MB of the total heap, which has been set to 1GB, so we focus on a consumer of 10% of the heap. The questions are : do we really have a memory problem ? How did we prove it ? Are those -.20% http errors caused by a bad memory consumption pattern ?

Long story short : upgrading a library and refactoring your code to go stateless makes you handle less requests per second (532req/s instad of 548req/s) with more http errors (0.70% instead of 0.20%).

In that case, I tend to think that these actions were not the right fix. The causes of the HTTP errors have never been cleary defined, so we might end up with new errors, who knows ?

I would have liked more justifications for the actions that were taken. Did we have a memory consumption problem ? How did we prove it ? How did we prove that the taken actions were actually solutions, and not problems we were adding in the code ?

5 - Audit conclusion

The conclusion of the last article mentions that going back to -Xmx128m produces about the same result, which is ~1200req/s.

"At the beginning of the tests, we had to increase our heap memory size to 1 Gb, and could only serve 548 req/sec, with some HTTP errors. After completing our audit, we are now back to 128 M, and can serve 1225 req/sec with no error at all. We expect those results to be even better on a real server, with many cores and threads, and where removing JVM locks will have a more significant impact."

I could not reproduce that.

On my machine, decreasing the heap to 128 MB makes the result go down to ~180req/s, which is just not acceptable. The following screenshot shows the memory consumption of the application : the old generation is completely filled, resulting in a huge number of full GC (~2500) that represent 15 minutes waiting for the GC to finish.

Is this memory reduction thing a typo ? How could an application behave identically with less memory if its liveset does not even fit in the heap ? I would like to know more about it.

6 - More testing !

I checkout'ed the source (revision fa1e1a8f86c574b765d65b26fd858e0d28ae81fc, 2013-03-15 09:36:26) and made my own tests. Here are some more conclusions.

Note : I have not the same testing environment. I chose to reintroduce HSQLDB and slightly modified the JMeter test so that 10 visits per pet per owner are added before the test takes place. This is because the HSQLDB is flushed are each server restart. The req/s results will be different, since I do not run the tests on the same hardware, but it is not the point here.

With 1 GB, we can see that we are now down to 3 full GC (total pause time ~2s), which is pretty good. My test finished with an average of 600req/s.

But can we achieve the same performances if we reduce the size of the heap ? The answer is yes. By tuning the different memory pools, I managed to get the same number of requests per second with the following parameters : -Xmx512m -XX:NewRatio=1 -XX:SurvivorRatio=14

Even more tests, for science. What about switching to ParNew and CMS ? By switching to CMS, we can expect less significant pauses, since only ParNew pauses (~20ms) would stop the application. However, this comes at a price : a decrease in the overall performance since CMS implies much more overhead than other GC. Well, this is easily verified : by adding CMS (-Xmx512m -XX:NewRatio=1 -XX:SurvivorRatio=14 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC), I got an average of ~537req/s, with 49 CMS cycles.

Finally, considering the memory consumption profile of the application, we can set the CMS starting threshold to a higher value. This way, we will get less concurrent execution of CMS, meaning more CPU cycles for our application. By adding -XX:CMSInitiatingPermOccupancyFraction=70, we end up with ~599req/s and only 1 CMS execution.

Wrap Up

Event if I would have preferred much more details about the choices that were made, the performance audit that was made is pretty cool :)

I hope you enjoyed having more information and more test cases on petclinic. Do not hesitate to try any configuration that comes through your mind, as long as you measure if it improves the situation, and as long as you are not afraid to revert your changes if you cannot prove any improvement. This part is one of the most entertaining things on the JVM.

The default GC causes long STW pauses, and that could have become a problem if the heap was set to more than 8 GB. The switch to CMS comes at a price, but with enough testing and tuning, we can manage to have identical performances.

Any idea how to improve these articles ? Some war stories that you want to share ? Let's talk about this in the comments !

Aucun commentaire:

Enregistrer un commentaire