Pseudo-random values

December 26, 2019 6 minutes read

Regardless of the technology used to write benchmarks, generating random data is necessary. In this article, I will discuss the possibility to use fixed seeds as a best practice in performance tests.

I will specifically focus on load testing tools that run on a JVM. Typically JMeter/Gatling/The Grinder/… or any other in-house tool.

Example

First, let’s start with an example.

Consider an e-commerce website. Users must be able to sign up. To do so, they have to fill a form with several information (contact & shipping info, billing info, identifiers, …). They can then order products, following the usual buyer’s journey.

We can create a benchmark for that. We will need to generate lots of users. We will then reuse them to simulate load on the e-commerce website.

2019 12 11 pseudo random values.three users

Problem: how to reuse existing users?

Alright so we have created and populated an environment dedicated for performance tests. Data has been generated, and a large number of users is available.

How can we reuse their identifiers in our future benchmarks?

If the test data has been generated randomly, we first need to extract the identifiers from the database and store them in a file. Typically, a CSV file. And that’s where things become nasty.

On the one hand, if we read the file line by line, during the benchmark, we end up with unwanted I/O which can mess with the measurements. On the other hand, if we load the file upfront in memory, we end up with a lot more GC pressure. And if the response times that we measured are low enough, the noise generated by any of these solutions can be significant.

Gatling v2.3 documentation is very clear about that issue. If it cannot be avoided, it recommends to read the file on-the-flu. Since version 3.0, the load testing tool comes with a feature to facilitate this process.

Loading feeder files in memory uses a lot of heap, expect a 5-to-10-times ratio with the file size. This is due to JVM’s internal UTF-16 char encoding and object headers overhead. If memory is an issue for you, you might want to read from the filesystem on the fly and build your own Feeder.

Using a fixed seed

Alright, now suppose that the seed that was used to generate the random data was noted somewhere. This means that if we reuse the very same seed, we can re-generate the exact same random data. And we don’t need to extract that data in a file anymore.

This can be simply illustrated by the code below. It is written in Scala but works just the same in Java. It uses instances of Random for which the seed is explicitly set. These instances will always produce the same values in the same order.

scala> val r1 = new Random(-229985452)
r1: scala.util.Random = scala.util.Random@7d528cf7

scala> val r2 = new Random(-147909649)
r2: scala.util.Random = scala.util.Random@6f6cc7da

scala> def randomWord(rand: Random) =
     |   Stream.continually(rand.nextInt(27)).map(x => ('`' + x).toChar).take(5).mkString
randomWord: (rand: scala.util.Random)String

scala> randomWord(r1) + " " + randomWord(r2)
res1: String = hello world

Translating these commands in Java is an exercise left to the reader.

Pros

As previously suggested, this approach has a lot of advantages.

First, extracting data from an initialised database before new tests can be run is not necessary anymore. This simplifies the design and the execution of said tests.

It is also not necessary anymore to store large CSV files to reuse data. This simplifies the operational aspect by significantly reducing the amount or required storage.

More importantly, it reduces the noise during the test. Generating random values being a 100% CPU-bound activity, no disk I/O is ever performed because of random data generation. Only the required data at a given time is computed. Therefore, the amount of memory used at time t is the strict minimum for the currently in-flight requests. And since generated values are short-lived, the allocation profile respects the weak generational hypothesis. The load generation engine is "GC-friendlier".

Finally, suppose you want to verify the entire state of your database. Say that you have run a resilience test that involved to voluntarily corrupt disks. You now want to verify that the data stored in your database is still correct, and that it contains no corrupted record. That volume of data to verify can be colossal. If you are able to re-generate the same data from a simple seed, you don’t need any database export as the source of truth. That very source of truth is now the combo formed by the seed and the data generation algorithm

In terms of test operations, this is a significant advantage. Even more if you want to verify a database like Apache Cassandra, distributed on multiple nodes. The data volume is proportional to the number of nodes and can easily reach hundreds of TB.

Cons

Now let’s talk about the cons. Because they are some. After all, if this technique was a silver bullet, everybody would use it.

First, there is an additional cost at the design phase. When the code of the test is written, we must always keep in mind that data has to be predictably generated. I.e. always in the same order. Because, as I mentioned before, only the combo "seed + data generation algorithm" allows for a perfectly identical re-generation.

The code is a little bit more complex, because methods like Math.random() or UUID.randomUUID() cannot be used. Each component that needs to generated random values must take a random source as parameter. For example, an instance of java.util.Random. This was the case of the function randomWord(rand: Random) in my previous example. Developers who are friendly with referential transparency won’t have any issue with that. For the others, it could be considered as a friction point.

Finally, the biggest impact is regarding scalability. Suppose that multiple clients are started simultaneously, to produce an even bigger load. For instance, consider a Cassandra cluster spread across 3 datacenters, where there is one dedicated Gatling process per DC. If all clients run with the same seed, they will produce exactly the same values. We could very well end up with simultaneous user registrations with the very same e-mail address/password in multiple continents. Don’t get me wrong, it is a very interesting test case. Just maybe not the one we want to simulate at this time.

We have to define one seed per client. For instance by creating a base seed value and associating unique ordinals to client. The actual seed used by the load testing engine can then be baseSeed + clientOrdinal. Keep in mind that we need to remember how many clients were used to load the database. Because in this case, only the combo "seed + client ordinal + data generation algorithm" allows for a perfectly identical re-generation.

Conclusion

I consider the usage of pseudo random values as a testing best practice. I have not run (yet) in a since use case where this technique could not simplify a test.

That being said, moving from a 100% random to a pseudo-random test is not straightforward. Rewriting the code is one of the challenges we all think about. But changing users habits is another one, as we have to make sure the correct seeds are reused everywhere.

Overall, the pros/cons ratio is still very positive.

If you have any question/comment, feel free to send me a tweet at @pingtimeout. And if you enjoyed this article and want to support my work, you can always buy me a coffee ☕️.