Rsync observations

Recently, I was wondering how the rsync command worked to only send the parts of files that had changed after an update. It triggered some interesting experiment which results are shared in this post.

10-seconds introduction

rsync is a very common program used to synchronize files between two machines. It’s main advantage over programs like scp is that it does not need to re-transfer an entire file, if only part of it has changed since the last time rsync was called. But how is this implemented?

Simply put, a file is split into blocks. For each block, a hash is computed. This process happens both on the sender and receiver sides. When hashes differ, it tells rsync that the associated block needs to be updated.

The question

Now, the question is this: Does rsync need to recompute the hash of each block every time? Or are these hashes stored and reused later?

I was asking that question in a specific context. Consider a 1+ GB file that has differs in exactly 1 byte between two machines. I expect that only one block will be sent over the network, that part is for sure.

But what happens on the disk I/O side? Is the file loaded entirely in memory, on both sides, to detect that 1 byte difference? Or is there another optimisation that improves these cases?

This page contains the answer, but it was not 100% clear to me when I read it.

Experiment design

Stand back, I’m going to try science.

— XKCD

In situations like this one, I like to write down my expectations, and only after that, design an experiment to verify them. The experiment looks like this:

  1. Monitor two machines on which the disk bandwidth is capped at 50 MB/s

  2. On node 0, create a 3 GB file*

  3. rsync it to node 1

  4. Append one byte to the file on node 0

  5. rsync it again to node 1

*: It takes exactly 60 seconds to write a 3 GB file at 50 MB/s. I like round numbers.

The OS metrics, and more specifically the disk bandwidth usage, will tell me if the entire file is read from the disk. One thing to note is that after steps 3, 4, 5 and 6, I need to clear the file system cache. Failing to do so would result in the file being read from RAM, and would give false results.

The test script is given below. It is run on node 0.

input_file=~/big-file

dd if=/dev/zero of=$input_file bs=1M count=3000
sleep 1m

sudo sync ; echo 1 | sudo tee /proc/sys/vm/drop_caches
ssh node1 'sudo sync ; echo 1 | sudo tee /proc/sys/vm/drop_caches'
sleep 1m

rsync -a $input_file node1:~/
sleep 1m

sudo sync ; echo 1 | sudo tee /proc/sys/vm/drop_caches
ssh node1 'sudo sync ; echo 1 | sudo tee /proc/sys/vm/drop_caches'
sleep 1m

echo 0 >> $input_file
rsync -a $input_file node1:~/

Test phases

In the charts below, the three red vertical lines represent respectively the timestamp at which the big file is generated, and the timestamps at which the first and second rsync commands are started.

Node 0 disk bandwidth

The disk bandwidth usage on node 0 clearly first shows that it took 1 minute (60 seconds * 50 MB/s) to generate the 3 GB file. Then it shows that the file was entirely read from disk during the first rsync invocation. However, on the second rsync invocation, it shows that the file is read entirely from disk, but not immediately. There is a 1-minute delay before this happens.

Disk bandwidth usage on node 0

Node 1 disk bandwidth

On node1, we can see that the disk bandwidth is also at 50 MB/s during the first rsync, which is normal. But during the second rsync, we see an interesting pattern. The 3 GB file is read entirely from disk when rsync starts. And the end of that read phase coincides with the start of node 0 reading that very file. In other words, it shows that node 0 waited for node 1 to build its hash before it did its part.

This point was surprising to me. I expected both nodes to compute the file hash at the same time.

Another (slightly less important) curiosity is that the large file is entirely rewritten to disk during the second rsync.

Disk bandwidth usage on node 1

Rolling checksums

This 1-minute offset between sender and receiver hash computation is explained by "rolling checksum computation".

Consider the case where a single byte is prepended to a file. If blocks boundaries are fixed every 4 KB, then the files will be considered 100% different, as all bytes are shifted by 1 in the source file. Hence, blocks boundaries may evolve dynamically during the process.

As soon as rsync has replicated the change of the first block, it recomputes the hash from the byte immediately after the last change. That way, if the rest of the file is identical, it will detect it and won’t sent it over again.

During the test, node 1 first computed the hash for each block on its side. Then, node 0 started computing them on its side as well. Every time two block matched, it moved to the next one while letting node 1 know that they were identical.

My naive expectation that both nodes would hash their file at the same time would have worked, in this particular case. But only because it was the last block of the source file that had been modified. If the file had been prepended with an extra byte, all blocks would have been marked as different. And to compute its rolling checksums, the sender would have read the file once more. I.e. the receiver would have fully read the file once, and the sender would have read it twice.

Conclusion

By default, rsync identifies the files that it needs to hash based on their size and last modification date. Every time it detects that a file has changed, it reads them entirely from the disk.

More charts

The two charts below show respectively the network bandwidth usage on node 0 and node 1. They show that the 3 GB content was only fully sent during the first rsync command invocation. This matches what we expect from rsync.

And this further confirms that node 1 rewriting the entire file during the second invocation did not involve any network transfer.

Network bandwidth usage on node 0
Network bandwidth usage on node 1

If you have any question/comment, feel free to send me a tweet at @pingtimeout. And if you enjoyed this article and want to support my work, you can always buy me a coffee ☕️.