Recently, I was wondering how the
rsync command worked to only send the parts of files that had changed after an update.
It triggered some interesting experiment which results are shared in this post.
rsync is a very common program used to synchronize files between two machines.
It’s main advantage over programs like
scp is that it does not need to re-transfer an entire file, if only part of it has changed since the last time
rsync was called.
But how is this implemented?
Simply put, a file is split into blocks.
For each block, a hash is computed.
This process happens both on the sender and receiver sides.
When hashes differ, it tells
rsync that the associated block needs to be updated.
Now, the question is this:
rsync need to recompute the hash of each block every time?
Or are these hashes stored and reused later?
I was asking that question in a specific context. Consider a 1+ GB file that has differs in exactly 1 byte between two machines. I expect that only one block will be sent over the network, that part is for sure.
But what happens on the disk I/O side? Is the file loaded entirely in memory, on both sides, to detect that 1 byte difference? Or is there another optimisation that improves these cases?
This page contains the answer, but it was not 100% clear to me when I read it.
Stand back, I’m going to try science.
In situations like this one, I like to write down my expectations, and only after that, design an experiment to verify them. The experiment looks like this:
Monitor two machines on which the disk bandwidth is capped at 50 MB/s
On node 0, create a 3 GB file*
rsyncit to node 1
Append one byte to the file on node 0
rsyncit again to node 1
*: It takes exactly 60 seconds to write a 3 GB file at 50 MB/s. I like round numbers.
The OS metrics, and more specifically the disk bandwidth usage, will tell me if the entire file is read from the disk. One thing to note is that after steps 3, 4, 5 and 6, I need to clear the file system cache. Failing to do so would result in the file being read from RAM, and would give false results.
The test script is given below. It is run on node 0.
input_file=~/big-file dd if=/dev/zero of=$input_file bs=1M count=3000 sleep 1m sudo sync ; echo 1 | sudo tee /proc/sys/vm/drop_caches ssh node1 'sudo sync ; echo 1 | sudo tee /proc/sys/vm/drop_caches' sleep 1m rsync -a $input_file node1:~/ sleep 1m sudo sync ; echo 1 | sudo tee /proc/sys/vm/drop_caches ssh node1 'sudo sync ; echo 1 | sudo tee /proc/sys/vm/drop_caches' sleep 1m echo 0 >> $input_file rsync -a $input_file node1:~/
In the charts below, the three red vertical lines represent respectively the timestamp at which the big file is generated, and the timestamps at which the first and second
rsync commands are started.
Node 0 disk bandwidth
The disk bandwidth usage on node 0 clearly first shows that it took 1 minute (60 seconds * 50 MB/s) to generate the 3 GB file.
Then it shows that the file was entirely read from disk during the first
However, on the second
rsync invocation, it shows that the file is read entirely from disk, but not immediately.
There is a 1-minute delay before this happens.
Node 1 disk bandwidth
On node1, we can see that the disk bandwidth is also at 50 MB/s during the first
rsync, which is normal.
But during the second
rsync, we see an interesting pattern.
The 3 GB file is read entirely from disk when
And the end of that read phase coincides with the start of node 0 reading that very file.
In other words, it shows that node 0 waited for node 1 to build its hash before it did its part.
This point was surprising to me. I expected both nodes to compute the file hash at the same time.
Another (slightly less important) curiosity is that the large file is entirely rewritten to disk during the second
This 1-minute offset between sender and receiver hash computation is explained by "rolling checksum computation".
Consider the case where a single byte is prepended to a file. If blocks boundaries are fixed every 4 KB, then the files will be considered 100% different, as all bytes are shifted by 1 in the source file. Hence, blocks boundaries may evolve dynamically during the process.
As soon as
rsync has replicated the change of the first block, it recomputes the hash from the byte immediately after the last change.
That way, if the rest of the file is identical, it will detect it and won’t sent it over again.
During the test, node 1 first computed the hash for each block on its side. Then, node 0 started computing them on its side as well. Every time two block matched, it moved to the next one while letting node 1 know that they were identical.
My naive expectation that both nodes would hash their file at the same time would have worked, in this particular case. But only because it was the last block of the source file that had been modified. If the file had been prepended with an extra byte, all blocks would have been marked as different. And to compute its rolling checksums, the sender would have read the file once more. I.e. the receiver would have fully read the file once, and the sender would have read it twice.
rsync identifies the files that it needs to hash based on their size and last modification date.
Every time it detects that a file has changed, it reads them entirely from the disk.
The two charts below show respectively the network bandwidth usage on node 0 and node 1.
They show that the 3 GB content was only fully sent during the first
rsync command invocation.
This matches what we expect from
And this further confirms that node 1 rewriting the entire file during the second invocation did not involve any network transfer.
If you have any question/comment, feel free to send me a tweet at @pingtimeout. And if you enjoyed this article and want to support my work, you can always buy me a coffee ☕️.