Here is rsync Series :

1.rsync( One ): Basic commands and usage <>

2.rsync( Two ):inotify+rsync Detailed description and sersync

3.rsync Algorithm principle and workflow analysis <>

4.rsync Technical Report ( translate ) <>

5.rsync Working mechanism ( translate ) <> rsync translate (rsync Command manual in Chinese ) <>

This article is rsync Official recommendation How Rsync Works <>
Translation of , The main content is Rsync Glossary and simple version rsync working principle . This article is not translated completely , Foreword skipped , But for the integrity of the article , The original text of the foreword remains .

Collection of my translations :

How Rsync Works
A Practical Overview


The original Rsync technical report
<> and Andrew Tridgell's Phd
thesis (pdf) <> Are both excellent
documents for understanding the theoretical mathematics and some of the
mechanics of the rsync algorithm. Unfortunately they are more about the theory
than the implementation of the rsync utility (hereafter referred to as Rsync).

In this document I hope to describe...

* A non-mathematical overview of the rsync algorithm.
* How that algorithm is implemented in the rsync utility.
* The protocol, in general terms, used by the rsync utility.
* The identifiable roles the rsync processes play.
This document be able to serve as a guide for programmers needing something of
an entré into the source code but the primary purpose is to give the reader a
foundation from which he may understand

* Why rsync behaves as it does.
* The limitations of rsync.
* Why a requested feature is unsuited to the code-base.
This document describes in general terms the construction and behaviour of
Rsync. In some cases details and exceptions that would contribute to specific
accuracy have been sacrificed for the sake meeting the broader goals.

Processes and Roles

When we discuss rsync Time , We use special terms to represent different processes and their roles in task execution . For convenience , Communicate more accurately , It's very important to use the same language ; similarly , In a specific context , It's also important to use fixed terms to describe the same things . stay Rsync In mailing list , There are always people who are right role and processes Create doubt . For these reasons , I'm going to define some of the things that will be used in the future role and process Terms of .



client( client ) Synchronization process will be started .



client Local transmission , Or via remote shell, Objects connected by network socket , It can be remote rsync process , Can also represent a remote system .

server It's just a general term , Please do not contact daemon Confusion .



When client and server After the connection is established , Will use sender and receiver these two items. role Instead of distinguishing them .


role and process 

A waiting from client Connected rsync process . Under certain platforms , Often called service.

remote shell

role and set of processes 

by Rsync client And remote rsync server One or more processes that provide a connection between .


role and process 

A process that will access the source file to be synchronized .


role and proces 

When receiver As a target system role, When receiver Is a process that updates data and writes it to disk as a process.



generator Process identifies parts of file changes and manages file level logic .

Process Startup

When Rsync client At startup , Will first and server End to establish a connection , The two ends of this connection can pass through the pipe , You can also communicate through a network socket .

When Rsync And remote non daemon Modal server Via remote shell When communicating , The process is started by fork long-range shell, It will launch a Rsync
server End process .Rsync Both client and server are remote shell Communication between pipelines . In the process ,rsync Process not involved in network . In this mode , Server side rsync The options for the process are remote shell Transitive .

When rsync And rsync
daemon When communicating , It uses network sockets directly for communication . This is the only one that can be called network awareness rsync communication mode . In this mode ,rsync Must be sent through socket , Details are described below .

At the beginning of communication between client and server , Both parties will send the maximum agreement version number to the other party , Both parties will use a smaller version of the protocol for transmission . If it is daemon Mode connection ,rsync Will be sent from the client to the server , And then transmit exclude list , From this moment on , The relationship between the client and the server is only related to errors and log messaging .( translator's note : From now on , Will adopt sender and receiver These two roles describe rsync Both ends of the connection )

local Rsync task ( Source and destination are on local file system ) Is similar to push. client ( translator's note : This is the source file side ) Change to sender, and fork One server Process to perform receiver Role responsibilities , then client/sender And server/receiver Communication between them through pipes .

The File List

file list Not only path names , Also includes copy mode , owner , jurisdiction , file size ,mtime Equal attribute . If used "--checksum" option , File level
Verification code of .

rsync The first thing a connection can do is sender Created by file list, When file list After creation , Every item in it is passed ( share ) reach receiver end .

When it's done , Both ends will follow the relative base directory (base directory) Path pair of file
list sort ( The sorting algorithm depends on the protocol version number of the transmission ), When sorting is complete , Future references to all files are passed file list Index in to find .

When receiver Received file list after , Meeting fork Out generator process , It and receiver Process complete together pipeline.

The Pipeline

rsync It's highly streamlined (pipelined). This means that processes communicate in a single direction . When file list Transmission completed ,pipeline The behavior of :

generator --> sender --> receiver

generator The output of is sender Input of ,sender The output of is receiver Input of . Each of them runs independently , And only in pipeline Blocked or waiting for disk IO,CPU Resource is delayed .

( translator's note : Although they are unidirectional , But each process will transmit the data to its receiving process as soon as it finishes processing the related work , And start working on the next job , The receiving process starts to process the data after receiving it , So they work in a pipeline way , But they are independent , Parallel working , Basically no delay or blocking )

The Generator

generator The process will file
list Compare with local tree . If specified "--delete" option , Then in generator Before the main function starts , It will first recognize that it is not sender Local file of ( translator's note : Because of this generator by receiver End process ), And recevier Delete these files .

then generator Will start its main work , It will file
list File by file forward processing . Each file is detected to see if it needs to be skipped . If the mtime Or different sizes , The most common mode of file operation does not ignore it . If specified "--checksum" option , File level checksum And make a comparison . catalog , Block devices and symbolic links are not ignored . The missing directory will also be created on the target .

If the file is not ignored , All existing file versions in the target path will be used as benchmark files (basis file)( translator's note : Please remember the word , It runs through rsync Working mechanism )
, These baseline files will be used as data matching sources , bring sender End can match the parts of these data sources without sending ( translator's note : To achieve incremental transmission ). To achieve this remote data matching , Will be basis
file Create block check code (block
checksum), And put it in the file index ( file id) Send to immediately after sender end . If specified "--whole-file" option , A blank block check code will be sent to all files in the file list , bring rsync Force full instead of incremental transmission .( translator's note : in other words ,generator Every block check code set of a file is calculated , Send it to sender, Instead of sending the block check codes of all files once they have been calculated )

The size of the blocks that each file is divided into and the size of the block checksums are calculated based on the file size ( translator's note :rsync Command support manual assignment block size).

The Sender

Sender Process read from generator Data for , Read one file at a time id Number and block check code set of the file ( translator's note : Or check code list ).

about generator Every file sent ,sender Block check codes are stored and their hash Index for faster lookup .

Then read the local file , And generate data block from the first byte checksum. Then find generator Sent check code set , Let's see checksum Whether an item in the set can be matched , If there is no match , The unmatched byte will be attached to the unmatched block as an additional attribute ( translator's note : here , No matching byte means the first byte , It represents the offset of the unmatched block , Identify where unmatched blocks start ), Then from the next byte ( Second byte ) Start to generate check code and compare and match , Until all data blocks are matched . This implementation is called rolling verification "rolling

If the block check code of the source file can match an item in the check code set , The data block is considered to be a matching block , Then all the accumulated non file data ( translator's note : Such as data block reorganization instruction , file id etc. ) Will accompany receiver The offset and length of the matched data block of the corresponding file of the end are sent to the receiver end ( translator's note : For example, matching block corresponds to receiver End the 8 Blocks , Send offset , Matching block number and block length value , Although the size of data blocks is fixed , But when the file is divided into fixed size data blocks , The size of the last block may be smaller than the fixed size value , So in order to ensure that the length matches exactly , You also need to send the length value of the data block ), then generator The process will scroll to the next byte of the matching block to continue calculating the check code and comparing the matches
( translator's note : Data block can be matched here , The scrolling size is a data block , For data blocks that do not match , Scroll size is one byte ).

In this way , Even if the data block order or offset of the files at both ends are different , All matching data blocks can also be identified . stay rsync In the algorithm , This process is very central .

Use this way ,sender Some instructions will be sent to receiver end , These instructions tell receiver How to reorganize the source file into a new target file . And these instructions specify when reorganizing a new target file , All available from basis
file Matching blocks copied directly in ( of course , The premise is that they are receiver End already exists ), It also includes all receiver Raw data that does not exist at the end ( notes : Pure data ). At the end of each file process , One will also be sent whole-file Verification code of ( translator's note : This is a file level check code ), after sender Next file will be processed .

Generate rolling check code (rolling
checksum) And it needs a good stage to search whether it can match from the check code set CPU. stay rsync All in progress ,sender It's the most expensive CPU Of .

The Receiver

receiver Will read from sender Data sent , And identify each file by its file index number , Then it opens the local file ( Is called basis file Documents of ) and
Create a temporary file .

after receiver From sender Read no matching data block in the sent data ( Pure data ) And additional information about data blocks on the match . If the read is a non matching block , This pure data is written to a temporary file , If you receive a matching record ,receiver Will find basis
file Offset of this block in , Then copy these matching data blocks to temporary files . In this way , Temporary files will be built from scratch until the build is complete .

When the temporary documents are completed , The check code of this temporary file will be generated . last , This check code will be compared with sender Check code comparison sent , If the comparison finds no match , Delete temporary files , The document will be reorganized in the second phase , If it fails twice , Report failed .

After the final and complete establishment of temporary documents , Its owner will be set , jurisdiction ,mtime, Then rename and replace basis file.

stay rsync All in progress , because receiver From basis
file Copy data to temporary file in , So it's the most disk consuming process . Because small files may be in cache all the time , So you can reduce the disk IO, But for large files , The cache may be generator It has been transferred to other documents and washed away , also sender Will cause further delay . Because it is possible to read data randomly from one file and write it to another , If worksets (working
set) Larger cache than disk , It's going to happen seek storm, This will degrade performance again .

The Daemon

And many others daemon similar , For every connection fork One daemon Subprocess . At startup , It will parse rsyncd.conf file , To determine which modules exist , And set global options .

When a defined module receives a connection ,daemon will fork A subprocess to handle the connection . The subprocess will then read rsyncd.conf File and set options for the requested module , This may chroot Path to module , You may also delete the setuid/setgid. After completing the above process ,daemon Subprocesses will be rsync
server equally , The role may be sender It could be receiver.

The Rsync Protocol

A well-designed communication protocol has a series of characteristics .

* Everything to send is clearly defined in the packet , Including the head , Optional body Or data load .
* The first part of each packet specifies the protocol type or the command line .
* The length of each packet is clear .
In addition to these features , Protocols should also have varying degrees of status , Independence between packets , Human readability and the ability to reconstruct disconnected conversations .

rsync Does not include any of the above 性.数据通过不间断的字节流进行传输.除了非匹配的数据外,既没有指定长度说明符,也没有长度计数器.相反,每个字节的含义取决于由协议层次定义的上下文环境.

例如,当sender正在发送file list,它仅只是简单地发送每个file



This document is a work in progress. The author expects that it has some
glaring oversights and some portions that may be more confusing than
enlightening for some readers. It is hoped that this could evolve into a useful

Specific suggestions for improvement are welcome, as would be a complete