rsync Working mechanism( translate)
Below isrsync Series：
1.rsync( One)： Basic commands and usage <http://www.cnblogs.com/f-ck-need-u/p/7220009.html>
2.rsync( Two)：inotify+rsync Detailed description andsersync
3.rsync Algorithm principle and workflow analysis <http://www.cnblogs.com/f-ck-need-u/p/7226781.html>
4.rsync Technical Report( translate) <http://www.cnblogs.com/f-ck-need-u/p/7220753.html>
5.rsync Working mechanism( translate) <http://www.cnblogs.com/f-ck-need-u/p/7221535.html>
6.man rsync translate(rsync Command manual in Chinese) <http://www.cnblogs.com/f-ck-need-u/p/7221713.html>
This article isrsync Official recommendationHow Rsync Works <https://rsync.samba.org/how-rsync-works.html>
Translation, The main content isRsync Glossary and simple versionrsync Working principle. This article is not translated completely, The foreword is skipped, But for the integrity of the article, The original text of the foreword remains.
Collection of my translations：http://www.cnblogs.com/f-ck-need-u/p/7048359.html
How Rsync Works
A Practical Overview
The original Rsync technical report
<http://www.cnblogs.com/f-ck-need-u/p/7220753.html> and Andrew Tridgell's Phd
thesis (pdf) <http://samba.org/~tridge/phd_thesis.pdf> Are both excellent
documents for understanding the theoretical mathematics and some of the
mechanics of the rsync algorithm. Unfortunately they are more about the theory
than the implementation of the rsync utility (hereafter referred to as Rsync).
In this document I hope to describe...
* A non-mathematical overview of the rsync algorithm.
* How that algorithm is implemented in the rsync utility.
* The protocol, in general terms, used by the rsync utility.
* The identifiable roles the rsync processes play.
This document be able to serve as a guide for programmers needing something of
an entré into the source code but the primary purpose is to give the reader a
foundation from which he may understand
* Why rsync behaves as it does.
* The limitations of rsync.
* Why a requested feature is unsuited to the code-base.
This document describes in general terms the construction and behaviour of
Rsync. In some cases details and exceptions that would contribute to specific
accuracy have been sacrificed for the sake meeting the broader goals.
Processes and Roles
When we discussrsync Time, We use special terms to represent different processes and their roles in task execution. For convenience, Communicate more accurately, It's very important to use the same language; Similarly, In a specific context, It's also important to use fixed terms to describe the same things. stayRsync In mailing list, There are always people who are rightrole andprocesses Cause doubts. For these reasons, I'm going to define some of the things that will be used in the futurerole andprocess Terminology.
client( Client) Synchronization process will be started.
client Local transmission, Or via remoteshell, Objects connected by network socket, It can be remotersync process, Can also represent a remote system.
server It's just a general term, Please do notdaemon Phase confusion.
Whenclient andserver After the connection is established, Will usesender andreceiver These two items.role Instead of distinguishing them.
role and process
A waiting fromclient Connectedrsync process. Under certain platforms, Often referred to asservice.
role and set of processes
byRsync client And remotersync server One or more processes that provide a connection between.
role and process
A process that will access the source file to be synchronized.
role and proces
Whenreceiver As a target systemrole, Whenreceiver Is a process that updates data and writes it to disk as aprocess.
generator Process identifies parts of file changes and manages file level logic.
WhenRsync client Startup time, First and foremostserver End to establish a connection, The two ends of this connection can pass through the pipe, You can also communicate through a network socket.
WhenRsync Remote nondaemon Modeserver By remoteshell Communication time, The process is started byfork Long-rangeshell, It will launch aRsync
server End process.Rsync Both client and server are remoteshell Communication between pipelines. In this process,rsync Process not involved in network. In this mode, Server sidersync The options for the process are remoteshell Transitive.
daemon Communication time, It uses network sockets directly for communication. This is the only one that can be called network awarenessrsync communication mode. In this mode,rsync Must be sent through socket, Details are described below.
At the beginning of communication between client and server, Both parties will send the maximum agreement version number to the other party, Both parties will use a smaller version of the protocol for transmission. If it isdaemon Mode connection,rsync Will be sent from the client to the server, And then transmitexclude list, From this moment on, The relationship between the client and the server is only related to errors and log messaging.( translator's note： From now on, Will adoptsender andreceiver These two roles describersync Both ends of the connection)
localRsync task( Source and destination are on local file system) Is similar topush. Client( translator's note： This is the source file side) Turn intosender, andfork Oneserver Process to performreceiver Role responsibilities, Then?client/sender Andserver/receiver Communication between them through pipes.
The File List
file list Not only path names, Also includes copy mode, owner, Jurisdiction, file size,mtime Equal attribute. If used"--checksum" option, File level
Parity check code.
rsync The first thing a connection can do issender Create itfile list, Whenfile list After creation, Every item in it is passed( Share) reachreceiver end.
When it's done, Both ends will follow the relative base directory(base directory) Path pairfile
list sort( The sorting algorithm depends on the protocol version number of the transmission), When sorting is complete, Future references to all files are passedfile list Index in to find.
Whenreceiver Receivefile list after, Meetingfork Outgenerator process, Its sumreceiver Process complete togetherpipeline.
rsync It's highly streamlined(pipelined). This means that processes communicate in a single direction. Whenfile list Transmission completed,pipeline The behavior of：
generator --> sender --> receiver
generator The output of issender Input,sender The output of isreceiver Input. Each of them runs independently, And only inpipeline Blocked or waiting for diskIO,CPU Resource is delayed.
( translator's note： Although they are unidirectional, But each process will transmit the data to its receiving process as soon as it finishes processing the related work, And start working on the next job, The receiving process starts to process the data after receiving it, So they work in a pipeline way, But they are independent, Parallel working, Basically no delay or blocking)
generator Process willfile
list Compare with local tree. If specified"--delete" option, While ingenerator Before the main function starts, It will first recognize that it is notsender Local file of( translator's note： Because of thisgenerator byreceiver Terminal process), And inrecevier Delete these files.
Then?generator Will start its main work, It will come fromfile
list File by file forward processing. Each file is detected to see if it needs to be skipped. If themtime Or different sizes, The most common mode of file operation does not ignore it. If specified"--checksum" option, File levelchecksum And make comparisons.. Catalog, Block devices and symbolic links are not ignored. The missing directory will also be created on the target.
If the file is not ignored, All existing file versions in the target path will be used as benchmark files(basis file)( translator's note： Please remember the word, It runs throughrsync Working mechanism)
, These baseline files will be used as data matching sources, bringsender End can match the parts of these data sources without sending( translator's note： To achieve incremental transmission). To achieve this remote data matching, Will bebasis
file Create block check code(block
checksum), And put it in the file index( fileid) Send to immediately aftersender end. If specified"--whole-file" option, A blank block check code will be sent to all files in the file list, bringrsync Force full instead of incremental transmission.( translator's note： In other words,generator Every block check code set of a file is calculated, Send it tosender, Instead of sending the block check codes of all files once they have been calculated)
The size of the blocks that each file is divided into and the size of the block checksums are calculated based on the file size( translator's note：rsync Command support manual assignmentblock size).
Sender Process read fromgenerator Data, Read one file at a timeid Number and block check code set of the file( translator's note： Or check code list).
aboutgenerator Every file sent,sender Block check codes are stored and theirhash Index for faster lookup.
Then read the local file, And generate data block from the first bytechecksum. Then look forgenerator Sent check code set, Look at this.checksum Whether an item in the set can be matched, If there is no match, The unmatched byte will be attached to the unmatched block as an additional attribute( translator's note： here, No matching byte means the first byte, It represents the offset of the unmatched block, Identify where unmatched blocks start), Then from the next byte( Second byte) Start to generate check code and compare and match, Until all data blocks are matched. This implementation is called rolling verification"rolling
If the block check code of the source file can match an item in the check code set, The data block is considered to be a matching block, Then all the accumulated non file data( translator's note： Such as data block reorganization instruction, fileid etc.) Will accompanyreceiver The offset and length of the matched data block of the corresponding file of the end are sent to thereceiver end( translator's note： For example, matching block corresponds toreceiver End the8 Data blocks, Send offset, Matching block number and block length value, Although the size of data blocks is fixed, But when the file is divided into fixed size data blocks, The size of the last block may be smaller than the fixed size value, So in order to ensure that the length matches exactly, You also need to send the length value of the data block), Then?generator The process will scroll to the next byte of the matching block to continue calculating the check code and comparing the matches
( translator's note： Data block can be matched here, The scrolling size is a data block, For data blocks that do not match, Scroll size is one byte).
In this way, Even if the data block order or offset of the files at both ends are different, It can also identify all matching data blocks. stayrsync Algorithm, This process is very central.
Use this way,sender Some instructions will be sent toreceiver end, These instructions tellreceiver How to reorganize the source file into a new target file. And these instructions specify when reorganizing a new target file, All availablebasis
file Match blocks copied directly in( Of course, The premise is that they arereceiver End already exists), It also includes allreceiver Raw data that does not exist at the end( notes： Pure data). At the end of each file process, One will also be sentwhole-file Parity check code( translator's note： This is a file level check code), aftersender Next file will be processed.
Generate rolling check code(rolling
checksum) And it needs a good stage to search whether it can match from the check code setCPU. stayrsync All in progress,sender Is the most consumedCPU Of.
receiver Will read fromsender Data sent, And identify each file by its file index number, Then it opens the local file( That is calledbasis file Documents) and
Create a temporary file.
afterreceiver Will fromsender Read no matching data block in the sent data( Pure data) And additional information about data blocks on the match. If the read is a non matching block, This pure data is written to a temporary file, If you receive a matching record,receiver Will findbasis
file Offset of this block in, Then copy these matching data blocks to temporary files. In this way, Temporary files will be built from scratch until the build is complete.
When the temporary documents are completed, The check code of this temporary file will be generated. Last, This check code will be compared withsender Check code comparison sent, If the comparison finds no match, Delete temporary files, The document will be reorganized in the second phase, If it fails twice, Report failed.
After the final and complete establishment of temporary documents, Its owner will be set, Jurisdiction,mtime, Then rename and replacebasis file.
stayrsync All in progress, Becausereceiver Will followbasis
file Copy data to temporary file in, So it's the most disk consuming process. Because small files may be in cache all the time, So you can reduce the diskIO, But for large files, The cache may begenerator It has been transferred to other documents and washed away, alsosender Will cause further delay. Because it is possible to read data randomly from one file and write it to another, If worksets(working
set) Larger cache than disk, It's going to happenseek storm, This will degrade performance again.
And many othersdaemon Similar, For every connectionfork Onedaemon Child process. At startup time, It will analyzersyncd.conf file, To determine which modules exist, And set global options.
When a defined module receives a connection,daemon willfork A subprocess to handle the connection. The subprocess will then readrsyncd.conf File and set options for the requested module, This may happen.chroot Path to module, You may also delete thesetuid/setgid. After completing the above process,daemon Subprocesses will bersync
server equally, The role may besender It could bereceiver.
The Rsync Protocol
A well-designed communication protocol has a series of characteristics.
* Everything to send is clearly defined in the packet, Including head office, Optionalbody Or data load.
* The first part of each packet specifies the protocol type or the command line.
* The length of each packet is clear.
In addition to these features, Protocols should also have varying degrees of status, Independence between packets, Human readability and the ability to reconstruct disconnected conversations.
rsync Does not include any of the above性.数据通过不间断的字节流进行传输.除了非匹配的数据外,既没有指定长度说明符,也没有长度计数器.相反,每个字节的含义取决于由协议层次定义的上下文环境.
This document is a work in progress. The author expects that it has some
glaring oversights and some portions that may be more confusing than
enlightening for some readers. It is hoped that this could evolve into a useful
Specific suggestions for improvement are welcome, as would be a complete