rsync Technical Report( translate)
This article isrsync Official recommended technical reportrsync technical report
Translation, The main content isRsync Algorithm principle andrsync How to implement these principles. In the process of translation, Add the translator's own notes to some incomprehensible places.
Collection of my translations：http://www.cnblogs.com/f-ck-need-u/p/7048359.html
Below isrsync Series：
1.rsync( One)： Basic commands and usage <http://www.cnblogs.com/f-ck-need-u/p/7220009.html>
2.rsync( Two)：inotify+rsync Detailed description andsersync
3.rsync Algorithm principle and workflow analysis <http://www.cnblogs.com/f-ck-need-u/p/7226781.html>
4.rsync Technical Report( translate) <http://www.cnblogs.com/f-ck-need-u/p/7220753.html>
5.rsync Working mechanism( translate) <http://www.cnblogs.com/f-ck-need-u/p/7221535.html>
6.man rsync translate(rsync Command manual in Chinese)
Andrew Tridgell Paul Mackerras
Department of Computer Science
Australian National University
Canberra, ACT 0200, Australia
This report describes an algorithm for updating files on one machine to be consistent with files on another machine. We assume low bandwidth between two machines, High latency two-way link for communication.
The algorithm calculates the consistent part of the source file and the target file( translator's note： Consistent parts of data block), Then send only those that don't match( translator's note： That is to say, the inconsistent parts in the documents at both ends) Part of
. Actually, This algorithm calculates a series of differences between two files on two machines. If the two documents are similar, The efficiency of the algorithm is very high, But even though the two documents are very different, It can also ensure correct and efficient work.
1.2 The problem
Suppose you have two filesA andB, You want to updateB Let it beA Exactly the same, The most direct way is to copyA Turn intoB.
But imagine, If the two files are connected and communicated between machines with extremely slow communication links, For example, using dial-upIP link. IfA File is very large. CopyA reachB It's very slow. In order to speed up, You can takeA Send after compression, But this method can only obtain20% reach40% Promotion.
Now supposeA andB Files are very similar, Maybe they are both separated from the same source file. For real speed, You need to take advantage of this similarity. A common approach is to send only over a linkA andB Part of the difference between, Then reorganize the file according to the difference list( translator's note： So you also need to create a list of differences, Send difference list, Finally, match the difference list and reorganize it).
The problem with this general approach is, To create a collection of differences between two files, It relies on the ability to open and read two files. therefore, This method is used before reading two files, It is required to obtain the file from the other end of the link in advance. If you cannot get a file from the other end, This algorithm will fail( But if you actually get the file from the other end, Both files will be on the same machine, At this time, you can copy directly, There is no need to compare the differences between these documents).rsync That's what we're dealing with.
rsync The algorithm can efficiently calculate the same part of the source file and the target existing file( translator's note： That is, the same data block can be matched). These same parts do not need to be sent over the link; All you need to do is reference these same parts of the target file to make matching references( translator's note： Reference documentbasis
file). Only the unmatched parts of the source file will be byte by byte in a pure data way
Send out. after, The receiver can form a copy of the source file through the same existing parts and byte by byte data received.
generally speaking, The data sent to the receiver can be compressed and transmitted by any common compression algorithm, To further speed up.
1.3 The rsync algorithm
Suppose we have two computersα andβ,α There are accessible files onA,β There are accessible files onB, AndA andB The two documents are similar.α andβ Low speed link communication between.
rsync The algorithm consists of the following processes：
1.β Will fileB Split into a series of non overlapping and fixed sizeS byte( translator's note： The authors note that500 reach1000 Byte is more suitable) Data block. Of course, Last block may be less thanS byte.
2.β Two check codes are calculated for each such block：32 Weak check code of bitrolling-checksum and128 Bit strong check codeMD4-checksum( translator's note： Currentrsync Used is128 BitMD5-checksum).
3.β Send these check codes toα.
4.α Will search for filesA, Find out from it that all the lengths areS Byte sum sumB Two data blocks with the same check code in( Search from any offset). This search and comparison process can be performed by using the weak roll check(rolling
checksum) The special functions of.
( translator's note： String123456 take as an example, To search for inclusion3 Character string, If you search all by any offset3 Character length string, The most basic method is from1 Start searching to get123, from2 Start searching to get234, from3 Start searching to get345, Until the search is complete. That's what any offset means, That is to say, the search length from any position isS Data block)
( Translator Renotes： Why to search at any offset, Consider a situation, There are two identical filesA andB, Now toA Insert a piece of data in the middle of the file, thatA From this data, The offsets of all data blocks immediately following are shifted back a certain length, If you do not search for a fixed length block at any offset, Then start with the newly inserted data, All data blocks andB Dissimilarity, stayrsync Which means that all these data blocks need to be transmitted, But actuallyA andB Different data blocks are only inserted in the middle segment)
5.α Send a series of instructions toβ To constructA Copies of documents. Each instruction is either rightB References to data blocks in, Or pure data. The pure data isA Cannot match toB Data block part of data block in( translator's note： NamelyA Different fromB Data block).
The end result isβ Got it.A Copy of, But only sentA Exist inB Data part not present in( It also includes some check codes and data block index data). The algorithm only requires one link round trip( translator's note： The first link communication isβ Send check code toα, The second communication isα Send instruction, Check code, Index andA Exist inB Pure data that does not exist inβ), This minimizes the impact of link latency.
The most important detail of the algorithm is rolling check(rolling
checksum) And many alternatives related to it(multi-alternate) Search mechanism, They guarantee all offset checks(all-offsets
checksum, I.e. the above steps4 Search data block from any offset position in) The search process can be handled very quickly. They are discussed in more detail below.
( translator's note： If you don't want to see algorithm theory, The following algorithm can be skipped( Let's see the final conclusion), Just make sense of itrsync What's done in the algorithm is enough.)
1.4 Rolling checksum
rsync Weak roll check used in the algorithm(rolling checksum) Need to be fast, Low consumption based on given bufferX1 ...Xn AndX1,Xn+1
The byte value of calculates the check value of the buffer.
We arersync The design inspiration of weak check algorithm used inMark Adler Ofadler-32 check. Our validation definition formula is：
amongs(k,l) Is the rolling check code of bytes. In order to calculate the rolling check code quickly and simply, We use.
The most important feature of this algorithm is that it can use recursion relation to calculate continuous values very efficiently.
Therefore, theS Length of data block to calculate the check code, And the number of calculations is very small.
Although the algorithm is simple enough, But it's enough to be the first level check when two file data blocks match. We found in practice, When block contents are different, Check code(rolling
checksum) The probability of matching is very low. This is very important, Because every weak parity check can match the data block on it, the strong parity check code must be calculated, It is very expensive to calculate strong check codes.( translator's note： In other words, Different block contents, Weak check codes may still be the same( Although the probability is very low), that isrolling
checksum There was a collision, Therefore, it is necessary to calculate the strong check code for the data block with the same weak check code for further matching)
1.5 Checksum searching
Whenα I got itB Check code list of file data block, It has to searchA Data block of file( Search at any offset), The goal is to find a matchB Data block part of the data block check code. The basic strategy is toA The length of each byte of the file is calculated successively asS Byte of data block32 Bit weak roll check code(rolling
checksum), Then, for each calculated weak check code, All taken andB Match the check codes in the file check code list. In our algorithm, Simple3 Layer search check( translator's note：3 Step search process).
First floor inspection, For each32 A bit weak roll check code is calculated16 Bit lengthhash value, And will each216 One suchhash Items in one sheethash surface. According to this16 positionhash Value pair check code list( E.g. receivedB Check code set of file data block) Sort.hash Each item in the table points to the corresponding one in the check code listhash First element of value( translator's note： Data blockID), Or when there is no corresponding in the check code listhash Value time, thishash Table entry will be a null value( translator's note： The reason why there is a null check code and the corresponding nullhash Possible values, Becauseβ Those will beα There areβ The check code of the file not found on is set to null and sent toα, suchα You'll know when you search for filesβ The file is not available on, Then send the entire file directly toβ).
For each offset in the file, It will be calculated32 Bit rolling check code and its16 positionhash value. If thishash Valuedhash Table item is a non null value, Layer 2 check will be called.
( translator's note： In other words, The first level is to compare and match16 Bithash value, If it can be matched, it is considered that the data block has the same potential, Then enter the second layer of check from the location of this data block)
The second level check will scan the sorted check code list, It will be fromhash Where the table entry points to( translator's note： This entry corresponds to the data block that can be matched after the first level check) Start scanning, The purpose is to find out what can match the current value32 Bit rolling check code. When scanning to the same32 Bit weak scrolling check value or until different16 positionhash None of the values match32 Bit weak check code, Scan termination( translator's note： Becausehash The probability of repetition of value and weak check code is very low, So basically scan down again1 Most items2 Item to find unmatched weak roll check codes). If you find a match, Then the third level check is called.
( translator's note： In other words, The second level of checking is to compare and match32 Weak roll check code of bit, If it can match, it means that it has the same potential, Then from this position, you can enter the third level of inspection, If there is no matching weak roll check code, It indicates thehash Duplicate value, But the good thing is that the matching of weak roll check excludes it)
The third level check will calculate the strong check code for the data block of the current offset in the file, And compare it with the strong check code in the check code list. If two strong check codes match, We thinkA Data blocks in andB Data blocks in are exactly the same. Theoretically, These data blocks may be different, But the probability is extremely small, So in the process of practice, We think it's a reasonable assumption.
When we find that we can match the data block on,α WillA The offset of this block in the file and the end offset of the previous matching block are sent toβ, The matching data block will also be sent in theB Index of data block in( translator's note： That is, theID,chunk number). When matching data is found, These data( translator's note： Including reassembly instructions related to data blocks on matching and pure data of data blocks that are not matched between two matching blocks) Will be sent immediately, This allows us to execute communication in parallel with further computation.
If it is found that the data block of the current offset in the file does not match up, The weak check code will scroll down to the next offset address and continue searching( translator's note： That is to say, a byte is scrolled down). If you find a match值时,弱校验码搜索将从匹配到的数据块的终止偏移地址重新开始(译者注：也就是说向下滚动了一个数据块).
使用标准GNU diff程序对这两个tar包进行"diff"操作,结果产生了超过32000行总共2.1 MB的输出.
matches：从A中匹配出B中的某个数据块所匹配的次数.(译者注：相比于下面的false matches,这个是true matches)
false alarms少于true matches次数的1/1000,这说明32位滚动校验码可以很好地检测出不同数据块.