stayLinux The most commonly used file generation and slicing tools aredd, It has comprehensive functions, But file data cannot be extracted in behavioral units, It is also impossible to divide files by size or number of lines directly( Except by circulation). Two other data segmentation toolssplit andcsplit These requirements can be realized easily.csplit yessplit Upgraded version.

When dealing with large files, A very efficient idea is to cut large files into multiple small file segments, And then through multiple processes/ Threads operate on small files, Last consolidated total. Just likesort command, When it implements sorting, The underlying algorithm involves cutting a large file into multiple temporary small files.


1.1 dd command

fromif Specified file read data, Write toof Specified file. Usebs Specifies the block size for reads and writes, Usecount Specifies the number of data blocks to read and write,bs andcount Multiply is the total file size. May specifyskip Ignore readif Specify the first blocks of the file,seek Specify write toof Ignore first blocks when specifying files.
dd if=/dev/zero of=/tmp/abc.1 bs=1M count=20
if yesinput file,of yesoutput
file;bs Yesc(1byte),w(2bytes),b(512bytes),kB(1000bytes),K(1024bytes),MB(1000),M(1024) andGB,G Etc. therefore, Don't put letters after unitsB.

Assume existing filesCentOS.iso Size1.3G, It needs to be cut and restored, The first small file size of the segmentation is500M.
dd if=/tmp/CentOS.iso of=/tmp/CentOS1.iso bs=2M count=250

Generate second small file, Because the second small file does not know the exact size, So don't specifycount option. Because the second small file is from the500M Start syncopation at, So we need to ignore itCentOS.iso Before500M. hypothesisbs=2M, Thereforeskip The number of data blocks dropped is250.
dd if=/tmp/CentOS.iso of=/tmp/CentOS2.iso bs=2M skip=250
NowCentOS.iso=CentOS1.iso+CentOS2.iso. Can beCentOS[1-2].iso reduction.
cat CentOS1.iso CentOS2.iso >CentOS_m.iso
compareCentOS_m.iso andCentOS.iso Ofmd5 value, They are exactly the same.
shell> md5sum CentOS_m.iso CentOS.iso 504dbef14aed9b5990461f85d9fdc667
CentOS_m.iso 504dbef14aed9b5990461f85d9fdc667 CentOS.iso

thatseek Options?? andskip What's the difference??skip Option is to ignore read beforeN Data blocks, andseek Before file write is ignoredN Data blocks. If the file to be written isa.log, beseek=2 Time, Will froma.log The first3 Data block begin to append data, Ifa.log The file itself is not large enough2 Data blocks, The missing parts are automatically used/dev/zero Fill.

Therefore, In the presence ofCentOS1.iso On the basis of, To restore it to andCentOS.iso Same file, You can use the following methods:
dd if=/tmp/CentOS.iso of=/tmp/CentOS1.iso bs=2M skip=250 seek=250
After reduction, Theirmd5 Same value.
shell> md5sum CentOS1.iso CentOS.iso 504dbef14aed9b5990461f85d9fdc667
CentOS1.iso 504dbef14aed9b5990461f85d9fdc667 CentOS.iso

1.2 split command

split The function of the tool is to split files into several small files. Since you want to generate multiple small files, It is necessary to specify the unit of segmentation document, Supports line segmentation and file size segmentation, In addition, we need to solve the problem of naming small files. for example, File Name Prefix , Suffix. If a prefix is not explicitly specified, The default prefix is"x".

Here is the syntax of the command:
split [OPTION]... [INPUT [PREFIX]] -a N: Build length isN Suffix, defaultN=2 -b
N: For each small fileN, That is, the file is divided according to the file size. SupportK,M,G,T( Conversion unit1024) orKB,MB,GB( Conversion unit1000) etc. Default unit is bytes-l
N: Each small file hasN That's ok, I.e. file segmentation by line-d N: Specifies that the suffix in the generated numeric format overrides the default letter suffix, Numerical value fromN start, Default is0. E.g. two digit length suffix01/02/03
--additional-suffix=string: Append an extra suffix to each small file, For example, plus".log". Some older versions do not support this option, stayCentOS 7.2 Already supported.
-n CHUNKS: File as specifiedCHUNKS Split by.CHUNKS The effective forms of( See the following for specific usage):N,l/N,l/K/N,K/N,r/N,r/K/N
CMD: No longer directly cut output to file, It is used as the input of pipe after cutting, Pass the cut data to theCMD implement. If you need to specify a file, besplit Automatic use$FILE variable. See the following example
INPUT: Specify the input file to be segmented, To segment standard input, Then use"-" PREFIX: Specify prefix for small files, If not specified, The default is"x"
1.2.1 Basic Usage

for example, take/etc/fstab Segmentation by line, each5 Line segmentation once, And specify that the prefix of the small file is"fs_", Suffix is numeric suffix, And the suffix length is2.
[[email protected] ~]# split -l 5 -d -a 2 /etc/fstab fs_ [[email protected] ~]# ls fs_00
fs_01 fs_02
View any small file.
[[email protected] ~]# cat fs_01 # Accessible filesystems, by reference, are
maintained under'/dev/disk' # See man pages fstab(5), findfs(8), mount(8)
and/or blkid(8) for more info # UUID=b2a70faf-aea4-4d8e-8be8-c7109ac9c8b8 / xfs
defaults0 0 UUID=367d6a77-033b-4037-bbcb-416705ead095 /boot xfs defaults 0 0
You can reassemble and restore these fragmented small files. for example, Restore the above three small files to~/fstab.bak.
[[email protected] ~]# cat fs_0[0-2] >~/fstab.bak
After reduction, Their content is completely consistent. have access tomd5sum compare.
[[email protected] ~]# md5sum /etc/fstab ~/fstab.bak 29b94c500f484040a675cb4ef81c87bf
/etc/fstab 29b94c500f484040a675cb4ef81c87bf /root/fstab.bak
The standard input data can also be segmented, And write them to small files. for example:
[[email protected] ~]# seq 1 2 15 | split -l 3 -d - new_ [[email protected] ~]# ls new*
new_00 new_01 new_02
Each small file can be appended with an additional suffix. Some old versionssplit This option is not supported, But incsplit Supported, But the new versionsplit Already supported. for example, Add".log".
[[email protected] ~]# seq 1 2 20 | split -l 3 -d -a 3 --additional-suffix=".log" -
new1_ [[email protected]~]# ls new1* new1_000.log new1_001.log new1_002.log
1.2.2 PressCHUNKS Segmentation

split Of"-n" The options are as followsCHUNK File cutting by:
'-e' When used-n Time, Do not generate empty files. for example5 Row data, But it requires cutting into100 File, Obviously the first5 Empty files after files '-u --unbuffered' When used-
n Ofr Mode time, Do not bufferinput, Copy each line read to output immediately, So this option may be slow'-n CHUNKS' '--number=CHUNKS' Split
INPUT to CHUNKS output files where CHUNKS may be: N According to the file size, it is divided intoN File( The last file may be uneven in size)
K/N output( Print to screen, standard output)N Of filesK File contents( It is not to find this file for output after cutting, It is output as soon as the file is cut) l/N
Divided intoN File( Last file may have uneven lines) l/K/N Pressl/N While cutting, The output belongs to theK Contents of files r/N
Be similar tol Form, But cut the rows by polling. For example, the first line to the first file, Second line to second file r/K/N according tor/N Cutting time, Output No.K Contents of files
Where capital lettersK andN It's the value we specify on demand,l orr Is the letter representing the pattern
Maybe not very well understood, Just a few examples are clear.

If file1.txt There area-z, One line per letter, common26 That's ok. Cut this file:

1. AppointCHUNK=N orK/N Time

take1.txt Equally divided into5 File. Because1.txt common52 byte( In every row, Letter one byte, Line break one byte), So every file after segmentation10 byte, Last file12 byte.
[[email protected] a]# split -n 5 1.txt fs_ [[email protected] a]# ls -l total 24
-rw-r--r--1 root root 52 Oct 6 15:23 1.txt -rw-r--r-- 1 root root 10 Oct 6 16:07
fs_aa-rw-r--r-- 1 root root 10 Oct 6 16:07 fs_ab -rw-r--r-- 1 root root 10 Oct
6 16:07 fs_ac -rw-r--r-- 1 root root 10 Oct 6 16:07 fs_ad -rw-r--r-- 1 root root
12 Oct 6 16:07 fs_ae
If specified againK/N OfK, For example, specify as2, Output belongs tofs_ab Content in file:
[[email protected] a]# split -n 2/5 1.txt fs_ f g h i j
2. AppointCHUNK=l/N orl/K/N Time

In this case, it will be divided equally according to the total number of lines. for example, take1.txt Medium26 Line splitting5 File, Front4 Each file will have5 That's ok, The first5 Files will have6 That's ok:
[[email protected] a]# split -n l/5 1.txt fs_ [[email protected] a]# wc -l fs_* 5 fs_aa
5 fs_ab 5 fs_ac 5 fs_ad 6 fs_ae 26 total
If specifiedK, The output belongs to theK Contents of files:
[[email protected] a]# split -n l/2/5 1.txt fs_ f g h i j
3. AppointCHUNK=r/N orr/K/N Time

User Time, Poll each cut line to the next file, After all the documents are rotated, return and cut to the first document.

See the results directly:
[[email protected] a]# split -n r/5 1.txt fs_ [[email protected] a]# head -n 2 fs* ==>
fs_aa <== a f ==> fs_ab <== b g ==> fs_ac <== c h ==> fs_ad <== d i ==> fs_ae
<== e j
a Output to No.1 File,b Output to No.2 File,c Output to No.3 File, And so on.

AppointK Time, Export outputK Contents of files:
[[email protected] a]# split -n r/2/5 1.txt fs_ b g l q v
1.2.3 filter CMD Pass the cut result to the specified command

By defaultsplit Is to pass file cutting to each file segment, If used--filter option, No longer cut into file segments, It is transmitted toCMD Processing.CMD Processing time, It may not be necessary to store data in a file, However, if it is necessary to save the processed data to multiple files in sub segments, You can use the$FILE To represent( This issplit Identified variables, Avoid being analysis), Just likesplit The normal cutting mode is the same.

for example, read1.txt Medium26 That's ok, each5 Pipeline once, Then useecho Output them:
[[email protected] a]# split -n l/5 --filter='xargs -i echo ---:{}' 1.txt ---:a ---
:b---:c ---:d ---:e # The first1 Secondary pipeline ---:f ---:g ---:h ---:i ---:j # The first2 Secondary pipeline
---:k ---:l ---:m ---:n ---:o ---:p ---:q ---:r ---:s ---:t ---:u ---:v ---:w
---:x ---:y ---:z

Thensplit No small file fragments were generated. If you want to save the above output to a small file fragment, Use$FILE, This variable issplit Built-in variables, Can not beshell analysis, So appear$split Must be protected with single quotes:
[[email protected] a]# split -n l/5 --filter='xargs -i echo ---:{} >$FILE.log' 1
.txt fs_ [[email protected] a]# ls 1.txt fs_aa.log fs_ab.log fs_ac.log fs_ad.log
fs_ae.log [[email protected] a]# cat fs_aa.log ---:a ---:b ---:c ---:d ---:e
Among them$FILE Namelysplit Named part. In the small file above, All are based onfs_ Prefix, with".log" Suffix.

filter Sometimes it's useful, For example, a large compressed file, Cut into several small compressed files.
xz -dc BIG.xz | split -b200G --filter='xz > $FILE.xz' - big-
-dc" Indicates decompression to standard output, The decompressed data stream will be passed tosplit, Then every200G Pass throughfilter Mediumxz Command to compress, The compressed file name format is similar to"big-aa.xz


1.3 csplit command

split Only by line or by size, Can't split by paragraph.csplit yessplit Variants, More functions, It mainly divides documents by paragraphs according to the specified context.
csplit [OPTION]... FILE PATTERN... describe: according toPATTERN Split files into"xx00","xx01",
... And output the bytes of each small file in standard output. Option description:-b FORMAT: Specify file suffix format, Format isprintf Format, Default is%
02d. Indicates suffix to2 Digit value, And not enough0 Fill.-f PREFIX: Specified prefix, Do not specify is the default is"xx". -
k: For emergencies. Indicates that even if an error occurs, Do not delete the small file that has been split.-m: Explicitly prohibit line matching of filesPATTERN. -s:(silent) Do not print small file size.
-z: If there is an empty file in the segmented small file, Delete them. FILE: Documents to be cut, If you want to segment standard input data, Then use"-". PATTERNs: INTEGER
: numerical value, If soN, Copy representation1 reachN-1 Line contents into a small file, The rest goes to another small file. /REGEXP/
[OFFSET]: Copy the specified number of lines to the small file by offset from the matched lines. : amongOFFSET The format is"+N" or"-N", Represents back and forward copyN That's ok
%REGEXP%[OFFSET]: Matching rows ignored. {INTEGER} : If the value isN, RepetitionN Previous pattern match. {*}
: Indicates that the matching will not stop until the end of the file.
It is assumed that the contents of the document are as follows:
[[email protected] ~]# cat test.txt SERVER-1 [connection] success
[connection] failed [disconnect] pending [connection] success SERVER-2 [connection] failed [connection] failed [disconnect] success [CONNECTION]
pending SERVER-3 [connection] pending [connection]
pending [disconnect] pending [connection] failed
Suppose eachSERVER-n Represents a paragraph, So we need to segment the document according to the paragraph, Use the following statement:
[[email protected] ~]# csplit -f test_ -b %04d.log test.txt /SERVER/ {*} 0 140 139 140
 "-f test_"  Specify a small file prefix of"test_", "-b %04d.log"
  Specify file suffix format"00xx.log", It automatically appends additional suffixes to each small file".log", "/SERVER/"
  Represents a matching pattern, Every match to, To generate a small file, And the matching line is the content of the small file, "{*}"
  Indicates that the previous pattern of infinite matching is/SERVER/ Until the end of the file, If you don't know{*} Or designated as{1}, No more matches after one successful match.
[[email protected] ~]# ls test_* test_0000.log test_0001.log test_0002.log

There are only three paragraphs in the above document:SERVER-1,SERVER-2,SERVER-3, But the result of segmentation is generated4 Small file, And notice that the first small file size is0 byte. Why? Because when the pattern matches, Every match to a row, This line is the starting line for the next small file. Due to the first line of this file"SERVER-1" Be being/SERVER/ Match up to, So this line is the content of the next small file, An empty file is automatically generated before this small file.

The generated empty file can be used"-z" Option to delete.
[[email protected] ~]# csplit -f test1_ -z -b %04d.log test.txt /SERVER/ {*} 140 139
You can also specify the number of row offsets to copy only to. for example, When matching to rows, Just copy what's behind it1 That's ok( Including its own two lines), But the extra lines are put into the next small file.
[[email protected] ~]# csplit -f test2_ -z -b %04d.log test.txt /SERVER/+2 {*} 42 139
140 98
The first small file has only two lines.
[[email protected] ~]# cat test2_0000.log SERVER-1 [connection] success
SERVER-1 The rest of the paragraph is put into the second small file.
[[email protected] ~]# cat test2_0001.log [connection] failed [disconnect] pending [connection] success SERVER-2 [connection] failed
The same goes for the third little file, Until the last small file holds all the remaining unmatched content.
[[email protected] ~]# cat test2_0003.log [connection] pending
[disconnect] pending [connection] failed
Appoint"-s" or"-q" Option to run in silent mode, Small file size information will not be output.
[[email protected] ~]# csplit -q -f test3_ -z -b %04d.log test.txt /SERVER/+2 {*}