stay Linux The most commonly used file generation and slicing tools are dd, It has comprehensive functions , But file data cannot be extracted in behavioral units , It is also impossible to divide files by size or number of lines directly ( Except by circulation ). Two other data segmentation tools split and csplit These requirements can be realized easily .csplit yes split Upgraded version of .

When dealing with large files , A very efficient idea is to cut large files into multiple small file segments , And then through multiple processes / Threads operate on small files , Last consolidated total . Like sort command , When it implements sorting , The underlying algorithm involves cutting a large file into multiple temporary small files .


1.1 dd command

from if Specified file read data , Write to of Specified file . use bs Specifies the block size for reads and writes , use count Specifies the number of data blocks to read and write ,bs and count Multiply is the total file size . Can be specified skip Ignore read if Specify the first blocks of the file ,seek Specify write to of Ignore first blocks when specifying files .
dd if=/dev/zero of=/tmp/abc.1 bs=1M count=20
if yes input file,of yes output
file;bs Yes c(1byte),w(2bytes),b(512bytes),kB(1000bytes),K(1024bytes),MB(1000),M(1024) and GB,G Etc . therefore , Don't put letters after units B.

Assume existing files CentOS.iso Size of 1.3G, It needs to be cut and restored , The first small file size of the segmentation is 500M.
dd if=/tmp/CentOS.iso of=/tmp/CentOS1.iso bs=2M count=250

Generate second small file , Because the second small file does not know the exact size , So don't specify count option . Because the second small file is from the 500M Start syncopation at , So we need to ignore it CentOS.iso Before 500M. hypothesis bs=2M, therefore skip The number of data blocks dropped is 250.
dd if=/tmp/CentOS.iso of=/tmp/CentOS2.iso bs=2M skip=250
Now? CentOS.iso=CentOS1.iso+CentOS2.iso. You can CentOS[1-2].iso reduction .
cat CentOS1.iso CentOS2.iso >CentOS_m.iso
compare CentOS_m.iso and CentOS.iso Of md5 value , They are exactly the same .
shell> md5sum CentOS_m.iso CentOS.iso 504dbef14aed9b5990461f85d9fdc667
CentOS_m.iso 504dbef14aed9b5990461f85d9fdc667 CentOS.iso

that seek What about the options ? and skip What's the difference? ?skip Option is to ignore read before N Blocks , and seek Before file write is ignored N Blocks . If the file to be written is a.log, be seek=2 Time , From a.log Of 3 Data block begin to append data , If a.log The file itself is not large enough 2 Blocks , The missing parts are automatically used /dev/zero fill .

therefore , Yes CentOS1.iso Based on , To restore it to and CentOS.iso Same file , You can use the following methods :
dd if=/tmp/CentOS.iso of=/tmp/CentOS1.iso bs=2M skip=250 seek=250
After restore , Their md5 Same value .
shell> md5sum CentOS1.iso CentOS.iso 504dbef14aed9b5990461f85d9fdc667
CentOS1.iso 504dbef14aed9b5990461f85d9fdc667 CentOS.iso

1.2 split command

split The function of the tool is to split files into several small files . Since you want to generate multiple small files , It is necessary to specify the unit of segmentation document , Supports line segmentation and file size segmentation , In addition, we need to solve the problem of naming small files . for example , File Name Prefix , suffix . If a prefix is not explicitly specified , The default prefix is "x".

Here is the syntax of the command :
split [OPTION]... [INPUT [PREFIX]] -a N: Build length is N Suffix of , default N=2 -b
N: For each small file N, That is, the file is divided according to the file size . support K,M,G,T( Conversion unit 1024) or KB,MB,GB( Conversion unit 1000) etc. , Default unit is bytes -l
N: Each small file has N That's ok , I.e. file segmentation by line -d N: Specifies that the suffix in the generated numeric format overrides the default letter suffix , Value from N start , Default is 0. E.g. two digit length suffix 01/02/03
--additional-suffix=string: Append an extra suffix to each small file , For example, add ".log". Some older versions do not support this option , stay CentOS 7.2 Supported on .
-n CHUNKS: File as specified CHUNKS Split by .CHUNKS The effective forms of ( See the following for specific usage ):N,l/N,l/K/N,K/N,r/N,r/K/N
CMD: No longer directly cut output to file , It is used as the input of pipe after cutting , Pass the cut data to the CMD implement . If you need to specify a file , be split Automatic use $FILE variable . See the following example
INPUT: Specify the input file to be segmented , To segment standard input , Use "-" PREFIX: Specify prefix for small files , If not specified , The default is "x"
1.2.1 Basic Usage

for example , take /etc/fstab Split by line , each 5 Line segmentation once , And specify that the prefix of the small file is "fs_", Suffix is numeric suffix , And the suffix length is 2.
[[email protected] ~]# split -l 5 -d -a 2 /etc/fstab fs_ [[email protected] ~]# ls fs_00
fs_01 fs_02
View any small file .
[[email protected] ~]# cat fs_01 # Accessible filesystems, by reference, are
maintained under'/dev/disk' # See man pages fstab(5), findfs(8), mount(8)
and/or blkid(8) for more info # UUID=b2a70faf-aea4-4d8e-8be8-c7109ac9c8b8 / xfs
defaults0 0 UUID=367d6a77-033b-4037-bbcb-416705ead095 /boot xfs defaults 0 0
You can reassemble and restore these fragmented small files . for example , Restore the above three small files to ~/fstab.bak.
[[email protected] ~]# cat fs_0[0-2] >~/fstab.bak
After restore , Their content is completely consistent . have access to md5sum compare .
[[email protected] ~]# md5sum /etc/fstab ~/fstab.bak 29b94c500f484040a675cb4ef81c87bf
/etc/fstab 29b94c500f484040a675cb4ef81c87bf /root/fstab.bak
The standard input data can also be segmented , And write them to small files . for example :
[[email protected] ~]# seq 1 2 15 | split -l 3 -d - new_ [[email protected] ~]# ls new*
new_00 new_01 new_02
Each small file can be appended with an additional suffix . Some old versions split This option is not supported , But in csplit Supported on , But the new version split Already supported . for example , add ".log".
[[email protected] ~]# seq 1 2 20 | split -l 3 -d -a 3 --additional-suffix=".log" -
new1_ [[email protected]~]# ls new1* new1_000.log new1_001.log new1_002.log
1.2.2 Press CHUNKS Split

split Of "-n" The options are as follows CHUNK File cutting by :
'-e' When using -n Time , Do not generate empty files . for example 5 Row data , But it requires cutting into 100 Files , Obviously 5 Empty files after files '-u --unbuffered' When using -
n Of r In mode , Do not buffer input, Copy each read line to output immediately , So this option may be slow '-n CHUNKS' '--number=CHUNKS' Split
INPUT to CHUNKS output files where CHUNKS may be: N According to the file size, it is divided into N Files ( The last file may be uneven in size )
K/N output ( Print to screen , standard output )N Of files K File contents ( It is not to find this file for output after cutting , It is output as soon as the file is cut ) l/N
Divided into N Files ( Last file may have uneven lines ) l/K/N Press l/N While cutting , The output belongs to the K Contents of files r/N
be similar to l Form of , But cut the rows by polling . For example, the first line to the first file , Second line to second file r/K/N according to r/N When cutting , Output No K Contents of files
Where capital letters K and N It's the value we specify on demand ,l or r Is the letter representing the pattern
Maybe not very well understood , Just a few examples are clear .

If the document 1.txt Among them a-z, One line per letter , common 26 That's ok . Cut this file :

1. appoint CHUNK=N or K/N Time

take 1.txt Equally divided into 5 Files . because 1.txt common 52 byte ( In each line , Letter one byte , Line break one byte ), So every file after segmentation 10 byte , Last file 12 byte .
[[email protected] a]# split -n 5 1.txt fs_ [[email protected] a]# ls -l total 24
-rw-r--r--1 root root 52 Oct 6 15:23 1.txt -rw-r--r-- 1 root root 10 Oct 6 16:07
fs_aa-rw-r--r-- 1 root root 10 Oct 6 16:07 fs_ab -rw-r--r-- 1 root root 10 Oct
6 16:07 fs_ac -rw-r--r-- 1 root root 10 Oct 6 16:07 fs_ad -rw-r--r-- 1 root root
12 Oct 6 16:07 fs_ae
If specified again K/N Of K, For example, specify as 2, Output belongs to fs_ab Content in file :
[[email protected] a]# split -n 2/5 1.txt fs_ f g h i j
2. appoint CHUNK=l/N or l/K/N Time

In this case, it will be divided equally according to the total number of lines . for example , take 1.txt In 26 Split rows into 5 Files , Front 4 Each file will have 5 That's ok , The first 5 Files will have 6 That's ok :
[[email protected] a]# split -n l/5 1.txt fs_ [[email protected] a]# wc -l fs_* 5 fs_aa
5 fs_ab 5 fs_ac 5 fs_ad 6 fs_ae 26 total
If specified K, The output belongs to the K Contents of files :
[[email protected] a]# split -n l/2/5 1.txt fs_ f g h i j
3. appoint CHUNK=r/N or r/K/N Time

use r Time , Poll each cut line to the next file , After all the files are rotated, return and cut to the first file .

See the results directly :
[[email protected] a]# split -n r/5 1.txt fs_ [[email protected] a]# head -n 2 fs* ==>
fs_aa <== a f ==> fs_ab <== b g ==> fs_ac <== c h ==> fs_ad <== d i ==> fs_ae
<== e j
a Output to 1 Files ,b Output to 2 Files ,c Output to 3 Files , And so on .

appoint K Time , Will output the K Contents of files :
[[email protected] a]# split -n r/2/5 1.txt fs_ b g l q v
1.2.3 filter CMD Pass the cut result to the specified command

By default split Is to pass file cutting to each file segment , If using --filter option , No longer cut into file segments , It is transmitted to CMD Processing .CMD When processing , It may not be necessary to store data in a file , However, if it is necessary to save the processed data to multiple files in sub segments , You can use the $FILE To represent ( This is split Identified variables , Avoid being shell analysis ), Like split The normal cutting mode of .

for example , read 1.txt In 26 That's ok , each 5 Pipeline once , Then use echo Output them :
[[email protected] a]# split -n l/5 --filter='xargs -i echo ---:{}' 1.txt ---:a ---
:b---:c ---:d ---:e # The first 1 Secondary pipeline ---:f ---:g ---:h ---:i ---:j # The first 2 Secondary pipeline
---:k ---:l ---:m ---:n ---:o ---:p ---:q ---:r ---:s ---:t ---:u ---:v ---:w
---:x ---:y ---:z

At this time split No small file fragments were generated . If you want to save the above output to a small file fragment , use $FILE, This variable is split Built in variables , Can't be shell analysis , So it appears $split Must be protected with single quotes :
[[email protected] a]# split -n l/5 --filter='xargs -i echo ---:{} >$FILE.log' 1
.txt fs_ [[email protected] a]# ls 1.txt fs_aa.log fs_ab.log fs_ac.log fs_ad.log
fs_ae.log [[email protected] a]# cat fs_aa.log ---:a ---:b ---:c ---:d ---:e
Of which $FILE namely split Named part . In the small file above , Because fs_ As prefix , with ".log" Is suffix .

filter Sometimes it's useful , For example, a large compressed file , Cut into several small compressed files .
xz -dc BIG.xz | split -b200G --filter='xz > $FILE.xz' - big-
-dc" Indicates decompression to standard output , The decompressed data stream will be passed to split, And then every 200G Through filter In xz Command to compress , The compressed file name format is similar to "big-aa.xz


1.3 csplit command

split Only by line or by size , Can't split by paragraph .csplit yes split Variant of , More features , It mainly divides documents by paragraphs according to the specified context .
csplit [OPTION]... FILE PATTERN... describe : according to PATTERN Split files into "xx00","xx01",
... And output the bytes of each small file in standard output . Option description :-b FORMAT: Specify file suffix format , Format is printf Format of , Default is %
02d. Indicates suffix to 2 Digit value , And not enough 0 fill .-f PREFIX: Specify prefix , Do not specify is the default is "xx". -
k: For emergencies . Indicates that even if an error occurs , Do not delete the small file that has been split .-m: Explicitly prohibit the line matching of the file PATTERN. -s:(silent) Do not print small file size .
-z: If there is an empty file in the segmented small file , Delete them . FILE: Documents to be cut , If you want to segment standard input data , Use "-". PATTERNs: INTEGER
: numerical value , If N, Represents a copy 1 reach N-1 Line contents into a small file , The rest goes to another small file . /REGEXP/
[OFFSET]: Copy the specified number of lines to the small file by offset from the matched lines . : among OFFSET The format of is "+N" or "-N", Represents back and forward copy N That's ok
%REGEXP%[OFFSET]: Matching rows ignored . {INTEGER} : If the value is N, Indicates repetition N Previous pattern match . {*}
: Indicates that matching does not stop until the end of the file .
It is assumed that the contents of the document are as follows :
[[email protected] ~]# cat test.txt SERVER-1 [connection] success
[connection] failed [disconnect] pending [connection] success SERVER-2 [connection] failed [connection] failed [disconnect] success [CONNECTION]
pending SERVER-3 [connection] pending [connection]
pending [disconnect] pending [connection] failed
Suppose each SERVER-n Represents a paragraph , So we need to segment the document according to the paragraph , Use the following statement :
[[email protected] ~]# csplit -f test_ -b %04d.log test.txt /SERVER/ {*} 0 140 139 140
 "-f test_"  Specify a small file prefix of "test_", "-b %04d.log"
  Specify file suffix format "00xx.log", It automatically appends additional suffixes to each small file ".log", "/SERVER/"
  Represents a matching pattern , Every match to , To generate a small file , And the matching line is the content of the small file , "{*}"
  Indicates that the previous pattern of infinite matching is /SERVER/ Until the end of the file , If you don't know {*} Or designated as {1}, No more matches after one successful match .
[[email protected] ~]# ls test_* test_0000.log test_0001.log test_0002.log

There are only three paragraphs in the above document :SERVER-1,SERVER-2,SERVER-3, But the result of segmentation is generated 4 Small files , And notice that the first small file size is 0 byte . Why ? Because when the pattern matches , Every match to a row , This line is the starting line for the next small file . Due to the first line of this file "SERVER-1" I was /SERVER/ It's a match , So this line is for the next small file , An empty file is automatically generated before this small file .

The generated empty file can be used "-z" Option to delete .
[[email protected] ~]# csplit -f test1_ -z -b %04d.log test.txt /SERVER/ {*} 140 139
You can also specify the number of row offsets to copy only to . for example , When matching to rows , Just copy what's behind it 1 That's ok ( Including its own two lines ), But the extra lines are put into the next small file .
[[email protected] ~]# csplit -f test2_ -z -b %04d.log test.txt /SERVER/+2 {*} 42 139
140 98
The first small file has only two lines .
[[email protected] ~]# cat test2_0000.log SERVER-1 [connection] success
SERVER-1 The rest of the paragraph is put into the second small file .
[[email protected] ~]# cat test2_0001.log [connection] failed [disconnect] pending [connection] success SERVER-2 [connection] failed
The same goes for the third little file , Until the last small file holds all the remaining unmatched content .
[[email protected] ~]# cat test2_0003.log [connection] pending
[disconnect] pending [connection] failed
appoint "-s" or "-q" Option to run in silent mode , Small file size information will not be output .
[[email protected] ~]# csplit -q -f test3_ -z -b %04d.log test.txt /SERVER/+2 {*}