sed Series of articles :

sed Cultivation series ( One ): Flower boxing and leg embroidery beginner level chapter <http://www.cnblogs.com/f-ck-need-u/p/7488469.html>
sed Cultivation series ( Two ): Martial arts mental skill (info sed translate + annotation )
<http://www.cnblogs.com/f-ck-need-u/p/7478188.html>
sed Cultivation series ( Three ):sed Window sliding technology for advanced applications <http://www.cnblogs.com/f-ck-need-u/p/7496916.html>
sed Cultivation series ( Four ):sed Difficult and miscellaneous diseases in <http://www.cnblogs.com/f-ck-need-u/p/7499309.html>

<>

1.sed The use of variables and variable substitution in


Use in scripts sed When , It is likely that sed Reference in shell variable , Even want to sed Using variable substitution on the command line . Maybe a lot of people have encountered this problem , But quotation marks can't be debugged in the right place . It's not sed Problems of , But shell Characteristics of . Understand sed How to solve the problem of quotation marks , Understanding shell Quotation marks help a lot , grasp a typical example and you will grasp the whole category , Later in use awk,mysql I won't be confused when I bring my own parsing tool .

For example, I want to output a.txt Reciprocal of 5 Line statement . You may have written the following command line :
total=`wc -l <a.txt` sed -n '$((total-4)),$p' a.txt
But unfortunately , It's a mistake . one side ,"$" stay sed Special symbol in , When placed in an addressing expression , It represents the tag of the last line of the input stream . and $(())
It also appears in "$" Symbol , This will make sed To parse the symbol . on the other hand ,$(())
This part is to use shell Calculate instead of using sed Calculated , Therefore, it must be exposed to shell, So that shell Can parse it .

Say it again shell Middle single quotation mark , Double quotes and no quotes .

* Single quotation mark : All characters in single quotes become literal . But notice : Single quotation mark cannot be used in single quotation mark , Even if backslash escape is used, it is not allowed .
* Double quotes : All characters in double quotes become literal , but "\","$","`"( backquote ) except , If it's on "!" When referencing a history command , Exclamation mark is also excluded .
* Do not use Quotes : Almost equivalent to using double quotes , But with braces and tilde extensions .
About double quotes above , The description is not really complete , But enough . These are just their literal meanings ,
The real meaning of quotation marks is : Decide which of the command lines " word " Need to be shell analysis , It also determines what the literal meaning does not need to be shell analysis . For details, see :
shell The process of parsing the command line and eval command <http://www.cnblogs.com/f-ck-need-u/p/7426371.html>.


obviously , All characters in single quotes become literal ,shell No words in it will be parsed , For example, a single quoted variable is no longer parsed , Command substitution and arithmetic operations are no longer performed , No path extension, etc . in short , The characters in single quotation marks are all ordinary characters , If some characters need to be parsed by the command with parsing function , Must use single quotes . for example ,"$","!" and "{}" stay sed There is special significance in , If you want to sed Can parse them , You must use single quotes for them , Otherwise, it must be wrong , Or ambiguity . For example 3 individual sed All symbols in statements must use single quotes to get the correct result .
sed '$d' filename sed '1!d' filename sed -n '2{p;q}' filename
And you want special characters to be shell analysis , It must not be enclosed in single quotes , You can use double quotes , You can also use no quotes , Even if it's not quoted, it might look weird . for example , Arithmetic operations above
$(()) I want to be shell Parsed , Therefore, it must be exposed to the shell. So the correct statement is :
sed -n $((total-4))',$p' a.txt sed -n "$((total-4))"',$p' a.txt sed -n
"$((total-4)),\$p" a.txt
From the naked eye , The quotation marks of this sentence are really weird . but shell No matter how ugly or beautiful , It's dead , It has its own set of rules when dividing the command line , How to divide the rules .

therefore , about sed How and shell The problem of interaction can draw a set of conclusions :

* Need to be shell Parsed without quotes , Or double quotes ;
* encounter shell When a special character is shared with the command being executed , To be sed analysis , Single quotation mark is required , Or escape with a backslash in double quotes ;
* Those unimportant characters , No matter what quotation marks .
therefore , Use command substitution to sed Output reciprocal 5 The statement of line is as follows :
sed -n `expr $(wc -l <a.txt) - 4`',$p' a.txt
In the above statement ,`expr $(wc -l <a.txt) - 4` To be shell analysis , Therefore, single quotation marks must not be used . and $p
Partial "$" To be sed Resolve to last line , Single quotes must be used to avoid being shell analysis .

More complicated , stay sed Using variable substitution in regular expressions of . for example , output a.txt Variable in str Line from beginning of string to last line .
str="abc" sed -n /^$str/',$p' a.txt
Because no quotes are used , therefore $str Can be shell replace with "abc". There are many ways to write this command :
sed -n '/^'$str'/,$p' a.txt sed -n "/^$str"'/,$p' a.txt sed -n "/^$str/,\$p"
a.txt sed -n"/^$str/,"'$'p a.txt
Give a harder one sed The use of symbols . take /etc/shadow Replace the password part of the last line in with "$1$123456$wOSEtcyiP2N/IfIl15W6Z0".
[[email protected] ~]# tail -n 1 /etc/shadow userX:$6$hS4yqJu7WQfGlk0M$Xj
/SCS5z4BWSZKN0raNncu6VMuWdUVbDScMYxOgB7mXUj./dXJN0zADAXQUMg0CuWVRyZUu6npPLWoyv8eXPA.::
0:99999:7:::
The replacement statement is as follows :
old_pass="$(tail -n 1 /etc/shadow | cut -d':' -f2)" new_pass=
'$1$123456$wOSEtcyiP2N/IfIl15W6Z0' sed -n '$'s%$old_pass%$new_pass%p /etc/shadow
because old_pass and old_pass
Included in "/" and "$" Symbol , therefore "s" The separator of the command uses "%" replace . Watch carefully again new_pass, There are "." Symbol , This is the metacharacter of a regular expression , So it can match other situations .

<>

2. Reverse reference failure

When using either option in a regular expression "|" Time , If grouping brackets () Content in does not participate in matching , Backward references will not work . for example (a)\1u|b\1
Will match only "aau" Of , Mismatch "ba" Of , Because in the second rule of one of them \1 Group represented does not participate in matching , So in the second regular \1 invalid , But in the first regular \1 Effective .

This is the problem of regular matching , Not just sed, Other tools that use basic and extended regular engines have the same problem .

in addition , stay s When using reverse references in commands , Will not be referenced "s" Groups outside commands . for example :
echo "ab3456cd" | sed -r "/(ab)/s/([0-9]+)/\1/"
The result will be ab3456cd, instead of ababcd, And if you use \2 quote , Error will be reported "invalid reference \2 on 's'
command's RHS".

<>

3."-i" File save problem for option

sed By creating a temporary file , And write the output to the temporary file , Then rename the temporary file as the source file to save the . therefore ,sed Ignore the read-only nature of the file .

Allow rename or move in or delete files , It is controlled by the permission of the directory where the file is located . If the directory is read-only , be sed unavailable "-i" Options save results , Even if the file has read permission .

<>

4. Greedy matching problem


So called greedy matching , When a regular expression can match multiple contents , Take the longest one . The simplest example , Given data "abcdsbaz", regular expression "a.*b" Can match the "ab" and "abcdsb", Because of greedy matching , It will take the longest "abcdsb".
echo "abcdbaz" | grep -o "a.*b" abcdb
One of the disadvantages of basic regular expression and extended regular expression is that they can't overcome greedy matching , image Perl The regular implementation of regular or other programming languages is relatively complete , stay "*
" or "+" This repeated match is followed by a "?" So we can clearly show that we take the lazy matching mode , for example "a.*?b".
echo "abcdbaz" | grep -P -o "a.*?b" ab
To overcome greedy matching of basic regular or extended regular , can only " be opportunistic " Use without symbols "[^]" To achieve . Like the one above :
echo "abcdbaz" | grep -o "a[^b]*b" ab

This opportunistic way , Poor performance , Because engines that base or extend regular expressions always match the longest content first , Then match back , This is called " to flash back ". for example "abcdsbaz" Being "a[^b]*b" When matching , Match first "abcdsb", Character by character fallback matching , Until you go back to the first "b" It's the shortest result .

Another example ,/etc/passwd The format of each line of data in the file is as follows :
rootx:0:0:root:/root:/bin/bash
How to use sed towards /etc/passwd Every user in , The output format is roughly :"hello root","hello nobody".

first , You have to take the first column out of the file , User name . But because all lines in the file are colon separated fields , Want to use regular expression matching to get the first paragraph , Greedy matching must be overcome . The statement is as follows :
sed -r 's/^([^:]*):.*/hello \1/' /etc/passwd
be careful ,sed Basic and extended regular engines are used , When overcoming greedy matching , It has to match the longest , Back to the shortest .

If you want to get /etc/passwd The first two fields in ? Just repeat the rule that overcomes greed as a whole .
sed -r 's/^([^:]*):([^:]*):.*/hello \1 \2/' /etc/passwd
Take the third field ?
sed -r 's/^([^:]*:){2}([^:]*):.*/hello \2/' /etc/passwd
Take the third and fifth fields ? no way out , You can only explicitly label the fourth field .
sed -r 's/^([^:]*:){2}([^:]*):([^:]*):([^:]*):/hello \2 \4/' /etc/passwd
Third to third 5 field ? Simpler , repeat 3 Time will do .
sed -r 's/^([^:]*:){2}(([^:]*:){3}).*/hello \2/' /etc/passwd

But in the end , The first 3 To 5 Fields must contain ":" Separator , Want to get rid of it ? Wash and sleep !sed I'm not good at dealing with fields , Overcoming greedy matching makes expressions difficult to read , And it's not efficient . Use it to process fields , It's definitely full of food .

<>

5.sed command "a" and "N" A tangle of

sed Of "a" The function of the command is to queue the provided text data in memory , Then, when the mode space content is output, it is added to the tail of the output stream for output .

for example , On matching lines "ccc" Insert a row of data after "matched successful".
echo -e "aaa\nbbb\nccc\nddd" | sed '/ccc/a matched successful' aaa bbb ccc
matched successful ddd
How to use it "a" command , Very well , No problem . But the combination "N" have a try ?
echo -e "aaa\nbbb\nccc\nddd" | sed '/ccc/{a\ matched successful ;N}' aaa bbb
matched successful ccc ddd

Isn't it added at the end , How to run ahead of the matching line ? even if "N" Read next line , It should be added in "ddd" Next line ? Want to really understand this problem , Yes sed The output mechanism of pattern space must be clear , You can refer to
sed Cultivation series ( One ): Flower boxing and leg embroidery beginner level chapter <http://www.cnblogs.com/f-ck-need-u/p/7488469.html>
. Here is a brief description "N" Command output mechanism .


Whether it is sed Read next line automatically , still "n" or "N" Command read next line , As long as there is a read action , The content of the pattern space must be output before it . When "N" When reading the next line , First, it determines if there is another line to read , If any , Lock the mode space first , Then automatically output and clear the mode space , Unlock the mode space again and append a line break to the end "\n", Last read next line append to end of line break . Because the mode space is locked , Make the output flow empty when automatic output , Also cannot empty mode space . be careful , It's not a disable output , Although the result of outputting the empty stream is the same as that of banning the output , But output air flow has output action , With output stream , Write standard output , No output action . If there is no next line to read , Then auto output mode space , Empty mode space and exit sed program . The process is described as follows :
if [ "$line" -ne "$last_line_num" ];then lock pattern_space; auto_print;
remove_pattern_space; unlock pattern_space; append"\n" to pattern_space; read
next_line to pattern_space;else auto_print; remove_pattern_space; exit; fi

go back to "a" Command and "N" On the issue of command combination . reason why "a" The queued text of the command is inserted before the matching line , The problem is with the output air flow ."N" When preparing to read the next line , It has an output action , Even if the output is empty . and "a" Orders are waiting sed Output stream's , As long as there is an output stream , We'll catch up and add it to the bottom of the output stream . therefore ,"matched
successful" Will be appended to the tail of the air flow , After appending "N" To read the next line , Output the content in the pattern space finally "ccc\nddd", That's how we get to the front " Contrary to expectations " Results of .

<>

6.sed The winding of the middle exclamation mark reversed

You know how to use "!" Number reversal , But maybe you didn't find that the exclamation point could be placed after the addressing expression , It can also be placed in front of the command . Both of them are opposite , But the meaning is definitely different , The result is different .

* Exclamation mark after addressing expression , Indicates filtering rows . Indicates that the line satisfying the condition does not execute the command , But unsatisfied guild execution .
* Exclamation point in front of the command , Indicates that the line satisfying the condition does not execute the command , And the unsatisfied lines will not be executed ( Does not execute because it is not matched to ). This is the command to filter the rows in the schema space .
If the document a.txt Included in 3 That's ok :
djkaldahsdf abcskdf2das chhdsjaj
For the following three sed script :

* (1)./^abc/!{d}
* (2)./^abc/{!d}
* (3)./^abc/!d

Example (1) Place exclamation mark after addressing expression in , It means it's not in letters "abc" The first line will execute d Delete command . And those with "abc" Line at the beginning , Does not match the addressing expression , Subsequent d Command will not be executed . in other words , The sed The purpose of the script is to : except "abc" Line at the beginning , Delete all remaining lines , So only output the 2 That's ok .


Example (2) Exclamation mark in front of the command , Not after the addressing expression . This means that "abc" The first line does not execute d command . And those don't "abc" The first row does not satisfy the addressing condition , Nor will it d command . in other words , The sed Scripted d Command is redundant , No lines will be deleted . So all lines are output .

Example (3) Equivalent to example (1), Because address matching takes precedence over command execution , Exclamation marks are directly considered part of the addressing expression .


But in either case , For those that do not satisfy the addressing expression ( The exclamation mark after addressing is also part of the addressing expression ) That's ok , No subsequent 命令,这些行是直接自动输出的,由"-n"选项控制是否将其输出.

<>

7.sed卡死,cpu 100%问题

有些人可能遇到过这种问题,特别是sed处理以UTF-8格式导出的数据库文件.

之所以会出现这样的问题,是因为字符集的问题,确切地说是本地环境(locale)和文件的编码不一致.

如果出现这样的问题,可以将LC_COLLATE和LC_CTYPE环境变量设置为C.也可以简单地设置LANG=C或LC_ALL=C.