This article is only about some mistakes of regularization, On regular learning, Please refer to the following two articles：

Basic regularity：https://www.cnblogs.com/f-ck-need-u/p/9621130.html
<https://www.cnblogs.com/f-ck-need-u/p/9621130.html>

Perl regular：https://www.cnblogs.com/f-ck-need-u/p/9648439.html
<https://www.cnblogs.com/f-ck-need-u/p/9648439.html>

1. All matching patterns in regular, Should be understood as" After matching a character or string, Follow closely and match again". This concept is very important.

2. When caret is used at the beginning of bracket, Represents a character immediately following a match that does not contain a given character, Instead of allowing mismatch of given characters.
Most of the time, they are equivalent, But when matching the end of a line, Different meanings, for example：Aa[^bcd]\$ The matching row is allowed to beAaa\$ orAax\$, But not onlyAa\$.
This is regular" Match closely" Meaning.

3.(\.[0-9]+)? Match decimal part, Can not be written (\.?[0-9]*) , The latter even if it doesn't match the decimal point, It can also match the value after the decimal point

4.perl When regular parentheses are grouped, Use(?: Replace left parenthesis(, It can be said that only groups are not captured. The so-called capture means that it can be inverted or saved to variables outside the regular
([-+]?[0-9]+(\.[0-9]+)?) *(cm|mm) ：(cm|mm) Will be saved as\$3
([-+]?[0-9]+(?:\.[0-9]+)?) *(cm|mm) : (cm|mm) Will be saved as\$2

5. Special anchor, The anchor matches the position, Not character, The head of a row^ End of line\$ the same is true.

Note that some programs don't understand words the same way they define boundaries. Some programs do not fully support all of the following special metacharacters. generally speaking, Words are made of letters, Composed of numbers and underscores, Namely[a-zA-Z0-9_].
for examplegnu grep 2.6 Version not supported\s and\d, andgnu grep 2.20 Support\s But it does not support it.\d
'\b'： Match empty characters at word boundariesMatch the empty string at the edge of a word.
'\B'： Match empty characters at non word boundariesMatch the empty string provided it's not at the edge of a
word.
'\<'： Matches empty characters at the beginning of a wordMatch the empty string at the beginning of word.
'\>'： Matches empty characters at the end of a wordMatch the empty string at the end of word.
'\w'： Match word componentsMatch word constituent, it is a synonym for `[_[:alnum:]]'.
'\W'： Match non word componentsMatch non-word constituent, it is a synonym for `[^_[:alnum:]]'.
'\s'： Match white space charactersMatch whitespace, it is a synonym for `[[:space:]]'.
'\S'： Match non white space charactersMatch non-whitespace, it is a synonym for `[^[:space:]]'.
'\d'： Matching numberit is a synonym for `[0-9]'.
'\D'： Match non numericit is a synonym for `[^0-9]'.

For example, '\brat\b' matches the separate word 'rat', '\Brat\B' matches
'crate' but not 'furry rat'.

6. Character class, Note that some programs do not fully support all of the following character classes
'[:alnum:]' ：same as '[0-9A-Za-z]'.
'[:alpha:]' ：'[:lower:]' and '[:upper:]', same as '[A-Za-z]'.
'[:lower:]' ：
'[:upper:]' ：
'[:digit:]' ：'0 1 2 3 4 5 6 7 8 9'.
'[:xdigit:]' ：Hex digits: `0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f'.

'[:blank:]' ：space and tab.
'[:space:]' ：tab, newline, vertical tab, form feed, carriage return, and
space.
'[:punct:]' ：Punctuation characters; this is '! " # \$ % & ' ( ) * + , - . / :
; < = > ? @ [ \ ] ^ _ ` { | } ~'.
'[:print:]' ：'[:alnum:]', '[:punct:]', and space.
'[:graph:]' ：Graphical characters: '[:alnum:]' and '[:punct:]'.

'[:cntrl:]' ：Control characters. octal codes 000 through 037, and 177 (`DEL').

7. In the same expression, Matched characters cannot be matched for the second time. Because the purpose of regularization is： After matching a character or string, Follow closely and match again.
For example, string"#c#", regular expression "(#.)(.#)" Unable to match.
Another example is string"#cc#", regular expression "(#.)(.*)(.#)" Can match successfully, Only the second group can match null.

8." Look around" Anchoring, Namelylookaround anchor( also known as" Zero width assertion", Indicates a match is a location, Not character).
with (?= Alternate left parenthesis for a left to right look around, for example(?=\d) Indicates that the condition is met when the right side of the current character is a number
with (?<= Alternate left parenthesis for right to left look around in reverse order, for example(?<=\d) Indicates that the condition is met when the left side of the current character is a number

* Forward looking round：(?=...) and(?!...), Exclamation mark table negative, That is, the characters to the right of the exclamation point cannot be matched.
* Reverse look around：(?<=...) and(?<!...)

An expression that looks backward must only represent a fixed length string, for example(?<=word) or(?<=word|word) Sure, but(?<=word?) May not, because? matching0 or1 length, Indefinite length.
stayPCRE in, Can be rewritten as(?<=word|words), butperl Not allowed in, becauseperl It is strictly required that the length must be fixed.

9. about" Look around" Anchoring, The most important thing to note is that the matching result does not take up any characters, It's just anchoring.
for example：your name is longshuai MA and your name is longfei MA
Use(?=longshuai) Will be able to anchor words in the first sentence"longshuai" Empty characters before, But it matches"longshuai" White space before,
therefore(?=longshuai)long Talent representative"long" These strings
So only for the two sentences here,long(?=shuai) and(?=longshuai)long It is equivalent.

10. Greedy matching, Inert matching and possessive priority matching
By default, For the expression of repetitions, it is greedy matching, Represent as many matches as possible.
Some advanced regular engines support lazy matching, Represent as few matches as possible, Stop as soon as conditions are met.

* *,    +,    ?     {M,N} ： All greedy matches(greedy)
* *?  +?  ??   {M,N}? ： It's all inert matching(lazy,Reluctant)
* *+,  ++,  ?+,   {M,N}+ ： It's all a priority match(possessive)
Possession priority is the same as curing group, As long as you have it, you don't exchange it, Backtracking not allowed. See the following for an example(?>...) Curing group method

11. Matching mode

* (?i)： Case insensitive, May use(?-i) Cancel the mode. for example"(?i)abc(?-i)cdB" Only for the middleabc Match case insensitive
*
Because(?i) Fail when closing bracket is encountered, You can write the parts that need case insensitive matching into grouping brackets, for example"((?i)abc)cdB",(?:(?i)abc)cdB=(?i:abc)cdB
* (?x)：extend Pattern, Multiple consecutive spaces and annotator to line end characters will be ignored
* (?m)：(multiline) Multi line mode, change^ and\$ Match pattern for. In default mode, They match the beginning and the end of the string. Under this mode：
* ^ Match the first part of the string with the newline character. To match only the first part of a string, Use\A.
* \$ Will match the end of the string, Empty characters before line breaks and line breaks. To match only the end of a string and the end of a line, Use\Z, To match only the end of a string, Use\z
* (?s)：(singleline ordotall) Single row mode, change"." Match pattern for, In default mode, spot"." Cannot match newline,dotall You can
* (?U)：lazy Matching mode. The default isgreedy matching.
12. Force literal interpretation：\Q...\E. This sequence forces all characters in the middle of it to be literal, Very mandatory.
butperl andpcre Somewhat different.perl in, Variables can be referenced in the middle of the sequence for variable replacement, andpcre Middle variable symbols are also treated as normal characters.

13. General grouping and capture

* (),\$1,\$2,\$3,\$4... Used in some places\1,\2,\3,\4,sed Use in& Indicates all matches,perl Use of medium\$&
* \g1,\g2,\g3 or\g{1},\g{2},\g{3}.
among\$1,\$2, ... For regular outside, and"\g1", "\g2", ... For regular inner

14. Named groups and captures

*
(?:...)： Unnamed capture, Group only, Not available for reference, Also known as Uncaptured brackets. for example"(1|one)(?:2|two)(3|three)",\$1=(1|one),\$2=(3|three)
* (?<NAME>...)： Named capture, Also named after group capture, Just like variable assignment. have access to\k<NAME> or\k'NAME' or\g{NAME} Method to reference
* (?>...)： Curing group. Once the match is successful, the content will never be returned( It's easy to understand with the idea of backtracking).
for example"hello world" Can be"hel.* world" Matching, But can not be"hel(?>.*) world" matching.
Because normally,".*" Match to all, Then backtrack to release the matched content until the space" " character. After curing, Matched content will never be returned, So we can't go back.

15. Reset match：\K Used to reset the matching position.
such as,foot\Kbar matching”footbar”, But the matching result is ”bar”. however, \K Use of does not interfere with content within a subgroup, such as
(foot)\Kbar matching ”footbar”, The results in the first subgroup will still be ”foo”.
\$ echo abc123abcfoo | grep -P -o '(abc)123\K\g1foo' abcfoo
16. To reverse a string match. It can be indirectly realized by forward looking anchoring and reverse looking anchoring.
for example,"-a -3 ac c 3 b" Take negative number out of, Positive numbers and spaces are simple,"-?[0-9]+|\s" that will do, But I want to get"-a ac c
b", At present, regular expressions can only pass through(?!) Realization of look around reverse："((?!-?[0-9]+|\s).)*", Outer brackets indicate that the right side is not a positive number, Negative or blank characters are matched and grouped, Then repeat the quantifier*, Connect continuous content.
for example：
echo "-a -3 ac c 3 b" | grep -P '((?!-?[0-9]+|\s).)*'
...

30天阅读排行