This article is only about some mistakes of regularization , On regular learning , Please refer to the following two articles :

Basic regularity :https://www.cnblogs.com/f-ck-need-u/p/9621130.html
<https://www.cnblogs.com/f-ck-need-u/p/9621130.html>

Perl regular :https://www.cnblogs.com/f-ck-need-u/p/9648439.html
<https://www.cnblogs.com/f-ck-need-u/p/9648439.html>

1. All matching patterns in regular , Should be understood as " After matching a character or string , Follow closely and match again ". This concept is very important .

2. When caret is used at the beginning of bracket , Represents a character immediately following a match that does not contain a given character , Instead of allowing mismatch of given characters .
Most of the time, they are equivalent , But when matching the end of a line , Different meanings , for example :Aa[^bcd]$ The matching row is allowed to be Aaa$ or Aax$, But not only Aa$.
This is regular " Match closely " Meaning of .

3.(\.[0-9]+)? Match decimal part , It can't be written (\.?[0-9]*) , The latter even if it doesn't match the decimal point , It can also match the value after the decimal point

4.perl When regular parentheses are grouped , use (?: Replace left parenthesis (, It can be said that only groups are not captured . The so-called capture means that it can be inverted or saved to variables outside the regular
([-+]?[0-9]+(\.[0-9]+)?) *(cm|mm) :(cm|mm) Save as $3
([-+]?[0-9]+(?:\.[0-9]+)?) *(cm|mm) : (cm|mm) Save as $2

5. Special anchor , The anchor matches the position , Not characters , The beginning of a line ^ And the end of the line $ the same is true .

Note that some programs don't understand words the same way they define boundaries . Some programs do not fully support all of the following special metacharacters . generally speaking , Words are made of letters , Composed of numbers and underscores , Namely [a-zA-Z0-9_].
for example gnu grep 2.6 Version not supported \s and \d, and gnu grep 2.20 support \s But not supported \d
'\b': Match empty characters at word boundaries Match the empty string at the edge of a word.
'\B': Match empty characters at non word boundaries Match the empty string provided it's not at the edge of a
word.
'\<': Matches empty characters at the beginning of a word Match the empty string at the beginning of word.
'\>': Matches empty characters at the end of a word Match the empty string at the end of word.
'\w': Match word components Match word constituent, it is a synonym for `[_[:alnum:]]'.
'\W': Match non word components Match non-word constituent, it is a synonym for `[^_[:alnum:]]'.
'\s': Match white space characters Match whitespace, it is a synonym for `[[:space:]]'.
'\S': Match non white space characters Match non-whitespace, it is a synonym for `[^[:space:]]'.
'\d': Match numbers it is a synonym for `[0-9]'.
'\D': Match non numeric it is a synonym for `[^0-9]'.

For example, '\brat\b' matches the separate word 'rat', '\Brat\B' matches
'crate' but not 'furry rat'.

6. Character class , Note that some programs do not fully support all of the following character classes
'[:alnum:]' :same as '[0-9A-Za-z]'.
'[:alpha:]' :'[:lower:]' and '[:upper:]', same as '[A-Za-z]'.
'[:lower:]' :
'[:upper:]' :
'[:digit:]' :'0 1 2 3 4 5 6 7 8 9'.
'[:xdigit:]' :Hex digits: `0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f'.

'[:blank:]' :space and tab.
'[:space:]' :tab, newline, vertical tab, form feed, carriage return, and
space.
'[:punct:]' :Punctuation characters; this is '! " # $ % & ' ( ) * + , - . / :
; < = > ? @ [ \ ] ^ _ ` { | } ~'.
'[:print:]' :'[:alnum:]', '[:punct:]', and space.
'[:graph:]' :Graphical characters: '[:alnum:]' and '[:punct:]'.

'[:cntrl:]' :Control characters. octal codes 000 through 037, and 177 (`DEL').

7. In the same expression , Matched characters cannot be matched for the second time . Because the purpose of regularization is : After matching a character or string , Follow closely and match again .
For example, string "#c#", regular expression "(#.)(.#)" Can't match .
Another example is string "#cc#", regular expression "(#.)(.*)(.#)" Can match successfully , Only the second group can match null .

8." Look around " Anchoring , Namely lookaround anchor( also known as " Zero width assertion ", Indicates a match is a location , Not a character ).
with (?= Alternate left parenthesis for a left to right look around , for example (?=\d) Indicates that the condition is met when the right side of the current character is a number
with (?<= Alternate left parenthesis for right to left look around in reverse order , for example (?<=\d) Indicates that the condition is met when the left side of the current character is a number

* Forward looking :(?=...) and (?!...), Exclamation mark table negative , That is, the characters to the right of the exclamation point cannot be matched .
* Reverse look :(?<=...) and (?<!...)

An expression that looks backward must only represent a fixed length string , for example (?<=word) or (?<=word|word) sure , but (?<=word?) may not , because ? matching 0 or 1 length , Variable length .
stay PCRE in , Rewritable as (?<=word|words), but perl Not allowed in , because perl It is strictly required that the length must be fixed .


9. about " Look around " Anchoring , The most important thing to note is that the matching result does not take up any characters , It's just anchoring .
for example :your name is longshuai MA and your name is longfei MA
use (?=longshuai) Will be able to anchor words in the first sentence "longshuai" Empty characters before , But it matches "longshuai" White space before ,
therefore (?=longshuai)long Can represent "long" These strings
So only for the two sentences here ,long(?=shuai) and (?=longshuai)long Is equivalent

10. Greedy matching , Inert matching and possessive priority matching
By default , For the expression of repetitions, it is greedy matching , Represent as many matches as possible .
Some advanced regular engines support lazy matching , Represent as few matches as possible , Stop as soon as conditions are met .

* *,    +,    ?     {M,N} : All greedy matches (greedy)
* *?  +?  ??   {M,N}? : It's all inert matching (lazy,Reluctant)
* *+,  ++,  ?+,   {M,N}+ : It's all a priority match (possessive)
Possession priority is the same as curing group , As long as you own it, you don't exchange it , Backtracking not allowed . See the following for an example (?>...) Curing group method

11. Match pattern

* (?i): Case insensitive , Available (?-i) Cancel the mode . for example "(?i)abc(?-i)cdB" Only for the middle abc Match case insensitive
*
because (?i) Fail when closing bracket is encountered , You can write the parts that need case insensitive matching into grouping brackets , for example "((?i)abc)cdB",(?:(?i)abc)cdB=(?i:abc)cdB
* (?x):extend pattern , Multiple consecutive spaces and annotator to line end characters will be ignored
* (?m):(multiline) Multiline mode , change ^ and $ Match pattern for . In default mode , They match the beginning and the end of the string . In this mode :
* ^ Match the first part of the string with the newline character . To match only the first part of a string , use \A.
* $ Will match the end of the string , Empty characters before line breaks and line breaks . To match only the end of a string and the end of a line , use \Z, To match only the end of a string , use \z
* (?s):(singleline or dotall) Single line mode , change "." Match pattern for , In default mode , spot "." Cannot match newline ,dotall You can
* (?U):lazy Match pattern . The default is greedy matching .
12. Force literal interpretation :\Q...\E. This sequence forces all characters in the middle of it to be literal , Very mandatory .
but perl and pcre Different .perl in , Variables can be referenced in the middle of the sequence for variable replacement , and pcre Middle variable symbols are also treated as normal characters .

13. General grouping and capture

* (),$1,$2,$3,$4... Used in some places \1,\2,\3,\4,sed Used in & Indicates all matches ,perl Use in $&
* \g1,\g2,\g3 or \g{1},\g{2},\g{3}.
among $1,$2, ... For regular outside , and "\g1", "\g2", ... For regular inner

14. Named groups and captures

*
(?:...): Unnamed capture , Group only , Not available for reference , Also known as Uncaptured brackets . for example "(1|one)(?:2|two)(3|three)",$1=(1|one),$2=(3|three)
* (?<NAME>...): Named capture , Also named after group capture , Just like variable assignment . have access to \k<NAME> or \k'NAME' or \g{NAME} Method to reference
* (?>...): Curing group . Once the match is successful, the content will never be returned ( It's easy to understand with the idea of backtracking ).
for example "hello world" Can be "hel.* world" Match , But not by "hel(?>.*) world" matching .
Because normally ,".*" Match to all , Then backtrack to release the matched content until the space " " character . After curing , Matched content will never be returned , So we can't go back .

15. Reset match :\K Used to reset the matching position .
such as ,foot\Kbar matching ”footbar”, But the matching result is ”bar”. however , \K Use of does not interfere with content within a subgroup , such as
(foot)\Kbar matching ”footbar”, The results in the first subgroup will still be ”foo”.
$ echo abc123abcfoo | grep -P -o '(abc)123\K\g1foo' abcfoo
16. To reverse a string match . It can be indirectly realized by forward looking anchoring and reverse looking anchoring .
for example ,"-a -3 ac c 3 b" Take negative number out of , Positive numbers and spaces are simple ,"-?[0-9]+|\s" that will do , But I want to get "-a ac c
b", At present, regular expressions can only pass through (?!) Realization of look around reverse :"((?!-?[0-9]+|\s).)*", Outer brackets indicate that the right side is not a positive number , Negative or blank characters are matched and grouped , Then repeat the quantifier *, Connect continuous content .
for example :
echo "-a -3 ac c 3 b" | grep -P '((?!-?[0-9]+|\s).)*'
...