So far , Many programming languages and tools include support for regular expressions ,C# No exception ,C# The base class library contains a namespace (System.Text.RegularExpressions) And a series of classes that can give full play to the power of regular expressions (Regex,Match,Group etc. ). that , What is regular expression , How to define regular expressions ?

 

One , Regular expression basis

           What is regular expression

    When writing a handler for a string , There is often a need to find strings that meet certain complex rules . Regular expressions are tools for describing these rules . let me put it another way , A regular expression is code that records text rules .


     usually , We are using WINDOWS When finding files , Use wildcards (* and ?). If you want to find all Word When documenting , You can use *.doc Find , ad locum ,* Is interpreted as an arbitrary string . Similar to wildcards , Regular expressions are also tools for text matching , It's just like a wildcard , It can more accurately describe your needs —— of course , The price is more complicated .

          A simple example —— Verify phone number

The best way to learn regular expressions is to start with examples , Let's start by verifying the phone number , Step by step to understand regular expressions .


In our country , phone number ( as :0379-65624150) Usually contains 3 reach 4 For 0 The area code at the beginning and a 7 or 8 Number for , Hyphen in the middle ’-’ separate . In this case , First we will introduce a metacharacter \d, It's used to match a 0 reach 9 The number of . This regular expression can be written as :^0\d{2,3}-\d{7,8}$


Let's analyze him ,0 Match numbers “0”,\d Match a number ,{2,3} Indicates repetition 2 reach 3 second ,- Match only ”-” oneself , Next \d Also match a number , and  {7,8} Repeat 7 reach 8 second . of course , The phone number can also be written as
(0379)65624150, This is for the reader .

      A.  Metacharacter

In the example above , We're touching a metacharacter \d, As you think , Regular expressions have many more \d Same metacharacter , The following table lists some common metacharacters :

 


Metacharacter

explain


.

Match any character except newline


\b

Match the beginning or end of a word


\d

Match numbers


\s

Match any whitespace


\w

Match letters or numbers or underscores or Chinese characters


^

Start of matching string


$

End of matching string

surface 1, Common metacharacters

       B.  Escape character


    If you want to find the metacharacter itself , For example, you search . perhaps *, There's a problem : You can't specify them , Because they're interpreted differently . Then you have to use it \ To remove the special meaning of these characters . therefore , You should use \. and \*. of course , To find \ itself , You have to use it, too \\.

for example :unibetter\.com matching unibetter.com,C:\\Windows matching C:\Windows.

       C.   qualifier

A qualifier is also called a repeating description character , Indicates the number of times a character will appear . For example, we use when matching phone numbers {3,4} It means that 3 reach 4 second . Common qualifiers are :

 


qualifier

explain


*

Repeat zero or more times


+

Repeat one or more times


?

Repeat zero or once


{n}

repeat n second


{n,}

repeat n Times or more


{n,m}

repeat n reach m second

surface 2, Common qualifiers

Two ,.NET Support for regular expressions in

    System.Text.RegularExpressions  Namespace contains some classes , These classes provide .NET Framework
Access to regular expression engine . This namespace provides regular expression capabilities , Can run from Microsoft .NET Framework Use this feature in any platform or language within .

 

    A, stay C# Using regular expressions in

I'm getting to know C# After regular expression supported classes in , Let's write the above regular expression to verify the phone number C# In code , Realize the verification of telephone number .

Step 1 , Create a SimpleCheckPhoneNumber Of Windows project .

Step 2 , introduce System.Text.RegularExpressions Namespace .


Step 3 , Write regular expressions . The regular expression here is the string of the verification number above . Because the above string can only verify phone numbers by hyphenating area codes and numbers , So we made some changes :0\d{2,3}-\d{7,8}|\(0\d{2,3}\)\d{7,8}. In this expression ,|  Part of the number one side is what we mentioned above , The latter part is used to verify (0379)65624150 This kind of telephone number . because  (   and   )  Also metacharacter , So use escape characters .|  Indicates branch matching , Or match the previous part , Or match the later part .

Step 4 , Constructing a regular expression Regex class .

Step 5 , use Regex Class IsMatch Method validation match .Regex Class IsMatch() Method returns a bool value , If there is a match , return true, Otherwise return false.

 

Three , Advanced regular expression

     A.  grouping

When matching phone numbers , We've used to repeat a single character . Let's learn how to use grouping to match a IP address .


as everyone knows ,IP The address is represented by a four segment dotted decimal string . therefore , We can group by address , To match . first , Let's match the first paragraph :2[0-4]\d|25[0-5]|[01]?\d\d?  This regular expression can match IP A number of addresses .2[0-4]\d  Match to 2 start , Ten are 0 reach 4, Three digit field with any number of digits ,25[0-5]  Match to 25
start , Bits are 0 reach 5 Three digit field of ,[01]?\d\d?  Match any 1 person 0 head , Fields with any number of digits and tens .?  Indicates zero or one occurrence . therefore , [01]  and
the last one  \d  Can not appear , If we add another one to the string  \.  To match .
You can divide it into sections . Now? , We put  2[0-4]\d|25[0-5]|[01]?\d\d?\.  As a group , It can be written as  (2[0-4]\d|25[0-5]|[01]?\d\d?\.) . Let's use this group next . Repeat this group twice , then , Reuse  2[0-4]\d|25[0-5]|[01]?\d\d?  That's it . The complete regular expression is : (2[0-4]\d|25[0-5]|[01]?\d\d?\.){3}2[0-4]\d|25[0-5]|[01]?\d\d?

 

    B. Backward reference

After we understand the grouping , So we can use backward references . So called backward reference , That's using the results captured earlier , Match subsequent characters . Multiple for matching repeating characters . Like matching go go
Such repeating characters . We can use (go) \1 To match .


By default , Each group will automatically have a group number , The rule is : Left to right , Marked by the group's left parenthesis , The group number of the first occurrence group is 1, The second is 2, and so on . of course , You can also specify the group name of the subexpression . Group name to specify a subexpression , Use this syntax :(?<Word>\w+)( Or change the angle bracket to ' Yes :(?'Word'\w+)), That's how \w+ The group name of is specified as Word 了 . To reverse reference the content captured by this group , You can use \k<Word>, So the last example can be written like this :\b(?<Word>\w+)\b\s+\k<Word>\b.

There is another benefit of customizing group names , In our C# In process , If you need to get the value of the group , We can clearly use the group name we defined to get , Without Subscripts .


When we don't want to use backward references , There's no need for the capture group to remember anything , In this case, it can be used (?:nocapture) Syntax to proactively tell the regular expression engine , Do not treat the contents of parentheses as capture groups , In order to improve efficiency .

    C. Zero width assertion

In the previous metacharacter introduction , We already know that there are such characters , Can match the beginning of a sentence , end (^
$) Or match the beginning of a word , end (\b). These metacharacters match only one position , Specify this location to meet certain conditions , Instead of matching certain characters , therefore , They are called   Zero width assertion
. So called zero width , They don't match any characters , And match a location ; So called assertion , It's a judgment . In regular expressions, matching continues only when the assertion is true .

In some cases , We match exactly one location , Not just sentences or words , This requires us to write assertions to match . Here is the syntax for assertions :

 


Assertion syntax

explain


(?=pattern)

Forward affirmation , matching pattern Front position


(?!pattern)

Forward negative assertion , Not after matching pattern Location of


(?<=pattern)

Backward affirmative assertion , matching pattern Back position


(?<!pattern)

Backward negative assertion , Not before match pattern Location of

surface 3, Syntax and description of assertions

Is it hard to understand ? Let's take an example .


There is a label :<book>, We want to get the label <book> Tag name of (book), This time , We can use assertions to handle . Look at this expression :(?<=\<)(?<tag>\w*)(?=\>) , Use this expression , Can match <  and  > Characters between , It's here book. You can also write more complex expressions using assertions , There are no more examples here .

One more thing is very important , Is that the parentheses used in the assertion syntax are not used as capture groups , So you can't use numbers or names to reference it .

     D. Greed and laziness


When a regular expression contains a qualifier that accepts duplicates , The usual behavior is ( On the premise that the whole expression can be matched ) Match as many characters as possible . Take a look at this expression :a\w*b , Use it to match strings
aabab Time , The matching result is  aabab . This kind of matching is called greedy matching .

Sometimes , We want it to be as repetitive as possible , That is to say, the matching result obtained by the above example is  aab, And then we're going to use lazy matching . lazy match
You need to add one after the repeat qualifier  ?  Symbol , The above expression can be written as :a\w*?b  Let's match the string aabab Time , The matching result is  aab  and  ab .

Maybe you need to ask ,ab than aab Fewer repetitions , Why not match first ab What about ? In fact, there is more greedy in regular expressions / Rules with higher priority for laziness :
The first match has the highest priority ——The match that begins earliest wins.

     E. notes

grammar :(?#comment)

    for example :2[0-4]\d(?#200-249)|25[0-5](?#250-255)|[01]?\d\d?(?#0-199)

    be careful : If using notes , You need to be very careful not to precede the comment with a space , Some characters such as line breaks , If you can ignore these characters , It is better to use “ Ignore whitespace in pattern ” option , Namely C# in
RegexOptions Enumerated IgnorePatternWhitespace option (C# In RegexOptions Enumeration will be mentioned below ).

      F. C# Processing options in

stay C# in , have access to RegexOptions  Enumeration to select C# How to deal with regular expressions . Here is MSDN in RegexOptions  About members of enumeration :

      C# in Capture class ,Group class ,Match class

Capture class
: Represents the result in a single subexpression capture .Capture Class represents a substring in a single successful capture . This class does not have a public constructor , From Group Class or Match Get a Capture Object collection of class .Capture Class has three common properties , namely Index,Length and Value.Index Represents the position of the first character of the captured substring .Length Represents the length of the captured substring ,Value Represents a captured substring .

Group class
: Represents information grouped in a regular expression . This class provides support for group matching regular expressions . This class does not have a public constructor . From Match Get a Group Set of classes . If a group in a regular expression is named , You can access it by name , If not named , Can be accessed by subscript . be careful : every last Match Of Groups No 0 Elements (Groups[0]) It's all this Match Captured string , It's also Capture Of Value.

Match class
: Represents the result of a single regular expression match . This class also has no public constructor , From Regex Class Match() Method to get an instance of this class , It can also be used Regex Class Matches() Method to get a set of given classes .

All three classes can represent the result of a single regular expression matching , but Match Class to get more details , Contains capture and grouping information . therefore ,Match Class is the most commonly used of these three classes .