So far, Many programming languages and tools include support for regular expressions,C# No exception,C# The base class library contains a namespace(System.Text.RegularExpressions) And a series of classes that can give full play to the power of regular expressions(Regex,Match,Group etc.). that, What is regular expression, How to define regular expressions?

 

One, Regular expression basis

           What is regular expression

    When writing a handler for a string, There is often a need to find strings that meet certain complex rules. Regular expressions are tools for describing these rules. Let me put it another way, Regular expressions are codes that record text rules.


     usually, We are usingWINDOWS When finding files, Use wildcards(* and?). If you want to find allWord Document time, You can use*.doc Search for, Ad locum,* Is interpreted as an arbitrary string. Similar to wildcards, Regular expressions are also tools for text matching, It's just like a wildcard, It can more accurately describe your needs—— Of course, The price is more complicated.

          A simple example—— Verify phone number

The best way to learn regular expressions is to start with examples, Let's start by verifying the phone number, Step by step to understand regular expressions.


In our country, Phone number( as:0379-65624150) Usually include3 reach4 In order to0 The area code at the beginning and a7 or8 Number for, Hyphen in the middle’-’ Separate. In this case, First we will introduce a metacharacter\d, It's used to match a0 reach9 Figures. This regular expression can be written as:^0\d{2,3}-\d{7,8}$


Let's analyze him,0 Matching number“0”,\d Match a number,{2,3} Repetition2 reach3 second,- Only match”-” Oneself, Next\d Also match a number, and {7,8} Repeat7 reach8 second. Of course, The phone number can also be written as
(0379)65624150, This is for the reader.

      A.  Meta character

In the example above, We're touching a metacharacter\d, As you think, Regular expressions have many more\d Same metacharacter, The following table lists some common metacharacters:

 


Meta character

Explain


.

Match any character except newline


\b

Match the beginning or end of a word


\d

Matching number


\s

Match any whitespace


\w

Match letters or numbers or underscores or Chinese characters


^

Start of matching string


$

End of matching string

surface1, Common metacharacters

       B.  Escape character


    If you want to find the metacharacter itself, For example, you search. perhaps*, There's a problem: You can't specify them, Because they're interpreted differently. Then you have to use it\ To remove the special meaning of these characters. therefore, You should use\. and\*. Of course, To find\ itself, You have to use it too.\\.

for example:unibetter\.com matchingunibetter.com,C:\\Windows matchingC:\Windows.

       C.   qualifier

A qualifier is also called a repeating description character, Indicates the number of times a character will appear. For example, we use when matching phone numbers{3,4} It means that3 reach4 second. Common qualifiers are:

 


qualifier

Explain


*

Repeat zero or more times


+

Repeat one or more times


?

Repeat zero or once


{n}

repeatn second


{n,}

repeatn Times or more


{n,m}

repeatn reachm second

surface2, Common qualifiers

Two,.NET Support for regular expressions in

    System.Text.RegularExpressions  Namespace contains some classes, These classes provide .NET Framework
Access to regular expression engine. This namespace provides regular expression capabilities, Can run from Microsoft .NET Framework Use this feature in any platform or language within.

 

    A, stayC# Using regular expressions in

Got it.C# After regular expression supported classes in, Let's write the above regular expression to verify the phone numberC# Code, Realize the verification of telephone number.

First step, Create aSimpleCheckPhoneNumber OfWindows project.

The second step, IntroduceSystem.Text.RegularExpressions Namespace.


The third step, Write regular expressions. The regular expression here is the string of the verification number above. Because the above string can only verify phone numbers by hyphenating area codes and numbers, So we made some changes:0\d{2,3}-\d{7,8}|\(0\d{2,3}\)\d{7,8}. In this expression,|  Part of the number one side is what we mentioned above, The latter part is used to verify(0379)65624150 This kind of telephone number. Because (   and  )  Also metacharacter, So use escape characters.|  Indicates branch matching, Or match the previous part, Or match the later part.

The fourth step, Constructing a regular expressionRegex class.

The fifth step, UseRegex ClassIsMatch Method validation match.Regex ClassIsMatch() Method returns abool value, If there is a match, Returntrue, Otherwise returnfalse.

 

Three, Advanced regular expression

     A.  Grouping

When matching phone numbers, We've used to repeat a single character. Let's learn how to use grouping to match aIP address.


As everyone knows,IP The address is represented by a four segment dotted decimal string. therefore, We can group by address, To match. First, Let's match the first paragraph:2[0-4]\d|25[0-5]|[01]?\d\d?  This regular expression can matchIP A number of addresses.2[0-4]\d  Match to2 Start, The ten is0 reach4, Three digit field with any number of digits,25[0-5]  Match to25
Start, Bit is0 reach5 Three digit field of,[01]?\d\d?  Match any1 person0 head, Fields with any number of digits and tens.?  Indicates zero or one occurrence. therefore, [01]  and
The last one \d  Can not appear, If we add another one to the string \.  To match .
You can divide it into sections. Now, We put 2[0-4]\d|25[0-5]|[01]?\d\d?\.  As a group, It can be written as (2[0-4]\d|25[0-5]|[01]?\d\d?\.) . Let's use this group next. Repeat this group twice, Then? Reuse 2[0-4]\d|25[0-5]|[01]?\d\d?  That's all right.. The complete regular expression is: (2[0-4]\d|25[0-5]|[01]?\d\d?\.){3}2[0-4]\d|25[0-5]|[01]?\d\d?

 

    B. Backward reference

After we understand the grouping, So we can use backward references. So called backward reference, Using the results captured earlier, Match subsequent characters. Multiple for matching repeating characters. Such as matching go go
Such repeating characters. We can use (go) \1 To match.


By default, Each group will automatically have a group number, The rule is: From left to right, Marked by the group's left parenthesis, The group number of the first occurrence group is1, The second is2, And so on. Of course, You can also specify the group name of the subexpression. Group name to specify a subexpression, Use this syntax:(?<Word>\w+)( Or change the angle bracket to' Also do:(?'Word'\w+)), This way\w+ The group name of is specified asWord 了. To reverse reference the content captured by this group, You can use\k<Word>, So the last example can be written like this:\b(?<Word>\w+)\b\s+\k<Word>\b.

There is another benefit of customizing group names, In ourC# In program, If you need to get the value of the group, We can clearly use the group name we defined to get, Without Subscripts.


When we don't want to use backward references, There's no need for the capture group to remember anything, In this case, it can be used(?:nocapture) Syntax to proactively tell the regular expression engine, Do not treat the contents of parentheses as capture groups, In order to improve efficiency.

    C. Zero width assertion

In the previous metacharacter introduction, We already know that there are such characters, Can match the beginning of a sentence, End(^
$) Or match the beginning of a word, End(\b). These metacharacters match only one position, Specify this location to meet certain conditions, Instead of matching certain characters, therefore, They are called  Zero width assertion
. Zero width, They don't match any characters, And match a location; Assertion, It's a judgment. In regular expressions, matching continues only when the assertion is true.

In some cases, We match exactly one location, Not just sentences or words, This requires us to write assertions to match. Here is the syntax for assertions:

 


Assertion syntax

Explain


(?=pattern)

Forward affirmation, matchingpattern Front position


(?!pattern)

Forward negative assertion, Not after matchingpattern Location


(?<=pattern)

Backward affirmative assertion, matchingpattern Back position


(?<!pattern)

Backward negative assertion, Not before matchpattern Location

surface3, Syntax and description of assertions

Is it hard to understand? Let's take an example.


There is a label:<book>, We want to get the label<book> Tag name(book), This time, We can use assertions to handle. Look at this expression:(?<=\<)(?<tag>\w*)(?=\>) , Use this expression, Can match<  and > Characters between, It's herebook. You can also write more complex expressions using assertions, There are no more examples here.

One more thing is very important, Is that the parentheses used in the assertion syntax are not used as capture groups, So you can't use numbers or names to reference it.

     D. Greed and laziness


When a regular expression contains a qualifier that accepts duplicates, The usual behavior is( On the premise that the whole expression can be matched) Match as many characters as possible. Take a look at this expression:a\w*b , Use it to match strings
aabab Time, The matching result is aabab . This kind of matching is called greedy matching.

Sometimes, We want it to be as repetitive as possible, That is to say, the matching result obtained by the above example is aab, And then we're going to use lazy matching. lazy match
You need to add one after the repeat qualifier ?  Symbol, The above expression can be written as:a\w*?b  Let's match the string aabab Time, The matching result is aab  and ab .

Maybe you need to ask,ab thanaab Fewer repetitions, Why not match firstab What about? In fact, there is more greedy in regular expressions/ Rules with higher priority for laziness:
The first match has the highest priority——The match that begins earliest wins.

     E. Notes

grammar:(?#comment)

    for example:2[0-4]\d(?#200-249)|25[0-5](?#250-255)|[01]?\d\d?(?#0-199)

    Be careful: If using notes, You need to be very careful not to have spaces before the parentheses in the comment, Some characters such as line breaks, If you can ignore these characters, It is better to use“ Ignore whitespace in pattern” option, NamelyC# in
RegexOptions EnumerativeIgnorePatternWhitespace option(C# MediumRegexOptions Enumeration will be mentioned below).

      F. C# Processing options in

stayC# in, have access toRegexOptions  Enumeration to selectC# How to deal with regular expressions. Below isMSDN inRegexOptions  About members of enumeration:

      C# inCapture class,Group class,Match class

Capture class
: Represents the result in a single subexpression capture.Capture Class represents a substring in a single successful capture. This class does not have a public constructor, Can fromGroup Class orMatch Get aCapture Object collection of class.Capture Class has three common properties, NamelyIndex,Length andValue.Index Represents the position of the first character of the captured substring.Length Represents the length of the captured substring,Value Represents a captured substring.

Group class
: Represents information grouped in a regular expression. This class provides support for group matching regular expressions. This class does not have a public constructor. Can fromMatch Get aGroup Class set. If the grouping in a regular expression is named, You can access it by name, If not named, Can be accessed by subscript. Be careful: Every lastMatch OfGroups No0 Element(Groups[0]) This is all this.Match Captured string, AlsoCapture OfValue.

Match class
: Represents the result of a single regular expression match. This class also has no public constructor, Can fromRegex ClassMatch() Method to get an instance of this class, It can also be usedRegex ClassMatches() Method to get a set of given classes.

All three classes can represent the result of a single regular expression matching, butMatch Class to get more details, Contains capture and grouping information. therefore,Match Class is the most commonly used of these three classes.