Regular Expressions

The Anchor Characters: ^ and $

Pattern
Matches
^A
"A" at the beginning of a line
A$
"A" at the end of a line
A^
"A^" anywhere on a line
$A
"$A" anywhere on a line
^^
"^" at the beginning of a line
$$
"$" at the end of a line

Match any character with .

The character "." is one of those special meta-characters. By itself it will match any character, except the end-of-line character.
The pattern that will match a line with a single characters is ^.$

Specifying a Range of Characters with [...]

If you want to match specific characters, you can use the square brackets to identify the exact characters you are searching for
The pattern that will match any line of text that contains exactly one number is ^[0123456789]$
This is verbose. You can use the hyphen between two characters to specify a range: ^[0-9]$
You can intermix explicit characters with character ranges. This pattern will match a single character that is a letter, number, or underscore: [A-Za-z0-9_]

Exceptions in a character set

You can easily search for all characters except those in square brackets by putting a "^" as the first character after the "["
To match all characters except vowels use "[^aeiou]".
Like the anchors in places that can't be considered an anchor, the characters "]" and "-" do not have a special meaning if they directly follow "[". Here are some examples:

Regular Expression
Matches
[]
The characters "[]"
[0]
The character "0"
[0-9]
Any number
[^0-9]
Any character other than a number
[-0-9]
Any number or a "-"
[0-9-]
Any number or a "-"
[^-0-9]
Any character except a number or a "-"
[]0-9]
Any number or a "]"
[0-9]]
Any number followed by a "]"
[0-9-z]
Any number,
or any character between "9" and "z".
[0-9\-a\]]
Any number, or
a "-", a "a", or a "]"

Repeating character sets with *

The special character "*" matches zero or more copies. That is, the regular expression "0*" matches zero or more zeros, while the expression "[0-9]*" matches zero or more numbers
This explains why the pattern "^#*" is useless, as it matches any number of "#'s" at the beginning of the line, including zero. Therefore this will match every line, because every line starts with zero or more "#'s"
Just use "^ *" to match zero or more spaces at the beginning of the line. If you need to match one or more, just repeat the character set. That is, "[0-9]*" matches zero or more numbers, and "[0-9][0-9]*" matches one or more numbers

Matching a specific number of sets with \{ and \}

You can specify the minimum and maximum number of repeats by putting those two numbers between "\{" and "\}"
The backslashes deserve a special discussion. Normally a backslash turns off the special meaning for a character. A period is matched by a "\." and an asterisk is matched by a "\*"
If a backslash is placed before a "<," ">," "{," "}," "(," ")," or before a digit, the backslash turns on a special meaning
The regular expression to match 4, 5, 6, 7 or 8 lower case letters is [a-z]\{4,8\}
Any numbers between 0 and 255 can be used. The second number may be omitted, which removes the upper limit. If the comma and the second number are omitted, the pattern must be duplicated the exact number of times specified by the first number
Regular Expression
Matches
_
*
Any line with an asterisk
\*
Any line with an asterisk
\\
Any line with a backslash
^*
Any line starting with an asterisk
^A*
Any line
^A\*
Any line starting with an "A*"
^AA*
Any line if it starts with one "A"
^AA*B
Any line with one or more "A"'s followed by a "B"
^A\{4,8\}B
Any line starting with 4, 5, 6, 7 or 8 "A"'s
followed by a "B"
^A\{4,\}B
Any line starting with 4 or more "A"'s
followed by a "B"
^A\{4\}B
Any line starting with "AAAAB"
\{4,8\}
Any line with "{4,8}"
A{4,8}
Any line with "A{4,8}"

Matching words with \< and \>

Searching for a word isn't quite as simple as it at first appears. The string "the" will match the word "other". You can put spaces before and after the letters and use this regular expression: " the ". However, this does not match words at the beginning or end of the line. And it does not match the case where there is a punctuation mark after the word.
There is an easy solution. The characters "\<" and "\>" are similar to the "^" and "$" anchors, as they don't occupy a position of a character. They do "anchor" the expression between to only match if it is on a word boundary.
The pattern to search for the word "the" would be "\<[tT]he\>". The character before the "t" must be either a new line character, or anything except a letter, number, or underscore. The character after the "e" must also be a character other than a number, letter, or underscore or it could be the end of line character.

No comments: