These notes were taken while or after taking the Learning Regular Expressions course found on lynda.com and taught by Kevin Skoglund.
- Regular Expressions is a formal language interpreted by a regular expression process.
- It is used for matching, searching, and replacing text.
- Regex is case sensitive.
- Regular Expressions are eager and greedy. See respective sections below.
- It is not a programming language. It is a text manipulation tool, which helps return text only. Therefore, working with numbers can be tricky. See the
[50-99]
example below. - Metacharacters inside character sets are usually treated as literals (already escaped). There is no need to escape them again. They no longer have their metacharacter meaning. Ex:
/h[abc.xyz]t/
matches "hat" and "h.t", but does not match "hot" because the.
within the set is not a wildcard.- Exceptions:
]
,-
,^
, and\
. These characters need to be escaped within a character set
- Exceptions:
Metacharacter | What does it do? | Example |
---|---|---|
. |
Matches any one character, except newline. | /h.t/ matches "hat", "hot", and "hit". But it does not match "heat". |
\ |
Escapes the next character. Only for metacharacters Allows use of metacharacters as literal characters. | /9\.00/ matches "9.00", but not "9500" or "9-00". |
[] |
Creates a character set which will match any one of the characters within the set. | /gr[ea]y/ will match gray and grey , but will not match great , because only one of the characters within the set will be matched at a time. |
- |
Matches all characters that are between two characters. Only works while within a set [] . Outside of a set, it is only a literal dash. |
/[0-9]/ matches 0 through 9. /[A-Za-z]/ matches the alphabet in lower and upper case. However, /[50-99]/ does not match the number from 50-99. Regex looks at text only. This will, therefore, only match the numbers between 0 and 9. |
^ |
Matches any one character that is not in the set. | /[^A-Z0-9]/ would match anything that does not contain upper case letter and does not contain numbers. So only text in lower case, spaces and especial characters would be found. Another example: /see[^mn]/ would match see , but it would not match seen , see (without the space at the end) and seem . |
- Space: a space is an actual character in regex. So
/cat/
does not matchc a t
. - Tabs
\t
: To find tab spaces. To matcha b
, I'd need to search for/a\tb/
. - Line returns and new lines: If I need to find a line return (when somebody hits "Enter" to start a new line) or a new line, there are a few options:
\r
: Finds a return.\n
: Finds a new line.\r\n
: Finds a return and a new line.
Most metacharacters are already escaped when they are inside a character set:
/h[abc.xyz]t/
matcheshat
andh.t
, but does not matchhot
, because the.
character is not a metacharacter. Since the.
is inside the set, it is automatically escaped (ignored).
Exceptions:
These characters need to be escaped manually:
]
: Closing square bracket only. (The opening square bracket is escaped automatically when found inside of a set).-
: Dash, or range metacharacter.^
: Caret\
: Back slash. This gets tricky, because this is the metacharacter used to escape other metacharacters including itself. So, in order to escape\
, I'd need to use a second\
, like this:\\
😱
Ex: Write a regular expression to match file01 file-1 file\1 file_1
Answer: /file[0\-\\_]1/
Explanation by parts:
file[0
: creates the match for the first part of our example:file0
file[0\-
: here the\
escapes the-
, to make it a literal. This way, it would matchfile-
.file[0\-\\]
: here the first\
escapes the second\
to make it a literal. This way, it would matchfile\
.file[0\-\\_]1
: The full regular expression. Note that_
isn't a metacharacter and therefore, does not need to be a concern it. The]
closes the set and1
will conclude the expression.
Shorthand | Meaning | Equivalent |
---|---|---|
\d | Digit | [0-9] |
\w | Word character | [a-zA-Z0-9_] |
\s | Whitespace | [ \t\r\n] (space , tab \t , return \r or new line \n ) |
\D | Not digit | [^0-9] |
\W | Not word character | [^a-zA-Z0-9_] |
\S | Not whitespace | [^ \t\r\n] |
Notes:
\w
- Underscore (_) is a word character.
- Hyphen (-) is not a word character.
- Digits are word characters.
Ex:
/\d\d\d\d/ matches "1984", but not "text". Because
\d` looks for digits only./\w\w\w/
matches "ABC", "123", and "1_A". Because numbers and the underscore character are considered word characters./\w\s\w\w/
matches "I am", but not "Am I". Because it looks for one word character followed by a space character./[\w\-]/
matches one word character or a hyphen./[\d\s]/
matches any one digit or whitespace character./[^\d]/
is the same as/\D/
and/[^0-9]/
.
❗ Attention: /[^\d\s]/
is not the same as [\D\S]
:
/[^\d\s]/
means not a digit NOR not a whitespace character. The whole set is negated.[\D\S]
means EITHER not a digit OR NOT a whitespace character.
These expressions are not as used as REGEX, but could be useful. You never know 🤷
Class | Meaning | Equivalent |
---|---|---|
[:alpha:] | Alphabetic characters | A-Za-z |
[:digit:] | Numeric characters | 0-9 |
[:alnum:] | Alphanumeric characters | A-Za_z0-9 |
[:lower:] | Lowercase alphabetic characters | a-z |
[:upper:] | Uppercase alphabetic characters | A-Z |
[:punct:] | Punctuation characters | |
[:space:] | Space characters | \s |
[:blank:] | Blank characters (space, tab) | |
[:print:] | Printable characters, spaces | |
[:graph:] | Printable characters, no spaces | |
[:cntrl:] | Control characters (non-printable) | |
[:xdigit:] | Hexadecimal characters | A-Fa-f0-9 |
Metacharacter | Meaning |
---|---|
* |
Preceding item appears zero or more times |
+ |
Preceding item appears one or more times |
? |
Preceding item appears zero or one times |
Examples:
/apples*/
matches "apple", "apples" and "applesssss"./apples+/
matches "apples" and "applesssss", but does not match "apple"./apples?/
matches "apple" and "apples", but does not match "applesssss"./\d\d\d*/
matches numbers with 3 digits or more. i.e: 123, 1234, 12345, 123456 ...
Use curly braces to create a repetition set: {min,max}. A minimum value is always required and it can be equal to 0. A maximum value is not required, as it can be infinite.
Examples:
\d{4,8}
matches numbers with four to eight digits.\d{4}
matches numbers with four digits only.\d{4,}
matches numbers with four or more digits.\w{5}\s
matches 5 word characters followed by a whitespace character.\d{3}-\d{3}-\d{4}
would match phone numbers like 555-867-5309.
Match as much as possible before giving control to the next expression part.
Examples:
/.+\.jpg/
matches "filename.jpg"- The
+
is greedy, but "gives back" the.jpg
to make the match. - Think of it as rewinding or backtracking.
- The
/.*[0-9]+/
matches "Page 266"./.*/
matches "Page 26" while/[0-9]+/
matches "6".
Match as little as possible before giving control to the next expression part.
Metacharacter | Meaning |
---|---|
? | Makes preceding quantifier lazy (could match 0 characters or 1 character) |
Syntax:
- *?
- +?
- {min,max}?
- ??
Examples:
/apples??/
: The first?
means that this expression will match "apple" or "apples" (0 or 1 character). However, because regular expressions are greedy, one?
means it prefers to return "apples". The second?
will tell the expression to be lazy and return zeros
, matching, therefore, "apple".