Regular expression syntax

Regular expression syntax has several basic rules and methods.

Using character sets

The pattern within the brackets of a regular expression defines a character set that is used to match a single character. For example, the regular expression "[ A-Za-z] " specifies to match any single uppercase or lowercase letter.

In the character set, a hyphen indicates a range of characters, for example [A-Z] will match any one capital letter.

In a character set a ^ character negates the following characters. For example [^A-Z] matches any single character that is not a capital letter.

The regular expression " B[IAU]G " matches the strings "BIG", "BAG", and "BUG", but does not match the string "BOG".

If you specified the regular expression as "B[IA][GN]", the concatenation of character sets creates a regular expression that matches the corresponding concatenation of characters in the search string. This regular expression matches  "B", followed by an "I" or "A", followed by a "G" or "N". The regular expression matches "BIG", "BAG", "BIN", and "BAN".

The regular expression [A-Z][a-z]* matches any sequence of letters that starts with an uppercase letter and is followed by zero or more lowercase letters. The special character * after the closing square bracket specifies to match zero or more occurrences of the character set.

Примечание.

The * only applies to the character set that immediately precedes it, not to the entire regular expression.

A + after the closing square bracket specifies to find one or more occurrences of the character set. You interpret the regular expression {{"A-Z+" }}as matching one or more uppercase letters enclosed by spaces. Therefore, this regular expression matches " BIG " and also matches " LARGE ", " HUGE ", " ENORMOUS ", and any other string of uppercase letters surrounded by spaces.

Changes in ColdFusion (2018 release) Update 5

In Update 5 of the 2018 release of ColdFusion, the application flag useJavaAsRegex is introduced. If you enable this flag at the application level, you can override the default Regex engine and you can use the Java Regex engine.

You can also enable the option at the server level. Enable the option Use Java As Regex Engine, located in Server Settings > Settings of the ColdFusion Administrator.

For example, using the default Perl-regex engine, you can write the following snippet:

<cfscript> 
     writeOutput(refind("[[:digit:]]","abc 456 ABC 789? Paraguay for $99 airfare!")) // Returns 5 
</cfscript>

After you enable the flag useJavaAsRegex, you can rewrite the snippet above as follows:

<cfscript> 
     writeOutput(refind("\p{Digit}","abc 456 ABC 789? Paraguay for $99 airfare!")) // Returns 5 
</cfscript>

Feature comparison of Perl and Java Regex engines

Feature Java Perl
Backslash escapes one metacharacter YES YES
\Q...\E escapes a string of metacharacters Java 6 YES
\x00 through \xFF (ASCII character) YES YES
\n (LF) YES YES
\f (form feed) and \v (vtab) YES YES
\a (bell) YES YES
\e (escape) YES YES
\b (backspace) and \B (backslash) no no
\cA through \cZ (control character) YES YES
\ca through \cz (control character) no YES
[abc] character class YES YES
[^abc] negated character class YES YES
[a-z] character class range YES YES
Hyphen in [\d-z] is a literal YES YES
Hyphen in [a-\d] is a literal no no
Backslash escapes one character class metacharacter YES YES
\Q...\E escapes a string of character class metacharacters Java 6 YES
\d shorthand for digits ascii YES
\w shorthand for word characters ascii YES
\s shorthand for whitespace ascii YES
\D YES YES
[\b] backspace YES YES
. (dot; any character except line break) YES YES
^ (start of string/line) YES YES
$ (end of string/line) YES YES
\A (start of string) YES YES
\Z (end of string YES YES
\z (end of string) YES YES
\` (start of string) no no
\' (end of string) no no
\b (at the beginning or end of a word) YES YES
\B (NOT at the beginning or end of a word) YES YES
\y (at the beginning or end of a word) no no
\Y (NOT at the beginning or end of a word) no no
\m (at the beginning of a word) no no
\M (at the end of a word) no no
\< (at the beginning of a word) no no
\> (at the end of a word) no no
| (alternation) YES YES
Feature Java Perl
? (0 or 1) YES YES
* (0 or more) YES YES
+ (1 or more) YES YES
{n} (exactly n) YES YES
{n YES YES
{n YES YES
? after any of the above quantifiers to make it "lazy" YES YES
(regex) (numbered capturing group) YES YES
(?:regex) (non-capturing group) YES YES
\1 through \9 (backreferences) YES YES
\10 through \99 (backreferences) YES YES
Forward references \1 through \9 YES YES
Nested references \1 through \9 YES YES
Backreferences non-existent groups are an error YES YES
Backreferences to failed groups also fail YES YES
(?i) (case insensitive) YES YES
(?s) (dot matches newlines) YES YES
(?m) (^ and $ match at line breaks) YES YES
(?x) (free-spacing mode) YES YES
(?n) (explicit capture) no no
(?-ismxn) (turn off mode modifiers) YES YES
(?ismxn:group) (mode modifiers local to group) YES YES
Feature Java Perl
(?>regex) (atomic group) YES YES
?+ n}+ (possessive quantifiers) YES
(?=regex) (positive lookahead) YES YES
(?!regex) (negative lookahead) YES YES
(?<=text) (positive lookbehind) finite length fixed length
(?<!text) (negative lookbehind) finite length fixed length
\G (start of match attempt) YES YES
(?(?=regex)then|else) (using any lookaround) no YES
(?(regex)then|else) no no
(?(1)then|else) no YES
(?(group)then|else) no no
(?#comment) no YES
Free-spacing syntax supported YES YES
Character class is a single token no YES
# starts a comment YES YES
\X (Unicode grapheme) no YES
\u0000 through \uFFFF (Unicode character) YES no
\x{0} through \x{FFFF} (Unicode character) no YES
\pL through \pC (Unicode properties) YES YES
\p{L} through \p{C} (Unicode properties) YES YES
\p{Lu} through \p{Cn} (Unicode property) YES YES
\p{L&} and \p{Letter&} (equivalent of [\p{Lu}\p{Ll}\p{Lt}] Unicode properties) no YES
\p{IsL} through \p{IsC} (Unicode properties) YES YES
\p{IsLu} through \p{IsCn} (Unicode property) YES YES
\p{Letter} through \p{Other} (Unicode properties) no YES
\p{Lowercase_Letter} through \p{Not_Assigned} (Unicode property) no YES
\p{IsLetter} through \p{IsOther} (Unicode properties) no YES
\p{IsLowercase_Letter} through \p{IsNot_Assigned} (Unicode property) no YES
\p{Arabic} through \p{Yi} (Unicode script) no YES
\p{IsArabic} through \p{IsYi} (Unicode script) no YES
\p{BasicLatin} through \p{Specials} (Unicode block) no YES
\p{InBasicLatin} through \p{InSpecials} (Unicode block) YES YES
\p{IsBasicLatin} through \p{IsSpecials} (Unicode block) no YES
Part between {} in all of the above is case insensitive no YES
\P (negated variants of all \p as listed above) YES YES
\p{^...} (negated variants of all \p{...} as listed above) no YES
(?<name>regex) (.NET-style named capturing group) no no
(?'name'regex) (.NET-style named capturing group) no no
\k<name> (.NET-style named backreference) no no
\k'name' (.NET-style named backreference) no no
(?P<name>regex) (Python-style named capturing group no no
(?P=name) (Python-style named backreference) no no
multiple capturing groups can have the same name n/a n/a
\i no no
[abc-[abc]] character class subtraction no no
[:alpha:] POSIX character class no YES
\p{Alpha} POSIX character class ascii no
\p{IsAlpha} POSIX character class no YES
[.span-ll.] POSIX collation sequence no no
[=x=] POSIX character equivalence no no

For more information on Regex patterns in Java, see the official Oracle docs on Regex.

Considerations when using special characters

Since a regular expression followed by an * can match zero instances of the regular expression, it can also match the empty string. For example,

REReplace("Hello","[T]*","7","ALL") - #REReplace("Hello","[T]*","7","ALL")#<BR>
</cfoutput>

results in the following output:

REReplace("Hello","[T]*","7","ALL") - 7H7e7l7l7o7

The regular expression T* can match empty strings. It first matches the empty string before "H" in "Hello". The "ALL" argument tells REReplace to replace all instances of an expression. The empty string before "e" is matched, and so on, until the empty string before "o" is matched.

This result might be unexpected. The workarounds for these types of problems are specific to each case. In some cases you can use [T]+, which requires at least one "T", instead of [T]*. Alternatively, you can specify an additional pattern after [T]*. 
In the following examples the regular expression has a "W" at the end:

REReplace("Hello World","[T]*W","7","ALL")
#REReplace("Hello World","[T]*W","7","ALL")#<BR>
</cfoutput>

This expression results in the following more predictable output:

REReplace("Hello World","[T]*W","7","ALL") - Hello 7orld

Finding repeating characters

In some cases, you might want to find
a repeating pattern of characters in a search string. For example,
the regular expression "a{2,4}" specifies to match two to four occurrences
of "a". Therefore, it would match: "aa", "aaa", "aaaa", but not
"a" or "aaaaa". In the following example, the REFind function returns an index of 6:

<!--- The value of IndexOfOccurrence is 6--->

The regular expression "[0-9]{3,}" specifies to match any integer

number containing three or more digits: "123", "45678", and so on.
However, this regular expression does not match a one-digit or two-digit
number. 

You use the following syntax to find repeating characters:

  1. m,n
    Where m is 0 or greater and n is greater than or equal to m. Match m through n (inclusive) occurrences. The
    expression {0,1} is equivalent to the special character ?.

  2. m
    Where m is 0 or greater. Match at least m occurrences. The syntax {,n} is not allowed.The expression {1,} is equivalent to the special character +, and {0,} is equivalent to *.

  3. }m{ Where m is 0 or greater. Match exactly m occurrences.

Case sensitivity in regular expressions

ColdFusion supplies case-sensitive and case-insensitive functions for working with regular expressions. REFind and REReplace perform case-sensitive matching and REFindNoCase and REReplaceNoCase perform case-insensitive matching. 
You can build a regular expression that models case-insensitive behavior, even when used with a case-sensitive function. To make a regular expression case insensitive, substitute individual characters with character sets. For example, the regular expression [Jj][Aa][Vv][Aa], when used with the case-sensitive functions REFind or REReplace, matches all of the following string patterns:

  • JAVA
  • java
  • Java
  • jAva
  • All other combinations of case

Using subexpressions

Parentheses group parts of regular expressions into subexpressions that you can treat as a single unit. For example, the regular expression "ha" specifies to match a single occurrence of the string. The regular expression "(ha)+" matches one or more instances of "ha". 
In the following example, you use the regular expression "B(ha)+" to match the letter "B" followed by one or more occurrences of the string "ha":

<!--- The value of IndexOfOccurrence is 5 --->

You can use the special character | in a subexpression to create a logical "OR". You can use the following regular expression to search for the word "jelly" or "jellies":

<!--- The value of IndexOfOccurrence is 26--->

Using special characters

Regular expressions define the following list of special characters:

+ * ? . [ ^ $ ( ) { | \

In some cases, you use a special character as a literal character. For example, if you want to search for the plus sign in a string, you have to escape the plus sign by preceding it with a backslash:

"\+"

The following table describes the special characters for regular expressions:

Special Character Description
  A backslash followed by any special character matches the literal character itself, that is, the backslash escapes the special character.For example, "+" matches the plus sign, and "
" matches a backslash.
\. A period matches any character, including newline. To match any character except a newline, use [^#chr(13)##chr(10)#], which excludes the ASCII carriage return and line feed codes. The corresponding escape codes are \r and \n.
[ ] A one-character character set that matches any of the characters in that set. For example, "[akm]" matches an "a", "k", or "m". A hyphen in a character set indicates a range of characters; for example, a-z matches any single lowercase letter. If the first character of a character set is the caret (^), the regular expression matches any character except those in the set. It does not match the empty string.For example,  akm  matches any character except "a", "k", or "m". The  caret  loses its special meaning if it is not the first character of the set.
^

If the caret is at the beginning of a regular expression, the matched string must be at the beginning of the string being searched.For example, the regular expression "^ColdFusion" matches the string "ColdFusion lets you use regular expressions" but not the string "In ColdFusion, you can use regular expressions."

In a character set (ie: within square brackets), a caret character negates the following characters. [^A] matches any character that is not an  upper case  A.

$ If the dollar sign is at the end of a regular expression, the matched string must be at the end of the string being searched.For example, the regular expression "ColdFusion$" matches the string "I like ColdFusion" but not the string "ColdFusion is fun."
? A character set or subexpression followed by a question mark matches zero or one occurrence of the character set or subexpression. For example, xy?z matches either " xyz " or " xz ".
| The OR character allows a choice between two regular expressions. For example,  jell ( yies ) matches either "jelly" or "jellies".
+ A character set or subexpression followed by a plus sign matches one or more occurrences of the character set or subexpression. For example, [a-z]+ matches one or more lowercase characters.
* A character set or subexpression followed by an asterisk matches zero or more occurrences of the character set or subexpression. For example, [a-z]* matches zero or more lowercase characters.
() Parentheses group parts of a regular expression into subexpressions that you can treat as a single unit.For example, (ha)+ matches one or more instances of "ha".
(?x) If at the beginning of a regular expression, it specifies to ignore whitespace in the regular expression and lets you use ## for end-of-line comments. You can match a space by escaping it with a backslash.For example, the following regular expression includes comments, preceded by ##, that are ignored by ColdFusion:reFind("(?x) one ##first option {{two ##second option}} {{three\ point\ five ## note escaped spaces}} ", "three point five")
(?m) If at the beginning of a regular expression, it specifies the multiline mode for the special characters ^ and $.When used with ^, the matched string can be at the start of the entire search string or at the start of new lines, denoted by a linefeed character or  chr (10), within the search string. For $, the matched string can be at the end the search string or at the end of new lines.  Multiline  mode does not recognize a carriage return, or  chr (13), as a new line character. The following example searches for the string "two" across multiple lines: {{#reFind("(?m)^two", "one#chr(10)#two")#}}This example returns 4 to indicate that it matched "two" after the  chr (10) linefeed. Without (?m), the regular expression would not match anything, because ^ only matches the start of the string.The character (?m) does not affect \A or \Z, which always match the start or end of the string, respectively. For information on \A and \Z, see Using escape sequences.
(?i) If at the beginning of a regular expression for REFind(), it specifies to perform a case-insensitive compare. For example, the following line would return an index of 1: {{#reFind("(?i)hi", "HI")#}}If you omit the (?i), the line would return an index of zero to signify that it did not find the regular expression.
(?=...) If at the beginning of a regular expression, it specifies to use positive lookahead when searching for the regular expression. If you prefix a subexpression with this, ColdFusion uses positive lookahead for that subexpression. Positive lookahead tests for the parenthesized subexpression like regular parenthesis, but does not include the contents  in  the match - it merely tests to see if it is there in proximity to the rest of the expression.
For example, consider the expression to extract the protocol from a URL:
<cfset regex = " http (?=://)"><cfset string = " http ://"><cfset result = reFind(regex, string, 1, "yes")>{{mid(string, result.pos1, result.len1)}}This example results in the string " http ". The lookahead parentheses ensure that the "://" is there, but does not include it in the result. If you did not use lookahead, the result would include the extraneous "://".Lookahead parentheses do not capture text, so backreference numbering will skip over these groups. For more information on backreferencing, see Using backreferences.
(?!...) If at the beginning of a regular expression, it specifies to use negative lookahead. Negative is just like positive lookahead, as specified by (?=...), except that it tests for the absence of a match.Lookahead parentheses do not capture text, so backreference numbering will skip over these groups. For more information on backreferencing, see Using backreferences.
(?:...) If you prefix a subexpression with "?:", ColdFusion performs all operations on the subexpression except that it will not capture the corresponding text for use with a back reference.

You must be aware of the following considerations when using special characters in character sets, such as a-z:

  • To include a hyphen (-) in the brackets of a character set as a literal character, you cannot escape it as you can other special characters because ColdFusion always interprets a hyphen as a range indicator. Therefore, if you use a literal hyphen in a character set, make it the last character in the set.
  • To include a closing square bracket (]) in the character set, escape it with a backslash, as in [1-3]A-z]. You do not have to escape the ] character outside the character set designator.

Using escape sequences

Escape sequences are special characters in regular expressions preceded by a backslash (). You typically use escape sequences to represent special characters within a regular expression. For example, the escape sequence \t represents a tab character within the regular expression, and the \d escape sequence specifies any digit, as [0-9] does. ColdFusion escape sequences are case sensitive.
The following table lists the escape sequences that ColdFusion supports:

Escape Sequence Description
\b Specifies a boundary defined by a transition from an alphanumeric character to a nonalphanumeric character, or from a nonalphanumeric character to an alphanumeric character.For example, the string " Big" contains boundary defined by the space (nonalphanumeric character) and the "B" (alphanumeric character). The following example uses the \b escape sequence in a regular expression to locate the string "Big" at the end of the search string and not the fragment "big" inside the word "ambiguous".reFindNoCase("\bBig\b", "Don't be ambiguous about Big."){{<!--- The value of IndexOfOccurrence is 26 --->}}When used inside a character set (for example [\b]), it specifies a backspace
\B Specifies a boundary defined by no transition of character type. For example, two alphanumeric characters in a row or two nonalphanumeric characters in a row; opposite of \b.
\A Specifies a beginning of string anchor, much like the ^ special character.However, unlike ^, you cannot combine \A with (?m) to specify the start of newlines in the search string.
\Z Specifies an end of string anchor, much like the $ special character.However, unlike $, you cannot combine \Z with (?m) to specify the end of newlines in the search string.
\n Newline character
\r Carriage return
\t Tab
\f Form feed
\d Any digit, similar to [0-9]
\D Any  nondigit  character, similar to [^0-9]
\w Any alphanumeric character, or the underscore (_), similar to  [[:word:]]
\W Any nonalphanumeric character, except the underscore similar to  [^[:word:]]
\s Any whitespace character including tab, space, newline, carriage return, and form feed. Similar to [ \t\n\r\f].
\S Any  nonwhitespace  character, similar to [^ \t\n\r\f]
x A hexadecimal representation of  character , where d is a hexadecimal digit
\ddd

An octal representation of a character, where d is an octal digit, in  the form  \000 to \377

Using character classes

In character sets within regular expressions, you can include a character class. You enclose the character class inside brackets, as the following example shows:

REReplace ("Adobe Web Site","[[:space:]]","*","ALL")

This code replaces all the spaces with *, producing this string:

Adobe*Web*Site

You can combine character classes with other expressions within a character set. For example, the regular expression [space:123] searches for a space, 1, 2, or 3. The following example also uses a character class in a regular expression:

"Some BIG string")>
<!--- The value of IndexOfOccurrence is 5 --->

The following table shows the character classes that ColdFusion supports. Regular expressions using these classes match any Unicode character in the class, not just ASCII or ISO-8859 characters.

Character class Matches
:alpha: Any alphabetic character.
:upper: Any uppercase alphabetic character.
:lower: Any lowercase alphabetic character
:digit: Any digit. Same as \d.
:alnum: Any alphabetic or numeric character.
:xdigit: Any hexadecimal digit. Same as [0-9A-Fa-f].
:blank: Space or a tab.
:space: Any whitespace character. Same as \s.
:print: Any alphanumeric, punctuation, or space character.
:punct: Any punctuation character
:graph: Any alphanumeric or punctuation character.
:cntrl: Any character not part of the character classes [:upper:], [:lower:], [:alpha:], [:digit:], [:punct:], [:graph:], [:print:], or [:xdigit:].
:word: Any alphabetic or numeric character, plus the underscore (_). Same as \w
:ascii: The ASCII characters, in the Hexadecimal range 0 - 7F

Получайте помощь быстрее и проще

Новый пользователь?