Regex (Pattern and Matcher)
About Regex
Regex (Regular Expression) is a sequence of characters that forms a search pattern. It is widely used for:
Validating inputs (e.g., email, phone numbers).
Searching and extracting text from larger strings.
Replacing patterns in text.
Splitting strings.
Terminology
1. Literals
Literals in regex are characters that match themselves exactly. They are the simplest building blocks of a regex pattern.
Example:
Pattern:
abc
Matches: The string "abc" exactly, no variations.
Does not match: "ab" or "abcd".
Use Case: Used when you want to match static text exactly as it appears.
2. Meta-characters
Meta-characters are special characters in regex that have a unique meaning or functionality. They are used to define patterns beyond literal characters.
Meta-character
Meaning
Example
.
Matches any single character (except newline).
Pattern: a.c
→ Matches: "abc", "a3c".
^
Matches the beginning of a string.
Pattern: ^abc
→ Matches: "abc" at the start of the string.
$
Matches the end of a string.
Pattern: abc$
→ Matches: "abc" at the end of the string.
[]
Denotes a character set.
Pattern: [a-z]
→ Matches any lowercase letter.
\
Escapes meta-characters to treat them as literals.
Pattern: \.
→ Matches a literal dot (".").
3. Quantifiers
Quantifiers define the number of occurrences of a character or group that must match for a pattern to be valid.
Quantifier
Meaning
Example
*
Matches 0 or more occurrences.
Pattern: ab*
→ Matches: "a", "ab", "abb", "abbb".
+
Matches 1 or more occurrences.
Pattern: ab+
→ Matches: "ab", "abb", "abbb".
?
Matches 0 or 1 occurrence.
Pattern: ab?
→ Matches: "a", "ab".
{n}
Matches exactly n
occurrences.
Pattern: a{2}
→ Matches: "aa".
{n,}
Matches at least n
occurrences.
Pattern: a{2,}
→ Matches: "aa", "aaa", "aaaa".
{n,m}
Matches between n
and m
occurrences.
Pattern: a{2,4}
→ Matches: "aa", "aaa", "aaaa".
4. Groups
Groups are portions of a regex enclosed in parentheses ()
that allow:
Capturing and extracting parts of a match.
Applying quantifiers to an entire group.
Types of Groups:
Capturing Groups:
Regular parentheses
( )
are used to capture matched sub-patterns.Example:
Pattern:
(a|b)c
Matches: "ac" or "bc"
Captures: "a" or "b".
Non-Capturing Groups:
(?: )
are used for grouping without capturing.Example:
Pattern:
(?:a|b)c
Matches: "ac" or "bc"
Captures: None.
5. Flags
Flags are optional modifiers that change the behavior of a regex. They are typically passed as the second argument to Pattern.compile()
in Java.
Flag
Description
Code
CASE_INSENSITIVE
Makes the pattern case-insensitive.
Pattern.CASE_INSENSITIVE
MULTILINE
Makes ^
and $
match the start/end of each line.
Pattern.MULTILINE
DOTALL
Makes .
match newlines as well.
Pattern.DOTALL
UNICODE_CASE
Enables Unicode-aware case-insensitive matching.
Pattern.UNICODE_CASE
UNIX_LINES
Matches only as a line terminator.
Pattern.UNIX_LINES
Example:
6. Anchors
Anchors are zero-width assertions that specify positions in the string (not actual characters).
Anchor
Meaning
Example
^
Matches the start of a string.
Pattern: ^abc
→ Matches: "abc" at the start.
$
Matches the end of a string.
Pattern: abc$
→ Matches: "abc" at the end.
\b
Matches a word boundary.
Pattern: \bword\b
→ Matches "word" as a whole word.
\B
Matches non-word boundaries.
Pattern: \Bword\B
→ Matches "word" inside another word.
7. Escaping
Since some characters (meta-characters) have special meanings in regex, they must be escaped with a backslash (\
) to be treated literally.
Meta-character
Escaped Form
Description
.
\.
Matches a literal dot.
*
\*
Matches a literal asterisk.
(
, )
\(
, \)
Matches literal parentheses.
Example:
Pattern:
3\.14
Matches: "3.14".
Does not match: "314".
8. Assertions
Assertions are zero-width patterns that check for specific conditions without consuming any characters.
Assertion
Meaning
Example
Lookahead
Matches if a pattern exists ahead.
Pattern: foo(?=bar)
→ Matches: "foo" if "bar" follows.
Negative Lookahead
Matches if a pattern does NOT exist ahead.
Pattern: foo(?!bar)
→ Matches: "foo" if "bar" does NOT follow.
Lookbehind
Matches if a pattern exists behind.
Pattern: (?<=bar)foo
→ Matches: "foo" if "bar" precedes.
Negative Lookbehind
Matches if a pattern does NOT exist behind.
Pattern: (?<!bar)foo
→ Matches: "foo" if "bar" does NOT precede.
9. Greedy, Reluctant, and Possessive Quantifiers
Quantifiers in regex can control how much text they try to match:
Type
Symbol
Behavior
Greedy
*
, +
, ?
, {}
Matches as much as possible (default).
Reluctant
*?
, +?
, ??
Matches as little as possible.
Possessive
*+
, ++
, ?+
Matches as much as possible without backtracking.
Example:
Pattern:
a.*b
(Greedy)Matches: "a123b456b" (entire string).
Pattern:
a.*?b
(Reluctant)Matches: "a123b" (stops after first "b").
Pattern
About Pattern
The Pattern
class represents a compiled regex. It is immutable and thread-safe, meaning a single Pattern
instance can be shared across threads.
Advantages:
Pre-compiling a regex with
Pattern.compile()
improves performance for repeated use.Pattern
provides advanced regex features like flags and Unicode support.
Features
Feature
Description
Pre-compilation
Compiles a regex once to avoid re-compilation in repeated use.
Flags
Enable special behavior like case-insensitivity or dotall mode.
Group Extraction
Supports capturing groups using parentheses for extracting matched sub-patterns.
Unicode Support
Supports Unicode-aware character classes and case folding.
Advanced Assertions
Provides zero-width assertions like lookaheads and lookbehinds.
Performance Optimization
Supports possessive quantifiers and atomic groups to reduce backtracking.
Escaping Characters
Allows matching meta-characters as literals (e.g., \\.
to match a dot).
Supported Methods in Pattern
Feature Group
Method
Description
Compilation
Pattern compile(String regex)
Compiles a regex into a pattern.
Pattern compile(String regex, int flags)
Compiles a regex with specific flags.
Flags
int flags()
Returns the flags used when compiling the pattern.
Matching
boolean matches(String regex, CharSequence input)
Matches the input string against the regex.
Pattern Retrieval
String pattern()
Returns the regex pattern as a string.
Splitting Strings
String[] split(CharSequence input)
Splits the input string around matches of the pattern.
String[] split(CharSequence input, int limit)
Splits the input string around matches, with a limit on splits.
Unicode Support
Pattern UNICODE_CASE
Enables Unicode-aware case folding.
Pattern UNICODE_CHARACTER_CLASS
Enables Unicode-aware character classes.
Some Regex Symbols
Symbol
Description
.
Matches any single character except a newline.
\d
Matches a digit (equivalent to [0-9]
).
\D
Matches a non-digit (equivalent to [^0-9]
).
\w
Matches a word character (alphanumeric or _
).
\W
Matches a non-word character (opposite of \w
).
\s
Matches a whitespace character (spaces, tabs, newlines).
\S
Matches a non-whitespace character.
^
Matches the beginning of a line or string.
$
Matches the end of a line or string.
\b
Matches a word boundary.
\B
Matches a position that is not a word boundary.
[...]
Matches any character inside the brackets (e.g., [abc]
matches "a", "b", or "c").
[^...]
Matches any character NOT inside the brackets (e.g., [^abc]
matches anything except "a", "b", or "c").
?
Matches 0 or 1 occurrence of the preceding element.
*
Matches 0 or more occurrences of the preceding element (greedy).
+
Matches 1 or more occurrences of the preceding element (greedy).
{n}
Matches exactly n
occurrences of the preceding element.
{n,}
Matches at least n
occurrences of the preceding element.
{n,m}
Matches between n
and m
occurrences of the preceding element.
(?=...)
Positive lookahead: Ensures that a certain pattern follows.
(?!...)
Negative lookahead: Ensures that a certain pattern does NOT follow.
(?<=...)
Positive lookbehind: Ensures that a certain pattern precedes.
(?<!...)
Negative lookbehind: Ensures that a certain pattern does NOT precede.
\
Escapes special characters (e.g., \\.
matches a literal dot).
Matcher
About Matcher
The Matcher
class in Java represents an engine that performs match operations on a character sequence using a Pattern
. It works as a stateful iterator, allowing for complex matching, group extraction, and replacement operations. The Matcher
is not thread-safe, so each thread must use its own instance if concurrency is required.
Features
Feature
Description
Stateful Matching
Allows iteration through matches in a target string using find()
.
Group Extraction
Extracts specific parts of the matched text using capturing groups ( )
.
Position Tracking
Tracks the start and end positions of matches within the input string.
Regex Replacement
Performs targeted replacement using regex patterns with replaceAll()
and replaceFirst()
.
Anchored Matching
Matches from the beginning of the string with matches()
or lookingAt()
.
Region Matching
Limits matching to a specific substring of the input.
Reset Functionality
Allows resetting the Matcher
with a new input or pattern.
Supported Methods in Matcher
Feature Group
Method
Description
Matching
boolean matches()
Attempts to match the entire input sequence against the pattern.
boolean lookingAt()
Attempts to match the input sequence from the beginning.
boolean find()
Finds the next subsequence that matches the pattern.
boolean find(int start)
Starts the search at the specified index and finds the next match.
Group Extraction
String group()
Returns the matched subsequence from the last match.
String group(int group)
Returns the specified capturing group's matched subsequence.
int groupCount()
Returns the number of capturing groups in the pattern.
int start()
Returns the start index of the last match.
int start(int group)
Returns the start index of the specified group in the last match.
int end()
Returns the end index (exclusive) of the last match.
int end(int group)
Returns the end index (exclusive) of the specified group in the last match.
Replacement
String replaceAll(String replacement)
Replaces every subsequence that matches the pattern with the replacement string.
String replaceFirst(String replacement)
Replaces the first subsequence that matches the pattern with the replacement string.
Matcher appendReplacement(StringBuffer sb, String replacement)
Appends a replacement to the StringBuffer
.
StringBuffer appendTail(StringBuffer sb)
Appends the remaining input after the last match to the StringBuffer
.
Position Tracking
int start()
Returns the starting position of the last match.
int end()
Returns the ending position of the last match.
Region Matching
Matcher region(int start, int end)
Sets the bounds of the region within which matches are searched.
boolean hasTransparentBounds()
Checks if the matcher uses transparent bounds.
Matcher useTransparentBounds(boolean b)
Sets whether the matcher uses transparent bounds.
boolean hasAnchoringBounds()
Checks if the matcher uses anchoring bounds.
Matcher useAnchoringBounds(boolean b)
Sets whether the matcher uses anchoring bounds.
Reset
Matcher reset()
Resets the matcher, clearing any previous match state.
Matcher reset(CharSequence input)
Resets the matcher with a new input sequence.
Named Capturing Groups
Named Capturing Groups allow us to assign names to specific groups in a regex pattern. This makes it easier to extract data without relying on the group index.
Syntax
Use the format
(?<name>...)
to define a named group.Use
Matcher.group("name")
to retrieve the content of the named group.
Example
Advantages:
Improves code readability.
Reduces errors caused by incorrect group indices.
Atomic Groups
Atomic Groups are used to prevent backtracking within a group. Once a group is matched, the regex engine will not revisit it, even if the match fails later.
Syntax
Use the format
(?>...)
to define an atomic group.
Example
Explanation:
(?>a|aa)
matches "a" first (atomic group), but when it fails to match "b" after it, the regex engine does not backtrack to try "aa".
Use Cases:
Performance Optimization: Reduces backtracking for large or complex patterns.
Matching Efficiency: Ensures certain patterns are matched only once.
When to Use:
When matching rules within a group are strict and should not allow any backtracking.
When the regex is suffering from performance issues due to excessive backtracking.
How Pattern and Matcher Work Together ?
The Pattern
and Matcher
classes in Java's java.util.regex
package work together to provide a mechanism for regular expression processing.
Relationship Between Pattern and Matcher
Pattern
: Represents the compiled version of a regular expression. It is immutable and thread-safe. You create aPattern
once and reuse it across multiple matching operations.Matcher
: Represents the engine that performs match operations against a specific input string using thePattern
. It is stateful and not thread-safe.
Workflow
Compile the Regex: A
Pattern
object is created usingPattern.compile(String regex)
. This compiles the regex for better performance.Create a Matcher: A
Matcher
object is created from thePattern
usingPattern.matcher(CharSequence input)
.Perform Matching Operations: The
Matcher
is used to perform operations likefind()
,matches()
, orreplaceAll()
on the input string.
Reuse of Pattern: The
Pattern
can be reused to create multipleMatcher
instances for different input strings.Statefulness of Matcher: The
Matcher
retains state during operations (e.g., the position of the last match).Thread-Safety:
Pattern
: Thread-safe and reusable.Matcher
: Not thread-safe; each thread should use its ownMatcher
instance.
Performance Optimization Techniques
Regex operations can sometimes be computationally expensive. Below are techniques to optimize the performance of Pattern
and Matcher
:
1. Compile the Pattern Once
Problem: Re-compiling the regex repeatedly can be expensive.
Solution: Compile the regex once using
Pattern.compile()
and reuse thePattern
object across multiple matching operations.
2. Use Lazy Quantifiers When Appropriate
Problem: Greedy quantifiers (
*
,+
,?
) can cause excessive backtracking, especially with large input strings.Solution: Use lazy quantifiers (
*?
,+?
,??
) to minimize unnecessary matching attempts.
3. Avoid Catastrophic Backtracking
Problem: Nested quantifiers can lead to exponential backtracking, causing performance issues.
Solution:
Use atomic groups (
(?>...)
) to prevent backtracking.Simplify regex patterns to reduce complexity.
4. Use Predefined Character Classes
Problem: Defining custom character classes like
[a-zA-Z0-9_]
can make regex verbose and less efficient.Solution: Use predefined character classes like
\\w
(word character),\\d
(digit), or\\s
(whitespace).
5. Limit the Region for Matching
Problem: Searching the entire string when only a portion is relevant can waste time.
Solution: Use
Matcher.region(int start, int end)
to limit matching to a specific substring.
6. Use Anchors for Efficiency
Problem: Matching without specifying start (
^
) or end ($
) anchors can lead to unnecessary scanning.Solution: Use anchors to match at specific positions in the input.
7. Optimize Replacement Operations
Problem: Using complex patterns for replacement can be inefficient.
Solution:
Use
Matcher.appendReplacement()
andMatcher.appendTail()
for fine-grained control.Precompile the
Pattern
for repeated replacements.
8. Profile and Benchmark Regex
Use tools like JMH (Java Microbenchmark Harness) to benchmark regex operations.
Analyze the runtime behavior of regex patterns and optimize accordingly.
9. Avoid Using Regex When Simpler Solutions Exist
Regex is powerful but can be overkill for simple operations. For example:
Use
String.contains()
for simple substring checks.Use
String.split()
for basic splitting instead of regex patterns.
Last updated
Was this helpful?