Regex (Pattern and Matcher)

About Regex

Regex (Regular Expression) is a sequence of characters that forms a search pattern. It is widely used for:

  • Validating inputs (e.g., email, phone numbers).

  • Searching and extracting text from larger strings.

  • Replacing patterns in text.

  • Splitting strings.

Terminology

1. Literals

Literals in regex are characters that match themselves exactly. They are the simplest building blocks of a regex pattern.

  • Example:

    • Pattern: abc

    • Matches: The string "abc" exactly, no variations.

    • Does not match: "ab" or "abcd".

  • Use Case: Used when you want to match static text exactly as it appears.

2. Meta-characters

Meta-characters are special characters in regex that have a unique meaning or functionality. They are used to define patterns beyond literal characters.

Meta-character

Meaning

Example

.

Matches any single character (except newline).

Pattern: a.c → Matches: "abc", "a3c".

^

Matches the beginning of a string.

Pattern: ^abc → Matches: "abc" at the start of the string.

$

Matches the end of a string.

Pattern: abc$ → Matches: "abc" at the end of the string.

[]

Denotes a character set.

Pattern: [a-z] → Matches any lowercase letter.

\

Escapes meta-characters to treat them as literals.

Pattern: \. → Matches a literal dot (".").

3. Quantifiers

Quantifiers define the number of occurrences of a character or group that must match for a pattern to be valid.

Quantifier

Meaning

Example

*

Matches 0 or more occurrences.

Pattern: ab* → Matches: "a", "ab", "abb", "abbb".

+

Matches 1 or more occurrences.

Pattern: ab+ → Matches: "ab", "abb", "abbb".

?

Matches 0 or 1 occurrence.

Pattern: ab? → Matches: "a", "ab".

{n}

Matches exactly n occurrences.

Pattern: a{2} → Matches: "aa".

{n,}

Matches at least n occurrences.

Pattern: a{2,} → Matches: "aa", "aaa", "aaaa".

{n,m}

Matches between n and m occurrences.

Pattern: a{2,4} → Matches: "aa", "aaa", "aaaa".

4. Groups

Groups are portions of a regex enclosed in parentheses () that allow:

  • Capturing and extracting parts of a match.

  • Applying quantifiers to an entire group.

Types of Groups:

  1. Capturing Groups:

    • Regular parentheses ( ) are used to capture matched sub-patterns.

    • Example:

      • Pattern: (a|b)c

      • Matches: "ac" or "bc"

      • Captures: "a" or "b".

  2. Non-Capturing Groups:

    • (?: ) are used for grouping without capturing.

    • Example:

      • Pattern: (?:a|b)c

      • Matches: "ac" or "bc"

      • Captures: None.

5. Flags

Flags are optional modifiers that change the behavior of a regex. They are typically passed as the second argument to Pattern.compile() in Java.

Flag

Description

Code

CASE_INSENSITIVE

Makes the pattern case-insensitive.

Pattern.CASE_INSENSITIVE

MULTILINE

Makes ^ and $ match the start/end of each line.

Pattern.MULTILINE

DOTALL

Makes . match newlines as well.

Pattern.DOTALL

UNICODE_CASE

Enables Unicode-aware case-insensitive matching.

Pattern.UNICODE_CASE

UNIX_LINES

Matches only as a line terminator.

Pattern.UNIX_LINES

Example:

Pattern pattern = Pattern.compile("abc", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher("ABC");  // Matches "ABC" due to case-insensitivity.

6. Anchors

Anchors are zero-width assertions that specify positions in the string (not actual characters).

Anchor

Meaning

Example

^

Matches the start of a string.

Pattern: ^abc → Matches: "abc" at the start.

$

Matches the end of a string.

Pattern: abc$ → Matches: "abc" at the end.

\b

Matches a word boundary.

Pattern: \bword\b → Matches "word" as a whole word.

\B

Matches non-word boundaries.

Pattern: \Bword\B → Matches "word" inside another word.

7. Escaping

Since some characters (meta-characters) have special meanings in regex, they must be escaped with a backslash (\) to be treated literally.

Meta-character

Escaped Form

Description

.

\.

Matches a literal dot.

*

\*

Matches a literal asterisk.

(, )

\(, \)

Matches literal parentheses.

Example:

  • Pattern: 3\.14

    • Matches: "3.14".

    • Does not match: "314".

8. Assertions

Assertions are zero-width patterns that check for specific conditions without consuming any characters.

Assertion

Meaning

Example

Lookahead

Matches if a pattern exists ahead.

Pattern: foo(?=bar) → Matches: "foo" if "bar" follows.

Negative Lookahead

Matches if a pattern does NOT exist ahead.

Pattern: foo(?!bar) → Matches: "foo" if "bar" does NOT follow.

Lookbehind

Matches if a pattern exists behind.

Pattern: (?<=bar)foo → Matches: "foo" if "bar" precedes.

Negative Lookbehind

Matches if a pattern does NOT exist behind.

Pattern: (?<!bar)foo → Matches: "foo" if "bar" does NOT precede.

9. Greedy, Reluctant, and Possessive Quantifiers

Quantifiers in regex can control how much text they try to match:

Type

Symbol

Behavior

Greedy

*, +, ?, {}

Matches as much as possible (default).

Reluctant

*?, +?, ??

Matches as little as possible.

Possessive

*+, ++, ?+

Matches as much as possible without backtracking.

Example:

  • Pattern: a.*b (Greedy)

    • Matches: "a123b456b" (entire string).

  • Pattern: a.*?b (Reluctant)

    • Matches: "a123b" (stops after first "b").

Pattern

About Pattern

The Pattern class represents a compiled regex. It is immutable and thread-safe, meaning a single Pattern instance can be shared across threads.

Advantages:

  • Pre-compiling a regex with Pattern.compile() improves performance for repeated use.

  • Pattern provides advanced regex features like flags and Unicode support.

Features

Feature

Description

Pre-compilation

Compiles a regex once to avoid re-compilation in repeated use.

Flags

Enable special behavior like case-insensitivity or dotall mode.

Group Extraction

Supports capturing groups using parentheses for extracting matched sub-patterns.

Unicode Support

Supports Unicode-aware character classes and case folding.

Advanced Assertions

Provides zero-width assertions like lookaheads and lookbehinds.

Performance Optimization

Supports possessive quantifiers and atomic groups to reduce backtracking.

Escaping Characters

Allows matching meta-characters as literals (e.g., \\. to match a dot).

Supported Methods in Pattern

Feature Group

Method

Description

Compilation

Pattern compile(String regex)

Compiles a regex into a pattern.

Pattern compile(String regex, int flags)

Compiles a regex with specific flags.

Flags

int flags()

Returns the flags used when compiling the pattern.

Matching

boolean matches(String regex, CharSequence input)

Matches the input string against the regex.

Pattern Retrieval

String pattern()

Returns the regex pattern as a string.

Splitting Strings

String[] split(CharSequence input)

Splits the input string around matches of the pattern.

String[] split(CharSequence input, int limit)

Splits the input string around matches, with a limit on splits.

Unicode Support

Pattern UNICODE_CASE

Enables Unicode-aware case folding.

Pattern UNICODE_CHARACTER_CLASS

Enables Unicode-aware character classes.

Some Regex Symbols

Symbol

Description

.

Matches any single character except a newline.

\d

Matches a digit (equivalent to [0-9]).

\D

Matches a non-digit (equivalent to [^0-9]).

\w

Matches a word character (alphanumeric or _).

\W

Matches a non-word character (opposite of \w).

\s

Matches a whitespace character (spaces, tabs, newlines).

\S

Matches a non-whitespace character.

^

Matches the beginning of a line or string.

$

Matches the end of a line or string.

\b

Matches a word boundary.

\B

Matches a position that is not a word boundary.

[...]

Matches any character inside the brackets (e.g., [abc] matches "a", "b", or "c").

[^...]

Matches any character NOT inside the brackets (e.g., [^abc] matches anything except "a", "b", or "c").

?

Matches 0 or 1 occurrence of the preceding element.

*

Matches 0 or more occurrences of the preceding element (greedy).

+

Matches 1 or more occurrences of the preceding element (greedy).

{n}

Matches exactly n occurrences of the preceding element.

{n,}

Matches at least n occurrences of the preceding element.

{n,m}

Matches between n and m occurrences of the preceding element.

(?=...)

Positive lookahead: Ensures that a certain pattern follows.

(?!...)

Negative lookahead: Ensures that a certain pattern does NOT follow.

(?<=...)

Positive lookbehind: Ensures that a certain pattern precedes.

(?<!...)

Negative lookbehind: Ensures that a certain pattern does NOT precede.

\

Escapes special characters (e.g., \\. matches a literal dot).

Matcher

About Matcher

The Matcher class in Java represents an engine that performs match operations on a character sequence using a Pattern. It works as a stateful iterator, allowing for complex matching, group extraction, and replacement operations. The Matcheris not thread-safe, so each thread must use its own instance if concurrency is required.

Features

Feature

Description

Stateful Matching

Allows iteration through matches in a target string using find().

Group Extraction

Extracts specific parts of the matched text using capturing groups ( ).

Position Tracking

Tracks the start and end positions of matches within the input string.

Regex Replacement

Performs targeted replacement using regex patterns with replaceAll() and replaceFirst().

Anchored Matching

Matches from the beginning of the string with matches() or lookingAt().

Region Matching

Limits matching to a specific substring of the input.

Reset Functionality

Allows resetting the Matcher with a new input or pattern.

Supported Methods in Matcher

Feature Group

Method

Description

Matching

boolean matches()

Attempts to match the entire input sequence against the pattern.

boolean lookingAt()

Attempts to match the input sequence from the beginning.

boolean find()

Finds the next subsequence that matches the pattern.

boolean find(int start)

Starts the search at the specified index and finds the next match.

Group Extraction

String group()

Returns the matched subsequence from the last match.

String group(int group)

Returns the specified capturing group's matched subsequence.

int groupCount()

Returns the number of capturing groups in the pattern.

int start()

Returns the start index of the last match.

int start(int group)

Returns the start index of the specified group in the last match.

int end()

Returns the end index (exclusive) of the last match.

int end(int group)

Returns the end index (exclusive) of the specified group in the last match.

Replacement

String replaceAll(String replacement)

Replaces every subsequence that matches the pattern with the replacement string.

String replaceFirst(String replacement)

Replaces the first subsequence that matches the pattern with the replacement string.

Matcher appendReplacement(StringBuffer sb, String replacement)

Appends a replacement to the StringBuffer.

StringBuffer appendTail(StringBuffer sb)

Appends the remaining input after the last match to the StringBuffer.

Position Tracking

int start()

Returns the starting position of the last match.

int end()

Returns the ending position of the last match.

Region Matching

Matcher region(int start, int end)

Sets the bounds of the region within which matches are searched.

boolean hasTransparentBounds()

Checks if the matcher uses transparent bounds.

Matcher useTransparentBounds(boolean b)

Sets whether the matcher uses transparent bounds.

boolean hasAnchoringBounds()

Checks if the matcher uses anchoring bounds.

Matcher useAnchoringBounds(boolean b)

Sets whether the matcher uses anchoring bounds.

Reset

Matcher reset()

Resets the matcher, clearing any previous match state.

Matcher reset(CharSequence input)

Resets the matcher with a new input sequence.

Named Capturing Groups

Named Capturing Groups allow us to assign names to specific groups in a regex pattern. This makes it easier to extract data without relying on the group index.

Syntax

  • Use the format (?<name>...) to define a named group.

  • Use Matcher.group("name") to retrieve the content of the named group.

Example

Pattern pattern = Pattern.compile("(?<day>\\d{2})-(?<month>\\d{2})-(?<year>\\d{4})");
Matcher matcher = pattern.matcher("15-08-2023");
if (matcher.matches()) {
    System.out.println("Day: " + matcher.group("day"));    // Output: 15
    System.out.println("Month: " + matcher.group("month")); // Output: 08
    System.out.println("Year: " + matcher.group("year"));   // Output: 2023
}

Advantages:

  • Improves code readability.

  • Reduces errors caused by incorrect group indices.

Atomic Groups

Atomic Groups are used to prevent backtracking within a group. Once a group is matched, the regex engine will not revisit it, even if the match fails later.

Syntax

  • Use the format (?>...) to define an atomic group.

Example

Pattern pattern = Pattern.compile("(?>a|aa)b");
Matcher matcher = pattern.matcher("aab");
System.out.println(matcher.matches()); // Output: false

Explanation:

  • (?>a|aa) matches "a" first (atomic group), but when it fails to match "b" after it, the regex engine does not backtrack to try "aa".

Use Cases:

  • Performance Optimization: Reduces backtracking for large or complex patterns.

  • Matching Efficiency: Ensures certain patterns are matched only once.

When to Use:

  • When matching rules within a group are strict and should not allow any backtracking.

  • When the regex is suffering from performance issues due to excessive backtracking.

How Pattern and Matcher Work Together ?

The Pattern and Matcher classes in Java's java.util.regex package work together to provide a mechanism for regular expression processing.

Relationship Between Pattern and Matcher

  • Pattern: Represents the compiled version of a regular expression. It is immutable and thread-safe. You create a Pattern once and reuse it across multiple matching operations.

  • Matcher: Represents the engine that performs match operations against a specific input string using the Pattern. It is stateful and not thread-safe.

Workflow

  1. Compile the Regex: A Pattern object is created using Pattern.compile(String regex). This compiles the regex for better performance.

  2. Create a Matcher: A Matcher object is created from the Pattern using Pattern.matcher(CharSequence input).

  3. Perform Matching Operations: The Matcher is used to perform operations like find(), matches(), or replaceAll() on the input string.

import java.util.regex.*;

public class RegexExample {
    public static void main(String[] args) {
        // Step 1: Compile the regex
        Pattern pattern = Pattern.compile("\\d{3}-\\d{2}-\\d{4}");
        
        // Step 2: Create a matcher for the input string
        Matcher matcher = pattern.matcher("123-45-6789");
        
        // Step 3: Perform matching operations
        if (matcher.matches()) {
            System.out.println("The input matches the pattern."); //The input matches the pattern.
        } else {
            System.out.println("The input does not match the pattern.");
        }
    }
}
  • The regex \\d{3}-\\d{2}-\\d{4} is compiled into a Pattern.

  • The Pattern is used to create a Matcher for the input string "123-45-6789".

  • The matches() method checks if the entire input matches the regex.

Performance Optimization Techniques

Regex operations can sometimes be computationally expensive. Below are techniques to optimize the performance of Pattern and Matcher:

1. Compile the Pattern Once

  • Problem: Re-compiling the regex repeatedly can be expensive.

  • Solution: Compile the regex once using Pattern.compile() and reuse the Pattern object across multiple matching operations.

// Compile once
Pattern pattern = Pattern.compile("\\d{3}-\\d{2}-\\d{4}");

// Reuse Pattern for multiple inputs
Matcher matcher1 = pattern.matcher("123-45-6789");
Matcher matcher2 = pattern.matcher("987-65-4321");

2. Use Lazy Quantifiers When Appropriate

  • Problem: Greedy quantifiers (*, +, ?) can cause excessive backtracking, especially with large input strings.

  • Solution: Use lazy quantifiers (*?, +?, ??) to minimize unnecessary matching attempts.

// Greedy
Pattern greedyPattern = Pattern.compile(".*b");

// Lazy
Pattern lazyPattern = Pattern.compile(".*?b");

3. Avoid Catastrophic Backtracking

  • Problem: Nested quantifiers can lead to exponential backtracking, causing performance issues.

  • Solution:

    • Use atomic groups ((?>...)) to prevent backtracking.

    • Simplify regex patterns to reduce complexity.

// Problematic regex
Pattern pattern = Pattern.compile("(a+)+b");

// Optimized with atomic groups
Pattern atomicPattern = Pattern.compile("(?>(a+))+b");

4. Use Predefined Character Classes

  • Problem: Defining custom character classes like [a-zA-Z0-9_] can make regex verbose and less efficient.

  • Solution: Use predefined character classes like \\w (word character), \\d (digit), or \\s (whitespace).

// Custom character class
Pattern custom = Pattern.compile("[a-zA-Z0-9_]");

// Predefined character class
Pattern predefined = Pattern.compile("\\w");

5. Limit the Region for Matching

  • Problem: Searching the entire string when only a portion is relevant can waste time.

  • Solution: Use Matcher.region(int start, int end) to limit matching to a specific substring.

Matcher matcher = pattern.matcher("123-45-6789");
matcher.region(4, 9); // Search only within "45-6789"

6. Use Anchors for Efficiency

  • Problem: Matching without specifying start (^) or end ($) anchors can lead to unnecessary scanning.

  • Solution: Use anchors to match at specific positions in the input.

// Match only if the entire input is a number
Pattern pattern = Pattern.compile("^\\d+$");

7. Optimize Replacement Operations

  • Problem: Using complex patterns for replacement can be inefficient.

  • Solution:

    • Use Matcher.appendReplacement() and Matcher.appendTail() for fine-grained control.

    • Precompile the Pattern for repeated replacements.

8. Profile and Benchmark Regex

  • Use tools like JMH (Java Microbenchmark Harness) to benchmark regex operations.

  • Analyze the runtime behavior of regex patterns and optimize accordingly.

9. Avoid Using Regex When Simpler Solutions Exist

  • Regex is powerful but can be overkill for simple operations. For example:

    • Use String.contains() for simple substring checks.

    • Use String.split() for basic splitting instead of regex patterns.

Last updated

Was this helpful?