> For the complete documentation index, see [llms.txt](https://www.pranaypourkar.co.in/the-programmers-guide/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://www.pranaypourkar.co.in/the-programmers-guide/java/java-basics/java-data-types/specialized-classes/regex-pattern-and-matcher.md). # Regex (Pattern and Matcher) ## **About Regex** **Regex** (Regular Expression) is a sequence of characters that forms a search pattern. It is widely used for: * Validating inputs (e.g., email, phone numbers). * Searching and extracting text from larger strings. * Replacing patterns in text. * Splitting strings. ## **Terminology** ### **1. Literals** Literals in regex are characters that match themselves exactly. They are the simplest building blocks of a regex pattern. * **Example**: * Pattern: `abc` * Matches: The string "abc" exactly, no variations. * Does not match: "ab" or "abcd". * **Use Case**: Used when you want to match static text exactly as it appears. ### **2. Meta-characters** Meta-characters are special characters in regex that have a unique meaning or functionality. They are used to define patterns beyond literal characters.


Meta-character	Meaning	Example
`.`	Matches any single character (except newline).	Pattern: `a.c` → Matches: "abc", "a3c".
`^`	Matches the beginning of a string.	Pattern: `^abc` → Matches: "abc" at the start of the string.
`$`	Matches the end of a string.	Pattern: `abc$` → Matches: "abc" at the end of the string.
`[]`	Denotes a character set.	Pattern: `[a-z]` → Matches any lowercase letter.
`\`	Escapes meta-characters to treat them as literals.	Pattern: `\.` → Matches a literal dot (".").

### **3. Quantifiers** Quantifiers define the number of occurrences of a character or group that must match for a pattern to be valid.


Quantifier	Meaning	Example
`*`	Matches 0 or more occurrences.	Pattern: `ab*` → Matches: "a", "ab", "abb", "abbb".
`+`	Matches 1 or more occurrences.	Pattern: `ab+` → Matches: "ab", "abb", "abbb".
`?`	Matches 0 or 1 occurrence.	Pattern: `ab?` → Matches: "a", "ab".
`{n}`	Matches exactly `n` occurrences.	Pattern: `a{2}` → Matches: "aa".
`{n,}`	Matches at least `n` occurrences.	Pattern: `a{2,}` → Matches: "aa", "aaa", "aaaa".
`{n,m}`	Matches between `n` and `m` occurrences.	Pattern: `a{2,4}` → Matches: "aa", "aaa", "aaaa".

### **4. Groups** Groups are portions of a regex enclosed in parentheses `()` that allow: * Capturing and extracting parts of a match. * Applying quantifiers to an entire group. #### **Types of Groups**: 1. **Capturing Groups**: * Regular parentheses `( )` are used to capture matched sub-patterns. * Example: * Pattern: `(a|b)c` * Matches: "ac" or "bc" * Captures: "a" or "b". 2. **Non-Capturing Groups**: * `(?: )` are used for grouping without capturing. * Example: * Pattern: `(?:a|b)c` * Matches: "ac" or "bc" * Captures: None. ### **5. Flags** Flags are optional modifiers that change the behavior of a regex. They are typically passed as the second argument to `Pattern.compile()` in Java.


Flag	Description	Code
`CASE_INSENSITIVE`	Makes the pattern case-insensitive.	`Pattern.CASE_INSENSITIVE`
`MULTILINE`	Makes `^` and `$` match the start/end of each line.	`Pattern.MULTILINE`
`DOTALL`	Makes `.` match newlines as well.	`Pattern.DOTALL`
`UNICODE_CASE`	Enables Unicode-aware case-insensitive matching.	`Pattern.UNICODE_CASE`
`UNIX_LINES`	Matches only as a line terminator.	`Pattern.UNIX_LINES`

**Example**: ```java Pattern pattern = Pattern.compile("abc", Pattern.CASE_INSENSITIVE); Matcher matcher = pattern.matcher("ABC"); // Matches "ABC" due to case-insensitivity. ``` ### **6. Anchors** Anchors are zero-width assertions that specify positions in the string (not actual characters).


Anchor	Meaning	Example
`^`	Matches the start of a string.	Pattern: `^abc` → Matches: "abc" at the start.
`$`	Matches the end of a string.	Pattern: `abc$` → Matches: "abc" at the end.
`\b`	Matches a word boundary.	Pattern: `\bword\b` → Matches "word" as a whole word.
`\B`	Matches non-word boundaries.	Pattern: `\Bword\B` → Matches "word" inside another word.

### **7. Escaping** Since some characters (meta-characters) have special meanings in regex, they must be escaped with a backslash (`\`) to be treated literally. | **Meta-character** | **Escaped Form** | **Description** | | ------------------ | ---------------- | ---------------------------- | | `.` | `\.` | Matches a literal dot. | | `*` | `\*` | Matches a literal asterisk. | | `(`, `)` | `$`, `$` | Matches literal parentheses. | **Example**: * Pattern: `3\.14` * Matches: "3.14". * Does not match: "314". ### **8. Assertions** Assertions are zero-width patterns that check for specific conditions without consuming any characters.


Assertion	Meaning	Example
Lookahead	Matches if a pattern exists ahead.	Pattern: `foo(?=bar)` → Matches: "foo" if "bar" follows.
Negative Lookahead	Matches if a pattern does NOT exist ahead.	Pattern: `foo(?!bar)` → Matches: "foo" if "bar" does NOT follow.
Lookbehind	Matches if a pattern exists behind.	Pattern: `(?<=bar)foo` → Matches: "foo" if "bar" precedes.
Negative Lookbehind	Matches if a pattern does NOT exist behind.	Pattern: `(?<!bar)foo` → Matches: "foo" if "bar" does NOT precede.

### **9. Greedy, Reluctant, and Possessive Quantifiers** Quantifiers in regex can control how much text they try to match: | **Type** | **Symbol** | **Behavior** | | -------------- | ------------------- | ------------------------------------------------- | | **Greedy** | `*`, `+`, `?`, `{}` | Matches as much as possible (default). | | **Reluctant** | `*?`, `+?`, `??` | Matches as little as possible. | | **Possessive** | `*+`, `++`, `?+` | Matches as much as possible without backtracking. | **Example**: * Pattern: `a.*b` (Greedy) * Matches: "a123b456b" (entire string). * Pattern: `a.*?b` (Reluctant) * Matches: "a123b" (stops after first "b"). ## **Pattern** ### **About Pattern** The `Pattern` class represents a compiled regex. It is immutable and thread-safe, meaning a single `Pattern` instance can be shared across threads. ### **Advantages**: * Pre-compiling a regex with `Pattern.compile()` improves performance for repeated use. * `Pattern` provides advanced regex features like flags and Unicode support. ### **Features**


Feature	Description
Pre-compilation	Compiles a regex once to avoid re-compilation in repeated use.
Flags	Enable special behavior like case-insensitivity or dotall mode.
Group Extraction	Supports capturing groups using parentheses for extracting matched sub-patterns.
Unicode Support	Supports Unicode-aware character classes and case folding.
Advanced Assertions	Provides zero-width assertions like lookaheads and lookbehinds.
Performance Optimization	Supports possessive quantifiers and atomic groups to reduce backtracking.
Escaping Characters	Allows matching meta-characters as literals (e.g., `\\.` to match a dot).

### **Supported Methods in Pattern**


Feature Group	Method	Description
Compilation	`Pattern compile(String regex)`	Compiles a regex into a pattern.
	`Pattern compile(String regex, int flags)`	Compiles a regex with specific flags.
Flags	`int flags()`	Returns the flags used when compiling the pattern.
Matching	`boolean matches(String regex, CharSequence input)`	Matches the input string against the regex.
Pattern Retrieval	`String pattern()`	Returns the regex pattern as a string.
Splitting Strings	`String[] split(CharSequence input)`	Splits the input string around matches of the pattern.
	`String[] split(CharSequence input, int limit)`	Splits the input string around matches, with a limit on splits.
Unicode Support	`Pattern UNICODE_CASE`	Enables Unicode-aware case folding.
	`Pattern UNICODE_CHARACTER_CLASS`	Enables Unicode-aware character classes.

### **Some Regex Symbols**


Symbol	Description
.	Matches any single character except a newline.
\d	Matches a digit (equivalent to `[0-9]`).
\D	Matches a non-digit (equivalent to `[^0-9]`).
\w	Matches a word character (alphanumeric or `_`).
\W	Matches a non-word character (opposite of `\w`).
\s	Matches a whitespace character (spaces, tabs, newlines).
\S	Matches a non-whitespace character.
^	Matches the beginning of a line or string.
$	Matches the end of a line or string.
\b	Matches a word boundary.
\B	Matches a position that is not a word boundary.
[...]	Matches any character inside the brackets (e.g., `[abc]` matches "a", "b", or "c").
[^...]	Matches any character NOT inside the brackets (e.g., `[^abc]` matches anything except "a", "b", or "c").
?	Matches 0 or 1 occurrence of the preceding element.
*	Matches 0 or more occurrences of the preceding element (greedy).
+	Matches 1 or more occurrences of the preceding element (greedy).
{n}	Matches exactly `n` occurrences of the preceding element.
{n,}	Matches at least `n` occurrences of the preceding element.
{n,m}	Matches between `n` and `m` occurrences of the preceding element.
(?=...)	Positive lookahead: Ensures that a certain pattern follows.
(?!...)	Negative lookahead: Ensures that a certain pattern does NOT follow.
(?<=...)	Positive lookbehind: Ensures that a certain pattern precedes.
(?<!...)	Negative lookbehind: Ensures that a certain pattern does NOT precede.
\	Escapes special characters (e.g., `\\.` matches a literal dot).

## **Matcher** ### **About Matcher** The `Matcher` class in Java represents an engine that performs match operations on a character sequence using a `Pattern`. It works as a stateful iterator, allowing for complex matching, group extraction, and replacement operations. The `Matcher`is **not thread-safe**, so each thread must use its own instance if concurrency is required. ### **Features**


Feature	Description
Stateful Matching	Allows iteration through matches in a target string using `find()`.
Group Extraction	Extracts specific parts of the matched text using capturing groups `( )`.
Position Tracking	Tracks the start and end positions of matches within the input string.
Regex Replacement	Performs targeted replacement using regex patterns with `replaceAll()` and `replaceFirst()`.
Anchored Matching	Matches from the beginning of the string with `matches()` or `lookingAt()`.
Region Matching	Limits matching to a specific substring of the input.
Reset Functionality	Allows resetting the `Matcher` with a new input or pattern.

### **Supported Methods in Matcher**


Feature Group	Method	Description
Matching	`boolean matches()`	Attempts to match the entire input sequence against the pattern.
	`boolean lookingAt()`	Attempts to match the input sequence from the beginning.
	`boolean find()`	Finds the next subsequence that matches the pattern.
	`boolean find(int start)`	Starts the search at the specified index and finds the next match.
Group Extraction	`String group()`	Returns the matched subsequence from the last match.
	`String group(int group)`	Returns the specified capturing group's matched subsequence.
	`int groupCount()`	Returns the number of capturing groups in the pattern.
	`int start()`	Returns the start index of the last match.
	`int start(int group)`	Returns the start index of the specified group in the last match.
	`int end()`	Returns the end index (exclusive) of the last match.
	`int end(int group)`	Returns the end index (exclusive) of the specified group in the last match.
Replacement	`String replaceAll(String replacement)`	Replaces every subsequence that matches the pattern with the replacement string.
	`String replaceFirst(String replacement)`	Replaces the first subsequence that matches the pattern with the replacement string.
	`Matcher appendReplacement(StringBuffer sb, String replacement)`	Appends a replacement to the `StringBuffer`.
	`StringBuffer appendTail(StringBuffer sb)`	Appends the remaining input after the last match to the `StringBuffer`.
Position Tracking	`int start()`	Returns the starting position of the last match.
	`int end()`	Returns the ending position of the last match.
Region Matching	`Matcher region(int start, int end)`	Sets the bounds of the region within which matches are searched.
	`boolean hasTransparentBounds()`	Checks if the matcher uses transparent bounds.
	`Matcher useTransparentBounds(boolean b)`	Sets whether the matcher uses transparent bounds.
	`boolean hasAnchoringBounds()`	Checks if the matcher uses anchoring bounds.
	`Matcher useAnchoringBounds(boolean b)`	Sets whether the matcher uses anchoring bounds.
Reset	`Matcher reset()`	Resets the matcher, clearing any previous match state.
	`Matcher reset(CharSequence input)`	Resets the matcher with a new input sequence.

### **Named Capturing Groups** **Named Capturing Groups** allow us to assign names to specific groups in a regex pattern. This makes it easier to extract data without relying on the group index. #### **Syntax** * Use the format `(?...)` to define a named group. * Use `Matcher.group("name")` to retrieve the content of the named group. #### **Example** ```java Pattern pattern = Pattern.compile("(?\\d{2})-(?\\d{2})-(?\\d{4})"); Matcher matcher = pattern.matcher("15-08-2023"); if (matcher.matches()) { System.out.println("Day: " + matcher.group("day")); // Output: 15 System.out.println("Month: " + matcher.group("month")); // Output: 08 System.out.println("Year: " + matcher.group("year")); // Output: 2023 } ``` #### **Advantages**: * Improves code readability. * Reduces errors caused by incorrect group indices. ### **Atomic Groups** **Atomic Groups** are used to prevent backtracking within a group. Once a group is matched, the regex engine will not revisit it, even if the match fails later. #### **Syntax** * Use the format `(?>...)` to define an atomic group. #### **Example** ```java Pattern pattern = Pattern.compile("(?>a|aa)b"); Matcher matcher = pattern.matcher("aab"); System.out.println(matcher.matches()); // Output: false ``` **Explanation**: * `(?>a|aa)` matches "a" first (atomic group), but when it fails to match "b" after it, the regex engine does not backtrack to try "aa". #### **Use Cases**: * **Performance Optimization**: Reduces backtracking for large or complex patterns. * **Matching Efficiency**: Ensures certain patterns are matched only once. #### **When to Use**: * When matching rules within a group are strict and should not allow any backtracking. * When the regex is suffering from performance issues due to excessive backtracking. ## **How Pattern and Matcher Work Together ?** The `Pattern` and `Matcher` classes in Java's `java.util.regex` package work together to provide a mechanism for regular expression processing. ### **Relationship Between Pattern and Matcher** * **`Pattern`**: Represents the compiled version of a regular expression. It is immutable and thread-safe. You create a `Pattern` once and reuse it across multiple matching operations. * **`Matcher`**: Represents the engine that performs match operations against a specific input string using the `Pattern`. It is stateful and not thread-safe. ### **Workflow** 1. **Compile the Regex**: A `Pattern` object is created using `Pattern.compile(String regex)`. This compiles the regex for better performance. 2. **Create a Matcher**: A `Matcher` object is created from the `Pattern` using `Pattern.matcher(CharSequence input)`. 3. **Perform Matching Operations**: The `Matcher` is used to perform operations like `find()`, `matches()`, or `replaceAll()` on the input string. ```java import java.util.regex.*; public class RegexExample { public static void main(String[] args) { // Step 1: Compile the regex Pattern pattern = Pattern.compile("\\d{3}-\\d{2}-\\d{4}"); // Step 2: Create a matcher for the input string Matcher matcher = pattern.matcher("123-45-6789"); // Step 3: Perform matching operations if (matcher.matches()) { System.out.println("The input matches the pattern."); //The input matches the pattern. } else { System.out.println("The input does not match the pattern."); } } } ``` {% hint style="info" %} * The regex `\\d{3}-\\d{2}-\\d{4}` is compiled into a `Pattern`. * The `Pattern` is used to create a `Matcher` for the input string `"123-45-6789"`. * The `matches()` method checks if the entire input matches the regex. {% endhint %} {% hint style="warning" %} * **Reuse of Pattern**: The `Pattern` can be reused to create multiple `Matcher` instances for different input strings. * **Statefulness of Matcher**: The `Matcher` retains state during operations (e.g., the position of the last match). * **Thread-Safety**: * `Pattern`: Thread-safe and reusable. * `Matcher`: Not thread-safe; each thread should use its own `Matcher` instance. {% endhint %} ## **Performance Optimization Techniques** Regex operations can sometimes be computationally expensive. Below are techniques to optimize the performance of `Pattern` and `Matcher`: ### **1. Compile the Pattern Once** * **Problem**: Re-compiling the regex repeatedly can be expensive. * **Solution**: Compile the regex once using `Pattern.compile()` and reuse the `Pattern` object across multiple matching operations. ```java // Compile once Pattern pattern = Pattern.compile("\\d{3}-\\d{2}-\\d{4}"); // Reuse Pattern for multiple inputs Matcher matcher1 = pattern.matcher("123-45-6789"); Matcher matcher2 = pattern.matcher("987-65-4321"); ``` ### **2. Use Lazy Quantifiers When Appropriate** * **Problem**: Greedy quantifiers (`*`, `+`, `?`) can cause excessive backtracking, especially with large input strings. * **Solution**: Use lazy quantifiers (`*?`, `+?`, `??`) to minimize unnecessary matching attempts. ```java // Greedy Pattern greedyPattern = Pattern.compile(".*b"); // Lazy Pattern lazyPattern = Pattern.compile(".*?b"); ``` ### **3. Avoid Catastrophic Backtracking** * **Problem**: Nested quantifiers can lead to exponential backtracking, causing performance issues. * **Solution**: * Use atomic groups (`(?>...)`) to prevent backtracking. * Simplify regex patterns to reduce complexity. ```java // Problematic regex Pattern pattern = Pattern.compile("(a+)+b"); // Optimized with atomic groups Pattern atomicPattern = Pattern.compile("(?>(a+))+b"); ``` ### **4. Use Predefined Character Classes** * **Problem**: Defining custom character classes like `[a-zA-Z0-9_]` can make regex verbose and less efficient. * **Solution**: Use predefined character classes like `\\w` (word character), `\\d` (digit), or `\\s` (whitespace). ```java // Custom character class Pattern custom = Pattern.compile("[a-zA-Z0-9_]"); // Predefined character class Pattern predefined = Pattern.compile("\\w"); ``` ### **5. Limit the Region for Matching** * **Problem**: Searching the entire string when only a portion is relevant can waste time. * **Solution**: Use `Matcher.region(int start, int end)` to limit matching to a specific substring. ```java Matcher matcher = pattern.matcher("123-45-6789"); matcher.region(4, 9); // Search only within "45-6789" ``` ### **6. Use Anchors for Efficiency** * **Problem**: Matching without specifying start (`^`) or end (`$`) anchors can lead to unnecessary scanning. * **Solution**: Use anchors to match at specific positions in the input. ```java // Match only if the entire input is a number Pattern pattern = Pattern.compile("^\\d+$"); ``` ### **7. Optimize Replacement Operations** * **Problem**: Using complex patterns for replacement can be inefficient. * **Solution**: * Use `Matcher.appendReplacement()` and `Matcher.appendTail()` for fine-grained control. * Precompile the `Pattern` for repeated replacements. ### **8. Profile and Benchmark Regex** * Use tools like JMH (Java Microbenchmark Harness) to benchmark regex operations. * Analyze the runtime behavior of regex patterns and optimize accordingly. ### **9. Avoid Using Regex When Simpler Solutions Exist** * Regex is powerful but can be overkill for simple operations. For example: * Use `String.contains()` for simple substring checks. * Use `String.split()` for basic splitting instead of regex patterns. --- # Agent Instructions This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com. ## Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter: ``` GET https://www.pranaypourkar.co.in/the-programmers-guide/java/java-basics/java-data-types/specialized-classes/regex-pattern-and-matcher.md?ask=&goal= ``` `ask` is the immediate question: it should be specific, self-contained, and written in natural language. `goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.