String Algorithms

Rabin-Karp Substring Search Algorithm

Given two strings:

  • Text (T) of length n

  • Pattern (P) of length m

The goal is to find all occurrences of the pattern P in the text T.

  • Convert the pattern and substrings of the text to numbers using a hash function.

  • Instead of comparing strings, compare their hashes.

  • Only do full character comparison if the hashes match (to confirm the match, in case of hash collisions).

This gives us average-case linear time, and it's especially useful when:

  • We are searching for multiple patterns.

  • We want a fast filtering technique.

How the Hash Works

The pattern and substrings are treated as numbers in a base (commonly base 256, for ASCII).

Example (simplified):

Assume base = 10.

  • String "abc" → hash = a × 10² + b × 10¹ + c × 10⁰

  • If a = 1, b = 2, c = 3 → hash = 1×100 + 2×10 + 3 = 123

We use a rolling hash to compute the next substring's hash in constant time, using the previous hash:

A prime number is used to mod the hash to prevent overflow and reduce collisions.

Step-by-Step Algorithm

  1. Choose:

    • A base (like 256)

    • A large prime number (say, 101)

  2. Compute the hash of the pattern.

  3. Compute the hash of the first window (first m characters) of the text.

  4. Slide the window:

    • At each step, compare the hash of the window with the pattern hash.

    • If they match, compare the actual substrings (to avoid false matches due to collisions).

    • Use rolling hash to compute next window’s hash efficiently.

Implementation

Find all occurrences of a pattern string in a text string.

Text: "abedabcabcabcde" Pattern: "abc"

1. Brute Force Approach

  • Slide the pattern over the text one character at a time.

  • At each position, compare the entire pattern with the substring from the text.

  • If all characters match, we found an occurrence.

Output

Rabin-Karp Approach

  • Convert the pattern and substrings of the text into hash values.

  • Slide a window and compare hash values.

  • If hashes match, do a character-level comparison (to handle collisions).

  • Use rolling hash to compute the next window hash in O(1) time.

Output

Time and Space Complexity

Case
Time Complexity
Why

Best/Average

O(n + m)

Because of rolling hash and few hash matches

Worst

O(n * m)

Too many hash collisions lead to full comparisons

Space

O(1) or O(m)

For hash calculation

Last updated