Substring Matching

Posted by Beetle B. on Thu 04 December 2014

We have a string of length \(N\), and wish to find an \(M\) character string in it.

Brute Force

Keep moving through the string till the first letter in the pattern is matched. Then keep checking the subsequent characters.

Problem: If your pattern is, say, 20 characters, and the first 19 match, then you have to backtrack 19 characters and start all over. This leads to a worst case complexity of \(MN\).

And backtracking is not always feasible (e.g. a live stream over the Internet). You’d have to keep a buffer of the last \(M\) characters.

Knuth-Morris-Pratt Algorithm

The cost of a backup could be diminished if we had a means to know how far back to go. If your pattern is \(M\) characters, you may get away by going back only a few characters. A concrete example: Say you want to find the string AAAABB. Let’s say the text has AAAAAABB. At the 5th character, we have a mismatch, but we need not go back and start from the 2nd A. We can just note that we still have 4 contiguous A‘s and continue.

The general way to handle this is to have a finite state machine, where decisions are made based on what the next character is. This way, we never backtrack on the main string!

The complexity is \(O(M+N)\) (\(M\) is needed to initially construct a desired finite state machine). We need extra space of \(RM\).

Boyer-Moore Algorithm

The basic gist is that if you don’t get a match after, say, 5 characters, you may be able to continue your search from character 6 (under ideal conditions). Even if you backtrack, you may not need to backtrack as much. There’s a scheme to calculate how much to jump.

On average, this takes \(O(N/M)\) time (which is amazing). But the worst case is still as bad as the brute force. If we make a hybrid of this with the KMP algorithm, we could get it down to about \(3N\).

Rabin-Karp Algorithm

The Rabin-Karp algorithm utilizes hashing.

Say your pattern is 26535. Pick a high prime (e.g. 997). Take the mod of it to get 613.

Now as you iterate through your text, after you encounter 5 digits, calculate the modulus. If it doesn’t match, update with the next entry (easy to do - you don’t need to go through all 5 characters again).

If you get the same value, you may have a match. You could explicitly look at the characters, or you could probabilistically declare a match. The error likelihood will be roughly \(1/Q\) where \(Q\) is your large prime.

For general text, you calculate the hash using \(a_{n}R^{n}+a_{n-1}R^{n-1}+\cdots+a_{1}R+a_{0} % Q\) where \(R\) is the size of your alphabet. You can calculate the polynomial in linear time using Horner’s Rule. Updating when you get the next character is also incrementally linear.

Pick a large \(Q\), but not so large that you risk overflow.

The algorithm is linear, but perhaps not as fast as some other methods.