A suffix array is an array formed by taking a string, and storing pointers to all the suffixes in the array. As an example, if the string is:
“A quick brown fox jumped over the lazy dog”, we form this temporary array:
Index | Content |
---|---|
0 | “A quick brown fox jumped over the lazy dog” |
1 | ” quick brown fox jumped over the lazy dog” |
2 | “quick brown fox jumped over the lazy dog” |
3 | “uick brown fox jumped over the lazy dog” |
4 | “ick brown fox jumped over the lazy dog” |
5 | “ck brown fox jumped over the lazy dog” |
6 | “k brown fox jumped over the lazy dog” |
7 | ” brown fox jumped over the lazy dog” |
8 | “brown fox jumped over the lazy dog” |
9 | “rown fox jumped over the lazy dog” |
10 | “own fox jumped over the lazy dog” |
11 | “wn fox jumped over the lazy dog” |
12 | “n fox jumped over the lazy dog” |
13 | ” fox jumped over the lazy dog” |
14 | “fox jumped over the lazy dog” |
15 | “ox jumped over the lazy dog” |
16 | “x jumped over the lazy dog” |
17 | ” jumped over the lazy dog” |
18 | “jumped over the lazy dog” |
19 | “umped over the lazy dog” |
20 | “mped over the lazy dog” |
21 | “ped over the lazy dog” |
22 | “ed over the lazy dog” |
23 | “d over the lazy dog” |
24 | ” over the lazy dog” |
25 | “over the lazy dog” |
26 | “ver the lazy dog” |
27 | “er the lazy dog” |
28 | “r the lazy dog” |
29 | ” the lazy dog” |
30 | “the lazy dog” |
31 | “he lazy dog” |
32 | “e lazy dog” |
33 | ” lazy dog” |
34 | “lazy dog” |
35 | “azy dog” |
36 | “zy dog” |
37 | “y dog” |
38 | ” dog” |
39 | “dog” |
40 | “og” |
41 | “g” |
Then we sort this array by using a string sort. The resulting array is called the suffix array:
Index | Content |
---|---|
7 | ” brown fox jumped over the lazy dog” |
38 | ” dog” |
13 | ” fox jumped over the lazy dog” |
17 | ” jumped over the lazy dog” |
33 | ” lazy dog” |
24 | ” over the lazy dog” |
1 | ” quick brown fox jumped over the lazy dog” |
29 | ” the lazy dog” |
0 | “A quick brown fox jumped over the lazy dog” |
35 | “azy dog” |
8 | “brown fox jumped over the lazy dog” |
5 | “ck brown fox jumped over the lazy dog” |
23 | “d over the lazy dog” |
39 | “dog” |
32 | “e lazy dog” |
22 | “ed over the lazy dog” |
27 | “er the lazy dog” |
14 | “fox jumped over the lazy dog” |
41 | “g” |
31 | “he lazy dog” |
4 | “ick brown fox jumped over the lazy dog” |
18 | “jumped over the lazy dog” |
6 | “k brown fox jumped over the lazy dog” |
34 | “lazy dog” |
20 | “mped over the lazy dog” |
12 | “n fox jumped over the lazy dog” |
40 | “og” |
25 | “over the lazy dog” |
10 | “own fox jumped over the lazy dog” |
15 | “ox jumped over the lazy dog” |
21 | “ped over the lazy dog” |
2 | “quick brown fox jumped over the lazy dog” |
28 | “r the lazy dog” |
9 | “rown fox jumped over the lazy dog” |
30 | “the lazy dog” |
3 | “uick brown fox jumped over the lazy dog” |
19 | “umped over the lazy dog” |
26 | “ver the lazy dog” |
11 | “wn fox jumped over the lazy dog” |
16 | “x jumped over the lazy dog” |
37 | “y dog” |
36 | “zy dog” |
Applications
Searching For All Instances of a Substring
If you wanted to search some text for all instances of a string, do a binary search on the string in the suffix array, and look at all adjacent entries to get all matches.
Longest Repeated Substring
Items with the first \(n\) characters equal in the whole string are adjacent to one another in the suffix array. This is useful in finding the longest repeated substring in some text (brute force would be \(O(N^{2})\)).
This works well if your longest repeated substring is not huge. The reason is that if “abcdefghij” is your longest repeated substring, then your suffix array will also have “bcdefghij”, “cdefghij”, and so on. If your longest repeated substring is \(D\) characters long, then to find it you’ll need at least \(1+2+\cdots+D\) comparisons - this is quadratic in \(D\).
This will be a problem even when forming the suffix array.
To get around this problem, use the Manber-Myers algorithm which is linearithmic.