What is a Suffix Array? Let’s start with the concept of “suffixes.” For a string, say “banana,” its suffixes are the substrings starting from each position to the end:
- Position 0: banana (the entire string)
- Position 1: anana
- Position 2: nana
- Position 3: ana
- Position 4: na
- Position 5: a
Each suffix has a starting position (e.g., “banana” starts at position 0). If we sort these suffixes lexicographically (the order we use in dictionaries), the resulting list is called a “suffix array.” In practice, suffix arrays do not directly store the suffixes themselves (which would be space-consuming); instead, they store the starting positions of these suffixes.
Let’s take a simpler example with the string “abrac” (indices 0 to 4):
- Position 0: a b r a c → Suffix: abrac
- Position 1: b r a c → Suffix: brac
- Position 2: r a c → Suffix: rac
- Position 3: a c → Suffix: ac
- Position 4: c → Suffix: c
To sort these suffixes lexicographically, the comparison rule is: first compare the first character. If they differ, the suffix with the smaller first character comes first. If they are the same, compare the second character, and so on. If one suffix is a prefix of another, the shorter suffix is considered smaller.
Comparing the suffixes one by one:
- Two suffixes start with “a”: “abrac” (position 0) and “ac” (position 3). Compare their second characters: “abrac” has “b”, “ac” has “c”. Since “b” < “c”, “abrac” comes before “ac”.
- Next is “brac” (position 1) starting with “b”.
- Then “c” (position 4) starting with “c”.
- Finally, “rac” (position 2) starting with “r”.
Thus, the sorted order is: “abrac” (0), “ac” (3), “c” (4), “brac” (1), “rac” (2). The suffix array for “abrac” is the array of starting positions: [0, 3, 4, 1, 2].
Why are suffix arrays useful? Consider problems like “finding the longest repeated substring” or “checking if a substring exists.” Without a suffix array, comparing all possible substrings directly would be computationally expensive, especially for long strings. By sorting all suffixes, adjacent suffixes often share longer common prefixes, enabling more efficient solutions to these problems.
For example, the Longest Common Prefix (LCP) of adjacent suffixes in a suffix array can help identify the longest repeated substring in the original string. Alternatively, to check if a substring exists, we can perform a binary search on the sorted suffix array to quickly locate the corresponding starting position.
In summary, a suffix array is the sorted list of all suffixes of a string (by lexicographical order), typically storing the starting positions of these sorted suffixes. It acts as a “navigation map” to efficiently solve complex string problems such as substring search, repeated substring detection, and longest common prefix analysis—making it a highly practical tool in string processing.