Least Significant Digit
Suppose we have \(N\) strings of length \(W\). The idea behind the LSD sort is to first sort the last “digit” (or letter) using counting sort, and then the next last, and so on. Stability of counting sort guarantees that sorting the i-th column will preserve the sorting of the characters after it (i.e. will not mess up the sorting performed in the prior pass).
Properties
- As described, we need all the entities to be of the same length.
- The sort is stable.
- The complexity is \(O(2WN)\) (not sure why).
Implementations
Python
from countingsort import counting_sort
def lsd_sort(lst, mapping):
"""
Radix sort using the least significant digit. The assumption is that each
element has the same length.
Keyword Arguments:
lst -- The list of elements to sort.
mapping -- A dictionary that maps the elements in the alphabet to numbers.
"""
length = len(lst[0])
for i in xrange(1, length+1):
lst = counting_sort(lst, mapping, lambda x: x[-i])
return lst
Most Significant Digit
Instead of sorting by least significant first, we could sort by the most significant digit. Then when we go to the next pass, we basically recurse on the sublists. As an example, we repeat the procedure to all elements that begin with A, and again to all the elements that begin with B.
Imagine sorting all the homework assignments students have handed in. You first make a pile of all names beginning with A, another pile of all names beginning with B, and so on. Then you take the A pile, and sort that by the 2nd letter, and so on.
Properties
- This can handle strings of variable length. Just pretend there is an “infinitely” low letter after you end the string.
- While recursing, it’s quite easy for your sublist to be much smaller than \(R\) (the size of your alphabet). This is particularly relevant in string processing: Extended ASCII has about 255 elements, and Unicode has about 65535 elements. Doing a counting sort under these conditions will be a waste. One mitigating strategy is to switch to insertion sort after the \(d\)-th digit.
- This algorithm, compared to the usual algorithms (quicksort, etc), has the benefit of not having to compare 2 full strings - it simply needs to scan the string till it comes across the first difference.
- Complexity is \(O(2WN)\) for the worst case, but \(O(N\log_{R}N)\) for the randomized case (no derivation).
- It is stable.
Compared to quicksort:
- This algorithm can suffer from cache mishits.
- The inner loop is “heavy” and slow.
- Requires more extra storage.
3-Way String Quicksort
This method is very similar to the regular quicksort. At the \(d\)-th character, We pick a partition element (which is a character, not a string), and partition into three sublists instead of 2. The first list will have all the elements “smaller” than the partition element. The third will have all the elements “larger” than the partition element. The 2nd list will have all the elements corresponding to the partitioned element.
For example, in the first pass, if we pick r as the partition element, the 2nd list will have all elements beginning with r.
We then recurse on the 1st and 3rd lists, but for the 2nd sublist, we recurse on the \(d+1\)-th element (as all of them share the same \(d\)-th element).
In a sense, this is like the usual MSD sort except we’re doing a 3 way sort instead of an \(R\)-way sort.
Comparison With Quick Sort
- In quick sort, we have to compare the whole string roughly \(2N\ln N\) times. In this 3 way string quick sort, we have roughly the same number of character compares.
- Like quick sort, it is cache friendly.
- It is in place.
- It is not stable.
Implementations
Python
from collections import OrderedDict
from copy import copy
def identity(x):
return x
def counting_sort(lst, mapping, function=identity):
"""
Counting sort. The assumption is that the entries in the list come from a
finite alphabet. Performance is good if the alphabet is much smaller than
the size of the list.
Keyword Arguments:
lst -- List to sort
mapping -- A dictionary that maps the elements in the alphabet to numbers.
function -- lst may contain complex objects. An optional function can be
passed that will extract the 'key' of the element. The default
is simply the identity function.
"""
counter = [0] * (len(mapping)+1)
for letter in lst:
counter[mapping[function(letter)]+1] += 1
cumul = OrderedDict()
running_sum = 0
for index in xrange(len(mapping)):
cumul[index] = running_sum
running_sum += counter[index+1]
orig_cumul = copy(cumul)
new_lst = [None] * len(lst)
for letter in lst:
new_lst[cumul[mapping[function(letter)]]] = letter
cumul[mapping[function(letter)]] += 1
return new_lst, orig_cumul
def msd_sort(lst, mapping):
"""
Radix sort using the most significant digit. The general algorithm should be
able to handle variable length strings, but mine doesn't.
Keyword Arguments:
lst -- The list of elements to sort.
mapping -- A dictionary that maps the elements in the alphabet to numbers.
"""
msd_helper(lst, mapping, 0, len(lst), 0)
def msd_helper(lst, mapping, lower, upper, d):
"""
Keyword Arguments:
lst -- The list being sorted.
mapping -- A dictionary that maps the elements in the alphabet to numbers.
lower -- The lower index to sort from.
upper -- The upper index (which itself is not sorted)
d -- The column to use as a key
"""
if d == len(lst[0]):
return
if (upper - lower) <= 1:
return
new_lst, cumul = counting_sort(lst[lower:upper], mapping,
lambda x: x[d])
lst[lower:upper] = new_lst
for index, location in list(cumul.iteritems())[:-1]:
msd_helper(lst, mapping, lower+location, lower+cumul[index+1], d+1)
# Don't forget the last index!
msd_helper(lst, mapping, lower+cumul[index+1], upper, d+1)
def qsort_3way(lst):
"""
Use the 3 way quicksort to sort a list of equal length strings.
"""
qs3w_helper(lst, 0, len(lst)-1, 0)
def qs3w_helper(lst, lower, upper, d):
"""
Keyword Arguments:
lst -- list to be sorted
lower -- lower index to sort from
upper -- upper index to sort to
d -- The column to sort.
"""
if upper <= lower:
return
lt, gt = lower, upper
pivot = lst[lower][d]
i = lower + 1
while i <= gt:
curr_char = lst[i][d]
if curr_char < pivot:
tmp = lst[lt]
lst[lt] = lst[i]
lst[i] = tmp
i += 1
lt += 1
elif curr_char > pivot:
tmp = lst[i]
lst[i] = lst[gt]
lst[gt] = tmp
gt -= 1
else:
i += 1
qs3w_helper(lst, lower, lt-1, d)
if d+1 < len(lst[0]):
qs3w_helper(lst, lt, gt, d+1)
qs3w_helper(lst, gt+1, upper, d)
def q3(lst):
"""
A more Pythonic but probably slower way to do it. Lots of temporary lists
created!
"""
return q3_helper(lst, 0)
def q3_helper(lst, d):
if len(lst) < 2:
return lst
pivot = lst[0][d]
left = [x for x in lst if x[d] < pivot]
right = [x for x in lst if x[d] > pivot]
center = [x for x in lst if x[d] == pivot]
left = q3_helper(left, d)
if d + 1 < len(lst[0]):
center = q3_helper(center, d+1)
right = q3_helper(right, d)
return left + center + right