New Motoko sort package

We are pleased to announce a new package, sort, available in mops. It is meant to bundle various types of sorting algorithms in one package and is optimised for performance. The initial version contains bucket sort, radix sort and merge sort for keys of type Nat32.

All algorithms work on VarArray input and are in-place sorts.

The implementations outperform all previously available sort algorithms in Motoko in terms of instructions and memory (see benchmarks further below).

As counting based algorithms, bucket sort and radix sort require keys with bits.
The package currently requires keys to be of type Nat32. Versions for other key types such as Nat64, Int32/64 or Blob are planned for the future based on demand. Please request what you need.

Merge sort is comparison based but at this point is also only implemented for key type Nat32 in this package. For merge sort with arbitrary keys, requiring only a compare function, see this motoko-core PR instead: Optimize sort. by AStepanov25 · Pull Request #453 · caffeinelabs/motoko-core · GitHub The PR replaces VarArray.sort in motoko-core with the merge sort of this package, re-written for a generic compare function. The PR is about twice as fast while using only half the memory as core’s old VarArray.sort.

Which algorithm should you use? Radix sort does not have a worst case (or worst case = average case). But bucket sort is faster in the average case. Merge sort is slower but uses the least amount of memory. Both bucket sort and radix sort internally fall back on merge sort for small input, so there is no need to do that manually.

Some recommendations:

  • Use radix sort when the distribution of keys in the input is unknown or known to be highly non-uniform. It’s speed is independent of the distribution.
  • Use radix sort if tight control of the worst case performance is required. For example, if the key distribution in the input can be controlled by an adversary.
  • Use bucket sort if the distribution is known to be close to uniform.
  • Use bucket sort for best performance in the average case.

Optional: for bucket sort and radix sort, if the full key space isn’t used then you can speed up the sort by passing in an upper bound for the keys. In some cases it may be worth calculating the maximum key value by doing one extra pass over data before calling sort.

The bucket sort implementation was inspired by motoko-sort (also see this forum thread). We improved instructions a little bit and memory by a lot. But motoko-sort provides many key types where we currently only support Nat32.

For further development of this package we would like know how sorting is being used in canisters. Please let us know below. For example:

  • Are you sorting call input (arguments) or data that is already in state before the call?
  • Are you sorting output (preparing return values) or data to keep it in state in a sorted way.
  • What sizes of input are sorting?
  • What are your key types?
  • What is your desired input/output type (VarArray, Array, List, etc.)?

The algorithms are described further in the README.

Benchmarks:
Instructions:

100 1000 10000 12000 100000 1000000
bucketSort 43_946 0.42 M 4.17 M 4.96 M 41.3 M 412 M
bucketSortWorstCase 0.21 M 1.15 M 9.41 M 11.1 M 91.9 M 553 M
radixSort 66_439 1.04 M 8.82 M 10.5 M 63.9 M 527 M
Zhus 63_420 0.62 M 6.3 M 7.53 M 63 M 628 M
mergeSort 66_620 1.04 M 15.4 M 18.5 M 193 M 2_319 M
VarArray 0.21 M 2.68 M 35.8 M 43.1 M 442 M 5_047 M

Memory:

100 1000 10000 12000 100000 1000000
bucketSort 872 B 5.24 KiB 55.4 KiB 63.21 KiB 518.96 KiB 4.82 MiB
bucketSortWorstCase 1.54 KiB 8.27 KiB 72.41 KiB 80.23 KiB 646.99 KiB 4.88 MiB
radixSort 536 B 2.28 KiB 47.44 KiB 55.25 KiB 647 KiB 4.07 MiB
Zhus 1.52 KiB 12.06 KiB 117.53 KiB 140.97 KiB 1.14 MiB 11.44 MiB
mergeSort 536 B 2.28 KiB 19.86 KiB 23.77 KiB 195.64 KiB 1.91 MiB
VarArray 736 B 4.23 KiB 39.39 KiB 47.2 KiB 390.95 KiB 3.82 MiB
14 Likes

Do you have a comparison of how much performance gain bucketSort has from switching to Insertion sort for (n < 8) buckets and Merge sort for (8 < n < 16) buckets?

For my algorithm I’ve found that every next bucket size has significantly less performance gain from switching to a more appropriate algorithm. I had a ~11% improvement from using a sorting network for 4-item buckets and ~3% improvement from using a sorting network for 5-item buckets. Which was already too low of a gain to add additional code. Also when I tried using a generic Insertion sort for 5-item buckets it was actually slower than the main algorithm (maybe I didn’t optimize it well but still).

Did you try switching at different bucket size breakpoints?

Also, I think I can improve memory usage by 1/3, my algorithm was never updated for 64 bit address space. But that’s only for arrays that are smaller than 32bit. Idk how practical it is to support sorting for arrays that are larger than 32bit.

How much gain relative to what exactly? Relative to not doing anything special for small buckets at all? Then I would think maybe the difference to radix sort is a ballpark number because radix sort is essentially the same work as bucket sort but without the small bucket optimization.

Or do you mean relative to doing it only for n <= 4? Then also I don’t know exactly. But I do remember that extending it from 8 to n <= 10, and respectively n <= 20 for merge sort, didn’t give that much. Maybe only 1%. We stopped optimising things when they gave less than 1% gain.

To be absolutely sure what the gain is I think you can easily turn it off or change the n <= 8 if-clause to another number and re-run mops bench.

We also had this experience that using generic code here didn’t give any gain.

What do you mean? Do you mean if we tried different values instead of 8 in the n <= 8 condition? Yes, we saw an incremental gain with each higher value, it just dropped to around 1% or lower for when going from 8 to 9.

What do you mean by address space? Do you mean wasm32/wasm64? Or how long the array can be? I forgot to say in the announcement that input array size is limited to 2^32-1. I guess it is not practical to sort anything larger than 100M with these type of algorithms (within the DTS instruction limit). For anything larger an incremental sort would be required.

1 Like

The thing is the distribution of bucket sizes, i.e. counts of buckets of given size of a large array, is different from yours. And expected number of elements per bucket is different. We have larger buckets because of the way we choose radix, we switched from larger radix to two times smaller and thus bigger buckets because it showed performance gains, because unrolled insertion sort is faster than any algorithm for n approximately equal 16.

2 Likes