Text Compression in Motoko

Does anyone know if there is a library to compress (zip) text? I want to reduce the storage footprint of long text. Thank you.

You can encode type Text to Blob with Text.encodeUtf8(). That will already save a lot of space because Text probably uses 4 bytes per character.

I am not aware of anyone having written compression in Motoko yet.


Interesting. I expected that Text would use 2 bytes per unicode character, same as a Blob. Where would the other 2 bytes come from?

1 Like

Text values are internally represented as (ropes of) UTF-8, so mostly 1 byte per character. If the rope consists of a single piece, then the conversion to Blob does nothing at runtime, otherwise it copies and concatenates the pieces.

FWIW, contemporary Unicode has a 21 bit value space. Hence, representations that use 2 bytes combine the disadvantages of a 1-byte representation (no random access) with those of a 4-byte representation (waste of space). In most cases they are only used for legacy reasons, like in old languages and APIs.


Thank you Mr Rossberg.

Just to clarify, my understanding is that Text and Blob values use the same memory footprint: 1 byte for ASCII characters and 2 bytes for non-ASCII UTF-8 characters. (Assuming the blob is holding UTF-8 encoded data.)

Is that correct?

UTF-8 uses up to 5 bytes for a character, depending on value. Though in practice, latin alphabets don’t require more than 2.

A text value may be represented as a rope data structure, if it was produced via uses of the # operator. But the concatenation of the individual parts will be identical to the blob. Hence, converting to a blob does not usually safe space. It’s more likely it wastes some because it produces a copy and prevents sharing between identical parts of multiple text values.

For example, if you do this:

let t = "a fairly long piece of text";
for (i in range(1, 10)) {
  f(t # Nat.toText(i));

This will share the same part of t for all the text values passed to f. The sharing gets lost if you convert to blobs, and you end up with 10 copies.

1 Like

Well, that’s a nice optimization! Thank you for the deep insight!

So for data sent to another canister (serialized, then deserialized into new memory addresses on another machine), I will treat Text and Blob as the same size, as there is no sharing between identical parts in that scenario.