Text Compression in Motoko

Does anyone know if there is a library to compress (zip) text? I want to reduce the storage footprint of long text. Thank you.

You can encode type Text to Blob with Text.encodeUtf8(). That will already save a lot of space because Text probably uses 4 bytes per character.

I am not aware of anyone having written compression in Motoko yet.

2 Likes

Interesting. I expected that Text would use 2 bytes per unicode character, same as a Blob. Where would the other 2 bytes come from?

1 Like

Text values are internally represented as (ropes of) UTF-8, so mostly 1 byte per character. If the rope consists of a single piece, then the conversion to Blob does nothing at runtime, otherwise it copies and concatenates the pieces.

FWIW, contemporary Unicode has a 21 bit value space. Hence, representations that use 2 bytes combine the disadvantages of a 1-byte representation (no random access) with those of a 4-byte representation (waste of space). In most cases they are only used for legacy reasons, like in old languages and APIs.

2 Likes

Thank you Mr Rossberg.

Just to clarify, my understanding is that Text and Blob values use the same memory footprint: 1 byte for ASCII characters and 2 bytes for non-ASCII UTF-8 characters. (Assuming the blob is holding UTF-8 encoded data.)

Is that correct?

UTF-8 uses up to 5 bytes for a character, depending on value. Though in practice, latin alphabets don’t require more than 2.

A text value may be represented as a rope data structure, if it was produced via uses of the # operator. But the concatenation of the individual parts will be identical to the blob. Hence, converting to a blob does not usually safe space. It’s more likely it wastes some because it produces a copy and prevents sharing between identical parts of multiple text values.

For example, if you do this:

let t = "a fairly long piece of text";
for (i in range(1, 10)) {
  f(t # Nat.toText(i));
}

This will share the same part of t for all the text values passed to f. The sharing gets lost if you convert to blobs, and you end up with 10 copies.

1 Like

Well, that’s a nice optimization! Thank you for the deep insight!

So for data sent to another canister (serialized, then deserialized into new memory addresses on another machine), I will treat Text and Blob as the same size, as there is no sharing between identical parts in that scenario.