Text Compression in Motoko

Motokoder · January 16, 2022, 12:16am

Does anyone know if there is a library to compress (zip) text? I want to reduce the storage footprint of long text. Thank you.

timo · January 16, 2022, 7:12pm

You can encode type Text to Blob with Text.encodeUtf8(). That will already save a lot of space because Text probably uses 4 bytes per character.

I am not aware of anyone having written compression in Motoko yet.

Motokoder · January 16, 2022, 7:28pm

Interesting. I expected that Text would use 2 bytes per unicode character, same as a Blob. Where would the other 2 bytes come from?

rossberg · January 17, 2022, 11:04am

Text values are internally represented as (ropes of) UTF-8, so mostly 1 byte per character. If the rope consists of a single piece, then the conversion to Blob does nothing at runtime, otherwise it copies and concatenates the pieces.

FWIW, contemporary Unicode has a 21 bit value space. Hence, representations that use 2 bytes combine the disadvantages of a 1-byte representation (no random access) with those of a 4-byte representation (waste of space). In most cases they are only used for legacy reasons, like in old languages and APIs.

Motokoder · January 17, 2022, 2:42pm

Thank you Mr Rossberg.

Just to clarify, my understanding is that Text and Blob values use the same memory footprint: 1 byte for ASCII characters and 2 bytes for non-ASCII UTF-8 characters. (Assuming the blob is holding UTF-8 encoded data.)

Is that correct?

rossberg · January 17, 2022, 3:06pm

UTF-8 uses up to 5 bytes for a character, depending on value. Though in practice, latin alphabets don’t require more than 2.

A text value may be represented as a rope data structure, if it was produced via uses of the # operator. But the concatenation of the individual parts will be identical to the blob. Hence, converting to a blob does not usually safe space. It’s more likely it wastes some because it produces a copy and prevents sharing between identical parts of multiple text values.

For example, if you do this:

let t = "a fairly long piece of text";
for (i in range(1, 10)) {
  f(t # Nat.toText(i));
}

This will share the same part of t for all the text values passed to f. The sharing gets lost if you convert to blobs, and you end up with 10 copies.

Motokoder · January 17, 2022, 3:30pm

Well, that’s a nice optimization! Thank you for the deep insight!

So for data sent to another canister (serialized, then deserialized into new memory addresses on another machine), I will treat Text and Blob as the same size, as there is no sharing between identical parts in that scenario.

Topic		Replies	Views
Motoko Binary Concatenation Language Support Motoko	1	575	October 4, 2021
What the max size of "Text" in motoko Developers	3	678	June 16, 2021
New Vector data structure in Motoko Language Support Motoko	9	709	July 31, 2024
Motoko Text creation performance Language Support Motoko	0	350	February 20, 2023
Explode `WordN` into byte array Language Support Motoko	7	1016	September 25, 2020

Text Compression in Motoko

Related topics