Announcing Motoko package stable-trie

We published a new package https://mops.one/stable-trie

It is a key-value map that lives entirely and always in stable memory so that it is not limited by 4 GB heap and does not require serialization during canister upgrade. So upgrades are “free”.

Keys and values are constant size Blobs, whose lengths are configured in the constructor. The trie internally stores data in two Motoko Regions.

It is also efficient. Here is a profiling comparison against two heap data structures and one stable memory data structure. The third column is this package.

method rb tree zhus map stable trie map motoko stable btree
put 3_748 3_720 4_476 259_441
inside get 2_196 1_905 3_778 202_699
ouside get 1_605 1_089 2_325 209_641
inside deletion 5_034 2_152 10_548 445_004
outside deletion 4_148 1_085 2_364 406_710

Here, “inside” means that the key that is looked up or deleted is present in the tree.
“Outside” means that the key is not present (it’s a “miss”).

And here is an analysis of the memory utilization which can be obtained by running mops test large:

children number: 4
pointer size: 4
keys: 16_777_216
byte size: 327_693_280
bytes per key: 19
leaves: 16_777_216
nodes: 12_092_222
nodes per leaf: 0.720753
pointers per leaf: 2.883010
children per node: 2.387439
children utilization: 0.596860

This is a trie with 4 children per node and 16.7 million keys in it. The keys are 8 bytes long and uniformly distributed. The value size is 0, i.e. this trie is not actually a map but merely functions as a set. The trie has 12 million inner nodes, 16.7 million leaves and occupies a total of 327 MB. This amounts to 19.5 bytes per key. The 8 bytes of the key are part of those 19.5 bytes so we can say the trie overhead is 11.5 bytes per key. This overhead does not depend on the key size or on the value size. It turns out that it does no depend on the size of the trie (the number of keys in it) either as long as the keys are uniformly distributed.

A typical example use is to store keys that are 29 bytes long representing user principals. Then the storage requirement would be 11.5 + 29 = 40.5 bytes per user plus the value size stored for each user. The value size can be 0 for a set or any other constant value.

The above trie uses a pointer size of 4 bytes which allows to store up to 2 billion keys. If more are required then a pointer size of 5 bytes can be used. In this case the overhead will go up from 11.5 bytes to 14.4 bytes.

Let us know if you have use for this data structure!

13 Likes

How is stable-trie better than using Motoko Regions directly?

A trie is a key-value map that let’s you look up values by keys efficiently in O(log n) time. A Region is bare memory, that cannot offer lookup by keys.

4 Likes

This is great news. We are extensively using base:trie for storing our data. Will soon evaluate migration. Thank you for sharing :slight_smile:

Hey there I am trying to actually use this data structure and was testing it.
I have created a sample code where I am using this map for storing key, value pair
where key is a principal, and the value is a Record type
and It seems the key_size assertion always tends to fail.
Can you help me with that ?

import Text "mo:base/Text";
import Principal "mo:base/Principal";
import Error "mo:base/Error";
import Blob "mo:base/Blob";
import Debug "mo:base/Debug";
import StableTrieMap "mo:stable-trie/Map";

actor {
  public type data = {
    name : Text;
    email : Text;
  };

  public type key = Principal;

  let data_map = StableTrieMap.Map({
    pointer_size = 4;
    aridity = 4;
    root_aridity = null;
    key_size = 1; // key size is 1 since anonymous principal size is 1
    value_size = 256;
  });

  public shared ({caller}) func add_data ( data : data) : async Text {
    let key_blob = Principal.toBlob(caller);
    let data_blob = to_candid(data);
    Debug.print(debug_show(data_blob.size()));
    Debug.print(debug_show(key_blob.size()));
    data_map.put(key_blob,data_blob);
    return "Data added";
  };

  public query ({caller}) func get_data() : async data {
    let key_blob = Principal.toBlob(caller);
    let data_blob = data_map.get(key_blob);

    switch(data_blob){
      case null {
        throw(Error.reject("No data found for the corresponding principal"));
      };

      case (?val){
        let Data : ?data  = from_candid(val);
        switch(Data){
          case null {
            throw(Error.reject("Failed to Deserialize the Data from the blob"));
          };
          case (?v){
              return v;
          };
        };
      };
    };

  };

};

But the output always seems to be an error when I call the add_data function

Call was rejected:
Request ID: 37f098e30f72260dd4aaa09e6636353f266999c8b8f962511f32f85c8d43f0ba
Reject code: 5
Reject text: Error from Canister bkyz2-fmaaa-aaaaa-qaaaq-cai: Canister called `ic0.trap` with message: assertion failed at base.mo:304.7-304.41.
Consider gracefully handling failures from this canister or altering the canister to handle exceptions. See documentation: http://internetcomputer.org/docs/current/references/execution-errors#trapped-explicitly

this call always gets failed at an assertion

 assert key.size() == args.key_size;

What am I doing wrong here?

With stable trie you are responsible for supplying keys and values with the exact size that was configured when creating the map. The key size (1) matches if you only ever call this as the anonymous principal. But the value size (256) will in general not match.

If values are of variable size, like in your case, then you are responsible for padding. So you would have to take the Blob obtained from to_candid and pad it either on the left or on the right with zeros to make it 256 bytes long. You may also have to add a length byte so that you able to parse it again.

The data structure does not do the padding for you because it does not want to make any assumptions on how people use it. Not everyone has variable length values. And there are many different ways to implement variable length values. Padding is just one of them.

I suspect that it fails somewhere else because I think it is the value that causes the trap, not the key.

Here is some code that embeds a Principal into 32 byte Blob with padding and length byte: icrc-84/src/lib.mo at main · research-ag/icrc-84 · GitHub

The functions in that file are called toSubaccount and fromSubaccount but Subaccount is just an alias for Blob. So just replace the word Subaccount with Blob everywhere and you have what you need for the keys. Then you are not limited to the caller being anonymous anymore. If you care about saving 2 bytes then you have to modify the code to make the Blob 30 bytes instead of 32 bytes. Principals are maximum 29 bytes so with the length byte the most needed for a key is 30 bytes.

Then you can use the same code to embed a variable length Blob into a fixed size Blob. In your case the output of to_candid into a 256 byte Blob.

Functions look like this:

import Blob "mo:base/Blob";
import Principal "mo:base/Principal";
import Array "mo:base/Array";
import Nat8 "mo:base/Nat8";

/// Convert Principal to 32 byte Blob
func encode32(p : Principal) : Blob {
  let bytes = Blob.toArray(Principal.toBlob(p));
  let size = bytes.size();

  assert size <= 29;

  Array.tabulate<Nat8>(
    32,
    func(i : Nat) : Nat8 {
      if (i + size < 31) {
        0;
      } else if (i + size == 31) {
        Nat8.fromNat(size);
      } else {
        bytes[i + size - 32];
      };
    },
  )
  |> Blob.fromArray(_);
};

/// Converts 32 byte Blob back to Principal.
///
/// The conversion can fail when the format is not as expected.
/// In this case `null` is returned.
func decode32(blob : Blob) : ?Principal {
  let bytes = Blob.toArray(blob);
  assert bytes.size() == 32;

  let (start, size) = do {
    var i = 0;
    label L while (i < 32) {
      if (bytes[i] != 0) break L;
      i += 1;
    };
    if (i == 32) return null;
    (i + 1, Nat8.toNat(bytes[i]));
  };

  if (start + size != 32) return null;
  Array.tabulate(size, func(i : Nat) : Nat8 = bytes[start + i])
  |> Blob.fromArray(_)
  |> ?Principal.fromBlob(_);
};

Should I specify my use case to you? I think then you will be able to help me!

Sure, feel free to send your use case.

Thanks for helping out,
Let’s Say I built a basic e-commerce platform where I have products , categories , users personal Data , orders Data like that , and I 've stored them in their dedicated TrieMap, the issue is that this data is not persistent , and as the platform Owner I want this data to persist across canister Upgrades. So how Can I do that ?
Here is the sample code of the data that I must use

    public type User = {
        id : Principal;
        firstName : Text;
        lastName : Text;
        email : Text;
    };

       public type UserProduct = {
        title : Text;
        //price : Float;
        //sellingPrice : Float;
        description : Text;
        category : SlugId;
        active : Bool;
        newArrival : Bool;
        trending : Bool;
    };
    // Backend data for products
    public type Product = UserProduct and {
        //img : Text; // Upload 3 images for each product
        id : ProductId;
        slug : SlugId;
        variantSize : [VariantSize];
        variantColor : [VariantColor];
        time_created : Time.Time;
        time_updated : Time.Time;
    };
    public type NewOrder = {
        shippingAddress : Address;
        products : [OrderProduct];
        paymentAddress : Text;
        userid : Principal;
        totalAmount : Float;
        subTotalAmount : Float;
        shippingAmount : ShippingAmount;
        paymentMethod : Text;
        orderStatus : Text;
        paymentStatus : Text;
        awb : Text;
    };
    private var Users = TrieMap.TrieMap<Principal, Index>(Principal.equal, Principal.hash);
    private var products = TrieMap.TrieMap<Types.SlugId, Index>(Text.equal, Text.hash);
    private var orders = TrieMap.TrieMap<Types.OrderId, Index>(Text.equal, Text.hash);

here Value is these types of data

I tried creating an implementation of the regions and using them to store the data
but it’s not efficient also sometimes I get some weird errors of “Magic Bytes” like that when getting the data and also sometimes it takes a lot of time to retrieve the data
you can check the basic implementation here

simple regions implementation

Can you help me in choosing what datastrucutre I must choose so that It takes minimum amount of time but store the data in stable memory so that It persists across upgrades

have you tried Map Mops • Motoko Package Manager for that use case?

I think I have tried using it but I don’t know if this stores data in stable memory ?
Tell me about this what do you think of this, What I think the issue with this is that I don’t know if we need to upgrade the canister in order to store the data in stable memory.
https://mops.one/stableheapbtreemap

How much data you would like to store? I think the map uses the heap memory and with that you can store in one canister until 3 GB I think, with the current 32 bit integration.

The map keeps the data secure by a canister upgrade until you don’t change the structure. If you change the structure of you data, than you loose you data too. In that case you have to use a migration strategy to keep the data save.

It’s definitely more than that,10 to 20 GB I think

What do you mean by “time” here? You mean number of wasm instructions that the code uses or real-world time that the code uses or do you mean development time?

You basically have 3 options:

  1. Use wasm64 and Motoko’s new enhanced orthogonal persistence and just use heap datastructures.
  2. Write directly to Regions with or without the help of some packages.
  3. Use heap data structures and copy to stable memory during preupgrade/postupgrade.

Option 2 is what you have done in your code that you linked. Some problems you may encounter are:

  • Using to_candid and from_candid may be expensive (in number of instructions)
  • The approach may be a lot of work and error-prone.
  • In your application the resulting Blobs from to_candid will have a large variance in size. Allocating a fixed (constant) amount for each of them will result in a lot of waste. But when you allow variable size then you are left with doing all the memory management inside the Regions yourself which is a lot of work. There are packages like Mops • Motoko Package Manager to help with that though.

Option 3 is ruled out by the size requirement that you mentioned of 10-20 GB. Option 3 only works up to 3 GB.

So I would use Option 1 in your case. Motoko EOP (enhanced orthogonal persistence) was introduces recently and allows you to no worry about stable memory at all. You just use heap datastructures but you don’t have the 4 GB limit with wasm64. See Enhanced orthogonal persistence | Internet Computer