Motoko B-Tree, for efficient ordered key-value data

I’ve been talking with @icme about the use case for an imperative data structure for ordered key-value data. As many recognize, the existing RBTree module from base is deficient for this use case in that the delete operation does not fully recover the space, or remove the key. Further, the RBTree is always a binary tree, using lots of space for pointers, compared to data being stored.

By moving to a B-Tree with a tunable degree parameter, more tuning is possible, and potentially, more efficiency too. Further, the delete operation for a B-Tree is comparatively simpler than that of either a functional or imperative RBTree, and will remove the key completely. As a result, the B-Tree (but not the existing, functional RBTree) will be space efficient for practical workloads with lots of deletion.

We were both interested in opening the discussion of this data structure to the wider dev community, and @icme suggested posting here.

Questions:

  1. Has this been done in Motoko? We don’t think so, but please correct us if that’s mistaken. There does seem to exist a Rust solution: ic/btreemap.rs at master · dfinity/ic · GitHub

We’d be aiming to do something comparable, but in Motoko, hopefully for inclusion in base.

In any case, it would also be a stable data structure for Motoko, in that it would permit stable variables to hold their values, somehow.

  1. What operations are most essential?

Our list thus far:

  • Insertion
  • Bulk insertion
  • Validation check (is valid B Tree?)
  • Search
  • Pure iteration: Through a range.
  • Map Iteration: Map (and update) values in a key range.
  • Deletion
  1. What performance tests?
  • Speed of each operation (plot for increasing data sizes)
  • Throughput of each operation (i.e. how many insertions per consensus round & cycle limitations) (plot for increasing data sizes)
  • Space used (plot for increasing data sizes)
  • Cycle usage (plot for increasing data sizes)
9 Likes

I didn’t see this lib before, but check about this

1 Like

Map Iteration: Map (and update) values in a key range.

This would be awesome! Really important for pagination and not currently part of RBTree.

It would also be great if you could talk with @dieter.sommer about the stable Rust data structure that his team is building for the Bitcoin integration.

I believe it directly stores bytes into stable memory. A Motoko equivalent using the ExperimentalStableMemory library would be very useful (as opposed to just declaring a data structure as a stable variable, which runs into the 4 GB heap limit).

One of the drawbacks of storing this in stable memory (as opposed to using the stable keyword) would be additional performance overhead for various operations.

One of the goals of this library is to improve upon the performance of the Red-Black tree in terms of operational performance (speed, throughput, space occupied, and cost).

Hopefully we’ll be at a place in a few months with multi-canister scaling such that developers don’t need to worry as much about the 4GB individual canister size limit and are optimizing for compute, not storage.

Other operations that would be nice:

split() - being able to split the tree in half with the lower half on the left and the higher half on the right
slice(n) - Like split but take n elements instead of half

It would also be very helpful to have a parallel implementation that is crypto aware
witness(id) - Calculate the witness of the leaf
root() - get the root hash of the tree.

You made the library?


It seems that it uses the mo:base/Trie behind the scenes?

No, the bucket lib is a lib for data stable storing and geting

Thanks!

Yes, I was also thinking we’d start by writing a data structure to be stored in stable vars.

But I also plan to keep in mind that we eventually want a version of a B Tree in Motoko using stable memory more directly, either through the existing library API (quite slow), or more likely, with additional system and language support for improving the currently slow performance of using the system API (what the library does under the hood). I believe such support is in the works for all Wasm canisters, not just Motoko-based ones.

Great to know, thanks!

I wonder how it compares to this one? (also linked above)

In any case, I am interested to do that. Thanks again!

Makes sense!

Quick check: As this would be imperative, the split operations would only make sense if the original B-Tree is mutated (“destroyed”) in the process, right? (I just want to confirm that we aren’t imaging a functional version where the pre and post split versions co-exist.)

Yes, good point. This aspect is important too. (WDYT @icme?)

Is the basic (non-Merkle) version still useful to you @skilesare?

To manage the work, I would do the basic operations and tests first, and move on to “authenticated” use cases after that, where query operations produce witnesses related to the root hash, and update operations update it.

And for performance, I wonder how much extra cycles the Merklized hashing will require? I assume it will be sizable, and some folks may prefer not to do it at all. But more experiments will confirm the overhead, and they seem important to answer other questions (e.g., how to best design the authenticated-ness as an optional feature?) .

Quick check: As this would be imperative, the split operations would only make sense if the original B-Tree is mutated (“destroyed”) in the process, right? (I just want to confirm that we aren’t imaging a functional version where the pre and post-split versions co-exist.)

You just went over my head. :grimacing: If it is impossible then it is impossible(Confession: I don’t know what imperative means :joy:). My functional programming is a bit sparse.

Yes! But less useful for returning certified data to HTTP queries.

Don’t mean to derail, but curious about using StableBTreeMap vs the approach taken here:

Wouldn’t the overhead only be during upgrades?

Oops, sorry – let me try to be more concrete.

Suppose that t is a B-Tree that holds some key-value data.

Suppose that you define let (t1, t2) = t.splitAt(splitKey), so that t1 and t2 are each given the left and right portions of data, around splitKey (defined somehow).

There are two ways to implement this regarding how t1 and t2 relate to t:

  • Functional: t is still valid after the operation, and t1.append(t2) == t
  • Imperative: t is not valid after the operation; the nodes it uses have been mutated to represent t1 and t2

I think both are possible, and was wondering which you expected?

I don’t see much difference for me. I’d say do the one that is the most efficient for cycle purposes.

1 Like

The imperative version will most likely be more efficient in terms of cycles and memory usage. The drawback is then as a programmer you have to be aware that this operation directly mutates the data structure passed into it, so you may have side effects that you did not originally intend to have happen.

For the purposes of permanent data storage, the imperative version is fine.

However, if you are splitting the data structure just to perform an operation on both haves, this will mutate the underlying data structure. In this case, a map/updateIter/Scan type of operation between two bounds would be best. Something like a combination of the following functions:

  • getMidpoint(bt, compareTo)
  • mapIter/Scan(bt, compareTo, lowerBound, upperBound, mapFunc) // midpoint as lowerbound or upperbound
2 Likes

I don’t have a strong opinion on a crypto awareness when it comes to data structures.

Unless it fundamentally changes how the data is stored, it might be nice to build the data structure out first, and then build wrappers around this base data structure that include that added functionality.

I’m not too familiar with merkle trees and the witness() and root() functionalities, but I’d prefer to keep this BTree data structure library as general purpose and lean as possible.

@matthewhammer Are you familiar with the witness() and root() functions and what that would take?

Perhaps someone who has touched that code can comment more (@chenyan ?), but my first guess is that they indeed do look similar in purpose (the asset one uses text keys, as opposed to byte strings, but that seems minor). The non-asset-specific one also seems to be engineered differently, in terms of its expression as Rust code. Without comparing them in a lot more detail, it’s hard to say more.

In my understanding*, the overhead of using the ExperimentalStableMemory module comes from the way that the system API is exposed to the canister, where it imposes an overhead for every read and write in that library API.

By contrast, using the stable keyword affects upgrade behavior (persisting the data in these vars using stable memory, and but incurring much less overhead during upgrade, due to “bulk operations” used then). Outside of upgrades, stable data structures in Motoko use ordinary memory read and writes, happening in the ordinary Wasm memory, and do not incur any system call overhead.

(* @claudio @ggreif please correct me if that’s inaccurate!)

Yes, a layered approach, with nice factoring, would be ideal. That would be my initial design goal, when the time comes for it.

The need for this comes from needing to relate computations done outside of consensus with those done as part of consensus. For everything within an update message, consensus means that the IC itself is ensuring a level of authenticity (“tamper proof”!).

For query messages, however, only a single replica replies and in principle, that one machine could be compromised to be deceptive or faulty in some way. If we could generally certify these quickly (without paying for full consensus), then the technical need for per-application certification would be eased, or eliminated, AFAIK.

Until that time, the role of witnesses and root hashes guards against this possibility of deception for such query responses, since the caller can check this response against a hash that has been processed by consensus.

To get a better sense of how this looks in Motoko, there’s an initial example for a single counter value. Ideally, some BTree (like this one we are discussing now) would generalize this example, certifying arbitrary sets of key-value data, not just a single 32-bit value (as the example now shows).

2 Likes

PS: For anyone interested in diving into that code example, I recommend reading this comment that explains the steps briefly.

Here is an old example that I did that (I think) is right and a way to do a real tree: Motoko Playground - DFINITY. I’m pretty sure I borrowed liberally from someone but I don’t remember who now.

1 Like

What’s strange is that @paulyoung’s example uses the ic_cdk storage API to read/write to stable memory, but StableBTreeMap doesn’t.

Can canisters even use that data structure then? It appears to me that it relies on replica internals, which makes it replica-specific and not canister-general. I’d like to be wrong though.

By contrast, using the stable keyword affects upgrade behavior (persisting the data in these vars using stable memory, and but incurring much less overhead during upgrade, due to “bulk operations” used then). Outside of upgrades, stable data structures in Motoko use ordinary memory read and writes, happening in the ordinary Wasm memory, and do not incur any system call overhead.

This sounds correct to me, but there is one huge drawback to Motoko stable variables: they are limited by the size of ordinary wasm memory, which is 4 GB (and in practice much less due to GC).

Compare this to a stable data structure in the mold of StableBTreeMap, which can access the entire 8 GB (and soon to be increased) stable memory address space.

I think @icme’s point about multi-canister scaling is valid, but I think any data structure developed from here onwards should be “multi-canister-aware”. Would an eventual (or even current) BigMap replace this? Developers don’t want to think about storage restrictions. They just want to read and write data to a data structure that abstracts all of that away. In 2022, I think any newly developed data structure that aims to be scalable should try to be as forward compatible as possible in light of all the changes going on.


With respect to Merkle trees, witnesses, and root hashes, I think it’s a great idea to make this library at least optionally “crypto-aware”. Certified variables are an incredibly powerful feature of the IC, and right now the biggest barrier to more widespread adoption of certified variables is a lack of simple libraries. Perhaps @nomeata could chime in on whether this would be the right place to do it.

Yep, I recognize that code. It’s from here:

And while it’s a perfectly reasonable starting place for a merklized KV store, it will not work in general, as an attacker that can choose put keys can choose these keys to create arbitrary tree depths, and arbitrarily bad future put/get performance.

OTOH, if the user that chooses put keys is trusted, or if they are always first hashed before being put into the tree, these concerns about balance are not as salient or security critical.

But if we care about security enough to use certification at all, we should also care about this possibility of the attacker controlling put keys too, as it will be severely problematic for data structures that do not treat these choices as potentially adversarial (DoS attack one path in the tree, and future requests there create canister paralysis, basically).

The extra complexity of a BTree (or other self-balancing tree) is that these problematic tree shapes cannot ever happen, due to rebalancing steps that this simpler approach totally lacks.

In a BTree (or RBTree, or other rebalancing tree), the attacker still chooses put keys that are ordered and that determine tree shape, but the tree responds by changing shape when a subtree becomes too deep.

The Trie from base solves the problem differently, and does not use ordered keys, but rather their hashes (not controlled by the attacker at all, unless the hash function is broken).