Hi, when’s the deadline ? Still working on my dht style solution.
We will likely keep this open for a while, but the sooner you get us a solution the better. Will likely shut things down after we have six or so submissions.
Hey @skilesare , great timing with this bounty. I had “research canister scaling” as a grant application todo, so I took a stab at it. Here’s a brief description of the project. There’s also a screen cap with the front-end, demoing the main concepts of the app.
Architecture: I went for the full scaling solution, where clients upload content to many buckets. The buckets create indexes based on the content they received and periodically (5s for the demo - but configurable) send the index to an Indexing canister. The Index canister instructs the front-end to upload new content to particular canisters based on the indexing strategy (more below). The Index canister also serves an index of canister IDs to the front-end, based on the indexes received from the Buckets. Thus the front-end first requests an “index” for #dogs, and then queries all the canisters where #dogs content was recorded, and displays the list.
To demo “user access” I chose a trivial top-to-bottom approach. “Moderators” are added to the Index canister (based on a principal) and the Index sends the “moderator list” to every Bucket. By default the Buckets only serve content created by the requester, or by Anonymous. If a moderator is added, however, they can view every piece of content uploaded to that Bucket.
I’ve used “entries” here, to denote pieces of content. This could be anything that we can track and quantize on a bucket. It could be megabytes for file storage, live sessions for proxy canisters, users served for game realms, etc. The 20 entry limit was chosen just so we can see the spawning in a live environment.
Implementation: The study is written in rust, with a react mock-up for front-end.
The canisters rely on heartbeat() functionality, as I wanted to also test this at scale. There are a ton of ways to optimize the flows, and the architecture could probably support having heartbeat() functionality just on the Index canister. Or, better yet, a dedicated heartbeat-enabled canister for the entire project.
Code organization: I tried to be as non-opinionated as possible. There are a lot of great production-ready projects out there where one could chose to get inspiration on file organization. I chose to keep it as simple as possible, so that people reading the code focus on the IC stuff and not on implementation details. A lot of things could obviously be optimized and better organized.
Both the Index and Bucket canisters have 4 main source-files. lib.rs deals with canister settings, and IC-related calls (queries and updates);
businesslogic.rs deals with … business logic. Here lies the main impl for most of the functionality;
env.rs is an adaptation from @hpeebles’ starting-project and deals with helpers for cdk API;
lifetime.rs deals with pre and post upgrades and the heartbeat function.
The front-end is simply thrown together to demo the canister workflows, nothing to write home about.
Key things to note when playing with the demo (some of the things can be hopefully seen in the video below):
-
There are two indexing strategies implemented - FillFirst and BalancedLoad. The default one is Balanced. The front-end first requests a list of can_id where to upload content, and then calls the first canister in the list. We can imagine the Index canister using a multitude of indexing strategies, based on business needs. (an interesting one would be grouping content to optimize for querying as few canisters as possible on content display)
-
The Index canister maintains a number of metrics. The main thing to notice is the relation between Free Slots, Desired Free Slots and Planned Slots. Free slots are computed every 5 seconds. When Free Slots becomes lower than the Desired number, a new bucket is planned for and added to the spawn queue. We add the planned slots, so that we don’t over-add too many canisters if the spawning process takes ~4-5 seconds and the heartbeat() gets called multiple times.
-
When sending multiple tags in a short time, notice that even though the Indexing Strategy wants to add content to the lowest canister, the re-indexing happens once every 5 seconds. So multiple entries will be sent to the same canister. (this turns out not to be a problem in larger deployments).
-
Adding a moderator to the Index canister will get populated to all the Buckets on the next heartbeat update, as this would probably need to be as close to synchronous as possible, since a real-life implementation would also require deleting moderators, and this approach would prevent weird edge-cases where moderators could still edit content on slow-to-update Buckets.
Play around with the demo, and let me know if there are any questions.
Note: please do not use this code as is in production!!! This was approached as a scalability study, and it is not optimized, and not tested well enough to be production ready. Feel free to use the code as you want, but please make sure it’s tested and stable before using this with real stakes!
Repo:
Video Demo:
Great job! Thanks for the research, explanation, great video, and well organized project and code. If it was up to me I’d give you the grand prize (pokes @skilesare)
This gives me a bunch of new ideas I’d like to test out
A few follow up questions:
\1. I see you’re storing a HashMap<hashtag, List<canisterId>>
in the index canister, and that the frontend is making two rounds of calls - one to query index canister to retrieve the list of storage canisters that contain a particular hashtag, and then one to query all of the storage canisters containing that specific hashtag. Did you explore any ideas of how you might scale out the index canister if the HashMap fills up?
I could see potentially having that “HeartBeat” canister you mentioned earlier turn into a management canister that holds your slot metrics and can spin up new index canisters as a second level of indirection, but now you’re scaling by putting additional levels of indirection in front of the user. If you take this route you now have 3 rounds of front-end calls - one to the management canister holding the index canisters, one to all the index canisters asking for canisters with your hashtag, and then finally to the storage canisters. For now, this multiple query approach makes a lot of sense since inter-canister queries are blocked and inter-canister updates are very slow.
Wonder if you were able to ponder on this point of scaling out the index canister itself, potentially without adding additional levels of indirection.
\2. The tradeoff made to achieve scalable key value storage sacrifices sortability and some query depth (say getting the latest entries - sorted by timestamp, or getting all cat hashtags geotagged in a particular region). This also comes into play when I want to update a specific post that I’ve given the #cat hashtag, or someone else tries to like my post - with no unique identifier, which cat post gets updated?
Did you give any thought to how one might auto scale while still providing sortability and additional query depth?
Thank you for the kind words
For the first question, that’s literally on my whiteboard right now I have two diagrams, one with an index of indexes and one with simply two index canisters. There are a couple of benefits of simply having two Index canisters, maybe sorted by “frequently used stuff”. I believe that 8gb ought to be enough (famous last words), and worst case scenario sometimes the clients will query both index canisters (instead of always making two queries). I guess we’ll have to grow and see…
Your second question is a bit more complicated, and I don’t have a good answer at this point. There are a lot of factors going in how you decide to index your data, and obviously there are some tradeoffs, one way or the other. My hope for my project is that it’s not a critically real-time system, thus I have the benefit of being able to schedule tasks and as long as they “eventually” complete, the system should still work. In other words, I might be able to have many indexing tasks work on each Bucket, and the main Index would just point towards them. It’s a kind of both distributed storage and distributed computing. The tradeoff is of course the front-end needs to do more queries. Looking at existing projects, that seems to be a good tradeoff to make, as the IC seems uniquely positioned to serve this use case with ease. When in doubt, we’ll “simply” add another layer of caching, trading space for compute time.
Hey, as promised I updated my submission with docs, some diagrams, and cleaner code and an instructional tutorial. I will remove the old reply. Thanks! I hope its not too late
The summary is from Github
The purpose of this project is to create canisters that have shared ownership. However, this is different than allowing multiple canisters to have control of the canister in the default sense. The problem is some default canister functions give full access to anyone with control (ie. there is only one level of control). A quick analogy is in a joint bank account from bank, you can have a married couple with each of their name on it but each party is able to completely withdraw funds. In a divorce, things can become messy if one party decides to withdraw everything quickly since both parties have full control. Instead, it would be nice to create a joint bank account where maybe people can vote on money spent, split evenly the amounts, divy up the percentage owned etc. In essence if the default settings for multiple users with control over a canister is like a legacy joint bank account,this program is meant to create extra sharing functionality for canisters!
I achieve this by creating a priary canister that creates canisters. This canister retains default control of the new canisters and any customized ownership is kept track in a list in the secondary canisters. Users interface with the secondary canisters through method calls from the the primary one. Essentially, the primary canister has an assoclist (dictionary) that has the users principal ids as a key and a buffer of secondary ids per principal id. This achieves project goal 1.
Primary canister provides indexing information such that a client can distribute prallel calls across secondary canisters directly.
principalids/users can join and unjoin secondary canisters as they please. (However, custom membership, limiting number of principal ids/users, etc. can be added). For now it is just join and unjoin. However, actual calling of canisters isnt from the the principalids/users themselves but from the primary canister. If we remove all control from the primary canister after enough development, the system can be quite trustless as default control would be essentially “black holed”. This achieves project goal 2.
Provide a security interface such that secondary canisters can hold private data from many users but only deliver requests to authorized requesters. Attempt to use as few inter-canister calls as possible.
TO SEE CODES STRUCTURE, SEE DIAGRAMS AT BOTTM OF THIS PAGE. Essentially, main.mo is the primary canister, and it creates many instances of NodeCanisters (NodeCanisters.mo). These live in /src/Main directory.
Office hours tomorrow.Hopefully an update today.
Technical features of ICSP:
- Infinite Capacity ICSP Canister: read and write to the same Canister without having to worry about storage space.
- Explanation:
- infinite capacity refers to the infinite creation of the Canister contract (in the case of a sufficient Cycle) , which supports the automatic creation of the Canister when the storage Canister is full and does not block the creation of data writes, smooth and smooth switching of storage destination.
- Support CRUD(only support read and write at current version)
- Explanation:
- in the business, supports the data to add, delete and check four operations, and in the related operations on the memory of the appropriate optimization, to support the reuse of fragmented memory(Next Version).
- One Step Store Data and Two Steps Get Data
- Explanation:
- One-step storage: support back-end direct: ignore await store the key and value into ICSP Canister. When store (key, value) , do not have to wait for the return value, which creates the convenience of storage.
- Two-step get : first obtain from the ISP which storage unit is stored in, and then obtain metadata from that storage unit(bucket).
- Cycle automatic monitoring
- Explanation:
- ICSP Heartbeat actively triggers the monitoring of ICSP and storage monomer Cycle balances and automatically Top Up, so the user only needs to monitor the ISP’s ICP balances.
Welcome to check out !
Awesome and pure key/value store! I’d love to see you replace blob with CandyLibrary so you can store all kinds of data!
No, please don’t replace anything with CandyLibrary. The native types exist for a reason.
There’s huge overhead with variant types, as we’ve discovered today in our project. I couldn’t think of anything worse than having a “variant type of all types”
What did you figure out? I thought it was like two bytes as long as you kept it under 256 entries?
And how else are you doing hierarchical data?(If you are)
I can’t tell you exactly because it’s a black box. We replaced variants with text fields and reduced our wasm function count by 600, from 5600 down to 5000.
It wouldn’t matter if there wasn’t a 6000 function cap on the wasm binary.
wasm-objdump and wasm-opt have helped us a lot
Good to know. Thanks. We’re doing a bunch of json and whatnot and nested documents that seem to require variants of some sort.
Okey, I will try it !
It’s with great pleasure I present the fruits of weeks of labor: Scaled Storage
Introduction
Before starting this bounty, I knew I wanted to build a general solution that could be easily added into any rust project.
While working on the bounty, the following goals drove most decisions I made:
- Simplicity
- Ease of use
- Storage efficiency
- It should work as an imported library.
Scaled Storage is a generic distributed hash table tailored for the internet computer. It can scale to possibly infinite amount of canisters, with a worst case scenario of one inter-canister calls and usually a best case of zero (I’ll got into this later). I’ve currently tested it to 10 canisters. The client side never needs prior knowledge of all canisters holding data, but instead just the canister id of any one of the canisters.
Features
- Imported as a Rust library
- Scales up or down depending on developer defined behaviour.
- “Quine” style canister replication. All canisters are functionally alike and use the same code.
- House keeping operations (migrations, scaling up and down) are abstracted away.
- There isn’t a “primary”, “index” or “secondary” canister, any request can be taken from any canister.
- Tries to reduce inter-canister calls.
Developer UX
- Developer imports the scaled_storage library
- Copy-pastes a few predefined scaled_storage operations.
- Uploads the canister’s WASM using ss_uploader.
How It Works
Scaled Storage uses a consistent hashing algorithm to map keys to their appropriate canisters.
When a key is accessed through the exposed functions, the result is either the value or the canister where it is located, In the latter case the request is simply forwarded to that canister. Here’s a code excerpt:
match canister.with_data_mut(key.clone(), |data| data.clone()) {
NodeResult::NodeId(canister_id) => {
//forward request to canister_id or return to client side
}
NodeResult::Result(result) => {
//return result to client side
}
}
Scaled Storage provides the following aptly named functions:
with_data_mut
with_upsert_data_mut
Upgrades & Migrations
When nodes are added or deleted, the changes are broadcast to all other nodes.
pub enum CanisterManagerEvent {
NodeCreated(Principal),
NodeDeleted(Principal),
Migrate(MigrateArgs),
}
These nodes then update their node list and hash function. The affected values that need to be migrated are converted to a generic byte array and sent to the relevant node.
#[derive(CandidType, Deserialize, Debug, Clone)]
pub struct MigrateArgs {
#[serde(with = "serde_bytes")]
data: Vec<u8>,
}
///candid file
type migrate_args = record {
data: blob;
};
The consistent hashing algorithm in question is Anchor Hash which has “high key lookup rates,and a low memory footprint” according to the authors of its paper.
This particular algorithm guarantees that given N number of nodes and K number of keys, the hash function distributes K/N number of keys to each node, achieving uniformity. It also reduces the amount of inter-canister calls necessary by:
- Only needing to migrate values to a newly created canister (instead of having to migrate to other nodes).
- Only needing to migrate values from a deleted canister without disrupting the location of other keys
Architecture
Scaled Storage can be described as a linked-list of nodes with each node having the following:
- The previous node id
- The next node id
- A consistent-hashing function mapping keys to a canister
- A LIFO list of all nodes.
- The underlying data
I chose this linked list structure over an “index node” → “secondary node” structure for the following reasons:
- Canisters/Nodes have exactly the same code and behaviour.
- Minimal Responsibilities - Each node is only responsible for its subsequent node and serves as its sole controller (with the developer being the controller of the first node). This reduces surface area for any accidents and gives each node a sole scape-goat.
- It reduces memory fragmentation - new canisters are only created after prior ones are full. Canisters cannot be removed in an arbitrary order, but must first have all subsequent nodes removed first.
- Reduces inter-canister calls since there’s no need to check the memory usage of all canisters, when the last canister wants to upgrade, it can simply assume all other canisters are full.
Issues and Solutions
-
Large binary from in-lining wasm via include_bytes!:
Wasm code for creating new canisters is added in heap memory at runtime via an init_wasm operation. I’ve created wasm_uploader for this. -
Payload size too large for init_wasm operation:
The wasm binary is chunked and sent one at a time.
"init_wasm":(wasm_init_args)->(bool);
type wasm_init_args = record {
position: nat8;
wasm_chunk: blob;
};
Performance
- Update operations take at most two calls; one made by the end user and a second potential inter-canister call if the hash function points to another canister.
- Query operations take at most three calls; an additional call made by the end user.
- Since operations can be taken from any canister, operation results could contain the id of the canister and stored/cached on the front-end. Subsequent operations could then be directly requested using the stored canister id.
Tests and examples
-
Scaled Storage Simple Scaling Test
This demonstrates the scaling capabilities of scaled storage. The default test demonstrates 10 canisters storing simple string types. -
IC Snippets
This is a more complex example, demonstrating storing more advanced data structures, and authentication to prevent unauthorised access:
...
.with_upsert_data_mut(snippet_key.page_id(), |page| {
match page.get_snippet(&snippet_key) {
Some(snippet) => {
let owner = snippet.owner;
if owner != ic::caller() {
return Err("Auth Error");
}
` ...
}
....
}
});
Future updates and issues:
- Consistent hashing algorithm change
Although Anchor hash prevents unnecessary data migrations, it distributes data equally amongst nodes, typically this is a desirable effect, but for Scaled Storage this means “filled” nodes still have data inserted in them (although the possibility of that reduces as the number of nodes increases).
A better hash function would have the following properties:
a. Given N nodes, only distributes keys to node N.
b. When number of nodes are increased from N to N+1, only keys in N have to be migrated to N+1.
c. When number of node are reduced from N to N-1, only keys in N are migrated to N-1
This would limit the number of canisters to be migrated to only one; either the prior node (downgrades), or the next node (upgrades).
- A rewrite implemented as a finite state machine: easier to read, more bug free code since state transitions can be validated at compile-time, easier to write rollback operations when state changes error.
- Macros to reduce the amount of boilerplate to write.
- Downscaling hasn’t been implemented.
- More operations on the underlying data apart from inserting and getting data. For example map reduce operations that require data from every node.
- Authenticate house-keeping requests between nodes. Currently it is trivial for a bad actor to pose as a newly created node, receiving migrated data.
- Query calls cannot be forwarded currently, the first demo’s query forwarding only works locally, pending inter-canister query calls. For other demos, the canister id is returned to the client.
- Cycles sent to one canister should be automatically distributed amongst all canisters.
- Scaled storage uses heartbeat, it may be possible to avoid this and instead “hitchhike” on other requests, reducing cycle consumption.
Summary
There’s still significant work to be done, but the foundation for a generic easy to use scaling solution has been created. Thanks for reading.
- Scaled Storage repo: Github
- Scaled Storage library crate: Crates
- Scaled Storage uploader crate: Crates
- Canister for the simple scaling test: qjikg-riaaa-aaaag-aaf7a-cai
Curious, was this wasm function count decrease with respect to variants observed strictly when writing your code directly in Motoko, or when writing your code in Go and then using that tool to compile it to wasm?
First off, I want to say that I absolutely love this scaling idea and what you’ve done with data partitioning, especially in the context of canister reliability. Very cool work, and I can tell you put a lot of hard work into designing and building this!
If one can distribute their data enough that one canister has the same management capacity as all canisters, then it becomes much harder to DDOS an entire application (you may just take one or more canisters offline).
Although I love the creativity of this idea, I see several potential issues with the scalability of the design in terms of the involvement of inter-canister update and/or query calls required for all the nodes to remain in sync. The fact that multiple canisters would constantly be rebalancing with each other (using inter-canister update calls) quite often means that when under significant load the entire application would most likely slow down and get “behind” on rebalancing.
In this situation imagine Canister A trying to update and re-balance with Canister C, Canister C simultaneously trying to re-balance with Canister B, and Canister B is trying to re-balance with Canister A. We end up with inter-canister update call 2+ second “locks” that lock up the rest of re-balancing efforts.
In a large enough application with enough of these requests coming in, I could see this eventually leading to application failure.
A few questions.
-
When the number of nodes is increased from N to N+1, how long does the complete re-paritioning take? How does this change as we go from N=3 to N=50?
-
If you have N=10 canister nodes and you hit each canister with the exact same ~500 requests in a second, how long does it take to re-balance? What about ~1000 requests/sec?
-
Does Anchor Hashing allow me any control over knowing the exact canister node/partition that my data was sent to? For example, can I easily ensure that all data for a specific user is in a specific canister, or do I need to spin up a new scaled storage actor for each user in order to do this?
-
How is the client expected to know of and randomly pick which canister to interact with at any one time given that the management canister generates random canister ids?
We were at about 5800 wasm functions (not using any optimisers at that point). We had about 17 entities that used variants, ie…
#common; #uncommon; #rare
I changed the go generator to skip variants and just reference them by ID (like a normal database relation), then recompiled and we were at 5200.
Going through the wasm-objdump logs and grepping for #common returned way more lines than I would have expected. It was at about 40 per variant, which makes sense if we dropped about 600 funcs.