ICP.Lab Storage & Scalability Summaries

Hey everybody,

A few weeks ago, we had the 4th ICP.Lab at the DFINITY HQ in Zurich to discuss pain points and potential solutions with teams that push the boundaries concerning Storage and Scalability on the Internet Computer.

For this ICP.Lab, we’d like to do something new, and provide summaries for (most of) the sessions for anybody who is interested. This is a long post, so get yourself a cup of coffee.

Content

  • Introductions and main pain points of the participating teams
  • Stable Structures & Memory + Data structures and schema evolution
  • Scalability: Current limitations and improvements
  • Scalability 2 - Multi-canister architectures
  • Immutable Storage API
  • Backup and restore
  • Internet Computer as CDN
  • Queries & Query charging
  • Encryption and Data Privacy
  • Partnerships between the teams

Introductions and main pain points

To start off the week, each project was asked to present themselves briefly and present their main pain points. In the following summary, we list the pain points mentioned.

Origyn

  • Subnet routing/Subnet Distribution - need to roll our own (self-directed subnet assignment)
  • Caching - need to roll our own
  • Alternative routing - prptl.io
  • Stable Storage?
  • Cost clarity (storage pricing/storage subnets, queries)
  • Easier backup and restore

Anvil/Internet Base

  • The agility of (Motoko) data structures. How to enable fast queries on data structures that evolve over time?

CanDB

  • Canister cloning (in order to do efficient sharding/partitioning)
  • ICQC could enable a more performant CanDB canister client (Motoko/Rust)
  • Canister groups & Subnet Splitting help canisters remain on the same subnet

OpenChat

  • Support more canisters per subnet
  • Threshold Key Derivation to support e2e encryption
  • SEV-SNP so user data is more secure when e2e encryption is not enabled
  • Fire and forget HTTPS outcalls from canisters
  • Inter subnet query calls (+ capability-based security model)
  • Upgrading canisters doesn’t require stop/upgrade/start but instead queues ingress messages during upgrade
  • Cost to provision and upgrade canisters
  • Canister hook on low cycles balance

Hot or Not

  • AuthZ mechanism for canisters belonging to the same project (Forum post)
  • Inter subnet query calls
  • Cost of canister creation for user-per-canister dapps
  • Native backup/restore mechanism (Forum Post 1 and 2)
  • Handle chunking at the protocol level instead of the application level
  • Better primitives of working with stable memory
  • Safer primitives for canister upgrades using pre/post upgrade hooks
  • Specialized subnets
    • Low-replication storage subnets that aren’t globally replicated
    • GPU support for hardware accelerated
      • Video encoding/decoding
      • Training neural networks

Portal

  • The dream is boundary nodes functioning as CDNs
  • Caching on the boundary nodes

Signals

  • Mobile iOS issues with streaming videos
  • Usability of stable memory
  • Complexity in migrating large amounts of data over the network (Cloning/Partitioning)
  • Best practices for testing

Stable Structures & Memory + Data structures and schema evolution

Storing persistent data in the Wasm heap has two main drawbacks: upgrades become riskier because the entire contents needs to be serialized to stable memory (without bugs!) and storage to 4GB. We recommend using stable memory for persistent data because it allows for safe upgrades and 48 GB of storage (soon to be increased!).

Wasm-native stable memory is improving the performance of stable memory. It’s still over 10x slower in some cases, but other cases are only ~2x worse and there’s still a bunch of work to be done that can bring it close to the heap.

The ic-stable-structures and ic-stable-memory libraries provide high-level structures (BTreeMap, Vec, etc.) that make using stable memory ergonomic.

Work is in development for Motoko to use stable memory as the main heap.

Comments from the community:

  • Worries about the ability to change schema of data in stable memory (HotOrNot). There are tricks for doing this, but maybe the libraries should make it as easy as possible, or come with examples of how it can be done?
  • The size limits for data in the ic-stable-structures lib are a problem which should be removed (HotOrNot).
  • More guidance/patterns on using stable memory in Motoko would be helpful (Origyn)
  • We should fix the 10% Wasm-native stable memory regression when reading large chunks (Origyn/Portal)
  • Canister forking or similar could be helpful for sharding data that no longer fits in a single canister (Byron)
  • Generally more examples of non-toy uses of stable memory would be helpful (including data evolution).
  • We need the ability to free stable memory
  • Tooling for testing with large stable memory could be helpful, since it takes a lot of time to load dummy data into a test canister (Origyn)

Scalability

When developing applications for the IC, multiple scalability limits when choosing between a single and a multi-canister architecture. Examples are: i) limits on each canister’s storage capacity (48 GiB), ii) the number of canisters supported per subnetwork (~100k canisters) and iii) the throughput of calls (~970 updates/s per subnet, ~4000 queries/s per node).

Several opportunities for increasing the IC’s scalability have been identified and rolled out since launch. Examples are: i) improvements in the memory subsystem to achieve higher concurrency for update execution ii) composite queries to allow calls to other canisters at the speed of query calls and iii) deterministic time splicing to allow long running messages, which permits more complex application logic in canisters.

Discussions led to the following takeaways:

  • The IC has currently been optimized for stability rather than performance, as we haven’t been running into any significant performance and scalability limits yet.
  • Several ideas for novel architectural improvements have been discussed. Examples are: read-only nodes to scale up query processing, different subnet types (storage, not-geo-replicated ones, different consensus protocol)
  • Developers would appreciate it if canisters could be made aware of subnet assignment. At the very least, a way to check if another canister is in the same subnet would be desirable. For example, this would be useful to figure out if canisters are in a different legal jurisdictions, based on which we might want to reject a call.
  • Discussions around capability support in the IC and whether they are a useful and easy to use abstraction for a wide audience of IC developers. Would for example help to “group” canisters to applications, so that permissions can easily be shared between all canisters of the same application. Access control is a significant pain point in multi-canister architectures
  • “Downtime” of dapps because of checkpointing (every 500 blocks) is big problem for dapps on subnets with a large number of canisters (OpenChat)

Scalability 2 - Multi-canister architectures

In this session, Hot or Not and OpenChat presented their dapp architectures, as well as their reasoning towards these architectures.

Hot or Not

  • Canister per user architecture because of logical partitioning and actor model maps nicely to nodes in a social graph.
  • Planning for flexible multi-subnet architecture
  • Upgrades take 1-2 days because they make a backup in a second subnet
  • Use ic-state-machine-tests for testing and test upgrades

OpenChat

  • Canister per user and group chat architecture. Have multi-subnet architecture in place.
  • UserID is a pair of user canister principal and user principal.
  • Use queues and batches for inter-subnet messaging.
  • User registration: Decide where to register new users based on the size of the user index in each subnet. Check if a user is an Internet Identity principal to prevent bot users.
  • Use peer-js and a signaling server to send messages off-chain as well for lower latency.

Immutable Storage API

Nodes have 10x the storage than currently used by the protocol. We can’t just use the available storage for performance reasons. We need to limit the protocol guarantees for the storage to be able to use more of it. The proposal was immutable blob storage. Write in update calls. Reads only in composite queries. Content is only available through the canister that wrote the blob. Content gets deleted when the canister runs out of cycles.

Discussion led to the following key take-ways:

  • Not super urgent for teams, as long as storage costs won’t dramatically be increased soon (Origyn
  • More interest in shared storage. Access control is not a big concern and can be added on application layer using encryption (Portal) => Data does not need to be served by a canister
  • IPFS integration strategic opportunity because of the pervasiveness in the Web3 ecosystem (Origyn)
  • IPFS integration (ICPFS) where nodes are able to pin content
  • Boundary Nodes used as general IPFS Gateways. This would also allow better caching and a path to the usage of Boundary Nodes as CDNs

Further feedback from teams related that are related:

  • Uploading large amounts of data is very slow (Origyn, Portal)
  • WASM deduplication would be valuable for teams with large fleets of the same canister that need to be upgraded regularly (Open Chat, HotOrNot)

Backup and restore

  • There’s an overwhelming interest in support for backups.
  • Most of the support for snapshotting revolves around the use case of recovering from bad upgrades.
  • Having irregular upgrades (last month, last week, last day, last few hours) would be a good cadence.
  • There are user-level solutions for backups as well as a protocol level-solution. The community is mostly leaning towards a protocol-level solution.
  • There’s interest in forking (i.e. the ability to spin up a snapshot as a separate canister).
  • The minimum required on the protocol level is the ability to take snapshots of a canister when it is stopped, and forking from an existing snapshot. Besides that, it would be great to allow downloading and uploading the snapshots.
  • There’s interest in a System API call for freeing stable memory.

Internet Computer as CDN

This session had three parts. First, Jesse gave a presentation about Portal. Second, Severin from the SDK Team, gave a presentation about the asset canister and asset certification. Third, Rüdiger from the Boundary Nodes team, gave a presentation about the boundary nodes.

Portal

  • Quick presentation explaining the process required to upload a full adaptive bitrate (4 copies in different resolutions) video onto the IC:
    • Videos are first stored in a temporary storage until they can be processed and optimized for long term storage and delivery
      • Currently the temporary storage is S3, but there were talks about cheap storage solutions e.g. “ICPFS” (IPFS nodes on the IC) which could replace S3 as a temporary storage solution for pre-optimized videos
      • Other solutions would be using standard IPFS (maybe not ideal for ownership-gated videos because of open nature of IPFS), or uploading to a canister (expensive and slow uploads makes this solution non-ideal), or using one of the file storage protocols build on top of IPFS such as Arweave, Filecoin or Storj
    • After they’re stored in temporary storage at a publicly accessible URL the URL can then be submitted to the Portal transcoding network (L2 on the IC) which orchestrates a fleet of nodes. The nodes are are specialized in transcoding, chunking, and uploading content onto the IC using Ffmpeg and agent-js
    • The video content is then processed and uploaded to the IC in a distributed manner over several canisters and subnets to try and reduce any load on a particular canister or subnet
    • Once the content has completed uploading to a users channel the Portal validator node validates the upload quality and if acceptable then triggers a payment for the work completed. The validator nodes use staking/slashing to ensure bad actors are discouraged from participating in the protocol
    • At this time the original video has been processed and uploaded to the users channel canister and the user can begin using the on-chain content in their own apps/dapps just as they would with a video URL from any traditional web2 video provider
  • Roadblocks to being on par with web2 services are the number of boundary nodes and caching, or more generally better HTTP response times in geo locations such as Japan and Australia (which have a low concentration of Replica nodes and BNs)
  • In terms of price, the IC has the ability of offering a cheaper solution to content delivery vs. traditional CDNs as long as the changes to query charging are reasonable (must be much lower than current CDNs to offset the much higher cost of storage and uploading to the IC), performance continues to increase to achieve parity with traditional CDNs, and storage costs continue to decline as new solutions and optimizations emerge

Asset Canister

Asset certification v2 has recently been introduced. The main improvements over v1 are the ability to certify headers, not only the body, and the possibility of selectively certifying only parts of a response. It is included in asset canisters starting at dfx 0.14.0.

Libraries that make certification easy(er) are in the works: Motoko Server (like express.js but for Motoko) is already released in an initial form, but only supports certification v1 so far. A library for convenient v2 certification is in the works for Rust.

Discussion/Take-Aways:

  • Most common source for non-matching certificates are fallen-behind replicas or non-matching system time
  • Supporting additional encodings (e.g. brotli) is not hard, but will increase the service worker size, which impacts all dapps
  • Streaming should be (partially) possible soon. Range headers are not entirely solved.
  • Feature requests
  • Access control for asset canister. Not on the roadmap, may be left for community implementations
  • Expiration for assets. Not on the asset canister roadmap, may be a better fit for something like Motoko Server

Boundary Nodes

The Boundary Nodes are the Gateway to the IC. They offer two endpoints: (1) the API endpoints and (2) the HTTP endpoints serving the service worker or going through icx-proxy.

The API endpoints are defined in the IC’s interface specification. The Boundary Nodes do not apply any filters on the API endpoints, but enforce very moderate rate-limits to protect the replicas, which will result in a 451 status code. Query responses are cached for one second.

The HTTP endpoints serve the service worker or use icx-proxy to translate HTTP requests into API calls. The Boundary Nodes do perform content filtering on these endpoints according to DFINITY’s code of conduct.

When users access content on the IC, they are first routed to the closest Boundary Node (DNS-based). The Boundary Node then forwards the request to a random replica.

Currently, the team is working on decentralizing the Boundary Nodes by splitting them into two entities, the API Boundary Nodes and the HTTP Gateways.

Discussion/Take-Aways:

  • Geographical distribution and number of Boundary Nodes - To provide better performance for users, it would be desirable to have more Boundary Nodes better spread across the entire world. As we decentralize the Boundary Nodes and onboard new Node Providers in other regions, we will most likely also have better geographical coverage for the API Boundary Nodes. (Portal)
  • Caching - To provide better performance and reduce the number of queries hitting replicas, Boundary Nodes should provide extensive caching. Caching is on our Roadmap, but needs to wait for improvements in the Boundary Nodes. (Portal)
  • Analytics - A canister has no information about the client besides the caller, which in many cases is the anonymous identity. It would be great to provide further information (e.g., IP address, user-agent). We realize the importance and have it on our wishlist. Currently, this is very difficult as information is either not known where it could be added to the payload (e.g., the service worker does not know the client’s IP address) or cannot be added (e.g., the Boundary Node cannot modify the request as it is signed by the caller). (Origyn)
  • One-Shot HTTP Upgrades - For analytics, it would be great to enable one-shot upgrades that immediately return (like a query), but allow mutating some state, for example, for statistics. We took note of the request, but currently this is not on our roadmap.

Queries & Query charging

Queries are an important part of the IC: their execution is much faster since they are executed by a single node in a non-replicated fashion and thereby avoid the cost of reaching consensus between nodes.

Queries have recently received multiple improvements. For example, there is now one query input queue per canister instead of a global queue shared by all canisters. This provides fairness and better locality.

There is soon also support for caching queries in the replica. Compared to caching in the boundary nodes, replicas can invalidate the cache whenever the state of a canister has changed and therefore avoid unnecessary periodic eviction of the cache.

Another feature related to query processing is charging for query calls: it increases fairness as available hardware resources of IC nodes are assigned to query processing proportionally to how much a canister is charged.

Discussions led to the following takeaways:

  • Manipulating query charges does not seem like an attractive attack vector for potentially malicious node providers. Due to aggregation of query stats of many machines over the period of epochs, it is not possible for malicious replicas to significantly impact cycle charges.
  • Appending query statistics to blocks does not affect consensus.
  • For protection against bots, the sender IP address might be a useful input to the rate limiter. W3C is working on trust tokens. How about having update calls produce the trust tokens and query calls consume them. Capability-based access control should also be considered for access control.
  • It might be useful to produce some metadata during query calls (e.g. how often an NFT has been viewed). A similar approach as aggregating query charges might work for this generalized use case as well.
  • Developer-defined instruction counters provided by the system API might also be useful.
  • Query chunking would be useful for many developers.

Encryption and Data Privacy

Everyone knows that data on the IC is not fully public like it is on other networks, but there are different views on the levels of existing privacy guarantees. We saw a table distinguishing the privacy levels for different types of IC data (ingress messages, state, key material) for the different entities that handle it (canisters, nodes, network). We saw motivation for why we should want encryption and what are the issues with existing solutions; key management is difficult, especially for server-side encryption. We introduced the vetKeys solution by giving a high-level view of how the scheme would work. We then saw different extensions/use cases for vetKeys, namely, encrypted file storage and retrieval (symmetric keys), E2EE messaging (public keys / IBE) and Timelock / MEV protection (encrypting to the ~future). We then saw the first WIP demo of this feature. There were three canisters shown; a canister stubbing the system API of the new functionality, a canister mock of the front end, and the back end of an encrypted file storage dapp. The demo showed a key derivation, a key derivation with file/message encryption, and decryption.

Discussion led to the following key take-ways:

  • There are remaining questions about pricing (similar or lower than tECDSA).
  • There is demand for multiple demos and for them to be open sourced.
  • Storage overheads are to be considered and depend on the encryption scheme used.
  • The most motivated ‘consumers’ of vetKeys right now are user-centric dapps (eg OpenChat), businesses (eg Canistore for GDPR reasons), and node providers (to reduce liability).
  • There are interesting use cases bridging software (vetKeys) and hardware (SEV).

Partnerships between the teams

The main discussion point was integration between dapps on the IC, and integration with Open Chat in particular. Most teams would like to use Open Chat as an API, not as an iframe or widget. Having to login to an app and to login into OpenChat inside the app is a no go for UX reasons. Most teams would like to take their chats from dapp to dapp. Internet Identity makes these things hard at the moment.

The second discussion point was about funding infrastructure projects. Infra is about solving (current) inefficiencies of the platform. There’s always the fear of getting eaten by the platform as an infrastructure project. Ideas for funding:

  • List of verified builders on the IC. DAOs can airdrop shares to builders or provide discounts (Anvil)
  • Social norm to reserve parts of an SNS to the infrastructure builders
17 Likes

Have there been recent improvements for update calls? Cause I remember not long ago the average was 1k~ updates/s per subnet.

1 Like

I seem to remember that the 9700 number came from a load test across the whole of the IC. Around 1K updates per second per subnet is indeed more realistic.

1 Like

Maybe @stefan-kaestle wants to chime in.

I think the 9700 updates/s per subnet is a typo.
It should be 970 updates/s, which is just short of the 1000 messages/block limit currently set in consensus.

3 Likes

Alright, makes sense. Corrected.

2 Likes

Cool thanks for chiming in, are there any particular reasons as to why the limit is currently set to 1k? Are there any ongoing efforts to raise it the limit?

I’m not on the consensus team, so please take this with a grain of salt :slight_smile:

I think it basically boils down to physical limits of geo-replicated subnetworks. In order for a block to be signed, each node in the subnet needs to have access to all ingress messages proposed in a block in order to verify its content.

Currently, it takes several round-trip until artifacts arrive on all nodes. If we were to allow larger blocks, or more ingress messages per block, we would likely increase the probability that a block will not be signed by enough nodes to get finalized.

As far as I know, we have currently simply not seen a need to increase this number further and have opted to keep current limits on the low side to increase the probability that blocks get finalized.

If there was ever a need to reach higher ingress message input rates, we could:

  • Explore to increase this limit further, perhaps while keeping the existing limits on the block size in bytes
  • Optimize the way ingress messages get gossiped/broadcasted to all nodes in a subsystem
  • Offer more localized subnets, where the round-trip is lower and we should be able to increase the block rate, while keeping the limits on the blocks as they are now.
2 Likes

Isn’t 970 updates/s per subnet to prevent DDOS attacks?

The response to DDOS attacks is rate-limited at the IP and Subnet level.
However, my understanding is that in Ethereum, etc., gas fees are charged to contain DDOS attacks themselves, but ICP does not seem to be able to do that because it is reverse gas.
Since DDOS attacks themselves are not prevented, please let me know if it is possible to contain DDOS attacks with these rate limits, or if there is a loophole and they could be attacked, and how Dfinity is handling this area.

AFAIK, the 970 updates/s is a consensus limit given the current distribution of node machines in subnetworks. I don’t think DDoS protection was the reason it got introduced.

The boundary nodes have some rate limits as well to protect against DDoS attacks. Those consider IP addresses and connections as well for rate limiting, which the IC does not.

1 Like

There are a couple more reasons for having a limit on ingress messages.

One is the size of the ingress history. Ingress messages in terminal states (replied or rejected) have a time-to-live of 5 minutes. At 1K ingress messages per block and about one block per second that’s up to 300K ingress messages in terminal states. Each of those can be up to 2 MB in size, meaning 600 GB (out of a total of 512 GB physical memory) allocated to ingress history only. (There’s actually a significantly lower limit on ingress history size; and replied or rejected messages will be transitioned to done, their payloads dropped, whenever this limit is reached. But at that point, under full load you only have something like 10 seconds to query for the the status of/response to your ingress message before the response gets dropped and replaced with done.)

The other reason for having a limit is execution throughput. Execution can handle more than 1K updates per round (as long as they don’t do significant computation), but that must cover both ingress and canister messages. And simply accepting multiple thousands of updates per second only to time them out after 5 minutes is not particularly useful.

1 Like

Understood thanks for the insight, it seems there are more than 1 bottlenecks that constraint a subnet’s throughput, so my next question would be is there a set of protocol improvements either being worked on, planned to be worked on or even initial considerations on how the TPS could be increased? And if so by which order of magnitude?

Cause if it’s a metric that can only be improved by a small margin, scaling big apps on the IC will prove very challenging if not impossible, CEXes order matching engines alone can process tens of thousands of transactions per second, the IC in its current state has ~35k updates/s as a max theoretical limit and that is assuming all exisiting subnets are used by a single dApp, which isn’t always possible based on data dependencies and when it is, it comes with greater complexity in architecture and still wouldn’t be a very cost efficient use of its resources.

1 Like

A subnet is a replicated virtual machine. With the key words for our purpose being “a machine”. As in “one machine”. And like a physical machine, it can only handle so much load before it hits a wall.

So the way to scale applications on the IC is sharding (i.e. horizontal scaling). Which is exactly how a CEX scales: you shard your data among N servers (or N subnets, in the case of the IC) and each of those shards handles 1/N of the load.

There are a number of directions in which the IC may be made to scale:

  • In terms of subnet throughput, more efficient orthogonal persistence and storage should allow for higher parallelism (i.e. more than 4 canisters running in parallel at a time).
  • With regards to scaling out, we need higher bandwidth between subnets. A physical machine has a network adapter with something like 10 Gbps; and with the right network architecture, all machines in the same data center can send and receive the full amount concurrently among themselves. The IC is currently limited to a few MB/s (around 50 Mbps), so it’s 2-3 orders of magnitude off. To address this, we are looking into immutable storage/IPFS (see OP); and how to make use of the storage infrastructure to significantly improve subnet bandwidth (e.g. by having clients upload some piece of content to IPFS and only including the CID (content hash) as opposed to the full content into the ingress message; same with (some) canister messages; this would allow for tiny blocks pointing to arbitrarily large payloads).
3 Likes

Quick question - are there any plans for lifting the application payload size limit for canister responses?

I’ve hit the limit when returning a large struct which has several vecs within. What’s the best way to chunk a response?

Not a massive issue for 221Bravo app… the problem only really hits accounts with over 100k transactions.

Thanks,

Nathan.

The canister message (request and response) payload size is limited by subnet block size (4 MB block, 2 MB payload). Block size is limited by the target subnet latency: you cannot get one block per second if you have to gossip more than a few MB around the world in that time.

So bumping the payload size limit outside of subnet local messages (where it complicates things) is impossible in the way that subnet memory size keeps getting bumped.

We have discussed ways to lift this limit entirely, by having a side channel for payloads. E.g. subnet A or user U upload a payload into IPFS; subnet B has every replica pin that payload; when all or most of subnet B replicas have the payload, it can be processed in replicated executions.

But there are a lot of edge cases: what if you need 2 such payloads to process a block and 2/3 of replicas have one and another 2/3 of replicas have the other? what if the same 2/3 of replicas have all payloads, but then the remaining 1/3 fall behind and stay behind because they’re all in locations with limited bandwidth? So we keep thinking about this, but it’s not as trivial as making a few optimizations here and there and then bumping the limit.

3 Likes

Update on one of the key topics from the ICP.Lab on Storage & Scalability: Canister backup and restore [Community Consideration]

1 Like

Following the topic of high amount of data (in the 10s’ of PBs I mean) to feed IC applications, there are other more elaborated effort or description in some sort of IPFS (or other method) to be useful in the IC space?

Or this is the thread for the topic?

Asking related to our building of big data oracle services, that should not rely on cloud for end users (cloud can help feed the data stack… but shouldn’t be the source for IC users, for security and reliability reasons).