Do we still need BigMap / NoSQL database?

Yes! Byron is doing some great work on this and we recently had a number of entries that came at it from different angles for our QuickStart bounty and Sudograph is coming along as well.

In my opinion, we need a Manhattan Project style effort like what is happening with the ledger right now to settle on an infinitely scalable certified data structure standard. Users should be able to power up a canister do the following and never have to think about data again other than topping off cycles to pay for the storage.

    let mydata : MetaTree = new MetaTree(a_storage_canister);
    mydata.put(collection : namespace , key : variant , value : variant);
    mydata.index(collection, key, index_type, indexerfunc);
    mydata.get(collection, search);
    mydata.getCertificateRoot(();
    mydata.getWitness(collection, key);
    mydata.validate(witness);
    mydata.onUpdate(some_function_that_lets_me_certify_my_data_with_other_data_in_my_canister)
6 Likes

@icme @lastmjs - thanks for the input.

Here’s where I’m coming from on this topic:

  1. It’s great to see that the community is coming up with various implementations of a DB framework
  2. As a product manager, I’m interested in learning more about the developer use cases and the demand. I want to know what types of apps will be easier to build if we had a database framework.
  3. Since today building a true DBMS on IC is challenging because of platform constraints, I’d like to build a better mental map of potential developer benefits to make a good case for underlying platform improvements.

Does it make sense?

@skilesare Thanks for the ideas!

What would this code do?

Create a Merkle tree of your data so that you can certify it and return it with queries to applications wanting to know they can trust the data. You can use the out put for CertifiedData | Internet Computer Developer Portal

edit: To be blunt, all the apps we are writing are just web2 wanna be apps until we all start certifying our data and the apis for doing this right now are very underdeveloped and lack good examples of why it is necessary/useful.

6 Likes

I think that scalable storage architecture is better to form on a case by case basis based on the specifics of a system/dapp. Bigmap-like libraries can cover many use cases, but for a sample, the ledger block storage and the openchat user storage use different scaling techniques. Openchat gives each user or group their own canister and keeps a map of user->user-canister. the ledger stores the blocks in archive canisters in a linear way only creating a new archive canister when the previous one is full.

1 Like

In short, I definitely still want and need a BigMap.

Speaking from a Motoko developer’s POV, the status quo is insufficient for a variety of reasons:

  • Stable variables are the easiest way to persist data across canister upgrades. Currently, stable variables are limited by the size of the canister heap instead of the size of stable memory, because Motoko needs to serialize and deserialize the stable memory data in the heap. That is currently 4 GB, and in practice, the available heap is lower due to GC constraints (which are in the process of being lifted). 4 GB is not a lot of space.

  • To access the full 8 GB of stable memory, a developer can use the ExperimentalStableMemory library. However, it is very cumbersome to program with that right now, as you can’t use any of the off-the-shelf Motoko standard library data structures (like Trie) if you go that route. Also, that library is marked as experimental and may change at any time, which makes it unsuitable for production.

  • Applications that store images or video will very quickly hit the 4 or 8 GB limit. There was talk about specialized storage subnets, but there have been no updates on that front. I anticipate many may plan on using a BigMap as a blob store for large assets like video; that may have unintended implications if not carefully designed.

For me, I am building a consumer-facing dapp intended for a mass market. I worry a lot about hitting memory limits. I currently use Motoko standard library data structures (like Trie, RBTree, List) in a single canister, but that won’t nearly be enough. A database (or blob store) solution would be huge. In fact, the vision of BigMap is one of the more compelling sells of the IC developer ecosystem.

It wouldn’t necessary lower the bar for new developers trying to prototype something on the IC, but it will massively lower the bar for more experienced developers trying to build a scalable, productionized web3 dapp on the IC.

8 Likes

I would say that most cloud developers coming from web2 expect not just scalability, but having the complexities of that scalability abstracted away so that they can focus less on infrastructure and more on the business logic of their applications. The current architecture works for isolated or individual-centric applications like blogging/drive storage, but will struggle to support the web 3 generation of social media applications.

If I’m coming from web2 and have built a web application that’s geared towards hundreds of thousands to millions of users, I’m used to Firebase/FireStore, AWS, GCP, Azure - data storage that scales and API gateways that load balance, throttle, and auto-scale without breaking. Coming to the IC from the web2 world is a bit like running straight into a wall. Part of this is because the DevX (developer experience) of web2 managed services is just that good, but the other part of this is because of lack of ready-to-go IC solutions that developers can use to build their own automation and management tooling on top of. I think this will come in time as the developer built tooling matures, but for many developers (like myself) who were originally attracted to the IC by the promise of “infinite scalability” and “BigMap”, it was a bit of a let down to realize that this abstraction did not yet exist and we had to build our own data storage scaling solutions from scratch.

As @jzxchiang mentioned, a developer looking to prototype an application can move pretty quickly if that application is contained in single canister - one can go pretty far with 2-4 GB, but this eventually becomes a liability. Not just in terms of data storage, but also in terms of application throughput - a canister can only handle so many query and update calls per second.

This means that even in a multi-canister application, if a single canister is responsible for running a pivotal part of that application a high traffic or denial of service event could shut down most of if not the entire application. The start of some work to throttle requests with inspect_message should help, but there’s still quite a bit of work to be done.

This throughput concern, plus the limitations with the current state of inter-canister calls make common architecture designs such as the API Gateway Pattern infeasible for the majority of application use cases, and has forced many developers to embrace client-centric multi-canister architectures when designing their applications. While this approach to scaling will “work”, it is not ideal to rely on the client as heavily as there is a limit on the amount of work the client (brower, phone app, etc.) can be expected to do and data it can be expected to store in local storage. Exposing the client directly to a wide variety of micro services also has security implications as it widens the attack surface to a particular application, vs. with the API Gateway pattern a developer just has to secure a single point of entry.

I believe improvements in the following areas would remove some of these barriers.

  1. Introduce a “load-balancing” scalability option to canisters that are pivotal in a multi-canister architecture, such as indexing, management, or transactional canisters that may need to handle high query or transactional (update) volume.
  2. Start long-term research into solving the performance issues with inter-canister calls, which could potentially mean re-architecting how inter-canister query and update calls work on the IC and breaking backwards compatibility. I think the IC can go pretty far with client-centric architectures, but when the data gets large enough, the client will eventually be overwhelmed without the capacity to do SSR or pre-computation with tools like Map Reduce on the backend. I don’t think that a patchwork solution to inter-canister query and update calls will suffice - we’d just be kicking the can down the road.
  3. Efficient canister forking/cloning of canister code and state/data will allow developers to easily re-partition the data in those canisters to scale out when a canister fills up and lessen the workload of the client application.
  4. Improve heartbeat. Building managed services on top of web3 infrastructure will eventually require some sort of cron-job/ heartbeat usage. Currently, heartbeat is pretty expensive for developers to use as it executes every heartbeat. DFINITY has suggested that the community build shared heartbeat canisters to “share the cost”, but this involves trusting other parties or entering into a DAO/SNS, which is way too much complexity for a simple cron-job.

Also, just speaking for myself, but I would appreciate if the marketing for the Internet Computer product and its presence as a new IT/cloud stack shifts to speak to the strengths that currently exist, and don’t use buzzwords like “infinite scalability”. Nothing scales infinitely (not even the centralized cloud providers), so providing specific metrics and data will give developers the confidence they need to an develop scalable designs and architectures for their applications.

13 Likes

I think this would be a great opportunity and time to start an “Infrastructure and Scalability” developer working group

9 Likes

I don’t know if people really want BigMap or a NoSQL database.

I suspect that what people actually want is the freedom to write code using whatever data structures they deem appropriate for the task at hand, and that they would love to never have to think about the limits imposed by the implementation details of canisters or the IC.

Some of the original differentiators for development on the IC were things like “write code as if it will live in memory forever”, “no databases”, and “infinite scalability”.

If this could be achieved through the right set of abstractions then that would be a truly amazing experience.

Perhaps for developers this would mean building an app using the otherwise naive “single canister” architecture and under the hood any complexity involved to address scaling or other limitations is taken care of.

7 Likes

@paulyong I want to give some thoughts on what you said.

I think you’re basically correct, or at least in line with my thinking. As in, I think DFINITY/Internet Computer should provide basic stable/scalable data structures. Then developers can take those data structures and do what they need.

So for simple projects people might get away with just using a BigHashMap, BigBTreeMap, etc that scales out and automatically persists across calls and upgrades.

But I do believe “databases” are necessary, as in data structures that have powerful querying, relation, and other CRUD functionality. That’s what Sudograph provides. Many apps will need this basic functionality, but it should all be buildable with underlying “Big” data structures.

Like right now Sudograph is just a bunch of Rust BTreeMaps under the hood. If I could just swap them out for BigBTreeMaps then we could instantly scale across canisters (hopefully). Sudograph doesn’t have indexing yet, but the basic concept should work pretty well I’m thinking.

So I don’t agree that all we need are the scalable data structures, but I do believe that DFINITY should focus on creating those and then let devs build abstractions on top.

7 Likes

The point I was trying to make is that the currently proposed solutions (BigMap/NoSQL db) appear to be mixing the concerns of data structures with otherwise orthogonal limitations.

I suppose I am trying to encourage the exploration of other solutions, perhaps at the system level, that would work more generally than providing BigMap instead of Map, BigBTreeMap instead of BTreeMap, etc.

Otherwise, everyone will have to continue to think in terms of size/time limits that are unique to the IC instead of being freed up to innovate on what is unique about what they’re building.

Maybe that’s not realistic, or at least not any time soon, but I think it would be a shame not to explore :slightly_smiling_face:

7 Likes

In that case, the most transparent and flexible solution is to simply wait for 64-bit wasm IMO. Canisters would then have a practically limitless address space upon which they can use any data structure they’d like; the replica would handle the messy details.

I think the main issue with that is that we might need to wait a long time.

1 Like

Throughput would still be limited by single canister performance tho.

5 Likes

If ICP is to truly to run everything on the blockchain. We going to need storage services that compares to the likes of AWS and other cloud providers performance-wise. Perhaps dfinity could build a high performance storage subnet service that use the blockchain but with far fewer replications.

1 Like

We have actually written this in our IS20 standard but have decided against it because we have not written an implementation compatible with stable storage.

I remember reading a post from a Dfinity dev 7/8 months ago where it was stated in the future we’d be able to configure the desired replication factor, so one could potentially build an app or some parts of it that would run on a single machine and have the ability to make it decentralized with the click of a button, unfortunately I can’t find it the post again, so it might just be my mind playing tricks on me.
It’d be an interesting solution to explore, not everything needs to be replicated on 10/20 nodes, having the possibility to choose would unlock lots of new options for developers.

I have not read every post in this thread, but I can speak from a DSCVR perspective.

I believe better control over our stable storage would help us quite a bit. Right now, writing the entire DSCVR state to stable storage will fail due to hitting the max cycles per a request. This is happening in pre_upgrade. When we could pass pre_upgrade and then load the state from post_upgrade, we could not copy over all the objects out of the state and rebuild our indexes. We then would have update functions that would load the state and then chunk a copy of the data we needed from the stable storage state to the desired objects progressively (i.e. restoring indexes and content).

Currently our state is too big to write to stable storage in one pre_upgrade, so we can’t even get to post_upgrade. I know we can solve this problem if we had libraries in place that worked like traditional file systems and allocate regions of the stable storage that were wrote to progressively… Like storing stale data in a specific region of stable storage in chunks and have an object that kept track of what regions this data was stored in. Just to note, 500mb of data being written and stored into stable storage is about the limit we are seeing. I also understand that we could potentially improve our data models, but I believe we will eventually hit a limit with the maximum amount of optimization.

Regardless if we need BigMap, NoSQL, SQL - Having something like ICFS would be a predecessor to those types of tools. Data still needs to be marshalled and building simple abstractions on top of the current APIs will just make this process easier. We get 8GB of storage now and eventually we will get 300GB? I could image we use a concept like drive volumes that are dedicated regions of the stable storage (which can be resized) that hold data for us and are progressively updated and repurpose pre_upgrade/post_upgrade for our highly transactional data which makes up about 10% of DSCVR’s state.

stable_save(object, region, index) 

Object that is being serialized to be stored
Region that the object is being stored in
Index of the region we are writing it too. (not sure about this one)

I could image having a master file table that tells me that objects ID 1 → 1,000 are stored in Region A at Index 5. These blocks of similar data are 50mbs~ in size and can be quickly retrieved, mutated, and stored again.

Also, this would empower map reduce, we could through a series of update calls process the stored data and modify the canisters state relative to rules created in future updates.

7 Likes

Deterministic Time Slicing (DTS) should be able to help here without additional application-level logic.

3 Likes

DTS is indeed my number one most needed feature right now. It will allow for a lot of interesting tooling and solutions to spawn on the IC ecosystem.
Let’s hope it will comes soon :crossed_fingers: :crossed_fingers:

I think progressively using stable storage is the way to go, even when DTS comes out. DTS will be great for map reduce functions and a bunch of other things.

We talk about Big in this thread, but can someone define Big? 10GB, 50GB, 100GB, 1TB, 1ZB? File storage, NoSQL with a single primary key, and SQL all have massively different implications when dealing with scale.

With file storage (images, videos, blob) horizontal scale becomes easy to conceptualize with multi-canister approach. With NoSQL and a single primary key, it becomes harder, but complex relation databases is extremely complex.

I think we should master the single canister before trying multi-canister with complex data systems.

2 Likes