Do we still need BigMap / NoSQL database?

Hey fellow IC developers,

I’m interested in learning your opinion on whether would it be valuable if DFINITY implemented a framework for managing large amounts of structured data.

In the past we had plans to implement something called BigMap, however, this project was never fully released in the original scope. For the purposes of this discussion, I will call the new feature “NoSQL database” to allow some flexibility in how we will define the feature from now on instead of focusing on the past.

Since I joined DFINITY last month I’m trying to summarize whether it’s still a desirable feature or something that the community is over about.

Today, I’m looking for your feedback on this topic. Here’s a list of warm-up questions, but feel free to add any relevant context:

  • How would a NoSQL database help you in the development?
  • Are you already hitting the 8GB size limit? If yes, what does your app do?
  • Is it only about the 8GB limit or there are some other benefits?
  • What types of new apps would you build if the database was available?
  • Would it lower the bar for developers to get started in IC?

Best regards,
Mikhail Turilin
Product Manager
DFINITY foundation

Here’s a list of previous discussions around this topic:

Here’s a list of various implementations and prototypes of past concepts:

4 Likes

I think databases in general are needed, even with orthogonal persistence making things easier a DBMS offers many features such as query languages, procedures, access control, etc… which we currently have to implement from scratch, losing a lot of dev time on an inferior version (both in terms of features and security)

@icme is developing something similar to BigMap, so he might have more to say on the matter.

1 Like

@Zane Thanks for pulling me in

@mikhail-turilin My grant project, CanDB is aiming not just to replicate, but to build upon BigMap by providing generic and flexible NoSQL query patterns on top of multi-canister auto-scaling key-value storage.

I showcased both the API concept and multi-canister design in the DFINITY developer discord community call last month, and I’m currently building out my multi-canister POC of which I’m hoping to release an alpha version of in the next 1-2 months.

If you have any questions I’d be happy to set up some time to chat and dig a bit deeper offline. You can DM me on the forums, or at @canscale on the DFINITY developer discord.

4 Likes

A vertically and horizontally scalable database is absolutely necessary for the IC to succeed in its vision. Demergent Labs is also pursuing this with Sudograph, which is a general purpose db that uses GraphQL as its query language. We will be going multi-canister hopefully within a few months. Beta is live now: GitHub - sudograph/sudograph: GraphQL database for the Internet Computer

I think DFINITY should focus on releasing the scalable primitives that we need as developers or library authors. For example, StableBigMap, BigMap, and other stable/big versions of fundamental data structures. This would IMO provide the greatest flexibility for database developers to build on top of.

8 Likes

Yes! Byron is doing some great work on this and we recently had a number of entries that came at it from different angles for our QuickStart bounty and Sudograph is coming along as well.

In my opinion, we need a Manhattan Project style effort like what is happening with the ledger right now to settle on an infinitely scalable certified data structure standard. Users should be able to power up a canister do the following and never have to think about data again other than topping off cycles to pay for the storage.

    let mydata : MetaTree = new MetaTree(a_storage_canister);
    mydata.put(collection : namespace , key : variant , value : variant);
    mydata.index(collection, key, index_type, indexerfunc);
    mydata.get(collection, search);
    mydata.getCertificateRoot(();
    mydata.getWitness(collection, key);
    mydata.validate(witness);
    mydata.onUpdate(some_function_that_lets_me_certify_my_data_with_other_data_in_my_canister)
6 Likes

@icme @lastmjs - thanks for the input.

Here’s where I’m coming from on this topic:

  1. It’s great to see that the community is coming up with various implementations of a DB framework
  2. As a product manager, I’m interested in learning more about the developer use cases and the demand. I want to know what types of apps will be easier to build if we had a database framework.
  3. Since today building a true DBMS on IC is challenging because of platform constraints, I’d like to build a better mental map of potential developer benefits to make a good case for underlying platform improvements.

Does it make sense?

@skilesare Thanks for the ideas!

What would this code do?

Create a Merkle tree of your data so that you can certify it and return it with queries to applications wanting to know they can trust the data. You can use the out put for CertifiedData | Internet Computer Developer Portal

edit: To be blunt, all the apps we are writing are just web2 wanna be apps until we all start certifying our data and the apis for doing this right now are very underdeveloped and lack good examples of why it is necessary/useful.

6 Likes

I think that scalable storage architecture is better to form on a case by case basis based on the specifics of a system/dapp. Bigmap-like libraries can cover many use cases, but for a sample, the ledger block storage and the openchat user storage use different scaling techniques. Openchat gives each user or group their own canister and keeps a map of user->user-canister. the ledger stores the blocks in archive canisters in a linear way only creating a new archive canister when the previous one is full.

1 Like

In short, I definitely still want and need a BigMap.

Speaking from a Motoko developer’s POV, the status quo is insufficient for a variety of reasons:

  • Stable variables are the easiest way to persist data across canister upgrades. Currently, stable variables are limited by the size of the canister heap instead of the size of stable memory, because Motoko needs to serialize and deserialize the stable memory data in the heap. That is currently 4 GB, and in practice, the available heap is lower due to GC constraints (which are in the process of being lifted). 4 GB is not a lot of space.

  • To access the full 8 GB of stable memory, a developer can use the ExperimentalStableMemory library. However, it is very cumbersome to program with that right now, as you can’t use any of the off-the-shelf Motoko standard library data structures (like Trie) if you go that route. Also, that library is marked as experimental and may change at any time, which makes it unsuitable for production.

  • Applications that store images or video will very quickly hit the 4 or 8 GB limit. There was talk about specialized storage subnets, but there have been no updates on that front. I anticipate many may plan on using a BigMap as a blob store for large assets like video; that may have unintended implications if not carefully designed.

For me, I am building a consumer-facing dapp intended for a mass market. I worry a lot about hitting memory limits. I currently use Motoko standard library data structures (like Trie, RBTree, List) in a single canister, but that won’t nearly be enough. A database (or blob store) solution would be huge. In fact, the vision of BigMap is one of the more compelling sells of the IC developer ecosystem.

It wouldn’t necessary lower the bar for new developers trying to prototype something on the IC, but it will massively lower the bar for more experienced developers trying to build a scalable, productionized web3 dapp on the IC.

8 Likes

I would say that most cloud developers coming from web2 expect not just scalability, but having the complexities of that scalability abstracted away so that they can focus less on infrastructure and more on the business logic of their applications. The current architecture works for isolated or individual-centric applications like blogging/drive storage, but will struggle to support the web 3 generation of social media applications.

If I’m coming from web2 and have built a web application that’s geared towards hundreds of thousands to millions of users, I’m used to Firebase/FireStore, AWS, GCP, Azure - data storage that scales and API gateways that load balance, throttle, and auto-scale without breaking. Coming to the IC from the web2 world is a bit like running straight into a wall. Part of this is because the DevX (developer experience) of web2 managed services is just that good, but the other part of this is because of lack of ready-to-go IC solutions that developers can use to build their own automation and management tooling on top of. I think this will come in time as the developer built tooling matures, but for many developers (like myself) who were originally attracted to the IC by the promise of “infinite scalability” and “BigMap”, it was a bit of a let down to realize that this abstraction did not yet exist and we had to build our own data storage scaling solutions from scratch.

As @jzxchiang mentioned, a developer looking to prototype an application can move pretty quickly if that application is contained in single canister - one can go pretty far with 2-4 GB, but this eventually becomes a liability. Not just in terms of data storage, but also in terms of application throughput - a canister can only handle so many query and update calls per second.

This means that even in a multi-canister application, if a single canister is responsible for running a pivotal part of that application a high traffic or denial of service event could shut down most of if not the entire application. The start of some work to throttle requests with inspect_message should help, but there’s still quite a bit of work to be done.

This throughput concern, plus the limitations with the current state of inter-canister calls make common architecture designs such as the API Gateway Pattern infeasible for the majority of application use cases, and has forced many developers to embrace client-centric multi-canister architectures when designing their applications. While this approach to scaling will “work”, it is not ideal to rely on the client as heavily as there is a limit on the amount of work the client (brower, phone app, etc.) can be expected to do and data it can be expected to store in local storage. Exposing the client directly to a wide variety of micro services also has security implications as it widens the attack surface to a particular application, vs. with the API Gateway pattern a developer just has to secure a single point of entry.

I believe improvements in the following areas would remove some of these barriers.

  1. Introduce a “load-balancing” scalability option to canisters that are pivotal in a multi-canister architecture, such as indexing, management, or transactional canisters that may need to handle high query or transactional (update) volume.
  2. Start long-term research into solving the performance issues with inter-canister calls, which could potentially mean re-architecting how inter-canister query and update calls work on the IC and breaking backwards compatibility. I think the IC can go pretty far with client-centric architectures, but when the data gets large enough, the client will eventually be overwhelmed without the capacity to do SSR or pre-computation with tools like Map Reduce on the backend. I don’t think that a patchwork solution to inter-canister query and update calls will suffice - we’d just be kicking the can down the road.
  3. Efficient canister forking/cloning of canister code and state/data will allow developers to easily re-partition the data in those canisters to scale out when a canister fills up and lessen the workload of the client application.
  4. Improve heartbeat. Building managed services on top of web3 infrastructure will eventually require some sort of cron-job/ heartbeat usage. Currently, heartbeat is pretty expensive for developers to use as it executes every heartbeat. DFINITY has suggested that the community build shared heartbeat canisters to “share the cost”, but this involves trusting other parties or entering into a DAO/SNS, which is way too much complexity for a simple cron-job.

Also, just speaking for myself, but I would appreciate if the marketing for the Internet Computer product and its presence as a new IT/cloud stack shifts to speak to the strengths that currently exist, and don’t use buzzwords like “infinite scalability”. Nothing scales infinitely (not even the centralized cloud providers), so providing specific metrics and data will give developers the confidence they need to an develop scalable designs and architectures for their applications.

14 Likes

I think this would be a great opportunity and time to start an “Infrastructure and Scalability” developer working group

9 Likes

I don’t know if people really want BigMap or a NoSQL database.

I suspect that what people actually want is the freedom to write code using whatever data structures they deem appropriate for the task at hand, and that they would love to never have to think about the limits imposed by the implementation details of canisters or the IC.

Some of the original differentiators for development on the IC were things like “write code as if it will live in memory forever”, “no databases”, and “infinite scalability”.

If this could be achieved through the right set of abstractions then that would be a truly amazing experience.

Perhaps for developers this would mean building an app using the otherwise naive “single canister” architecture and under the hood any complexity involved to address scaling or other limitations is taken care of.

7 Likes

@paulyong I want to give some thoughts on what you said.

I think you’re basically correct, or at least in line with my thinking. As in, I think DFINITY/Internet Computer should provide basic stable/scalable data structures. Then developers can take those data structures and do what they need.

So for simple projects people might get away with just using a BigHashMap, BigBTreeMap, etc that scales out and automatically persists across calls and upgrades.

But I do believe “databases” are necessary, as in data structures that have powerful querying, relation, and other CRUD functionality. That’s what Sudograph provides. Many apps will need this basic functionality, but it should all be buildable with underlying “Big” data structures.

Like right now Sudograph is just a bunch of Rust BTreeMaps under the hood. If I could just swap them out for BigBTreeMaps then we could instantly scale across canisters (hopefully). Sudograph doesn’t have indexing yet, but the basic concept should work pretty well I’m thinking.

So I don’t agree that all we need are the scalable data structures, but I do believe that DFINITY should focus on creating those and then let devs build abstractions on top.

7 Likes

The point I was trying to make is that the currently proposed solutions (BigMap/NoSQL db) appear to be mixing the concerns of data structures with otherwise orthogonal limitations.

I suppose I am trying to encourage the exploration of other solutions, perhaps at the system level, that would work more generally than providing BigMap instead of Map, BigBTreeMap instead of BTreeMap, etc.

Otherwise, everyone will have to continue to think in terms of size/time limits that are unique to the IC instead of being freed up to innovate on what is unique about what they’re building.

Maybe that’s not realistic, or at least not any time soon, but I think it would be a shame not to explore :slightly_smiling_face:

7 Likes

In that case, the most transparent and flexible solution is to simply wait for 64-bit wasm IMO. Canisters would then have a practically limitless address space upon which they can use any data structure they’d like; the replica would handle the messy details.

I think the main issue with that is that we might need to wait a long time.

1 Like

Throughput would still be limited by single canister performance tho.

5 Likes

If ICP is to truly to run everything on the blockchain. We going to need storage services that compares to the likes of AWS and other cloud providers performance-wise. Perhaps dfinity could build a high performance storage subnet service that use the blockchain but with far fewer replications.

1 Like

We have actually written this in our IS20 standard but have decided against it because we have not written an implementation compatible with stable storage.

I remember reading a post from a Dfinity dev 7/8 months ago where it was stated in the future we’d be able to configure the desired replication factor, so one could potentially build an app or some parts of it that would run on a single machine and have the ability to make it decentralized with the click of a button, unfortunately I can’t find it the post again, so it might just be my mind playing tricks on me.
It’d be an interesting solution to explore, not everything needs to be replicated on 10/20 nodes, having the possibility to choose would unlock lots of new options for developers.