Database-like searching/querying with BigMap/BigSearch

I haven’t put much thought into it just yet, but I wonder if the IC lends itself more towards a system like Hadoop / MapReduce instead of relational DBs.

Totally. In fact, those tools you mention are good any time you are employing a lambda architecture (not to be confused with AWS Lambda) Lambda architecture - Wikipedia

Any time you have way more reads that writes, it can be helpful to take your most common reads/queries, precompute them on write, store that somewhere, then reads/queries are just key-value lookups and your service can spend time calculating expensive queries instead of common/easy ones.

Some programmers don’t like that it’s not ‘DRY’, but it’s simply a matter of data structure choice based on the nature of your reads/writes in the wild.

2 Likes

I agree with what Enzo said.

IC removes the need to use databases or files for persistence purpose, but it does not, nor was it intended to, change any data relationships in your application. If your data naturally fits a key/value model, bigmap can be a candidate solution to scaling. If your data fits a traditional relational model, something else needs to foot the bill. We (those who work at DFINITY) don’t have a ready solution to offer to developers at this moment, but I do imagine building a relational data model will also be less complex when persistence is transparent, and atomicity (at message level) is already provided at system level.

It makes a lot of sense that a conventional application will want to use relational DB both for its data model and for persistence. But when it comes to scaling out, it can also become very complex and application specific. I’m afraid there is no silver bullet in this area (unlike key/value stores, where every other NoSQL solution claims they are the silver bullet).

4 Likes

Ping @matthewhammer as I recall he has built a relational data model with joins in Motoko, but I don’t think it supports multi-canister, nor was it intended as a scaling solution.

4 Likes

Thanks for all of the thoughts in this thread so far, they have been very helpful.

As I’ve been reasoning through this more based on all of the discussion here, I’m starting to come to the conclusion that an efficient and scalable data structure, which allows relational modeling, will be possible by composing together multiple basic data structures customized somewhat for the IC (the customization mostly dealing with scaling across canisters).

It really seems that with a combination of tree/map structures, like b-trees and b±trees, we can get the basic functionality that we want. For example, these two data structures are what SQLite is built out of, under the hood (if my sources aren’t wrong). SQLite uses b±trees to store the tables, and b-trees for indexes.

Given the orthogonal persistence of the IC, a lot of the underlying complexity of getting a database to function properly in a web/network app context will just melt away. That is exciting.

I’m thinking of starting using simple Rust data structures that already exist, such as a BTreeMap. With some binary search functionality, either built-in or custom-made, I think I can get a good prototype to work well within one canister. If that can work, I think we’ll be well on our way to a scalable solution.

As long as the data structures are built to compose together well, I’m thinking we should be able to create Big versions of these data structures, and they should work remarkably similar to the non-Big versions that only work within a single canister.

So if we can get a BTreeMap and a B+TreeMap and a binary search to work well within one canister, then if we build a BigBTreeMap and a BigB+TreeMap, with a BigBinarySearch, then we might be able to simply swap out the data structures in the single canister version and viola get a multi-canister version.

Thoughts?

5 Likes

@lastmjs Here is a port of SQLite to the Internet Computer IC_sqlite. It’s a simple canister that allows you to create a database and to run SQL updates and queries. SQLite is probably the easier port of a legacy application to the IC you can find. I think it is nearly impossible to do this for the most popular database software like MYSQL or PostgreSQL. Running a SQLite on the IC has its advantages compared to running it on an OS. SQLite does not have any specific user management functionality and hence is not suitable for multiple user access. SQLite does not have an inbuilt authentication mechanism. However, your can add to your canister an authentication layer, and the ability of having multiple users access multiple databases hosted on the same canister.

Note that the wasm is not optimized at all. 2/3 of the wasm code is probably dead code. Many settings that make sense on a traditional OS are not needed on the IC. So there are a lot of things to like about it despite its limitations. It’s probably not too difficult to extract the core BTreeMap logic and throw away most of the rest of the code. This way you can take advantage of years of optimizations put into SQLite.

I’d like to echo what @enzo and others have said though. Building applications and services on the IC is as much about simplification as it is about taking advantage of the fact that messaging is reliable, persistence is orthogonal, and computation is trusted. So I think we might be threatened by a great opportunity to innovate here.

25 Likes

thank you for sharing that! can you also estimate how many days/hours it took you to make this port?

I would recommend an RDF triple store like Terminusdb https://terminusdb.com/ or Flureedb https://flur.ee/ with datalog as query language.

Please also check this article Rethinking the Database and the implementation of this idea in rust GitHub - Roenbaeck/bareclad: Based on transitional modeling, bareclad is a new type of database that provides features from relational, graph, columnar, and name-value pair databases.

1 Like

For use cases like the one in this thread, would it be possible to implement a memory manager that abstracts away the canister memory/storage limit? If you were porting C code like SQLite, you could override malloc(), free(), etc. to call into your memory manager. The memory manager would scale the heap by using multiple canisters, effectively extending the heap memory address space from 32-bit to 64-bit (16-exabyte)

That’s assuming you can compile the SQLite C code as 64-bit and run it in a canister. I haven’t compiled C code to Wasm, so it’s quite possible that what I just suggested cannot be done, at least not without some tricks in the C code to implement pseudo-64-bit addressing. Also, it seems like it would be inefficient, since I would think that inter-canister calls to retrieve large blocks of data are very inefficient compared with accessing memory within a canister.

I agree that the best option in most cases is to try to think exclusively in a way that breaks large, complex systems into components that live within multiple IC canisters, at least while the 4GB Wasm/canister limit exists. For those of us coming from 64-bit programming though, this is a limitation we’re not used to, at least not in the past decade.

2 Likes

This seems really relevant and interesting:

It seems like it would not work perfectly with ICP out of the box but it appears to have done a lot of relevant research that may apply here as well.

2 Likes

A very interesting conversation here!

I just wanted to add that it seems like if you have a BigMap and a BigTree, then any database in an actor based system could be expressed as a set of actors which each contain a different index over the same data.

It seems likely that at the maxiumum scales were considering here that such a database won’t be able to support arbitrary queries without limits and that aggregates will likely need to be expressed as actor indexes themselves.

2 Likes

I think this would be a huge step forward. I’d be happy to work on this pretty much full-time. Has anyone worked on this yet?

Hi. Has there been a progress or any codes published since this conversation?

I’ve made a lot of progress on Sudograph, which you can view here: GitHub - sudograph/sudograph: GraphQL database for the Internet Computer

Sudograph has an underlying relational database, called Sudodb: https://crates.io/crates/sudodb

Currently Sudodb is implemented with a combination of Rust BTreeMaps, and works within a single canister. Based on multiple conversations with DFINITY engineers, though it’s not a certainty, it seems very likely that with memory64 and some improvements to the IC, that canisters will be able to theoretically scale to the capacity of a subnet without too many complications. A single canister being able to operate on terabytes of data doesn’t seem out of the question.

So for now, Sudograph will be moving forward with its single-canister design, and will take advantage of hopefully soon-to-come improvements to the canister storage limit. And I am starting to believe that this will be sufficient for an enormous amount of use-cases.

14 Likes