Database-like searching/querying with BigMap/BigSearch

I’ve seen the videos on BigMap and BigSearch, and seen some examples and a lot of hearsay. One major set of functionality might be missing here, and I would love to be corrected if I’m wrong.

Is there going to be any native (built into the data structure) way of querying a BigMap, with search queries similar to what you’d find with a relational database or something like MongoDB? As far as I know, BigSearch is for full text search. But we need per field searching capabilities on BigMap, so that we can truly emulate a database.

The LinkedUp GitHub example shows a simple Motoko HashMap being used as the database, and the search functionality is implemented with a for loop that searches through all records to find the correct record. I’m really hoping BigMap will have this type of functionality built into it, otherwise that’s a major project that someone is going to have to do.

So, will BigMap/BigSearch have these features? If not, am I correct in that a library would need to be made that essentially reimplements the common search functionality of popular database like Postgres, MySQL, or MongoDB? And that in the mean-time rudimentary searches through all records is the best we have (and that library would essentially provide abstracts over full-iterative searches)?

14 Likes

The purpose of BigMap is to provide a key-value store that is distributed across multiple canisters. In contrast to relational databases, which define a data structure made up of tables of rows and columns with predefined data types, key-value stores store data as a single collection without any structure or relation. I do not expect BigMap to provide relational features since it is not relational in nature. If you want something that resembles a relational database, then I would recommend you look into compiling SQLite down to WebAssembly and deploying that. I’ll let others speak for the capabilities of the newer Rust implementation of BigMap. My Motoko implementation from last Summer supported few features beyond get and put, but was also written in just a few hundreds lines of code, which I think is rather remarkable. If I had more time, then I would have supported ordered iteration through keys and thus searches on partial keys. I believe the Rust implementation has been more focused on autoscaling via dynamic canister creation and automatic rebalancing of data.

4 Likes

So it sounds like there is going to still be a significant effort to have something like the databases that we are used to developing with…compiling something like SQLite to Wasm could work within one canister, but we are going to need a database that automatically scales across canisters, and can query/search across canisters.

We were told we wouldn’t need databases anymore. It looks like we still need a database, but one specifically designed for the Internet Computer. A simple key value store is insufficient for many applications…in fact, I’m not sure what web scale CRUD-based application if any is going to succeed without a scalable queryable data store.

Am I missing something?

6 Likes

Hi @lastmjs

I’ve been researching this for a bit but it seems BigMap won’t be the solution for a relational db.

As @enzo said and AFAIK bigmap will behave like an auto-scalable redis.

Check Matt’s answer here(scroll down a bit): Idea for CRUD types - feedback appreciated - #4 by Gabriel

I’ve been working on simple structure for a graph as well and hopefully I’ll be able to share that soon but it’s going slow because of my different background (go and ts).

For a better understanding I would suggest to have a look at:

Most of the online docs are pointing towards neo4j and it will be a bit hard to get your head around it but the core implementantion could start with this:

7 Likes

If one could access the canisters underlying Memory based on addresses it should be possible to implement neo4j‘s approach and scale it out following their ‚Fabric‘ approach. Is this what you have in mind @Gabriel ?

2 Likes

Hi @cryptoschindler

Not sure what you mean with access the canisters underlying Memory based on addresses but basically yes. In the end their approach is based on labeled property graph model but they’ve been developing this for years so my approach is the least rudimentary. Also in terms of scaling I don’t know how this is going to work. Big map auto scale would give me an idea how to map things in order to scale when the data structure gets too big.

For now I have a simple structure just to get things going and on top of that I’m trying to add a simple orm for crud operations, validation etc.

4 Likes

@cryptoschindler

Ah, now I see what you meant yeah basically if we can access each canister memory like they use fabric infrastructure eg: data sharding, and the query language will retrieve results from multiple federated graphs

That should solve the scaling issue quite nicely.

I don’t know exactly how this will work with dfinity as I don’t have a clear understanding how canisters clone and split memory between them and how you keep track of each one of them but this could become the go-to graph database.

3 Likes

Basically one of the reasons why neo4j is so fast is because they mostly store fixed size records and use index-free adjacency. This means each node acts as a micro-index of nearby nodes. They also divide their storage into Node storage, Label storage, Property storage and Relationship storage which contain the respective information.
If we have fixed size records and want to look up a certain node id, we can calculate the node records location directly with cost O(1). One could probably implement this on top of a higher level data structure but it would be nice to get rid of the overhead and directly write to the canisters memory at a certain address.
As a canister only has access to 4 GB of memory you have to shard this out to multiple canisters anyways if your DB is bigger than 4GB.

I’m interested in working on this, let me know if you have an accessible repository :slight_smile:

6 Likes

Ah, thanks for all of the info.

So we definitely will need to create a general-purpose CRUD database that automatically scales and allows complex querying, BigMap and BigSearch seem out for that then.

Now my question is, what is the best architecture to choose for this database? I know you’ve mentioned a lot about graph databases, I just want to make sure that’s the right choice. Please reason through this with me if you don’t mind.

Taking an existing relational database like SQLite or Postgres is probably out of the question for multiple reasons. SQLite would probably work well compiled to Wasm, but it will be stuck within one canister, so no automatic scaling. And as far as I understand, relational databases don’t do horizontal scaling too well (I could be wrong, I believe you can scale them horizontally, but I believe their underlying mathematical properties don’t lend to this well, and I don’t seem to ever hear about horizontally sharded Postgres or other SQL databases).

So if a traditional SQL/relational database is out of the question, which architecture is best? For vertical scaling, we’ve got about 4gb of storage and that’s it. As for processing power within a canister, I’m not sure what we’re dealing with. But anyway, we’ll need an architecture that can be sharded and scale out horizontally, extremely well.

So a graph database architecture seems like a good choice. But is it the best choice? What other architectures have you considered? What led you to the graph database?

This is going to be a foundational and indespensible project, I feel most apps are going to need a scalable database, as in it will be crucial for web-scale applications.

You mentioned creating an ORM. Personally, I’m more interested in just having the query language or the low-level API. It might make more sense to focus efforts there, and create a solid foundation first that ORMs and other projects can build off of. ORMs are notoriously bad, and getting them right I believe is a very difficult task.

To that end, that’s one of the major motivations for the project I’m about to start working on in a couple weeks. I’m not sure the exact name I’ll use yet, probably something like GraphIC, GraphICP, or ic-graphql. The idea is to create the ultimate GraphQL generator. You define a schema, and it will automatically generate resolvers for all basic CRUD operations, and eventually more. Essentially it will be an ORM, and IMO using GraphQL as your ORM is the best way to do an ORM that I’ve ever seen. The standardization and world of tooling will be unmatched compared to any custom ORM, especially one being built now from scratch.

But for my project to succeed and reach its full potential, it needs an infinitely scalable database API to implement the resolvers with. I was hoping BigMap/BigSearch were going to provide that implementation, but I guess not.

So my motivation here is creating the ultimate declarative backend, the simplest way to develop a CRUD application that has ever existed. I’m about to start dedicating most of my time to the GraphQL project, but I need this database.

8 Likes
  1. Option 1: Build the database you need. Maybe others will even pay to use it if it’s so necessary. There will be nothing remotely graphql-related useful until this ambitious effort is done.
  2. Option 2: Limit the domain of GraphQL queries the tools you build can handle to be ones that you know how to program a resolver for. It’s totally fine to only be able to resolve simple queries in a first prototype milestone. Most of the time, rather than taking on a big ambitious set of work in one batch, you’ll be delivering better return on investment by setting goals to limit your batch size, ship something useful for for a small set of use cases, and then release regularly with progressively increased use case goals.

mvp

8 Likes

Now my question is, what is the best architecture to choose for this database?

Mostly because you don’t have any other options. Unless you get TCP/HTTP the only option for now is in memory db’s. sqllite option could work as well but I’m fairly sure it won’t scale. (eg: augur core).

So a graph database architecture seems like a good choice. But is it the best choice? What other architectures have you considered? What led you to the graph database?

For scaling if @cryptoschindler suggestion can be implemented then we’re gold. Also most of the benchmarks shows that graph databases are way faster. Check :::https://www.researchgate.net/publication/220996559_A_comparison_of_a_graph_database_and_a_relational_database_A_data_provenance_perspective

You mentioned creating an ORM. Personally, I’m more interested in just having the query language or the low-level API.

The ORM in question is not the magic ORM you’re thinking of.

Basically I want the graph structure be a simple structure and the ORM part to be added on top of the graph. For example for our own API we need custom validations per field per action, sanitisation and a lot of custom logic. See Idea for CRUD types - feedback appreciated

The idea is to create the ultimate GraphQL generator.

For a graph database a GraphQL layer is esentially useless. Because you don’t need custom resolvers for complex queries, parsers etc. See Cypher is the best example as why you don’t need it. Connecting your React app to Neo4j with React Hooks | Neo4j Developer Blog

Baby steps.

PS: @cryptoschindler I’ll be in touch over a message once I clean my stuff.

2 Likes

I wouldn’t agree with that. Taking a look at Cypher, GraphQL is far more declarative. Cypher I’m sure is far more powerful and flexible, but I think GraphQL could provide a very nice declarative abstraction over resolvers written with Cypher, for example.

For the database itself, I’m not sure GraphQL would be the right choice (but it could be, see https://dgraph.io/), but for an application that might be using a graph database along with other data sources, using GraphQL could be a great choice to abstract over those underlying details.

Also, authentication, authorization, and per field rules can be provided through GraphQL schema directives, not that the graph database shouldn’t have that functionality natively as well, perhaps.

3 Likes

This database is such an important building block, I’m hoping a robust community project evolves or the foundation takes a deep interest in incentivizing its creation.

One of my biggest problems here is the marketing we have received. We were told specifically that there would be no need for databases, and though skeptical I hoped it would be true. It seems we will have the primitives to build a database on the Internet Computer, but it must be built and many applications will require a specialized database built for the Internet Computer.

I wish for the DFINITY team to be more transparent with the vision and the limitations of the Internet Computer. Right now I have to shuffle through everything that’s said to get to what’s real and practical and what’s not.

8 Likes

Few would dispute that there is much work to be done before the vision for the IC is fully achieved. There are many challenges still to overcome, not least better search functionality. Though perhaps the greatest challenge of all is to think differently on how existing use cases can be better satisfied without resorting to the legacy stack, especially when messaging is reliable, persistence is orthogonal, and computation is trusted. All the same, I will share your feedback with marketing to ensure we deliver a consistent message across all communication channels.

4 Likes

Thanks, I appreciate it!

And thanks for pointing out these new capabilities, I’m excited for the challenge ahead

It should be possible to build a Balanced-tree or General-balanced tree library, in which case you would be well on the way to enabling indexes that can be searched in O(Log n) time. Finding a good method to partition the tree across multiple canisters might have some challenges but perhaps just try it naively and see what the performance is like.

3 Likes

What about joins with a B-tree? Any advice there? Do you think those would perform well?

1 Like

It should work just fine:

Consider - MySQL :: MySQL 5.7 Reference Manual :: 8.2.1.6 Nested-Loop Join Algorithms

The tricky part would be how you handle a join with data split between two canisters.

I haven’t put much thought into it just yet, but I wonder if the IC lends itself more towards a system like Hadoop / MapReduce instead of relational DBs.

2 Likes

Yep, the multi-canister issues seem to be the major source of complexity when discussing these primitives that need to be built on top of the platform.

It would have been really nice for the multi-canister/multi-Wasm-module abstraction to have been handled at the canister level, instead of moving it up above the canister level to user-space.

+1

It’s a good practice to present an internet service as an API that isn’t tightly coupled to one database vendor like Neo4J. GraphQL is clearly becoming the de-facto standard for that (IMHO) and (I believe) has much more widespread adoption in HTTP APIs that Neo4J query language.

1 Like