Sqlite backed by stable memory: the perfect match?

You might have seen my previous posts discussion issues about safe upgrades of canisters and zero downtime upgrades, and clearly, our platform isn’t particularly great for developers when it comes to persisting your data across safely and efficiently canister code upgrades. But it should be great!

If we squint a bit, we’ll see that the problems we are facing here aren’t that unique: We have a system with a persistent storage (the stable memory) that is unstructured, flat and kinda slow. We need to store all relevant data in this storage. And we want to be able to make sense of this data even after changing the code of our canister, or possibly even changing the implementation language. This describes basically any computer out there with a disk attached to it! There, this is solved using file systems and/or databases, so maybe we can simply use the same?

TL;DR

The sqlite in-process database solves most if not all of the problems related to persistent storage, and can be backed by the IC’s stable memory.

The vision

Something like this should work, and the developer shoudn’t need to worry about much else:

(TBD, avoiding 403 error)

Behind the scenes this would use the canister’s stable memory like a raw block device.

Why is this great?

  • sqlite solves the problem of mapping structured data onto a flat disk, in a time-tested way.
  • sqlite is well known to many developers, as it is the most commonly used data base out there. Even mobile phone apps tend to use sqlite for their data storage!
  • The problem of schema migration, i.e. what to do if your new canister version stores its data differently, is well-known and has plenty of existing solutions to pick from.
  • sqlite is used to work with a slow, block-based backend, so backing it with stable memory is sensible, and will likely make good use of the canister main memory as a cache.
  • sqlite is designed to support large files (>4GB), so you can use large stable memory easily.
  • sqlite guarantees backwards compatibility, so no worry about losing your data because your data serialization library has changes.
  • The sqlite file format can be analyzed by dedicated tools, so if the IC (or your canister) provides read access to stable memory, this can be used for debugging, backup and archival.
  • sqlite is available in many programming languages. By storing your data this way, you can in the future switch from one Canister implementation language to another without losing your data.
  • By syncing at the end of each message, the data in stable memory is complete after each messages. This means you can upgrade your canister without running code in the canister_preupgrade hook, greatly reducing the risk of non-upgradeable canisters.
  • The atomicity guarantees of the Internet Computer may allow you to use sqlite without locking and possibly without a journal, simplifying things. (But beware cross-message transactions are possible).
  • Plenty of tools out there that build on top of sqlite, e.g. ORMs, that you could then use on the IC.

How to get there?

@hassen.saidi has already demonstrated that we can compile the sqlite library to Wasm and use it on the Internet Computer; there backed by Main memory only (memvfs), and thus not persisted across upgrades.

Luckily, sqlite is designed to be cross-platform and allows us to provide our own storage backend code, using the VFS feature. It even has example VFS code, which quite nicely describes our use case:

This file implements a demonstration VFS named “fs” that shows how SQLite can be used on an embedded device that lacks a filesystem. Content is written directly to the underlying media. A VFS derived from this demonstration code could be used by a gadget with a limited amount of flash memory to make SQLite behave as the filesystem for the flash memory on the device.

So we’d have to change this file to use the ic0.stable_* system API calls in mediaRead() and mediaWrite(), wrap it in a Rust crate, maybe add some initialization code, and we are ready to roll!

So, who wants this?

I find the prospect of this pretty appealing, but I currently don’t have a project of my own that needs this, so I shied away from just doing this and instead came here to describe the idea. Maybe someone is now enthused and builds this?

Or maybe ICDevs considers this a worthwhile bounty (ping @skilesare)?

Or maybe the SDK team in the foundation will take this on?

15 Likes

This is the code that was missing above (stupid overly eager SQL injection protection – can this please please be disabled?)

use rusqlite::{…};
…
  let conn = Connection::open(":stable-memory:", ).unwrap();
  conn.execute("CREATE TABLE person (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    email TEXT NOT NULL
  )", NO_PARAMS).unwrap();

  let name: String = "Steve Example".to_string();
  let email: String = "[email protected]".to_string();
  conn.execute("INSERT INTO person (name, email) VALUES (?1, ?2)",
    &[&name, &email]).unwrap();
1 Like

Yes please, this would be a game changer for our project. I don’t think I’m capable of coding something like this (nor would I want the future of data storage to hinge on my janky code).

We would gladly contribute to any bounty though.

I think I’d still use really low level stuff like arrays for smaller tables, but for player data this would be perfect.

4 Likes

Well, in this vision, all persisted memory would be put in the database somehow (like in a web service). You can of course keep the data in some other form in main memory for faster access (although I am not sure if this would really be needed).

1 Like

I found an example of a sqlite VFS that is written in pure rust (as well as another one inspired by that); so maybe no C writing would be necessary.

3 Likes

Hopefully we can also have Motoko bindings for sqlite if OP’s proposal really gets implemented…

1 Like

Now that I think about it, this problem is more widespread than I had thought.

For example, I need full-text search functionality, so I was looking around on Rust and Motoko and found tantivy on Rust. It just occurred to me that in order to persist tantivy data (e.g. search indexes) across upgrades, I’d have to implement some sort of IC VFS adapter, just like OP suggested for sqlite. Currently, tantivy doesn’t support writing to stable memory using IC syscalls, obviously.

So maybe writing canisters in Rust to take advantage of pre-existing libraries isn’t as clear of a home run as I had thought. For libraries that don’t interface with a file system, Rust definitely provides an advantage. For libraries that do, well… someone needs to write an adapter.

This is one area where the Motoko stable variable abstraction is super convenient (although still limited to 4 GB until 64-bit support is in).

This looks super interesting and we could certainly do an ICDevs bounty to get this up and running. I’d love a couple of pieces of analysis from Rust/C Devs that can look at this code. @lastmjs probably has some good experience to throw in here as well from his work on GraphQL.

  1. What processes are going to be an issue for long-running processes? I’m particularly interested in processes that need to run out of band and build indexes/run table scan searches.
  2. Does the architecture of SQLLite bend itself toward chunking these processes so that they might be optimized to fit inside of the cycle limit?

You are spot on a possible weak point with that question The whole scheme only works well if sqlite does not have to long (i.e. linear with the file size) operations that can’t be interrupted (migrations, indexing etc.). So yes, checking that (maybe simply by asking on their mailing list) is part of the job.

I think we need journaling at a block level. This is simple to do, than
-porting sql
-figuring out the performance and other nooks of a database engine.

There is nothing lite about sqllite in this context.

Why SQL ? Sql is justified, when data is large, structured and more importantly queries are adhoc. The relational algebra helps with the adhoc introspection of data. Using SQL to solve journaling/atomicity problem seems like a overkill, and ties the solution to unrelated usecase

A block based journal is fairly easy to implement.

  1. Divide free memory into a linear block address space
  2. Reserve initial 32 or N blocks for journal. You can fit fixed sized txn here

Journal api
Txn_create
Txn_add(block_no, new_data)
Txn_add(…)
Txn_commit(…)
- replay any transaction that exists in uncommitted state and valid state (crc32 matches)
- write out txn blocks to journal area, with a crc32 checksum over all blocks in txn

Eventually you can extend this to be an infinitely scalable lvm like journaled block Device, by linearly stitching canisters together

If you are going down the sqllite path. Two things

  1. Storage sizing. If you report X as disk/canister size, the writes to disk for X bytes must always succeed

  2. Exclusive access to memory for every update. Imo ic model lends well here

And other things from here
https://www.sqlite.org/howtocorrupt.html

I am not sure why journaling plays a role here. The IC already guarantees atomicity, so data corruption isn’t an issue (with or without a SQL database). Maybe you are right with the above statement, but among many problems solved by sqlite, this is the one that we don’t have on the IC.

But the problems we are facing on the IC is:

  • Coming up with a binary storage format for your application’s data
  • That allows you to index into it (as reading all of stable data into main memory isn’t feasible)
  • Where writes are reflected in the storage at the end of each message (to avoid canister_preupgrade and it’s dangers)
  • Is stable even across library versions (and hence canister versions)
  • Schema changes are a well-understood problem.
  • Does not tie your canister to a single programming language forever.
  • Cleverly uses the (fast) main memory as a cache to the (slower) stable memory.
  • Is designed to avoid dirtying too many pages of stable memory, to keep costs down, just like a system built on a SSD disk.

Of course it’s possible and fun to build all that manually, but why not reach for an existing solution.

There might be other block-device-backed persistence solutions out there to consider, but sqlite seems to be a go-to standard.

2 Likes

@nomeata Yes, you are right in the “intra” canister calls case atomicity is guaranteed at a call level (barring the throw). The atomicity/isolation pitfalls you described in your blog are for the multi-call transactions (inter/intra canister).
So the block journaling approach I described is not applicable in the single canister call case. It might be useful in multi-canister transactions but will additional infrastructure for isolation, (some form of cross canister locking :(.

So, back to SQLite/FS as a solution to the problems you mentioned. It’s a great idea, the insight probably is that you cannot use data structures/lib that were not intended to host persistent data (viz rust hash tables) to compose systems that store data permanently (ie IC). These in-memory data-structures/lib (viz in-memory hash) have no obligation to keep their internal layout backward compatible, (unlike FS/databases).
On that front, there are two logical solutions.

  1. Switch/port on-disk solution like leveldb,sqllite,initramfs,tinydb to work on IC. Canister memory then truly is dumb disk abstraction.
  2. Or version the in-memory data-structures libraries, such that incompatible code-changes don’t interpret data the wrong way.

Just be aware that FS/DB doesn’t magically solve the upgrade problem. Much of backward compatibility is at the cost of backing off on “features” that require incompatible data-structure changes :). Filesystems and databases are slow on innovation because they are tied to on-disk formats.

https://www.joachim-breitner.de/blog/788-How_to_audit_an_Internet_Computer_canister

After reading this, looks like building any “serious” composable solution (all persistent storage is serious business) is a non-trivial task on IC. The promise of IC is composability and to compose anything trivial it seems the developer is expected to have an in-depth understanding of the implementation of ACID.

Auditing is great, SQLLite is a great start. But I think it needs more thought on making it difficult for people to shoot themselves in the foot in the first place. In short, I believe we are presenting a raw disk abstraction to users to work with, and expecting the FS/DB details to be filled in by or careful programming/audits :slight_smile:

1 Like

I wondered someone can implement a filesystem but no one reply me.
Can stable memory used as a virtual filesystem - Developers - Internet Computer Developer Forum (dfinity.org)
I tried a little but OS develop is not such an easy project and it’is better done by dfinity offical rather than by community.

Wow ! great where are you stuck ? How can we help, but why do need OS programming for this. I might not have the entire picture here, but you don’t have to deal with WASI when working with the IC canister abstraction. Think of building a “model” canister library - that can aggregate stable memory across canister, scale when needed and provide a S3 like abstraction. Filesystem hierarchy can be built on top of it. For just accessing local canister memory as a filesystem - I don’t see the universal applicability. It may be just me.

I must admit that I never thought that I would actually build a file-system WITHIN a smart contract.

I am working on developing two different filesystems on stable memory… the first one is the s3-like abstraction … i.e. an object store. This is a toy file system…just to get the nuances of working on mediaRead and mediaWrite just so.

The second is a more ambitious file-system that automatically does keyword based pathing for textual objects (not a general purpose fs).

Will publish this in my github soon.

2 Likes

very nice. Filesystem everywhere ! A request, let’s keep this thread for the SQLite discussion that @nomeata started, we can use the old FS thread or make a new one for the filesystem discussion :slight_smile:

2 Likes

Hello @nomeata my opinion

SQLlite might not be a perfect match and maybe dogged with a long tail of corruption issues. Please read this Atomic Commit In SQLite for more context on the response.

Performance and cost: DB relies on the underlying FS and storage hardware features for performance. One can present the DB with a featureless storage and FS(stable memory). This would force the DB to take the most conservative and non-performant path to “ensure” crash consistency. In a canister, this would translate to a single-byte update resulting in a exorbitantly large write amplification.

Also, SQLite documentation mentions support for configurable sector size (atomic unit of write to disk). In practice this is hardly configurable - only 2 sizes are guaranteed to be tested heavily and work 4K and 512bytes. Every byte change would result in reading and rewriting of a sector chunk. Roughly 2 read 2 write per byte change (of course there is coalescing but let’s take the worst case). The sector size cannot be scaled down arbitrarily.

For correctness: To the very least we need SQLite crash consistency test (.i.e. crash the OS at an inopportune time) to pass on the IC. Don’t quite know how to build a test harness for this.

So if it’s a build vs re-use decision, it boils down to WRITE GOOD code OR Trust SQLlite is good code on new hardware. SQLite is good code it needs to be qualified on new hardware aka stable memory.

If someone wants to pursue this I would suggest getting the crash consistency tests running on IC. If the test fails this is what the debugging will involve No Title

Some interesting scenarios to work out:

  1. A crash/trap leaves an uncommitted transaction on the disk.
  2. An update query will involve having to clean/replay the transaction, which would involve a write. The write will be discarded. At least one problem is that every update will involve a read replay discard of the uncommitted journal.

TL;DR
If one is considering mission-critical (all production) data like ledger info I won’t recommend SQL lite. Too much complexity in lieu of benefits.

I think you are missing the fact that canisters never ever crash uncontrolled. If a canister method traps, the IC rolls back the changes! It will never happen that a canister stops in the middle of a sqlite operations and just leaves it’s memory in that corrupted state. So none of the problems that sqlite needs to deal with exits on the IC.

And the sector size setting is actually ideal as well: The IC will charge memory usage not per byte, but per read or dirtied Wasm page (64k). So if sqlite manages its storage in 64k sectors (or even 4k) then we benefit from the years of work that sqlite authors did to make that efficient, i.e. avoid reading and writing more sectors than needed, plus its logic for not hitting stable memory when the data is in main memory and combining reads at the end of the message (assuming that’s when you end your transaction and sync).

The complexity is not in sqlite per se, but inherent in the system (sans worry about corrupted state due to corruption). And the problems are not unique to the IC. So one has to deal with it in any case, and then it’s better to reach for a proven solution than to reinvent the wheel again.

1 Like