Hello everyone,
Over the past few months, I’ve been working on improving the current implementation of stable-fs, and I’m excited to share the new features and optimizations I’ve introduced.
TLDR
Stable-fs is the ic-wasi-polyfill backend. With the wasi2ic project it allows to adapt various WASI projects for running on the Internet Computer. The file system is implemented using a few BTreeMap
stable structures and can utilize stable memory for larger memory storage and canister data preservation between upgrades.
Various optimizations and some interesting new features were implemented. It is now possible to write a 100MB file onto the file system within the 2B instruction limit.
Some other interesting new features: mounted memory files (for faster read/write operations), the ability to configure chunk size, the ability to limit the maximum allowed file size.
The new version (0.7) is backwards compatible and should just work with the older storage after canister upgrade.
I don’t consider it production-ready yet, and would recommend it for non-critical projects. At the same time, I think it is quite stable and should be able to handle reasonably heavy loads.
Your feedback is welcome!
What is Stable-FS?
Stable-fs is a file system implementation that serves as the backend for the ic-wasi-polyfill library that is used by the wasi2ic tool to convert and run the WASI applications on the Internet Computer (IC).
The initial implementation of stable-fs used a combination of a stable Cell and three BTreeMap
structures to manage file system components such as the folder structure, file metadata, and file contents stored in data chunks:
struct StableFileSystem {
header: Cell<Header>,
metadata: BTreeMap<Node, Metadata>,
dir_entry: BTreeMap<(Node, DirEntryIndex), DirEntry>,
file_chunk: BTreeMap<(Node, FileChunkIndex), FileChunk>,
}
-
header : Stores general information like the file system version and the next node ID to use.
-
metadata : Contains details like file type, size, and reference count.
-
dir_entry : Represents folder entries within parent nodes.
-
file_chunk : Holds 4KB chunks of file data.
While functional, this initial implementation had some performance problems and to address those I have implemented a few new features and improvements.
Mounted virtual memory
As a first step to address performance concerns, I implemented the mounted memory files. This allows a user to configure files to act as hosts for virtual memories. Once configured, the file system forwards all read
and write
operations to one of the virtual memories and bypasses a few overheads and work considerably faster than the chunked file storage. This makes them particularly suitable for use-cases with a few large files requiring high-speed random access, such as SQLite databases.
While implementing the mounted memories, I wanted to possibly give much flexibility on how it is used. The memory can be attached during runtime over an existing file. There is a functionality to transfer the existing mounted memory to the chunked storage or to initialize a new mounted memory with the data from the host chunk storage file. Alternatively, it is possible to just use the mounted storage without synchronization with the chunked storage and after canister upgrade: re-mount it again, no data will be lost as its current file size information is stored in its own mounted file metadata.
While this approach has limitations, ie. the limited number of virtual memories, it sets a good performance baseline for comparisons with the chunked file storage performance.
New chunked memory storage
For use-cases when there are many files of the file system, a faster chunked storage is still important. There are a few overheads that we need to deal with:
-
Serialization Overhead: Reading or writing even small amounts of data requires processing entire chunks.
-
Search Overhead: Chunks had to be located within a
BTreeMap
. -
Insertion/Re-balancing Overhead: Adding new chunks to the
BTreeMap
causes locating, repositioning existing nodes and re-balancing the tree with an extra overhead.
New Chunked Storage Structure
Inspired by discussions with @Ulan, I introduced a more efficient data structure split across three virtual memories:
struct V2FileChunks {
chunk_ptr: BTreeMap<(Node, FileChunkIndex), FileChunkPtr>,
chunks: VirtualMemory,
allocator: ChunkPtrAllocator,
}
The chunk_ptr in the new BTreeMap
only stores pointers instead of the actual data. This reduces the serialization overhead. Also, as long as chunks are not deleted, the pointers to the chunks remain constant and can be cached for faster rewrites.
The chunks memory stores the actual file chunks of a fixed size. The direct access to memory allows reading and writing small portions of data, no need to rewrite the whole chunk and there is no need to do multiple copies (the data is only read/written once between the source and the destination memory).
Finally, the allocator is responsible for managing freed chunks. The allocator remembers all the released pointers in its stack and offers them for reuse when the allocate
function is called.
If there are no chunks available, the allocator knows the next address it can offer from the chunk memory that was not used before. This grows the chunks memory as a result.
Chunk size matters!
Experimenting revealed that the chunk size significantly affects the chunk insertion overhead. For instance, doubling the chunk size halved the number of insertions needed for the same amount of data.
I’ve increased the default chunk size from 4KB to 16KB, enabling the writing of a 100MB file in under 2 billion instructions.
For flexibility, users can configure chunk sizes during canister initialization (4K to 64K) before writing anything to a file system. With the chunk size setting to 64K, the writing of 100MB can be done in under 1B instructions.
Summary of new features available in v0.7
-
Backward compatibility with older versions.
-
Storing mounted memory into its host file, and initializing chunk storage with the host file chunked storage.
-
Switch between using V1 and V2 chunks if needed (V2 is active default and there is probably no good reason to switch back to the older V1 chunks other than performance estimations).
-
Configuring chunk size (possible settings: 4K, 8K, 16K, 32K, 64K).
-
Sparse file support.
-
Maximum allowed file size setting.
Summary of optimizations
-
Caching: Metadata, File descriptors, chunk pointers for faster access times.
-
Caching missing chunks (if a chunk was not found, this information is also cached).
-
Look-ahead pointer caching for reading and writing (we only search for the first chunk, then use the iterator next command to find the next chunk available).
-
Faster metadata read/write (no serialization, the object is stored directly for a dedicated memory area).
-
Chunk storage optimization for metadata (we use the same V2 file chunks to also store file metadata).
Performance comparisons
Here’s how the stable-fs versions compare when reading or writing 100 MB of data with different access patterns: the older version (v0.4), the current version (v0.7) and the current version performance based on the mounted files:
Read/write 100 MB of data | v0.4 | v0.7 | v0.7 mounted files |
---|---|---|---|
Write into a new file | 14.51 B | 1.32 B | 124.45 M |
Overwrite existing file | 10.52 B | 142.73 M | 100.01 M |
Read in a single data segment | 4.00 B | 142.34 M | 100.02 M |
Write 1 KB segments into a new file | 79.48 B | 2.26 B | 918.28 M |
Overwrite an existing file in multiple 1K segments | 57.42 B | 655.89 M | 519.91 M |
Read multiple 1 KB segments | 22.59 B | 621.00 M | 476.11 M |
Write 1 KB segments into 10 new files (10 MB each) | 81.39 B | 2.29 B | 1.11 B |
Overwrite 1 KB segments in 10 files | 57.61 B | 819.30 M | 699.55 M |
Read 1 KB segments from 10 files | 22.93 B | 736.88 M | 655.03 M |
The first three tests check the performance of writing and reading one 100 MB data segment in a single call. The rest of the tests estimate the performance of making multiple 1024 byte calls on one or multiple files.
As shown, there was a significant improvement from version v0.4 to v0.7.
The baseline estimation is as expected faster than the chunked storage. We can see the big difference for newly created files, where the chunked storage needs an extra 1 billion instructions to create a file and insert all the chunks into a tree. On the other hand, reading files or writing over existing data (that does not require chunk node insertions) is relatively fast and has only a few million instructions overhead.
SQLite database benchmarks
The database consists of two tables: users(user_id, username, email) and orders(order_id, user_id, amount).
regular files | database on mounted memory | |
Create 100 000 users | 2.35 B | 2.30 B (improved by 2.23%) |
Create 1M orders (each refers to one of the users, no extra indexes present) | 36.28 B | 35.99 B |
Create indexes on fields when the orders is filled with data: users.email and orders.user_id | 6.98 B | 6.78 B (improved by 2.76%) |
Make a joint selection:SELECT u.user_id, u.username, o.order_id, o.amount
FROM users u
JOIN orders o ON u.user_id = o.user_id
WHERE u.user_id < 1000
ORDER BY o.created_at DESC;
|
247.81 M | 217.42 M (improved by 12.26%) |
Select using "like" on an indexed field:SELECT * FROM users WHERE email LIKE 'user%'
|
856.80 M | 846.90 M |
Create 100 extra orders after there were already 1M orders and field indexes created. | 32.37 M | 18.39 M (improved by 43.20%) |
Remove 1000 orders (each user has 10 orders):DELETE FROM orders WHERE user_id <= 100
|
90.10 M | 53.32 M (improved by 40.82%) |
Delete 100000 orders with transaction rollback:BEGIN TRANSACTION
DELETE FROM orders WHERE order_id > 900000
ROLLBACK
|
6.42B | 6.30 B |
Interestingly, adding records in bulk didn’t have much benefit from using the mounted files. The main advantage was seen on the operations related to random memory access.
Is it production-ready?
I don’t think it is production ready yet, I’ve made my best effort to make the system durable and I would recommend it for usage on various non-critical projects and share your experience. I plan to continue improving the existing implementation and adding more tests to check its WASI compliance and I would be happy to receive your feedback.