Sandbox is now fully switched to the shared memory representation.
Stable memory operations are handled within the sandbox process without IPC.
System calls that return static data such as (canister_id, controller, and canister_state) are also handled within the sandbox process without IPC.
The SubnetAvailableMemory counter is now maintained properly.
The existing performance optimization to avoid tracking dirty pages for queries is ported to sandbox.
Now the unittests automatically rebuild the sandbox binary if there were wany changes.
A lot of investigation and debugging work was done on the infrastructure side to identify issues blocking CI with sandboxing.
Preliminary benchmarking shows that the performance of sandboxed query calls is on par with the non-sandboxed version. Performance of update calls in memory-intensive workloads has improved by 100x compared to the state two weeks ago, but it is still 2x slower compared to the non-sandboxed version. We are currently experimenting with the optimization that should bring the performance gap down to 20%-30%.
Remaining work:
Fix numerous infrastructure issues to enable CI with sandboxing.
Enable sandboxing in ic-replay that is used for disaster recovery.
Implement the performance optimization for memory-intensive updates calls (mentioned above).
Port WasmExecutor::create_execution_state to sandbox so that module instantiation happens in the sandbox process.
Improve error logging and metrics reporting in the sandbox process.
I am happy to share that the implementation of sandboxing is almost complete. We are aiming for the launch in two weeks and will host a community conversation with a detailed update on January 25.
The work mentioned in my previous post is done. The sandboxed version now passes all our qualification tests. We also implemented optimizations that fixed the performance blockers.
You can follow most of the sandboxing work since December here and here (doesnât include infra, guest image changes).
The sandbox process crashed on a single node on pjljw and we had to disable sandboxing there. The investigation is in progress. Until we have a fix we will keep sandboxing disabled on pjljw, eq6en , tdb26.
We found the root cause of the crash. It was caused by an old bug in the state manager. Sandboxing changed the timing and made the crash more likely to happen.
The fix is in review and will make into the next release cut that will be deployed in two weeks.