Security Sandboxing

Update

A lot has happened in the last two weeks:

  • Sandbox is now fully switched to the shared memory representation.
  • Stable memory operations are handled within the sandbox process without IPC.
  • System calls that return static data such as (canister_id, controller, and canister_state) are also handled within the sandbox process without IPC.
  • The SubnetAvailableMemory counter is now maintained properly.
  • The existing performance optimization to avoid tracking dirty pages for queries is ported to sandbox.
  • Now the unittests automatically rebuild the sandbox binary if there were wany changes.
  • A lot of investigation and debugging work was done on the infrastructure side to identify issues blocking CI with sandboxing.

Preliminary benchmarking shows that the performance of sandboxed query calls is on par with the non-sandboxed version. Performance of update calls in memory-intensive workloads has improved by 100x compared to the state two weeks ago, but it is still 2x slower compared to the non-sandboxed version. We are currently experimenting with the optimization that should bring the performance gap down to 20%-30%.

Remaining work:

  • Fix numerous infrastructure issues to enable CI with sandboxing.
  • Enable sandboxing in ic-replay that is used for disaster recovery.
  • Implement the performance optimization for memory-intensive updates calls (mentioned above).
  • Port WasmExecutor::create_execution_state to sandbox so that module instantiation happens in the sandbox process.
  • Improve error logging and metrics reporting in the sandbox process.
12 Likes

Update

I am happy to share that the implementation of sandboxing is almost complete. We are aiming for the launch in two weeks and will host a community conversation with a detailed update on January 25.

The work mentioned in my previous post is done. The sandboxed version now passes all our qualification tests. We also implemented optimizations that fixed the performance blockers.

You can follow most of the sandboxing work since December here and here (doesn’t include infra, guest image changes).

10 Likes

What’s the current status of sandboxing? Is it deployed now on all subnets?

1 Like

We had to disable sandboxing after the initial launch because of the issues discovered on large subnets like pjljw:

  • The static process allocation strategy (create a process lazily on the first message and keep it alive) did not work well.
  • The process spawn duration increased over time from 10ms to 500ms as more processes are spawned.

Good news is that we fixed these issues in the meantime and will enable sandboxing again in the next release (that rolls out next week):

  • We have dynamic process allocation where idle processes are terminated.
  • We optimized process spawning to 2ms that doesn’t regress over time.
7 Likes

Awesome! And will it be rolled out to all subnets initially or will you expand per subnet over time?

2 Likes

It will roll out to all subnets (as part of a normal release process) except for eq6en and tdb26:

  • eq6en has high memory usage,
  • tdb26 is the NNS.

In both cases we want to be extra-careful and enable sandboxing after confirming that it works well.

6 Likes

Fwiw, the disabling was in the release change log, but I didn’t think to post here. My bad.

2 Likes

Sandboxing rolled out for all subnets except eq6en, w4rem, and tdb26. These three subnets will get sandboxing in the next release.

There are no issues so far.

11 Likes

Legendary comment, excellent work

4 Likes

There are no issues so far.

The sandbox process crashed on a single node on pjljw and we had to disable sandboxing there. The investigation is in progress. Until we have a fix we will keep sandboxing disabled on pjljw, eq6en , tdb26.

2 Likes

We found the root cause of the crash. It was caused by an old bug in the state manager. Sandboxing changed the timing and made the crash more likely to happen.

The fix is in review and will make into the next release cut that will be deployed in two weeks.

10 Likes

Canister sandboxing is deployed on all subnets and is running without any issues so far.

17 Likes

That is awesome, congratulations!

7 Likes