Security Sandboxing

Update

As mentioned in the community conversation, we have a prototype implementation that passes all test, but is quite slow. To fix the performance we:

  • Found a memory representation that allows sharing of memory pages between processes [design doc].
  • Implemented a file-based page allocator for PageDelta pages according to the design.
  • Refactored the execution layer to unify the handling of Wasm and stable memory such that both benefit from memory sharing.
  • Added support for transfering file descriptors between the replica and the sandbox processes.

There is more implementation work remaining to switch sandbox to the new shared memory representation.

8 Likes

Such excellent work I love hearing about this

1 Like

I want to know if a canister can be processed in a sandbox, the canister was processed by only one node in a subnet or all the nodes in the subnet. And if only one node run the canister, can the security be encsured?

Replication will remain the same as before, so an update message of a canister will be processed by all nodes in the subnet. What is changing is that each node gets a sandbox process per canister in order to isolate the canister from the system and other canisters on that node.

3 Likes

Sorry if this was already answered, but I didn’t see it explicitly specified in the original proposal: is the sandboxing something that will be unilaterally applied to all canisters on the IC, or is this an additional feature that will be available to specific canisters/subnets?

1 Like

I believe it is for all canisters and all subnets, however maybe @ulan and @hcb want to roll out across subnets and test the feature at different stages.

I will let @ulan @hcb add more color

Sandboxing will be applied unilaterally to all canisters on each subnet, once the feature is launched on a subnet. The purpose of sandboxing is to protect the IC and other canisters “from” a sandboxed canister, but its own sandbox awards no protection to the sandboxed canister by itself. Hence, it makes no sense for canisters to “opt in” to sandboxing, it only makes sense to treat all equally.

5 Likes

Update

A lot has happened in the last two weeks:

  • Sandbox is now fully switched to the shared memory representation.
  • Stable memory operations are handled within the sandbox process without IPC.
  • System calls that return static data such as (canister_id, controller, and canister_state) are also handled within the sandbox process without IPC.
  • The SubnetAvailableMemory counter is now maintained properly.
  • The existing performance optimization to avoid tracking dirty pages for queries is ported to sandbox.
  • Now the unittests automatically rebuild the sandbox binary if there were wany changes.
  • A lot of investigation and debugging work was done on the infrastructure side to identify issues blocking CI with sandboxing.

Preliminary benchmarking shows that the performance of sandboxed query calls is on par with the non-sandboxed version. Performance of update calls in memory-intensive workloads has improved by 100x compared to the state two weeks ago, but it is still 2x slower compared to the non-sandboxed version. We are currently experimenting with the optimization that should bring the performance gap down to 20%-30%.

Remaining work:

  • Fix numerous infrastructure issues to enable CI with sandboxing.
  • Enable sandboxing in ic-replay that is used for disaster recovery.
  • Implement the performance optimization for memory-intensive updates calls (mentioned above).
  • Port WasmExecutor::create_execution_state to sandbox so that module instantiation happens in the sandbox process.
  • Improve error logging and metrics reporting in the sandbox process.
12 Likes

Update

I am happy to share that the implementation of sandboxing is almost complete. We are aiming for the launch in two weeks and will host a community conversation with a detailed update on January 25.

The work mentioned in my previous post is done. The sandboxed version now passes all our qualification tests. We also implemented optimizations that fixed the performance blockers.

You can follow most of the sandboxing work since December here and here (doesn’t include infra, guest image changes).

10 Likes

What’s the current status of sandboxing? Is it deployed now on all subnets?

1 Like

We had to disable sandboxing after the initial launch because of the issues discovered on large subnets like pjljw:

  • The static process allocation strategy (create a process lazily on the first message and keep it alive) did not work well.
  • The process spawn duration increased over time from 10ms to 500ms as more processes are spawned.

Good news is that we fixed these issues in the meantime and will enable sandboxing again in the next release (that rolls out next week):

  • We have dynamic process allocation where idle processes are terminated.
  • We optimized process spawning to 2ms that doesn’t regress over time.
7 Likes

Awesome! And will it be rolled out to all subnets initially or will you expand per subnet over time?

2 Likes

It will roll out to all subnets (as part of a normal release process) except for eq6en and tdb26:

  • eq6en has high memory usage,
  • tdb26 is the NNS.

In both cases we want to be extra-careful and enable sandboxing after confirming that it works well.

6 Likes

Fwiw, the disabling was in the release change log, but I didn’t think to post here. My bad.

2 Likes

Sandboxing rolled out for all subnets except eq6en, w4rem, and tdb26. These three subnets will get sandboxing in the next release.

There are no issues so far.

11 Likes

Legendary comment, excellent work

4 Likes

There are no issues so far.

The sandbox process crashed on a single node on pjljw and we had to disable sandboxing there. The investigation is in progress. Until we have a fix we will keep sandboxing disabled on pjljw, eq6en , tdb26.

2 Likes

We found the root cause of the crash. It was caused by an old bug in the state manager. Sandboxing changed the timing and made the crash more likely to happen.

The fix is in review and will make into the next release cut that will be deployed in two weeks.

10 Likes

Canister sandboxing is deployed on all subnets and is running without any issues so far.

17 Likes

That is awesome, congratulations!

7 Likes