Can you upgrade a canister without stopping it? Yes, eh, no, eh … maybe

This is a copy of a post of mine in Actor Model fundamentals compromised? - #25 by nomeata, but it may be of general interest:

Ah, upgrades-without-stopping … interesting topic! Here are some comments (and all of them apply to rust and_ motoko):

  • The system doesn’t stop you from upgrading a canister that isn’t “stopped”. But it isn’t always a good idea.

  • If you have a canister that never does outgoing calls, you can safely upgrade atomically without stopping, and have no downtime.

  • If you have a canister that you know at the moment has no outgoing calls, you can safely upgrade atomically without stopping, and have no downtime.

    Maybe your service provides lots of useful functionality without doing calls on it own, and only rarely does something that requires an outgoing call. Then you could add application logic where you instruct the canister (not the system!) to stop doing outgoing calls, wait for all outstanding to come back, and then upgrade (atomically and without stopping), without impeding the main functionality of the service.

  • If you have a conventional canister (Motoko or Rust), and you might have outstanding calls, you really really should not upgrade without stopping first.

    It’s not just that you might lose the response, but when the response comes back, the way things are set up right now, it could arbitrarily corrupt your canister state.

    The technical reason is that responses are delivered by invoking a Wasm function identified by a WebAssembly function table index, and neither Rust nor Motoko give you control over the location of functions in the table. So the new version of your canister might have a completely unrelated, internal function in that slot.

  • Theoretically, you can write canisters that you can upgrade while the call is in flight, if you make sure that the functions handling the callbacks are at the same position in the table. For example if you write your canister by hand in wasm, or beef up your Rust toolchain, or maybe some clever post-processing.

    I don’t think anyone has done or tried that so far. But the system conceptually supports this.

    In this model, you wouldn’t be using async and await, though, but you would use top-level named functions as callback handlers, and implement your service closer to the actor model, or maybe closer to a state machine. After all, you do want to handle the responses in the upgraded version, so you need to be more explicit about the flow here.

  • We have plans (but not high priority, unfortunately) to change or extend the System Interface to remove the problem that you can corrupt your state if you get this wrong, by delivering callbacks to named exported functions of the module (separate from the public methods, though).

    With that in place, you’ll be able to use write canisters that can be upgraded instantaneous and autonomously with in-flight outgoing calls at last in Rust/C/etc, and – after a bit more language design – hopefully also as an opt-in in Motoko.

16 Likes

Changing the programming model to the actor model is a very big ask.

I think it’s pretty infeasible to expect a user to keep track of which functions they can and can’t upgrade.

Why not just reject incoming messages that come back after a certain period of time has elapsed?

I think this forum post turned into the following blog post.

https://www.joachim-breitner.de/blog/789-Zero-downtime_upgrades_of_Internet_Computer_canisters

5 Likes