Canister Lifecycle Hooks

A lifecycle hook or method is a common way of allowing developers a passive way to tie into and handle the occurrence of specific and important events. In the context of React and Angular, lifecycle hooks allow developers to tie into a component mounting or state changes, and in the context of AWS EC2 these hooks are triggered when auto-scaling an instance.

On the IC, developers are currently able to actively interact with canisters but are limited in their ability to passively tie into specific important events in the lifecycle of a canister. This proposal aims to introduce an initial set of these lifecycle hooks for canisters on the IC.


Types of Lifecycle Hooks Proposed

Canister Metric Hooks

  • canister_on_low_cycles(cyclesThreshold: Nat): async ()
    → triggers when the canister has cycles <= cyclesThreshold

  • canister_on_low_heap_memory(heapMemoryThreshold: Nat): async ()
    → triggers when the canister has heapMemory >= heapMemoryThreshold

Currently in order to monitor canisters, developers need to to proactively reach out to the canister or call a system level API. Providing canister metric lifecycle hooks allows developers to define their own thresholds for canisters and react when these thresholds are breached. Each of these thresholds will use 8 bytes and be stored in the canister settings within the replica.

If a canister metric threshold has been provided, once the replica detects that a canister has crossed that threshold it will send a message to the specific endpoint (i.e. canister_on_low_cycles()) of the canister, triggering that specific lifecycle hook once the canister is able to process the message.

Canister Error Hooks

  • canister_on_error(methodName: Text, args: Blob, error: Error): async()
    → triggers on uncaught canister runtime error (trap), allows the developer to capture, analyze, and log different types of errors

Currently, there is no way for canister developers to catch synchronous errors occuring within a single canister, including heartbeat errors. By introducing the canister_on_error() hook, after executing a message/heartbeat/timer, if the execution fails the replica will schedule a message to be sent to the canister_on_error() endpoint of the canister passing the error message.

Canister Lifecycle Hooks

  • canister_post_init(): async()
    → triggers immediately after the canister_init() (i.e. constructor) function of the canister

Currently, there is no way to execute inter-canister/asynchronous calls during the canister canister_init() method. In order to ensure that the asynchronous call happens before any other message is executed, the current workaround is to add a guard in place that either awaits on the completion of the asynchronous task or rejects any messages until that task is completed.

The canister_post_init() lifecycle hook would create a message in the per-canister task queue abstraction mentioned here Cross-canister compatible Post-Init hook - #5 by ulan to ensure that the canister_post_init() hook is executed before any other “regular” message.

Wish List Hooks (additional, nice to have)

  • canister_output_queue_size(): Nat → synchronous call that exposes the output queue size of a canister, helping a developer to throttle/pace outgoing calls from a canister.
  • canister_on_output_queue_full(): () → synchronous call that triggers when a canister’s output queue is full



Special thanks to @ulan for his technical background and expertise, and encouraging me to take on this proposal.

22 Likes

I like the proposal! Two suggestions:

  1. Similar to canister_post_init(), would it be also useful to have a hook to be executed after the canister upgrade? If so, those two might be merged into one hook…
  2. It seems the low cycles and low heap memory hooks seems like just a special case of an error, so we might have just one canister_on_error() hook with some predefined set of errors…

WDYT?

5 Likes

Thanks for starting this thread @icme. I think this is a very reasonable proposal and I like it.

Currently, there is no way for canister developers to catch synchronous errors occuring within a single canister, including heartbeat errors.

Nit: I think that’s true for Motoko but in Rust you can actually catch synchronous errors in your canister, e.g. a queue being full or the canister not having enough cycles when attempting to send an inter-canister call (the heartbeat errors are still problematic and thus it’s still a good example).

I also have one clarifying question on the canister_post_init hook. It sounds to me like one important aspect you want from it is the ability to execute inter-canister/asynchronous calls after the canister’s initialization. Adding a task in the per-canister task queue to ensure canister_post_init is triggered first before any other regular messages doesn’t seem to ensure that any inter-canister calls triggered by it would also be executed before anything else. Unless, that is, we somehow keep adding some task in the queue until the outgoing requests from canister_post_init are already handled.

So, to confirm, you expect that any calls initiated by canister_post_init would also be processed before any other messages right?

2 Likes

I think this makes sense - the only question potential issue I see is then confusion with the postupgrade() system function hook in Motoko. That also runs after immediately after an upgrade, and is meant to deserialize stable variables from stable memory back into the heap or run something immediately after an upgrade.

In my opinion, canister_post_init() should run after Motoko’s postupgrade(). Tagging @claudio to pull in some of his thoughts on how this might work from the Motoko side.

The canister_on_low_cycles and canister_on_low_heap_memory actually is meant as a warning and should be triggered well before the respective canister cycles freezing threshold or canister heap memory limit is reached. In this way, it’s less of an error message and more of a “hurry up and do something before this becomes an error” message.

Correct. Since each of these hooks are async(), would it be possible to await the response of canister_post_init before any other messages are executed?

Doesn’t the IC’s deterministic execution (preservation of message ordering) take care of this?

2 Likes

If I understand the intent behind this correctly, then I’d disagree. You probably want to treat low_* hooks as a warning rather than an error. To borrow an analogy from rust, canister_on_error is a panic! (and you need to rethink something in your code, probably can’t recover at runtime), while low_* would be a Result that you can and should cover in your code (e.g. by adding more cycles, or freeing up some buffers if possible, or spinning up new canisters depending on your use-case).

Sure it’s possible, but I’m not sure it’s as simple as “add a task in the canister’s task queue” or we need something like a different state to know whether the canister is still being initialized (and so we can know that other messages should be rejected or queued up).

Doesn’t the IC’s deterministic execution (preservation of message ordering) take care of this?

I think there’s some misunderstanding here or maybe I wasn’t clear enough. If we send a message from canister_post_init I don’t see how it’s guaranteed that we won’t process any other messages before the response to this message is received. Your canister might have received ingress messages or other inter-canister calls in the mean time. Nothing prevents that. (Are you suggesting that we do?)

So, if we want to really make sure that canister_post_init is “fully” done, a (one-time) task added in the canister’s task queue will not be enough in itself I think. You’ll need either to keep injecting tasks in the queue until the responses for canister_post_init are processed or we use some intermediate state for the canister to know that certain responses need to be handled before other messages (and likely reject other messages while we’re waiting for those responses).

2 Likes

To avoid confusions, we could rename the hook to canister_ready_to_serve() or so…

I agree, there are warnings and errors. But at the end, in both cases, all we can do from the canister’s perspective is to log this error/warning or to send an alert elsewhere.

If we have something more neutral, like canister_system_notification(), we could handle errors and warnings in one place, as the logic might be quite similar…

2 Likes

What do you see as some of the pros and cons of having one system API vs. multiple system APIs?

From an DevX perspective, I’d personally like to handle each of these events in a different API, rather than having one single endpoint where the developer has to parse/match on different system notification codes.

Ideally this would both execute the canister_post_init() call and await it’s resolution before any other messages in the queue are executed. This would allow a newly created canister to fully sync with the rest of a multi-canister application before opening up its message queues. Maybe this would look like opening up the task queue, but closing off/halting the canister message ingress, input, output queues until canister_post_init() completes (and then opening back up the ingress/input/output queues).

From the implementation and feasibility side of things I’m a bit out of my depth here, and am just going off the videos I’ve seen, some brief browsing of the IC code, and what I read about the canister task queue that was implemented specifically for DTS. I’ll rope in @ulan.

1 Like

Ideally this would both execute the canister_post_init() call and await it’s resolution before any other messages in the queue are executed.

We can ensure that canister_post_init() is executed before any other message, but if it makes other calls, then these calls will be regular calls. In other words, we can guarantee that the part of canister_post_init() until the first await executes before any other messages. After the await, other messages may execute (i.e. the standard async/await semantics).

To ensure that all awaits also happen before other messages, we would need some kind of atomic inter-canister transactions, which is super difficult.

Does this limitation greatly reduce usefulness of canister_post_init() ?

1 Like

Yeah, we might want to purge some memory on low heap or something, it might be useful to have a dedicated hook for that…

Does this limitation greatly reduce usefulness of canister_post_init() ?

Yes. I think the driving motivation behind a canister_post_init() hook (or whatever we end up naming it) is specifically to await cross-canister calls. If you just want to execute non-async code I’d imagine you would just add it to your init() or postUpgrade() methods. It’s specifically the async stuff that we’re trying to provide a handle for.

To provide a concrete use case, consider the need to seed custom randomness. This requires making a cross-canister call to the management canister’s raw_rand() method. You’d most assuredly want that to complete before you attempted to use random in your other canister methods. If other methods start getting hit before the randomness is fully seeded it could cause significant problems.

Seeding randomness is just one example. I imagine there will be many scenarios where external data is needed to get a canister into the correct start state.

Furthermore, when asking about how to guarantee async code happens before other messages, the currently accepted solution is to add a guard clause inside each update method. I see the canister_post_init() hook as the replacement to those initialization guards, so I think it must provide the same level of assurance.

2 Likes

Thanks for explaining @dansteren!

When I was thinking about feasibility of canister_post_init(), I was only thinking about calling that function before other messages, but didn’t consider that we also want to wait for all its calls to complete, which makes a lot of sense in the hindsight.

I am afraid, it is very difficult to support the full solution that waits for all the outgoing calls of canister_post_init().

1 Like

@ulan Given the complexity around a full async block until canister_post_init() is completed, a non-blocking canister_post_init() is still valuable in guaranteeing that it is the first message after the init() function is called (as long as the canister implements canister_post_init()). It’s also valuable in being able to provoke async canister calls without requiring a heartbeat or external message trigger.

@dansteren Looking at the other thread, I was thinking about the guards solution and came up with another potential solution to your problem.

  1. Have 2 wasms, wA and wB. wA initializes the canister and has only the canister_post_init() method and no other public APIs. The wB extends wA and has all of your public APIs and other business logic, and is able to upgrade wA without a breaking change or loss of state.
  2. Create canister C with wA. canister_post_init() will run directly after and create your seed for C and make any needed calls elsewhere, but no additional ingress or input calls will be able to come through to canister C since no endpoints/APIs are enabled it.
  3. Upgrade canister C with wB. canister_post_init() might still run immediately after the upgrade, but just include logic inside canister_post_init() that says if your seed has already been generated/exists, don’t repeat the logic. Now canister C will already have its seed set up prior to receiving any messages and you don’t have to worry about adding any additional guards/blocking logic.
2 Likes

I agree, kicking off an async call could still be valuable. Not as great as assuring it returns, but it’s a step in the right direction since you can’t even kick calls off from the init() method.

This is a decent solution, but it won’t really work for my specific use case. Basically you’re saying, when deploying a canister, first deploy a very basic canister that only contains init and canister_post_init. Then afterwords add all the rest of your functionality.

That could be an alright solution for some, but my team and I are building CDKs, i.e. working at the language level (Typescript and Python). We don’t have control over when our users deploy their canisters. They just expect Math.random() and random.random() to work in TS and Python respectively. So it’s up to us to seed the randomness for them, meaning, inject some underlying code that calls the management canister’s raw_rand method, and do so as early as possible.

So I think having a canister_post_init would still be helpful. It would be ideal if we knew the call to raw_rand had fully returned, but if not, even just kicking it off from there would be better than putting a guard at the top of every method, especially since that wouldn’t work for query methods.

Given the new year, I’d like to revive this thread and potentially add canister lifecycle hooks to the DFINITY roadmap somewhere :slight_smile:

So far, I haven’t seen any red flags that would hold these back, with the one limitation being that the canister_post_init implementation may not perfectly meet @dansteren’s use case, but would meet others’ use cases such as mentioned here https://forum.dfinity.org/t/question-how-do-i-make-an-inter-canister-call-from-the-post-upgrade-hook

Also, I’d like to add one additional metric hook to the wish list that canisters could tie into related to their output queues.

  • canister_output_queue_size(): Nat → synchronous call that exposes the output queue size of a canister, helping a developer to throttle/pace outgoing calls from a canister.
  • canister_on_output_queue_full(): () → triggers when a canister’s output queue is full

I’ve added this to the “wish list” section of the main post in an edit

3 Likes

Wdyt about the following plan?

  1. Submit an NNS motion proposal to add the three non-controversial hooks to the roadmap:
    • canister_on_low_cycles()
    • canister_on_low_heap_memory()
    • canister_on_error()
  2. Postpone canister_post_init() until we have ideas on how to implement the useful version of it.
  3. Move canister_output_queue_size() into a standalone feature. It sounds like a very useful API orthogonal to the hooks.
  4. Drop canister_on_output_queue_full() because it doesn’t seem very actionable. By the time when the hook runs the queue can change arbitrarily (the messages may be drained from the queue or added to it).
7 Likes

I’d like to summarize an offline discussion with @berestovskyy. Andriy, please correct me if I missed something.

  • canister_on_error():
    • The main downside is that non-replicated queries are not supported. To make errors in non-replicated queries observable to the developers, we would need to implement an off-chain mechanism that allows the canister owner to access statistics on executed queries and errors that occurred. If this mechanism were in place, it could also be used to gather statistics on replicated message execution, making the canister_on_error() hook unnecessary.
    • The hook could also become redundant if we would support some generic logging mechanism. E.g. if there was a way for the developer to pull the last N debug_print() output of the canister from all nodes and if a trapped execution would be automatically logged, then it would work both for replicated and non-replicated messages.
  • canister_on_low_memory(), canister_on_low_cycles():
    • If there was a way to query the balance and memory statistics for a set of controlled canisters in a single message (without querying each canister individually), then the monitoring canister or the off-chain monitoring service could periodically poll the statistics and detect the low-memory/low-cycles conditions.
    • Nevertheless, it might still be useful to have these hooks because they might allow the canister to react faster and adjust its behavior.

After this discussion, I am not sure if we should include canister_on_error() in the proposal. I am still optimistic about canister_on_low_memory(), canister_on_low_cycles() because they make it easy for a canister to react to these corner-case conditions without building a complex monitoring system.

@icme: wdyt?

5 Likes

I’m not proposing catching errors in query calls at this point in time.

Could this off-chain mechanism be implemented by a non-DFINITY developer team, or do you have something in mind at the protocol level?

If we’re having this discussion, it might be even better to build some sort of DLQ tooling at the protocol level. I’ve had some thoughts about what a general purpose DLQ would look like on at a higher software level, but it seems like something like that could utilize the canister_on_error() type of failures and send these raw error metadata to an DLQ/logger type of canister. You could then imagine the user being able to redrive API calls and events that had failed through inter-canister calls instead of ingress messages.

I’m perfectly fine with starting small and building from that. Start with one or two hooks, and see what the residual effects on the canister message queues, etc. look like before adding more.

In fact, if we were ordering these “hook” type of features in terms of ease of implementation and immediate developer impact, I really think that Proposal: Configurable Wasm Heap Limit should come first since it’s synchronous, followed by canister_on_low_memory() and canister_on_low_cycles().

I agree canister_on_error() and canister_post_init() could use some more time to marinate!

1 Like

Yep, my initial thinking was the same: that we could handle update errors first and then tackle the query errors at some point in the future. However, it seems better to think the query case through to ensure that we don’t build a hook that will become redundant in the future.

The idea was to add some replica endpoint that the developers can query to get stats and logs from the messages of their canisters:

  • The number of executed messages (updates/queries).
  • The latency of executed messages.
  • The number of executed instructions per message type.
  • The number of consumed cycles per message type.
  • The number of errors/failures per message type.
  • The output of ic0.debug_print().
  • etc.

This would require changes in the replica code, so the DFINITY team would need to help with the implementation. I am not sure if it needs to be specified at the protocol level or not. If we allow only off-chain queries of the endpoint, the we probably don’t need to specify it. If we want to feed the data to other canisters, then we need to specify it.

it might be even better to build some sort of DLQ tooling at the protocol level

If we limit the scope only to update (~ replicated) messages and ignore queries, then canister_on_error() would be sufficient to build such as DLQ system at the canister level, right?

I don’t have a good idea on how to implement DLQ for queries. It depends on whether we want to protect against malicious nodes or not.

This plan sgtm!

3 Likes

The NNS motion proposal: Proposal: 106146 - IC Dashboard

2 Likes