Deterministic Time Slicing

ulan · February 2, 2022, 11:03am

Summary

Currently, the amount of computation a smart contract can perform per WebAssembly call is limited by the block time. Deterministic time slicing allows for long(er) running, multi-round computations by suspending the execution of an update message at the end of one round and resuming it later.

Status

Discussing

Key people involved

Andriy Berestovskyy (@berestovskyy), Bogdan Warinschi (@bogwar), Ulan Degenbaev(@ulan).

Timeline

1-pager posted on the forum for review: February 2, 2022.
Community Conversation: TBD.
NNS Motion Proposal (to approve design + project): TBD
If the NNS Motion Proposal passes, implementation and deployment would take a few months: April 2022.

Objective

The current implementation of the Internet Computer (IC) does not support multi-round WebAssembly execution. In other words, if a WebAssembly function starts executing in one round, then its execution must finish before the next round can start. This introduces a trade-off between the fast finality and the amount of work a smart contract can do in a single update message execution.

To ensure fast finality, IC imposes a limit on the number of instructions per WebAssembly call. If the limit is exceeded, then the execution traps with an error:
Canister <id> exceeded the cycles limit for single message execution.

The problem is that the existing limit is not sufficient in some important cases:

Pre- and post-upgrade hooks of smart contracts. They potentially serialize and deserialize the entire Wasm memory.
Garbage collection in Motoko smart contracts. The garbage collection potentially touches every byte of the Wasm memory.

Besides these cases, a smart contract may need to perform other long-running computations specific to its domain.

The goal of this project is to fix the problem by supporting multi-round WebAssembly execution of update messages. Note that supporting long-running query messages is out of scope of this proposal.

Proposal

The proposal is to implement deterministic time slicing (DTS) – a mechanism that automatically suspends WebAssembly execution when it reaches the per-round instruction limit and resumes the execution in a subsequent round. The mechanism should be transparent to the smart contract and the developers, such that pausing and resuming are not observable. Specifically, DTS should guarantee atomicity and isolation of a long-running execution:

atomicity: if an error occurs in any chunk of the execution, then the smart contract rolls back to the state before the entire execution has started.
isolation: other update messages do not run until the entire execution completes. Queries may be executed concurrently, but they must use the state before the execution has started.

Figure 1. Pausing and resuming WebAssembly execution in the sandbox process.

Developer experience

The main observable effect of DTS is increased instruction limit per WebAssembly execution. Many messages that currently fail due to the instruction limit error, will succeed once DTS is implemented. Developers do not need to take any actions to benefit from DTS.

The initial implementation will gradually increase the instruction limit by 10-50 times. Thus allowing a single message execution to span up to 50 rounds. Multi-round executions will likely have a small performance and memory overhead, which will be reflected in the execution fees. The performance and the costs of single-round messages will not be affected.

As a consequence, the cost of infinite-loop bugs will increase by 50x. One of the follow-up features being considered is to allow developers to lower the instruction limit per method if they know the upper bound and want to guard against the infinite-loop bugs.

Implementation

The initial implementation is estimated to take about 4 SWE-months. Most of the changes will be in the execution layer of IC. The main risks and challenges:

Multi-round messages will likely have higher performance and memory overhead.
The message scheduler needs to support multi-round messages while preserving the compute allocation and fairness guarantees.
Pausing and resuming mechanism may increase complexity of the IC.

skilesare · February 2, 2022, 12:42pm

This is a huge feature and will personally allow me to shift a significant amount of focus to other things.

I wish that I’d be able to specify which functions I’d like to have this property at run time so that I don’t have massive costs overruns due to infinite loop bugs. Much like method upgrading, it can always call the LONGUPDATE if UPDATE fails and most of the time my client will know if it should call LONGUPDATE in the first place.

senior.joinu · February 2, 2022, 1:15pm

This is indeed a very huge feature and I would love to see it implemented.

But I have a couple of questions:

As far as I understand, these long-running updates will be executed one-by-one according to the consensus. This means, that a single update call would occupy a canister for up to 50 consensus ticks (~100 sec) with no other update message (including async responses) being executed within this time window. For some frequently used canisters this could lead to message queue overflow. What is the plan for such scenarios?
How does this proposal connects to Multi-threading proposal?

Thanks in advance!

ComputerInternetMan · February 2, 2022, 2:33pm

Feels analogous to how we built telephony. Will have a big impact in how we can implement Rust and interconnect. Mission-critical for Iridium.

berestovskyy · February 2, 2022, 9:30pm

@skilesare

Once this feature is implemented, the message limits will be increasing gradually, i.e. from 5B to 10B, then to 50B etc. So no worries about infinite loop bills
Later, the runtime per-message instruction limits are planned, but it’s part of another feature.

@senior.joinu

You’re totally right, no other updates will be scheduled until the long-running execution is complete, but queries will run. Scaling Canisters horizontally might be the way, but it’s a big change from the dev (or IC) perspective.
I have no details about multi-threading proposal, maybe @ulan knows? Sounds like a challenge to me, as multithreading is usually non-deterministic…

ulan · February 3, 2022, 5:00pm

Thanks all for showing the support!

+1 to @berestovskyy’s reply. To add more details:

I wish that I’d be able to specify which functions I’d like to have this property at run time so that I don’t have massive costs overruns due to infinite loop bugs.

This is one of the features we are considering to work on after the initial implementation of DTS.

How does this proposal connects to Multi-threading proposal?

DTS is orthogonal to multi-threading as in having DTS wouldn’t make multi-threading easier or harder.

pattad · February 16, 2022, 4:36pm

Looks like a super important feature to me!

lastmjs · February 18, 2022, 2:11am

Does this affect queries and updates identically? Why do query calls have an execution limit, and why would DTS allow us to increase it?

ulan · February 18, 2022, 10:30am

Does this affect queries and updates identically?

DTS will not apply to queries and queries will keep the existing limit.

The main motivation for having an instruction limit for queries is to bound the resource consumption. That’s because queries run concurrently to update messages and do not block the progress of IC. Each query keeps a reference to the corresponding replicated state when the query started. If we allow long-running queries, then we may get into a situation where IC has executed many rounds while the query is still running and keeping a very old state in memory. (This is a similar concern that block inter-canister query calls: Inter-Canister Query Calls (Status: done) - #44 by akhilesh.singhania)

lastmjs · February 18, 2022, 4:16pm

This is a bit confusing, because a smart contract performs computations in query calls as well as update calls. This proposal doesn’t readily explain that the computational limits are only being increased for update computation.

I just want to point out that increasing the cycle limit for query calls could also be very important.

ulan · February 18, 2022, 4:27pm

Good point. I added “update messages” and a clarifying sentence about queries not being supported.

jzxchiang · February 19, 2022, 5:24pm

Are these limits present in the local replica? Or do they only show up on mainnet?

akhilesh.singhania · February 21, 2022, 1:21pm

The same limits are present in the local replica. This is very much intentional to allow developers to develop canisters under mainnet conditions.

Agreed! Our assessment is that increasing the limit for update calls is more urgent. Increasing the limit for query calls will require a different set of optimisations to be performed and different design considerations. Would definitely be worth digging into this in future.

senior.joinu · February 22, 2022, 2:59pm

I was thinking…
If this feature is implemented, it means that there is no more reason for a consensus timer to tick every 2 seconds. Since any computation can be split into slices and execute during several blocks, there is no need to wait so much time. Moreover it is now more preferable to process more blocks during the same time interval, because it will result in faster total execution time for some cases.

Do you guys consider lowering the consensus interval? Is the current consensus interval dictated by the network latency (and there is no way to lower it) or it is artificial and you can affect it that way?

ulan · February 25, 2022, 3:53pm

I am not an expert in consensus, but I think the block rate is set mainly based on the estimated network latency between any two nodes. See “5.9 Delay functions” in https://dfinity.org/whitepaper.pdf

DTS will probably help in making the block rate more stable (i.e. ensuring that the block rate doesn’t drop because of slow execution).

skilesare · February 25, 2022, 9:06pm

Is this new? I’ve been fighting with the fact that the local replica doesn’t have cycle simulation for months. This is great news if we now have cycle simulation on the local replica.

akhilesh.singhania · February 28, 2022, 1:44pm

Cycles and instructions are two different things. We do not have cycles simulation on the local replica. We have instruction simulation though. Instruction simulation is putting limits on how long a message can execute for. Cycles simulation is actually paying for resources that a canister is using.

skilesare · February 28, 2022, 1:51pm

I regularly have functions that work fine on the local replica and hit a limit in production. Is there a switch to turn this on?

jzxchiang · March 1, 2022, 1:45am

I think the confusion also stems from the wording…

To ensure fast finality, IC imposes a limit on the number of instructions per WebAssembly call. If the limit is exceeded, then the execution traps with an error:
Canister <id> exceeded the cycles limit for single message execution.

Shouldn’t the error message say “exceeded the instruction limit”?

ulan · March 1, 2022, 8:42am

@akhilesh.singhania: doesn’t dfx use the configuration of the system subnet by default, which has much higher instructions limits? I heard that there was a feature request to add a flag to use the application subnet configuration, but don’t know whether it materialized or not.