Reducing End to End latencies on the Internet Computer

Hi Everyone!

I want to update you all on the progress DFINITY has made on the Tokamak milestone that aims to reduce end-to-end (E2E) latency of update calls to canister smart contracts. We have completed the implementation of several exciting upcoming features that will lower user-perceived latency when interacting with the Internet Computer. All of these features will greatly improve how users experience the speed of ICP.

Synchronous Ingress Messages

We have completed the implementation of a new HTTPS endpoint for making update calls, /v3/ā€¦/call. This new endpoint is synchronous and responds with a certificate. This differs from todayā€™s asynchronous endpoint, /v2/ā€¦/call, where users submit ingress messages, and must then continuously poll the for ingress messageā€™s status. This polling adds significant latency to every call, so switching to a synchronous endpoint that does not require polling will lower the end-to-end latency.

Figure 1 below illustrates the semantics of the old and new call endpoints. On the left, we see that when a user submits an update call to the old endpoint, they must start polling the ICP for the certified response with the /read_state requests. On the right-hand side, we see that the new call endpoint waits to respond until a certificate is ready and sends it back to the user.


Figure 1

Routing ingress messages to the new call endpoint will just be an implementation detail of user agents such as agent-js and agent-rs. This means your dapp can benefit from the new endpoint by simply upgrading the agent version once the agents support the new endpoint.

Geo Aware boundary node routing

Boundary nodes serve as a gateway to the Internet Computer by providing HTTP endpoints that route canister requests to the right subnet.

At the moment, boundary nodes route a request to a random node within the destination subnet. As the nodes are distributed across the globe, some nodes are closer and others further away. This leads to vastly different network latencies per request (from tens to hundreds of milliseconds).

We are proposing to change the boundary nodesā€™ routing behavior for update calls. In particular, we propose that they choose among the closest ā€œthirdā€ (f+1). This helps reduce the latency between the boundary nodes and the replica nodes.

Increasing the Block Rate of subnets.

We also want to increase the rate at which Internet Computer subnets produce blocks. A higher block rate means messages can be included in a block sooner, leading to a lower latency. Note that this change will lead to more variability in the block rate of subnets: under low load, we expect to see more than 2 blocks per second, but under high load, the block rate would likely fall to ~1 block per second. This increased block rate can be achieved by modifying consensus protocol parameters that are part of the subnet settings in the registry (namely, the initial_notary_delay_millis).

The reason we can now lower this notarisation delay is partly due to the new P2P layer of ICP, announced here. The new P2P supports a higher throughput between nodes and an optimized protocol for sending messages between nodes. This means that subnets can produce blocks at a faster rate and ensure that all nodes can keep up.

Next Steps

DFINITY plans to submit proposals to gradually roll out these three features over the coming months, and use this forum thread to keep everybody informed of the progress. We also plan to collect end-to-end latency metrics, so hopefully we will see this number reduce as the new features roll out.

Feel free to comment if you have any questions!

  • Daniel
51 Likes

CC @quint @neeboo @rdobrik @Gekctek @levi @jleni (just re-using a list of agent developers that I found in an older post of mine)

11 Likes

Than you @Severin @dsharifi ! We will start working on Java IC4J Agent implementation ASAP.

9 Likes

Great work Daniel, super interesting, canā€™t wait to see this live!

5 Likes

This is going to be epic! The user experience for DApps like RuBaRu will improve significantly. We had planned some UX workarounds for handling update latency, like optimistic updates for a few requests, but it looks like weā€™ll need to revisit those plans.

Geo-aware update calls sound amazingā€”are we planning to implement the same for query calls, bringing in Edge capabilities to the network? Query performance is already close to Web2 standards, so Iā€™m curious if this is in place.

Canā€™t wait to see it in action. @neeboo, integrating this into agent_dart would be fantastic. We would love to integrate & test. We can also measure latency improvements and publish them?

7 Likes

I have a question on the synchronous endpoint.

Iā€™m looking for solutions to authentication of raw/pure HTTP requests from clients to canisters. I would like to enable JWTs etc for authentication so that developers can use traditional normal non-ICP-specific HTTP clients.

The problem is that all of these calls are treated as anonymous, and the result is written to a location that can be polled publicly.

Is it possible for the synchronous endpoint to return the result of the call directly and only to the entity making the call? And not write it to a public location that can be polled?

If so this might solve the problem.

Does it work like this? I am afraid not as it looks like the certified state is still written to and thus anyone could call read_state on it. Do I have that correct?

4 Likes

read_state is still the fallback behavior. We canā€™t count on perfect conditions for the boundary nodes or even for clients to have consistent network connections. If the request canā€™t stay open, agents need to be able to confirm the result of an update

5 Likes

Itā€™s great to see the latency being reducedā€”keep up the progress!

4 Likes

Hey @lastmjs ,
Iā€™m curious about your interest in having the synchronous endpoint return results directly without writing them to the blockchain. Could you elaborate a bit more on your specific use case or application?
Are you primarily looking to improve performance and reduce latency, or are there other factors like data privacy or simplifying certain types of queries that are driving this request?

1 Like

My explanation is what I am after, authentication purposes. Itā€™s not that it isnā€™t written to the blockchain but that it isnā€™t written to a public location that any anonymous user can retrieve.

3 Likes

It seems like a good approach , let us digest v3 and see what to do next

2 Likes

is agent-js ready for v3?

1 Like

Thatā€™s great! Here is the PR for the IC specification change for the the new endpoint that I believe can be helpful when creating the new agent. You can also use the latest PocketIC or dfx (locally) to test against the endpoint.

Keep in mind as @kpeacock mentioned, the fallback behavior is similar to the v2 endpoint. That means if the request can not be processed over some long time threshold, the replica can terminate the connection and reply with 202 Accepted, meaning the agent must fall back to polling.

4 Likes

No, the processing of the update call on the subnet is the exact same as before. This indeed means that the result of the update call is written to the certified state of the subnet.

4 Likes

Hey @TusharGuptaMm

At the moment, we will go with the update calls and see how it works (e.g., how the load is spread among the nodes etc.). At a later point, we will reconsider that and might even change our routing completely and not just decide randomly or based on latency, but based on the actual load of the different nodes.

6 Likes

Update:

We have submitted an NNS proposal to increase the block rate of the io67a-2jmkw-zup3h-snbwi-g6a5n-rm5dn-b6png-lvdpl-nqnto-yih6l-gqe subnet. Hereā€™s the proposal: https://dashboard.internetcomputer.org/proposal/132123.

6 Likes

@dsharifi
Yesterday you said that ā€œWe also plan to collect end-to-end latency metricsā€. Why did you submit this proposal before the collection of the end-to-end latency metrics?

300ms is too low for nodes on the other side of the world.

This node is in sc1 data centre, an hour north of Brisbane in Australia. I have been monitoring ping times from a server on AWS in Frankfurt to a node on the io67a subnet q3w37-sdo2u-z72qf-hpesy-rgqes-lzflk-aescx-c5ivv-qdbty-s6pgc-jae :

This node wonā€™t be able to keep up with European nodes it the timeout is set at 300ms

400ms would be a better number, and a more relavant test, if we want to move all application subnets to a lower number in future.

2 Likes

Just got a couple pieces of feedback from prodsec, but itā€™ll be ready to go out as soon as the endpoint doesnā€™t 404

3 Likes

For anyone interested, thereā€™s a dedicate thread for changes to the subnet affected by this proposal ā†’ Subnet Management - io67a (Application) - Developers - Internet Computer Developer Forum (dfinity.org)

@Lerak also provided his network latency analysis there. Great job @Lerak!

Note that the consensus protocol is designed to be able to cope with network delays that occasionally fall behind the 0-rank block notarization delay.

I am curious as to why this parameter is a fixed value rather than adaptive (adaptive to network conditions that ebb and flow - incrementing when network conditions are bad, and decrementing when better performance can be achieved). An implementation like this would have avoided the need to update the config (the notarization delay would have gradually reduced to an equilibrium point that optimises for throughput) - just thinking out loud.

4 Likes

Wow, itā€™s an amazing update. ICP is getting faster and making the community more bullish about its tech stack. ICP is true love for us.