Reducing End to End latencies on the Internet Computer

dsharifi · August 22, 2024, 11:40am

Hi Everyone!

I want to update you all on the progress DFINITY has made on the Tokamak milestone that aims to reduce end-to-end (E2E) latency of update calls to canister smart contracts. We have completed the implementation of several exciting upcoming features that will lower user-perceived latency when interacting with the Internet Computer. All of these features will greatly improve how users experience the speed of ICP.

Synchronous Ingress Messages

We have completed the implementation of a new HTTPS endpoint for making update calls, /v3/…/call. This new endpoint is synchronous and responds with a certificate. This differs from today’s asynchronous endpoint, /v2/…/call, where users submit ingress messages, and must then continuously poll the for ingress message’s status. This polling adds significant latency to every call, so switching to a synchronous endpoint that does not require polling will lower the end-to-end latency.

Figure 1 below illustrates the semantics of the old and new call endpoints. On the left, we see that when a user submits an update call to the old endpoint, they must start polling the ICP for the certified response with the /read_state requests. On the right-hand side, we see that the new call endpoint waits to respond until a certificate is ready and sends it back to the user.

Figure 1

Routing ingress messages to the new call endpoint will just be an implementation detail of user agents such as agent-js and agent-rs. This means your dapp can benefit from the new endpoint by simply upgrading the agent version once the agents support the new endpoint.

Geo Aware boundary node routing

Boundary nodes serve as a gateway to the Internet Computer by providing HTTP endpoints that route canister requests to the right subnet.

At the moment, boundary nodes route a request to a random node within the destination subnet. As the nodes are distributed across the globe, some nodes are closer and others further away. This leads to vastly different network latencies per request (from tens to hundreds of milliseconds).

We are proposing to change the boundary nodes’ routing behavior for update calls. In particular, we propose that they choose among the closest “third” (f+1). This helps reduce the latency between the boundary nodes and the replica nodes.

Increasing the Block Rate of subnets.

We also want to increase the rate at which Internet Computer subnets produce blocks. A higher block rate means messages can be included in a block sooner, leading to a lower latency. Note that this change will lead to more variability in the block rate of subnets: under low load, we expect to see more than 2 blocks per second, but under high load, the block rate would likely fall to ~1 block per second. This increased block rate can be achieved by modifying consensus protocol parameters that are part of the subnet settings in the registry (namely, the initial_notary_delay_millis).

The reason we can now lower this notarisation delay is partly due to the new P2P layer of ICP, announced here. The new P2P supports a higher throughput between nodes and an optimized protocol for sending messages between nodes. This means that subnets can produce blocks at a faster rate and ensure that all nodes can keep up.

Next Steps

DFINITY plans to submit proposals to gradually roll out these three features over the coming months, and use this forum thread to keep everybody informed of the progress. We also plan to collect end-to-end latency metrics, so hopefully we will see this number reduce as the new features roll out.

Feel free to comment if you have any questions!

Daniel

Severin · August 22, 2024, 11:54am

CC @quint @neeboo @rdobrik @Gekctek @levi @jleni (just re-using a list of agent developers that I found in an older post of mine)

rdobrik · August 22, 2024, 12:23pm

Than you @Severin @dsharifi ! We will start working on Java IC4J Agent implementation ASAP.

1eo · August 22, 2024, 2:22pm

Great work Daniel, super interesting, can’t wait to see this live!

TusharGuptaMm · August 22, 2024, 3:48pm

This is going to be epic! The user experience for DApps like RuBaRu will improve significantly. We had planned some UX workarounds for handling update latency, like optimistic updates for a few requests, but it looks like we’ll need to revisit those plans.

Geo-aware update calls sound amazing—are we planning to implement the same for query calls, bringing in Edge capabilities to the network? Query performance is already close to Web2 standards, so I’m curious if this is in place.

Can’t wait to see it in action. @neeboo, integrating this into agent_dart would be fantastic. We would love to integrate & test. We can also measure latency improvements and publish them?

lastmjs · August 22, 2024, 4:14pm

I have a question on the synchronous endpoint.

I’m looking for solutions to authentication of raw/pure HTTP requests from clients to canisters. I would like to enable JWTs etc for authentication so that developers can use traditional normal non-ICP-specific HTTP clients.

The problem is that all of these calls are treated as anonymous, and the result is written to a location that can be polled publicly.

Is it possible for the synchronous endpoint to return the result of the call directly and only to the entity making the call? And not write it to a public location that can be polled?

If so this might solve the problem.

Does it work like this? I am afraid not as it looks like the certified state is still written to and thus anyone could call read_state on it. Do I have that correct?

anon74414410 · August 22, 2024, 4:17pm

read_state is still the fallback behavior. We can’t count on perfect conditions for the boundary nodes or even for clients to have consistent network connections. If the request can’t stay open, agents need to be able to confirm the result of an update

Phasma · August 22, 2024, 4:32pm

It’s great to see the latency being reduced—keep up the progress!

Sormarler · August 22, 2024, 8:22pm

Hey @lastmjs ,
I’m curious about your interest in having the synchronous endpoint return results directly without writing them to the blockchain. Could you elaborate a bit more on your specific use case or application?
Are you primarily looking to improve performance and reduce latency, or are there other factors like data privacy or simplifying certain types of queries that are driving this request?

lastmjs · August 22, 2024, 9:37pm

My explanation is what I am after, authentication purposes. It’s not that it isn’t written to the blockchain but that it isn’t written to a public location that any anonymous user can retrieve.

neeboo · August 23, 2024, 6:46am

It seems like a good approach , let us digest v3 and see what to do next

neeboo · August 23, 2024, 6:50am

is agent-js ready for v3?

dsharifi · August 23, 2024, 9:41am

That’s great! Here is the PR for the IC specification change for the the new endpoint that I believe can be helpful when creating the new agent. You can also use the latest PocketIC or dfx (locally) to test against the endpoint.

Keep in mind as @anon74414410 mentioned, the fallback behavior is similar to the v2 endpoint. That means if the request can not be processed over some long time threshold, the replica can terminate the connection and reply with 202 Accepted, meaning the agent must fall back to polling.

dsharifi · August 23, 2024, 9:51am

No, the processing of the update call on the subnet is the exact same as before. This indeed means that the result of the update call is written to the certified state of the subnet.

rbirkner · August 23, 2024, 11:01am

Hey @TusharGuptaMm

At the moment, we will go with the update calls and see how it works (e.g., how the load is spread among the nodes etc.). At a later point, we will reconsider that and might even change our routing completely and not just decide randomly or based on latency, but based on the actual load of the different nodes.

dsharifi · August 23, 2024, 11:55am

Update:

We have submitted an NNS proposal to increase the block rate of the io67a-2jmkw-zup3h-snbwi-g6a5n-rm5dn-b6png-lvdpl-nqnto-yih6l-gqe subnet. Here’s the proposal: https://dashboard.internetcomputer.org/proposal/132123.

Lerak · August 23, 2024, 3:07pm

@dsharifi
Yesterday you said that “We also plan to collect end-to-end latency metrics”. Why did you submit this proposal before the collection of the end-to-end latency metrics?

300ms is too low for nodes on the other side of the world.

This node is in sc1 data centre, an hour north of Brisbane in Australia. I have been monitoring ping times from a server on AWS in Frankfurt to a node on the io67a subnet q3w37-sdo2u-z72qf-hpesy-rgqes-lzflk-aescx-c5ivv-qdbty-s6pgc-jae :

This node won’t be able to keep up with European nodes it the timeout is set at 300ms

400ms would be a better number, and a more relavant test, if we want to move all application subnets to a lower number in future.

anon74414410 · August 23, 2024, 5:01pm

Just got a couple pieces of feedback from prodsec, but it’ll be ready to go out as soon as the endpoint doesn’t 404

github.com/dfinity/agent-js

feat: v3 api sync call

dfinity:main ← dfinity:kai/SDK-1448-sync-call

opened 11:55PM - 18 Jul 24 UTC

krpeacock

+237 -102

# Description Implements `sync_call` functionality from the v3 api for improv…ed performance. See https://github.com/dfinity/interface-spec/pull/265 for more details ## Changes: - feat: sync_call support in HttpAgent and Actor - Skips polling if the sync call succeeds and provides a certificate - Falls back to v2 api if the v3 endpoint 404's - Adds certificate to SubmitResponse endpoint - adds callSync option to `HttpAgent.call`, which defaults to `true` Fixes SDK-1448 # How Has This Been Tested? Updates to unit tests and e2e tests run against dfx `0.22.0` and `0.23.0` to ensure backwards compatibility # Checklist: - [X] My changes follow the guidelines in [CONTRIBUTING.md](https://github.com/dfinity/agent-js/blob/main/CONTRIBUTING.md). - [X] The title of this PR complies with [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/). - [X] I have edited the CHANGELOG accordingly. - [X] I have made corresponding changes to the documentation.

Lorimer · August 24, 2024, 7:27am

For anyone interested, there’s a dedicate thread for changes to the subnet affected by this proposal → Subnet Management - io67a (Application) - Developers - Internet Computer Developer Forum (dfinity.org)

@Lerak also provided his network latency analysis there. Great job @Lerak!

Note that the consensus protocol is designed to be able to cope with network delays that occasionally fall behind the 0-rank block notarization delay.

I am curious as to why this parameter is a fixed value rather than adaptive (adaptive to network conditions that ebb and flow - incrementing when network conditions are bad, and decrementing when better performance can be achieved). An implementation like this would have avoided the need to update the config (the notarization delay would have gradually reduced to an equilibrium point that optimises for throughput) - just thinking out loud.

Techgirlkhushi · August 24, 2024, 11:45am

Wow, it’s an amazing update. ICP is getting faster and making the community more bullish about its tech stack. ICP is true love for us.

Topic		Replies	Views
What is the theroretical number for txns per second on Internet Computer right now Intros	32	2648	February 20, 2023
ICP.Lab Storage & Scalability Summaries Developers	18	4750	April 9, 2025
Subnets with heavy compute load: what can you do now & next steps Developers	174	4064	November 26, 2024
High User Traffic Incident Retrospective - Thursday September 2, 2021 Developers	50	8954	October 30, 2021
Decreasing HTTP Outcall Latency and Cost Developers	16	751	July 17, 2024

Reducing End to End latencies on the Internet Computer

Synchronous Ingress Messages

Geo Aware boundary node routing

Increasing the Block Rate of subnets.

Next Steps

Related topics