I think this is fairly easily mitigated with a migration fee/delay of some kind…I’m certainly not proposing we don’t try to mitigate the risk.
As @free says…many applications just stop working well if canisters their ‘canister group’ gets moved to different subnets. There are mitigations to this like having a private key create a signature of principle id + group id and putting it in each canister metadata so the balancer keeps them together.
I’m more concerned about things like attestation certificates or security certificates that would become harder to validate if all of a sudden the canister moves subnets and the validation procedure uses a subnet look up.
For example, here in ICRC-75 we create a certificate for the user to carry around with them that proves they are part of a list:
And then here we return the cert:
I’d imagine that these breaks when you move subnets(not the end of the world as you can get them re-issued, but letting everyone know that is necessary get tough)…and then also I guess some immutable applications are not possible. I’d rather have the ability to split and move subnets, but just noting for discussion one of the things that needs to be considered.
But it looks to me like what this is is certified data. That is simply a blob attached to your canister state in the canonical representation of the subnet state (which is what gets certified every round). This behaves very much the same as an ingress response: it’s some leaf in subnet A’s certified state; when your canister is moved to subnet B, it simply becomes a leaf in subnet B’s certified state.
Ultimately, the thing that proves that an ingress response or a canister’s certified data is authentic is the NNS’ public key. A subnet certifies the piece of data, whatever it is; and proves that it’s an actual IC subnet by also providing you with a delegation from the NNS. So the NNS certifies that this is an actual subnet; the subnet provides a certificate that the root of the state tree has some specific hash; the witness then proves that the data fits exactly into the pruned state tree and (after you combine all hashes in the witness with the hash of the data you end up with the expected root hash. Which precise subnet is on this trust chain, is irrelevant.
(Which is problematic if you have a mix of 40-replica subnets and single-replica subnets, since you end up trusting both just as much. But that’s a completely different issue. And the reason why we don’t have single replica or single data center subnets on mainnet.)
Great point. Another reason to be more careful about letting canisters choose which subnets to live on.
Here the difference that I’m talking about is about control, who/what controls on which subnets canisters live. It doesn’t matter about speed in this context.
Incremental in general is a good plan, but in this specific case, letting canister-controllers choose which specific subnets to migrate to, would block the protocol from having a chance to handle targeted DDOS. So it’s not strictly incremental, it adds/solidifies a new blocker.
It works great if we start with the statement that: Every inter-canister call can take up to the time of a cross-subnet call. Most canisters are already making many cross-subnet calls to ledgers, SNSs, and other dapps. As someone building canisters on the other side, the benefit of being safe from an indefinite dapp-targeted DDOS is much greater than the benefit of a canister-call sometimes taking 1 or 2 seconds faster. It’s not even close.
Random shuffling of canisters between subnets of the same subnet-type is a great tool in the toolbox to be able to handle attempted dapp-targeted DDOS. The only canister-architecture constraint that this adds, is the statement that: Every inter-canister call can take up to the time of a cross-subnet call.
For people looking to build microservices/actors with millisecond level inter-actor communication, a decentralized 13-times-replicated consensus environment already isn’t the best choice for that use-case. But guessing what people might need can lead us in circles. Do you know of a specific factual use case that “would not work” with this constraint but would work otherwise?
We can wait if we know that the outcome will be a canister environment that is safe.
A migration fee or delay would hurt the good canisters just as much as the bad ones, and would make it so that whoever has more money to burn will win. We need a platform where someone with very little money can build something that is safe from bad-actors that may have a lot of money.
What specific things will “just stop working well”? The only difference is that every canister-call can take up to the time of a cross-subnet canister-call. Do you know of a specific dapp that exists now that would not be able to function with this constraint?
Function and function with tolerable UX are two debatably different things, but I’d argue that if users get bored waiting for your system to respond then you’ve lost the game already.
The ICRC-72 configuration assumes that a subscriber is on the same subnet as its broadcaster.
ICRC-1/2/3 ledgers will be significantly slowed if on a different subnet as the ledger/index canisters.
I believe DeFi vectors degrade performance significantly if x-subnet.
If Bob Miners had been on a different subnets than the ‘mother’ canister we might have had a whole host of different problems.
The idea of composite queries currently ONLY work when on the same subnet.
Any system using canPack relies on canisters being on the same subnet.
The protocol dreamed up by OpenInternetSummer was looking to use canisters that assumed they were on the same subnet.
Basically, any ‘chatty’ protocol that relies on having intra-round communications for settlement and reporting to users becomes unusable when you cross a subnet boundary as the slow UX will likely not be tolerated.
Just taking a quick look at ICRC-72, it looks like the thing that will change is that canister-calls will take 1/2 seconds longer, but the canisters don’t have to make any code changes. Yea that is a personal preference, but the 1/2 seconds faster calls comes with a hole for dapp-targeted indefinite DDOS.
The ICRC-1/2/3 methods don’t make any downstream canister-calls, so their duration/latency stays the same. If an index-canister is on a different subnet from the ledger-canister, it will take 1/2 seconds longer for the index-canister to update. But again that comes with the benefits of being safe from a targeted DDOS. For a ledger I think that is a considerable factor.
I don’t know about the specifics of the implementation, but I do know that canister-calls made by a dapp to the ICP or SNS ledgers are already cross-subnet and their duration/latency stays the same.
If the system would have automatically spread the load of the Bob canisters throughout the available space on other subnets, there would have been zero downtime. If needed, new subnets could’ve been created with the spare nodes.
Yea, they are still marked as EXPERIMENTAL in the interface spec. Also the ICRC-3 query pattern can always replace composite queries.
It still works if the canisters would be on different subnets, just would take 1/2 seconds longer for each canister-call. I heard the wasm-component-model will replace the need for Canpack.
Not sure their reasons for this, but that would limit it to the size of a single subnet? Are you sure their protocol doesn’t need to grow beyond a single subnet?
For sure, UX must be workable. Will canister-calls taking 1.5 seconds longer make the user bored? I guess it is different for different people with different personal preferences. But what will the user think if the application is unavailable for an indefinite amount of time, due to a targeted DDOS? Then would the user rather that the canister-calls take 1.5 seconds longer if that meant there would be no chance of a DDOS?
I don’t think we can afford any of these options. If latency exceeds 2 seconds, there’s little benefit to having the frontend on-chain. Even as a programmer, I already find it difficult to wait for the NNS app. Similarly, we can’t risk getting DDOSed the moment the dapp becomes popular.
Why are we suddenly prioritizing DDoS protection? As someone already pointed out, we don’t have bullet-proof DDoS protection right now, so providing a canister migration implementation that also does not provide bullet-proof DDoS protection does not constitute a regression. I.e. we’re not breaking anything, we’re simply providing more tools to canister developers.
No one is (honestly) looking for millisecond latency on the IC. But what you are proposing is to increase the latency of the average update with downstream calls by a factor of 3 (not 1-2 seconds), Currently, given canisters colocated on a subnet that is not meaningfully backlogged, an update call that involves multiple such canisters takes on the order of 2 seconds (maybe half a second to a second more if we’re talking multiple calls).
With your approach, if we have canister 1 on subnet A and canister 2 on subnet B, with canister A calling canister B once as part of handling an update call, latency jumps to at least 6 seconds (and likely more). This is because instead of one induct → execute ( → execute → execute → …) → certify cycle, you have 3: one for the ingress message on subnet A, one for the request on subnet B, and one for the response on subnet A again. So the increase in latency is about 5 seconds, not 1.5. And, as said, this is not something that can be optimized away, it is a physical limitation, having to do with geography, the speed of light and consensus algorithms.
While you may be happy with somewhat better DDoS protection at the cost of a 3x latency increase (from an already borderline base), most canister developers won’t be. There are studies going back 20 years or more, showing that a meaningful portion of users will drop off if latency increases by as little as a couple hundred milliseconds. It’s why Google literally chooses to show you only part of the search results if some of the backends executing your search query take longer than tens of milliseconds to respond.
Yes. Anything that involves me clicking on something and then waiting for close to 10 seconds for anything at all to happen. That may be par for the course for the average crypto dapp on the average crypto network, but we’re building a general purpose decentralized compute platform that should be able to do quite a bit more than run a glorified ledger. So going “we’ve implemented DDoS protection; BTW latency has increased 3x” is wholly unrealistic.
Leaving aside the 1.5 seconds (which is actually 5 seconds), randomly shuffling around canisters would not guarantee “no chance of a DDOS”. No single silver bullet would.
Also, I would argue that the average dapp whose developer is unwilling to put more than a couple of bucks per month into is not the usual DDoS attack target. If OTOH your dapp is the kind of dapp that is the target of DDoS attacks (e.g. a large dapp; or one managing meaningful amounts of funds), then you should be able to protect against it by running on your own separate subnet or by paying for a compute allocation. And if you are indeed fine with a 10 second latency (under DDoS conditions; under regular conditions you would still get the regular 2 second latency) then you only need to pay for 4 compute allocation (one round out of every 25). With no additional DDoS measures required. And without an absolutely huge latency impact on every single non-trivial dapp and their users.
Oh, and we didn’t even cover bandwidth. Canisters on the same subnet can transfer more than 1GB between them every round. Cross subnet bandwidth is limited by the block size to 4 MB per round. Shared with ingress messages and HTTP outcalls. And while there’s ongoing work to significantly increase that, you won’t be able to sustain much higher concurrent traffic between every single pair of subnets (which is what will happen if you randomly distribute every single dapp’s canisters). So it’s actually at least 3x higher latency; and at least 250x lower bandwidth. The costs are way out of proportion to the benefits.
Finally, I will go back to my initial point: this thread discusses reducing latency and improving scalability. The proposed solutions do not materially impact the DDoS resilience of the network, such as it is. I don’t think we should be focusing on DDoS here. Or on anything not related to or impacted by the proposed changes.
According to a conversation in another thread the increase in latency is 1.8 seconds since the recent lowering of the notarization delays:
(I was surprised too.)
I just want to clarify for our readers the following cases where the latency will stay the same (will not change): Canister methods that don’t make any inter-canister calls, and canister methods that already make inter-canister calls to canisters on different subnets. Both of these will not change. The one specific case that will increase latency is canister-methods that are currently making inter-canister calls to other canisters on the same subnet, the latency of this case will increase by 1.8 seconds.
If most canister developers don’t want that, then that’s ok, as long as the different factors are being considered one way or the other.
For random canister shuffling I was using it as a sample of a possible thing that the protocol can do. The main point is that the protocol (and not canister-developers) being in charge of where canisters live and how the canisters get spread out across subnets, is what puts the protocol in the best position that I know of, to be able to spread any kind of load (ddos or well-intended) throughout the subnets.
If there are other things we will be able to do later on to round-out the load balancing while still letting canister-developers choose their specific subnets, that could be the best option for sure.
I think it is great that compute-allocation exists now to be able to at least have an option. It’s something like $130 per month per canister for a compute allocation of 4. Works for a few canisters.
Collectively yea good point, but on the public-subnets it’s not so easy for a single dapp to take advantage of that.
My purpose here is sharing my thoughts on a plan for scalability which contains the mechanics to handle any kind of scalability load (ddos or well-intended), and making it simpler for canister-controllers to scale their dapps.
The typical frontend and backend canisters would maintain the same latency, correct? That doesn’t seem too bad then. I agree that allowing subnet selection might not be a viable long-term option unless the subnets are of different types. Maybe the selection of subnet is what should be costly.
No, it will not. It will increase by something like 3-5 seconds (the equivalent of 2 extra rounds of induction ingress message roundtrips). More simply, if you take into account whatever the current latency is, it’s going to be 3x that (because you’re inducting, executing and certifying 3 messages instead of one; and because a local roundtrip more often than not takes zero rounds).
What I was trying to say, in turn, is that random canister shuffling with no input from canister controllers (as in “these 3 canisters should stay together”) is an absolute no-go. It breaks one of the core behaviors of the network: yes, the minimum call latency is on the order of 2 seconds; yes calling out to a random ledger is 3x that; but if you deploy your app just right, you can get away with the former the vast majority of the time.
I am also trying to say that we should not be discussing additional DDoS protection on this thread. Particularly arguing whether it should block features that are necessary to get us out of the uneven load (and its significant side-effects) that we are experiencing as we speak.
I’m not sure what your point is, though. Are you arguing that because the average canister can only get a fraction of the 1 GB/block subnet-local bandwidth it should be perfectly fine for it to only get the same fraction of a 4 MB/block bandwidth instead?
I apologize if I sound like I’m losing patience. Because I really am losing patience. Please believe me when I say that randomly separating the frontends and backends of a dapp across subnets will lead to a 3x latency increae compared to them being hosted by the same submet. The actual number depends on the actual ingress latency, which depends on a lot of factors, but it is 3x of what it would otherwise be. And this “otherwise” is unfortunately not tens or hundreds of milliseconds, it is seconds. Plural.
Canisters are actors, i.e. they’re single threaded. You may get a small dapp running acceptably as a single canister, but anything that needs to handle non-negligible load or traffic must be split across multiple canisters. And spreading out those canisters as widely as possible means that any non-trivial application’s latency increases by 3x. For something that we agree is not even a panacea for DDoS prevention. Which is something that should be discussed on a separate thread. And definitely not block efforts to improve latency and scalability for canisters suffering through uneven subnet load.
It’s still a problem if someone decides to create 20k canisters in your subnet just for fun. Increasing the price 10x for the selection of subnet could solve this problem and allow the frontend/backend to still have the low latency. Afaik, shared VMs providers don’t let the user select the machines for free.
I think you are fundamentally missing the point that if canisters are on the same subnet then one can call the other canisters(or set of canisters) on the subnet and get a response up to like 20 times WITHIN THE SAME ROUND and get a result within 2-3 seconds. If you unknowingly move one or more of those canisters to a different subnet. this becomes a 6-second round trip for each and increases your call to 2 minutes.
I’m not sure where you are getting 1.5 seconds from, but going across subnets requires
a. finalizing your call on subnet A - 2-3 seconds
b. processing on subnet B and finalizing the return to A - 2-3
c. receiving and processing the response on A. 2-3
So each x-subnet call is 6-9 seconds if no other calls to other canisters are needed.
Please see Motoko Playground - DFINITY and deploy subactor and then main. Call test with the principal of subactor. You will see that main.mo can call. and get a response from subactor 22 times during a single round(I imagine that was a swarm of user-controlled wallets syncing a multi-party swap).
I think that finding a mitigation for this is worth that architectural gain. It would be great if the IC dev could ignore subnet topology, but is is not possible for certain classes of applications.
I tested it myself today. The average latency-difference between doing an end-to-end ingress-message that makes an inter-canister call to a canister on the same subnet, versus doing an ingress-message that makes an inter-canister call to a canister on a different subnet, is 3.0025 seconds. So it’s not 1.8 seconds, my mistake. It is 3 seconds.
I tested it with two canisters on fuqsr and the third canister on qdvhd. Here the average of the first case is 4.116 seconds, and the average of the second case is 7.020 seconds. So that is a 2.9 seconds difference and 1.7x.
I also did the same thing with the first two canisters on opn46 and the third canister on lspz2, here the average of the first case is 2.656 seconds and the average of the second case is 5.757 seconds. So that is a 3.1 seconds difference and 2.17x.
Average is 3 seconds and 1.9x.
And that only applies to canister-calls that are currently between canisters on the same-subnet, which then changes to be on different subnets, and then takes the same amount of time as a cross-subnet call.
Trying to keep canisters on the same specific subnet on the public-subnets is an unstable optimization that if you try to take advantage of it on the public subnets, it gets about 100 times harder to build and maintain an application, if it would be possible at all. A public subnet can get filled up at any time by things outside the canister-developers control, which would then require the application to anyways spread across multiple subnets.
For nns/rental/utopia subnets where there is one developer controlling what can live on that subnet then it makes sense.
Is the ICP ledger random? Are the SNS ledgers random?
When an application calls the ledgers and/or any other dapp besides itself, then there is no difference.
Hold on, are you saying that if a dapp uses another dapp, the first dapp should manually try to follow the second dapp to the same subnets!? That would make things so unstable and chaotic, I didn’t even think about that that’s what you would recommend. The load would never even out like that, all the canisters that integrate with a big-dapp would all try to move to the same subnet(s) as the big-dapp, immediately overloading the subnet(s), and everyone will want to be on that subnet so people will try to wait for others to move first so they don’t have to move, then the big-dapp would need to move, then everyone would try to follow it … . Long-term it would be absolute insanity, and would keep getting more unstable as the ecosystem and dapps gets bigger.
It’s not just DDoS, per my previous paragraph, it could make the whole network more unstable.
Automatic protocol-managed subnet-splitting and load-balancing is a way that stabilizes the network with a stability that lasts forever, and that gets more stable as the network gets bigger.
No it wouldn’t be limited to the bandwidth between two subnets, because the application’s canisters would be spread throughout all the subnets in the whole network (of the same subnet-type), then a canister can make calls to all of them in parallel, where the bandwidth per block is then multiplied by the number of subnets in the whole network. As the network grows with more subnets, the dapp’s scalability and bandwidth gets better and better forever.
If it is a single frontend canister and a single backend canister, there would not be any latency increase. If it were multiple canisters of each, when the canister methods don’t make any inter-canister calls, then there would not be any latency increase. When the canisters make calls to other dapps besides itself, there would not be any latency increase.
For canisters that currently make calls to other canisters on the same subnet, this specific case happens to be extra fast right now, and with the changes, the latency will be the same as any other call to a ledger or to another dapp.
You can have a non-trivial application that works with the ledgers or the SNS canisters that would not increase at all. Also my above point about how spreading canisters as widely as possible will let you use the bandwidth of every subnet on the network in parallel, it can make huge applications even faster then they would be trying to clump together on the same public subnets.
Letting canister controllers choose which specific subnets to migrate to, is something that could cause more uneven subnet load.
No, if your canisters are spread out throughout all the subnets in the whole network, and you have 20 canisters, you can call all of them in parallel, and they would all complete at the same time within a few seconds.
It’s easy to tell people about a quick optimization, but what happens when that application needs to scale beyond 20 users (like 99% of applications out there)? Then what is your plan to keep taking advantage of that?
Spreading the dapp’s canisters throughout all the subnets in the whole network is automatic infinite scalability.
I was going to post a sarcastic answer, but I’ll refrain. I will just point out again that 5 second best-case latency and spreading your app over 250 subnets to get the same bandwidth as deploying 2 canisters next to one another is unacceptable.
I will also not further engage in this discussion, it does not belong on this thread.
Feel free to put together a motion proposal. If the majority of the community accepts it, I will swallow my pride (and my common sense) and implement it.
My conclusion from this entire thread is that the IC is not scalable as a world computer, as we desire the scalability of @levi combined with the latency of @free.
Based on the comments, achieving this seems impossible. It would be beneficial for everyone to become aware of this reality.
I know that many Dfinity employees don’t have patience for this discussion, but I have been here since the beginning as an investor, developer, and founder.
As it is right now, I don’t imagine people using my dapp for anything other than read-only activities.
Perhaps it would be wise to move away from the ‘everything on-chain’ motto and implement a solution similar to Solana, where at least token swaps, which are the basis of blockchains, can be scaled with proper latency.