Question about potential internal network overload

Hi everyone,

I’ve been digging into how ICP handles scaling and security, and I have a concern I’d like to get some insight on.

Is there any built-in mechanism within the ICP protocol that protects the network from being overloaded from the inside? Let’s say someone with malicious intent wants to damage the network’s reputation and is willing to spend, for example, a million dollars to do so.

In such a case, it doesn’t seem too difficult to write a bot that registers Internet Identities, creates canisters, and starts filling up subnets. For instance, the OpenChat subnet currently holds around 90,000 canisters, while the subnet limit is 100,000. What happens if 10,000 fake users suddenly join and start creating canisters? Wouldn’t that push the subnet to its limit and potentially impact the usability or performance of real dapps?

Is there any throttling, economic deterrent, or governance mechanism in place to prevent this type of resource exhaustion scenario?

Thanks in advance for any insights.

4 Likes

There are indeed a number of subnet resources that could be exhausted by malicious actors with enough money / ICP / cycles. And the issue has not yet been addressed holistically. Partly because it would take quite a bit of money to mount such an attack; partly because such an attack has never been attempted; and partly because addressing it requires a change of mindset (see below).

We do have one-off solutions for specific resources (e.g. progressively higher storage cycle reservations when subnet storage usage exceeds some threshold), but the elephant in the room is the fact that even in those situation fees are still flat (i.e. in the case of storage reservations, canisters are forced to reserve cycles to pay for said storage for longer and longer time spans, but the fee itself is flat – $5 / GB / year).

In order to more widely address abuse, we need a generalized model of progressive fees / surge pricing. Storage can get away with increasingly longer term reservations while keeping fees flat, but we cannot apply this approach to all types of resources. So while it’s really nice to easily be able to predict one’s costs due to flat fees, efficient and reliable use of shared resources requires the ability to dynamically react to malicious behavior (or just random spikes). Also note that, as you observed, it is relatively trivial to mount a Sybil attack on the IC, so that surge pricing won’t only affect the “bad guys”. Everyone using the resource under attack at the time of an attack would have to face significantly higher fees. Meaning that any solution will have to be carefully designed so as to not make the protection mechanism itself be a tool for a potential attack (e.g. if there’s a hefty fee penalty for some resource when its subnet or global usage exceeds some threshold, an attacker could exhaust just enough of said resource to make everyone else pay the fee penalty while avoiding it themselves).

And given the nature of the IC, those maintaining and extending the protocol (i.e. DFINITY engineers) cannot unilaterally decide to tweak the fee structure, even if that would mean a safer, more scalable network. We need to do a comprehensive analysis, put together a well argued solution, get the community to weigh in on it; and, finally, ask them to vote on a high-level design.

In the meantime, in addition to the specific solutions mentioned above, we also have (less granular) tools and processes to help us deal with such occurrences as one-off incidents: we can create new subnets to spread the load (there are still many hundreds of unassigned nodes); we can split heavily oversubscribed subnets; controllers will soon be able to migrate their canisters to less loaded subnets; and there’s always the option to fall back on the NNS to deal with problematic behavior (e.g. by taking down / blocking / penalizing actively malicious canisters; including by implementing one-off changes to the protocol to achieve these).

And we are continuously working on improving scalability. For a while now, the limits on the number of canisters and subnet storage have been there mostly to ensure responsive subnets and less as a hard limit beyond which the subnet would fall over. There is now quite a bit of headroom for subnets to scale vertically if need be.

6 Likes

When you say never been attempted, what do you make of this?

It can’t be a coincidence that the state jumped randomly to 451Gib for a week and all the other canisters on the subnet were drained.

3 Likes

It’s really hard to say exactly what happens on specific subnets beyond metrics of the kind you linked to: increase in storage usage, lots of canister installs, canisters being created, etc.

There are no per-canister metrics, because it would be virtually impossible to collect any meaningful set of metrics from every single canister on the IC without having dozens and dozens of servers (and a whole lot of infrastructure) dedicated to this. I imagine there may also be reasons why some canister developers / controllers wouldn’t want DFINITY (or everyone) to closely monitor their canisters’ cycle balances or behavior.

So all I can say is that the canister count on the subnet increased in a couple of large steps (even though that was at least months before this issue was reported); there was very high load on the subnet for a few hours, including apparently a whole lot of canister installs; and some canister’s large balance may or may not have evaporated. The exact code deployed to the subnet at the time can be looked up on GitHub; and having looked at it a lot over the years, AFAIK there’s nothing fishy about it (not even, considering my previous message, any sort of surge pricing beyond storage reservations). So if the cycle balance did evaporate, it was some sort of resource usage by that canister. Potentially triggered by malicious load on the canister (e.g. maybe someone intentionally triggered lots of HTTP outcalls), but that’s entirely a wild guess and nothing more.

If you want to find out why your canister is running out of cycles, the only method I can think of is to instrument your canister code: count how many times each method is called, count HTTP outcalls or whatever other costly operations you’re doing, track the cycle balance, etc. With that in hand, there’s a good chance that you could pinpoint the cause and likely address it (e.g. rate limit specific APIs, block untrusted callers). Unfortunately, that’s the best answer I have.

6 Likes

Let’s please stick to facts and keep the threads focused on what the OP asks/comments on. @free gave an honest reply of where we stand and what we’re considering for the future, I have nothing to add to this, however I want to address potentially false/imprecise accusations/observations.

The subnet in question experienced a brief spike in memory usage (barely above 450GiB) around January 24-25th (one can easily verify by the linked graphs in the linked threads). I did not follow back then and as @free explained we have limited view anyway on what each and every one canister on the subnet is doing (we mostly collect aggregates). For all we know, it was some legit spike and then it dropped (likely canisters were re-installed or so but of course that’s a guess).

The other thread you linked referred to a cycles burn spike (which was related to a large number of installs of canisters as I replied on that thread) on March 9th and there was no spike in memory usage of the subnet (again can be easily verified through the public dashboard if one expands the state chart here over the last 3 months). Therefore, there’s not even an argument that can be made about the subnet going beyond the memory reservation threshold which could affect other canisters on the subnet (by forcing cycles to be reserved for extra memory allocations). Any canisters that were “drained” during that time, were “drained” because of their own usage, not because of something else.

6 Likes

The fact is that this predominately Hot or Not subnet spiked to 640 GiB for seventeen days straight.

How is this off-topic? What happens when a malicious actor with 30k canisters on a subnet artifically inflates the state size so that all the canisters that haven’t reserved cycles have to pay extra? Am I misunderstanding it?

This is a potential internal network overload. Don’t try and wenzel my posts, I expected more from you.

2 Likes

Technically speaking, you’re not paying extra when the subnet’s state size exceeds whatever the threshold for memory reservations is (400 GB? 500 GB?). You are simply required to reserve any new memory that you are allocating for a longer time span. I.e. instead of paying by the minute, as usual, cycles are set aside (not burned) to pay for the extra 1 MB (or however much your memory usage increased by) for a month or two or three (the duration increases progressively, the closer you get to the subnet’s storage limit). But unless you intend to uninstall your canister, those cycles will then be spent to cover your own storage or instruction or call costs over the following days and weeks. The storage costs themselves stay flat, at $5 / GB / year.

That being said, this is exactly the issue I mentioned above: on the very same thread there are questions about “how can the protocol deal with abuse” quickly followed by “what happens when someone uses up a lot of resources and I have to pay extra”? This is the problem that needs to be addressed and it’s not a simple one. Not technically and especially not politically.

Because AFAICT (and I’d be exceedingly happy to be proven wrong) pricing is the one tool we can use to mitigate most abuse scenarios.

5 Likes

How is this off-topic? What happens when a malicious actor with 30k canisters on a subnet artifically inflates the state size so that all the canisters that haven’t reserved cycles have to pay extra? Am I misunderstanding it?

Yes, you’re correct that if some canisters bring the state size to above the threshold, that would mean that other canisters would also need to reserve cycles. However, the canisters do not exactly pay more, they have to prepay for a longer period for memory reservations as it was explained earlier in the thread (but at the same price). All other costs (pertaining to other resources) remain exactly the same.

This is a potential internal network overload.

Indeed, and the current best mechanism we have is memory reservations to disincentivize short-term memory allocations but of course the mechanism is not perfect. Any ideas for improvements are very welcome.

5 Likes

Ok thanks, and thanks @free.

I know its complicated to fix and I’m not the person to help there. I’m just seeing a lot of people complaining about the IC breaking or being slow in certain ways, and when I dig in I always find it connected to the same group of people.

Anything that can be gamed will be gamed.

4 Likes

Rest assured (and anyone else reading the thread) that we keep an eye on these things and we always make the improvements that we can make.

Scalability and performance of the IC are one of the top priorities for us and we invest a lot of time analyzing data/incidents from mainnet and running our internal benchmarks looking for improvements. This work is not always visible though and we can’t always openly talk about it as in some scenarios it could give ideas to truly malicious actors about exploits in the system.

The storage story is a particularly challenging one and changes are not so easy to make (mostly due to political reasons actually, not technical).

6 Likes