Long Term R&D: Malicious Node Security (proposal)

1. Summary

This is a motion proposal for the long-term R&D of the DFINITY Foundation, as part of the follow-up to this post: Motion Proposals on Long Term R&D Plans (Please read this post for context).

This project’s objective

This project is about the IC monitoring, reacting, and handling malicious behavior from nodes in a subnet. Examples include: handle equivocating block makers more efficiently or detect and act upon malicious behavior.

2. Discussion lead

Manu Drijvers

3. How this R&D proposal is different from previous types

Previous motion proposals have revolved around specific features and tended to have clear, finite goals that are delivered and completed. They tended to be measured in days, weeks, or months.

These motion proposals are different and are defining the long-term plan that the foundation will use, e.g., for hiring and organizational build-out. They have the following traits and patterns:

  1. Their scope is years, not weeks or months as in previous NNS motions
  2. They have a broad direction but are active areas of R&D so they do not have an obvious line of execution.
  3. They involve deep research in cryptography, networking, distributed systems, language, virtual machines, operating systems.
  4. They are meant to match the strengths of where the DFINITY foundation’s expertise is best suited.
  5. Work on these proposals will not start immediately.
  6. There will be many follow-up discussions and proposals on each topic when work is underway and smaller milestones and tasks get defined.

An example may be the R&D for “Scalability” where there will be a team investigating and improving the scalability of the IC at various stages. Different bottlenecks will surface and different goals will be met.

3. How this R&D proposal is similar to what we have seen

We want to double down on the behaviors we think have worked well. These include:

  1. Publicly identifying owners of subject areas to engage and discuss their thinking with the community
  2. Providing periodic updates to the community as things evolve, milestones reached, proposals are needed, etc…
  3. Presenting more and more R&D thinking early and openly.

This has worked well for the last 6 months so we want to repeat this pattern.

4. Next Steps

Developer forum intro posted
1-pager from the discussion lead posted
NNS Motion proposal submitted

5. What we are asking the community

  • Ask questions
  • Read 1-pager
  • Give feedback
  • Vote on the motion proposal

Frankly, we do not expect many nitty-gritty details because these are meant to address projects that go on for long time horizons.

The DFINITY foundation’s only goal is to improve the adoption of the IC so we want to sanity-check the projects we see necessary for growing the IC by having you (the ICP community) tell us what you all think of these active R&D threads we have.

6. What this means for the existing Roadmap or Projects

In terms of the current roadmap and proposals executed, those are still being worked on and have priority.

An intellectually honest way to look at this long-term R&D project is to see them as the upstream or “primordial soup” from which more baked projects emerge from. With this lens, these proposals are akin to asking, “what kind of specialties or strengths do we want to make sure DFINITY foundation has built up?”

Most (if not all) projects that the DFINITY foundation has executed or is executing are borne from long-running R&D threads. Even when community feedback tells the foundation, “we need X” or “Y does not work”, it is typically the team with the most relevant R&D area that picks up the short-term feature or project.

Please note:

Some folks gave asked if they should vote to “reject” any of the Long Term R&D projects as a way to signal prioritization. The answer is simple: “No, please, ACCEPT” :wink:

These long-term R&D projects are the DFINITY’s foundation’s thesis at R&D threads it should have across years (3 years is the number we sometimes use internally). We are asking the community to ACCEPT (pending 1-pager and more community feedback of course). Prioritization can come at a separate step.

Malicious party security

Objective

The internet computer is designed to run software in the form of canister smart contracts in a very secure and reliable fashion. In simple terms, it means that many replicas together form a subnet, and even if some of those replicas are malicious and misbehaving, the internet computer still guarantees two important properties:

  • Liveness: the canister smart contracts on that subnet should still be available and process incoming messages
  • Safety: the state of a canister only transitions according to the rules of the canister (i.e., tampering with a canister state is impossible, and in particular rolling back / “double spending” attacks are impossible).

The internet computer protocol by design has a very simple argument that guarantees safety, which can be observed simply by receiving sufficiently many cryptographic signatures. Liveness however is harder to achieve in practice: if peers in the subnet are malicious, they can send a lot of useless messages, trying to waste the bandwidth or CPU of an honest replica.

Why this is important

The IC should be able to withstand and defend against all sorts of adversarial and non-adversarial behavior that may affect its liveness, safetly, performance and/or fairness. In particular, the IC should cope with DoS and other attacks that aim to waste its resources.

Beyond spamming the network by sending invalid, incorrect, irrelevant (e.g. duplicate, expired) or explicitly unwanted artifacts, every behavior that is capable of wasting system resources like advertising and then withholding requested artifacts is considered a form of misbehavior.

Background

There are many different ways in which a replica can misbehave. Some are relatively simple: if a node sends some signed artifacts that deviates from the protocol (no honest party would ever send such artifacts), then this node is clearly misbehaving and there exists evidence. Other types of misbehavior are much harder to act upon: if a node never sends any message, there is no hard evidence of misbehavior.

We can classify the types of misbehavior based on whether a node can deterministically detect and prove them to other nodes in the network. If certain malicious actions are not provable, the affected victims could still file a complaint and let the system punish the bad node or user if a threshold of (e.g. > n/3) has complained against the same node or user.

Deterministically detectable misbehavior

This category of misbehavior is unequivocally detectable based on specific action or received artifact.

Publicly blamable:

  • Node sends an unrequested or invalid artifact/advert/request
  • Node sends the same artifact twice
  • Node withholds requested artifacts
  • Node sends too many artifacts (e.g., state sync requests) per time unit

Publicly provable:

  • Equivocating blocks
  • Equivocating signatures
  • DKG dealings from non-dealers

Probabilistically detectable misbehavior

A node can only probabilistically infer a misbehavior by statistically analyzing communication patterns of a peer.

Publicly blamable:

  • Node regularly responds to artifact requests late (but before timeout)
  • Node sends “too many” errors to requests.

Publicly observable:

  • Node fails to produce blocks most of the time, i.e. the fraction of the blocks in the final chain is below a certain level

Collectively detectable

Some adversarial communication patterns may only be detectable on an aggregate basis by multiple nodes.

Publicly blamable:

  • Node selectively advertises some artifacts to some of the peers but not to others.

Key milestones

  1. Replicas temporarily disconnect from peers that send invalid artifacts
  2. All data structures are of bounded size, and malicious peers cannot prevent an honest node from staying up-to-date on the blockchain and state of the subnet
  3. Statistically deviating replicas can be identified by the protocol
  4. Action is taken against provably misbehaving replicas, and such replicas may be permanently removed from the internet computer by the protocol
  5. Action is taken against statistically deviating replicas, and such replicas may be permanently removed from the internet computer by the protocol
  6. No bad peer can deteriorate the throughput and latency by more than 20% in and across subnets of at least 50 nodes

Discussion leads

@yvonneanne, @Manu

Open Research questions

  • What are relevant metrics for correct participation in the Internet Computer Protocol?
  • What are suitable observation windows and what are the expected and tolerated deviations from aggregate metrics?
  • How can the traffic be shaped to utilize the available bandwidth between nodes in the same and different data centers efficiently despite malicious nodes?
  • How can one design OS and networking scheduling algorithms to provide Quality of service guarantees despite Byzantine players?
  • How can we prove that the internet computer protocol is live while maintaining bounded data structures?

Skills and expertise necessary to accomplish this

This project will require a wide range of different skills. One the one hand, this is a theoretical problem, and requires academic distributed systems and cryptography experts. On the other hand, it requires empirical testing, which includes testing how specific malicious behavior is handled, but also chaos engineering and fuzzing techniques.

What are we asking the community

  • Review comments, ask questions, give feedback
  • Vote accept or reject on NNS Motion
  • Participate in technical discussions as the motion moves forward
3 Likes

The proposal is live! Internet Computer Network Status