Canister audit advice

From a recent Canister source code auditing gig I extracted some general advice, which I am happy to share here with the community. If you are implementing Canisters beyond toy examples, this might be a useful list to go through:

https://www.joachim-breitner.de/blog/788-How_to_audit_an_Internet_Computer_canister

18 Likes

This is incredibly helpful.

One question: when you mean reply/response handler to an inter-canister message, do you mean the code that happens after the await?

Also, I really hope canister upgrades are improved in the near future. I read somewhere in the forum that this is in the roadmap. Right now, it seems at best tedious and at worst quite dangerous.

1 Like

Great insights! Two quick questions:

  1. (in rust) There doesn’t seem to be a way to set a timeout when calling a canister. Are there any plans to support this in the future? Being able to handle long response times at source would be handy.

  2. Regarding backups - I had this flow in mind: Have a canister state that goes from “live” to “maintenance”, and if it’s in maintenance drops every call in “inspect_message” except for a set of backup related calls. Would dropping any update calls in “inspect_message” guarantee that the state cannot change? How would one check if there are “in flight” calls still pending? Just wait some random amount of minutes before proceeding with backing up?

1 Like

Exactly! Unfortunately, I expect serious developers won’t get around thinking of their code in the form that the compiler transforms it to, with explicit continuations. At least sometimes.

Inter-canister calls or external calls? For inter-canister calls you cannot. For external calls, your agent library has to poll for the response anyways, so there a timeout applies.

No, inspect message is only for ingress messages, and will not block inter-canister messages.

You should add this “maintenance mode” check to the beginning of each update method, then your plan is good.

I see. That’s an important distinction to make. Thanks.

1 Like

Canister upgradeability

Just FYI: For both PRE and POST upgrade, if these fail, nothing happens to your production environment, it’s still up and still has the same data. Pre_upgrade and post_upgrade have a cycle limit which is a lot larger than standard query and updates. DSCVR has not hit the cycle limit for pre_upgrade but often hits the cycle limit for post_upgrade.

Backup and recovery

DSCVR currently has the following service calls to perform all the needed offline backups:

backup: () -> (opt Backup) query; // needs to be chunked soon
backup_content_chunks: (nat64, nat64) -> (opt BackupContentChunk) query;
backup_users_chunk: (nat64, nat64) -> (opt BackupUserChunk) query;
backup_portals_chunk: (nat64, nat64) -> (opt BackupPortalChunk) query;
backup_limits: () -> (opt ActionLimits) query; //never needs to be chunked
backup_nfts: () -> (vec NFT) query;   // needs to be chunked soon


restore: (Backup) -> (); //Backup can contain partial data which the canister will detect
restore_content: (Backup) -> ();
restore_users: (Backup) -> ();
restore_nfts: (vec NFT) -> ();
restore_portals: (BackupPortalChunk) -> ();
restore_limits: (ActionLimits) -> ();

Chunked backups are extremely important as you only get 3MB down and 2MB up. When DSCVR would deploy a new version of the backend canister, we would typically backup the entire backend, deploy a new instance, then restore all the data. A complete download of all the data (400mb) takes about 10mins (±2mins). A chunked backup in rust looks something like this:

fn content_chunks(start: usize, limit: usize) -> StableStorageContent {
    let all_content: Vec<&Content> = storage::get::<ContentStore>().values().collect();

    let chunk = all_content[start..start + limit].to_vec();

    let mut chunk_polls: Vec<Poll> = Vec::new();
    for content in chunk.iter() {
        if let Some(poll) = Poll::get_by_content_id(&content.id) {
            chunk_polls.push(poll.clone());
        }
    }

    return StableStorageContent {
        chunk: chunk.into_iter().map(|x| x.clone()).collect(),
        chunk_polls: chunk_polls,
        total_length: all_content.len(), //this will grab trailing data
    };
}

If you’re reading this and have a better way to do this, please just say something.

Its worth pointing out that DSCVR has experimented heavily with how to actually download all the canister data without hitting the limits and the best method has been to try and grab as much as possible and when the request fails, decrease the amount of data being requested using the start and limit params.

When performing a restore, we follow a similar method. We used to have a dry-run to estimate the size of the restore request, but the LEB128 takes too long for large amounts of data and it’s easier and much faster to just reduce the amount being restored when a limit is hit (exception from the canister) and perform a retry with less data.

Full site backup runs every hour right now and is stored in case a state-bomb happens. We have a lot of guards in place to prevent this from happening, but it helps me sleep at night knowing we can only lose about 1 hour of state.

Multi-stage stable store.

Needs a lot of design thinking, but possible solutions outlined below.

Multi-stage stable restore.

Post_upgrade has a cycle limit that is actually really big (no idea how big is big, but bigger than a query or post).

To perform a multi-stage stable restore theory:

  • Use guards to block all transactions not from Admins
  • Execute canister upgrade
  • Pre_upgrade: All state that is within the realm of a backup is transformed and written to stable storage
  • Post upgrade restores all stable storage and not content (80% of our state)
    • There is a lot of data transformation that happens here to restore indexes that are not written to stable storage.
  • restore_content_stable(to, from) is called that pulls stable storage and restores the content within the range of to/from
  • Allow transactions again

The above only works if the restore_content_stable does not hit the instruction limit when pulling data out of stable storage.

Right now data is not exactly critical on DSCVR and if some data is lost it’s not a big deal, but when we have more critical systems, pausing and unpausing transactions will be extremely critical for large backup/restores.

I might be missing somethings here, but this is a high level

One thing I wish I knew is how many instructions my requests are using relative to the limit of my subnet.

6 Likes

Thanks for sharing that, great insights! (And a clear sign that the IC platform still has quite done way to go before it matches the vision…)

2 Likes

What is the destination for these backups?
Also, have you considered off chain backup where a cloud hosted cron scheduled worker takes incremental backups to a cloud hosted data store?

Asking because I’m contemplating these exact questions myself

1 Like