Canister Logging Support [Community Consideration]

Background

  • When a canister traps, changes it makes to the state are completely reverted. As a result, it’s not possible for developers to detect these errors.
  • The IC provides a way to log messages throughout execution, the debug_print API, but it’s only available when running the canister locally.

Problem

There is no reliable way for canister developers to know that their canister is trapping.

  • Traps leave no trace in the replica that’s visible for developers.
  • debug_print is available locally, but on mainnet.
  • Heartbeat and timer traps are completely silent.
  • Logging solutions that are built by canister developers can only capture logs on the happy paths (i.e. when the canister doesn’t trap)

The only way for developers to know of errors today is by relying on user reports. A user encounters an error, and then the user needs to report that error to the developers to be aware of it.

Proposal

  • An endpoint is added to the management canister to fetch a canister’s logs. The IC will maintain the last few KBs of logs + trap messages and return them in the response.
// New API on the management canister
fetch_logs : (record { canister_id }) -> (fetch_logs_response) query;

type log_record = record {
  idx: nat;
  timestamp: nat;
  contents: blob;
};

type fetch_logs_response = record {
  log_records: vec log_record;
};
  • Every log record returned has an idx, which is an auto-incrementing ID for each log record. This is useful for merging logs between different requests and for detecting missing logs.

  • A new setting will be added in the canister’s settings for configuring the log’s visibility. Logs are visible only to controllers by default, but can also be configured to be available publicly.

// An enum for setting logs' visibility.
type log_visibility = variant {
  public;
  controllers;
};
  • Logs persist across upgrades.
  • Logs are deleted if the canister is reinstalled or uninstalled.
  • Logs include:
    • Messages in debug_print
    • Trap messages
    • (In the future) Stack traces

SDK Changes

A new command to tail a canister’s log can be added to the SDK to facilitate retrieving the latest logs.

dfx canister tail <canister_id>

Use-Cases

  • Integration with 3rd party logging services
    • Logs can be pulled into logging services like Kibana, DataDog, etc for developers to store, index, add alerts, etc.
  • Live debugging sessions
    • dfx canister tail can print the canister’s logs in realtime.

Feedback Request

Let us know your thoughts! This proposal is meant to be a very simple yet functional way for canisters to access their logs and trap messages on mainnet. It isn’t meant to be a fully satisfying solution in the long-term, but rather a simple solution that can be delivered quickly. Specifically, the following are non-goals at this point:

  • Supporting large amounts of logs.
  • Supporting searching (by timestamp, method name, etc.)
  • Supporting log levels

In the future this solution can be expanded to address the above shortcomings.

16 Likes

Does this mean that trap logs will also be available locally when developing. I’ve had issues using timers and not seeing any errors.

That’s correct, the same functionality will be available locally as well.

3 Likes

I’d really like to be able to access the logs, especially for traps! Can’t wait for this feature to be available!

2 Likes

Agree that this is an issue - but I’d like something a bit more than just logging. It would be good if you could run a function on traps. For example - notify X canister of error or re-attempt/ re-start a timer to try again.

  • For future-proofing purposes, it could maybe be beneficial to consider adding a log level parameter, such as info|warn|error to the interface?

  • Including from|to filters in the fetch_logs interface, based on timestamps, would be a useful feature to consider, as it can provide more precise log retrieval options and I guess won’t add too much work on the implementation.

  • Having two timestamps, one for the time of the error and another for when the log entry was collected, could maybe provide valuable context during the debugging process?

Absolutely, I agree with you. In some limited capacity, the IC does support executing a function if the canister traps, but only when making an inter-canister request. There, you can specify a cleanup function that is called if your canister traps when handling the response. This is supported by the protocol (see the call_on_cleanup System API in the interface specification), but to my knowledge neither Motoko nor the Rust CDK actually expose that feature for developers.

We did have discussions about how we can generalize the call_on_cleanup System API and allow developers to instead be able to specify a cleanup throughout all replicated execution.

cc @claudio

3 Likes

Thanks for the feedback! We did discuss a lot of these suggestions, but see my note at the very bottom of the initial post:

All these features can easily be introduced in future iterations, and given that logging support is non-existent today, we wanted to start with the simplest possible solution, release it quickly, and iterate from there. We’ll take that feedback into account of course in the next iterations :slight_smile:

2 Likes

Really good proposal. And to include both log ID and timestamp is great!

I know we are trying to build an MVP logging solution but I think it’s a good idea to try and have Open Telemetry support in mind from the get go.

Even if we are not 100% spec compliant from the get go, it’s a good idea to reserve the necessary keywords, conventions, aspects, etc. as this gets built out.

This will lead to us not having to reinvent the wheel on common observability conventions.

I believe this should be a lesson learned from how the IC System API is solving the same conventions that WASI does but it reinvented the wheel during implementation and now we are having to figure out how to wrap the System API to be WASI compatible. Also the same story for Candid and WIT

Thoughts?

5 Likes

Hey @saikatdas0790, thanks for your feedback. I’m not familiar myself with OpenTelemetry. Can you share what concretely would you change in the current proposal in order to add OpenTelemetry support?

2 Likes

To be honest, I don’t have any experience designing a logging destination. I have mostly been on the publishing end of logs.

I can take a stab however. I imagine we just want to be compliant with this part of the spec:

Here are some example models that show how to take the internal data respresentation of a platform and translate that to the OpenTelemetry model:

Now, on the log viewer side of things we have some assurances on the kinds and shape of data that’s expected to come in and we can have nice structures to messages on both TUIs and GUIs

The ultimate benefit of doing this is you become automatically compatible with a host of vendors that support OT and you don’t have to painstakingly build the additional tooling required to consume the logs/metrics/traces generated as we build this system out.

Here’s an indicative list of vendor support:

Imagine just being able to point your Grafana deployment to the relevant endpoint and having pretty printed logs that you can query with all the power of that tooling. Or pretty graphs once we do OT compliant metrics monitoring. And so on…

Thoughts?

1 Like

@ielashi will this work when the canister is bundled with ic-wasm shrink option?

Any supported dfx version info and example of using canister logging feature using rust code?

btwl@btwl-virtual-machine ~/c/R/e/hello (master) [2]> dfx --version
dfx 0.15.3
btwl@btwl-virtual-machine ~/c/R/e/hello (master)> dfx canister tail v7g7o-oiaaa-aaaag-qcj3q-cai
error: unrecognized subcommand 'tail'

Usage: dfx canister [OPTIONS] <COMMAND>

For more information, try '--help'.

The commands in the OP were suggestions about how it could look like in the future. There’s nothing released yet

1 Like

do you have a rough release date for a beta we can try? Also when you say the last few Kbs are we talking the last 5 min of logs or a fixed amount of logs no matter the time elapsed?

Is there a difference in cycle cost when you pull the last chunk of logs vs tail?

Thanks

1 Like

I forwarded the question about an ETA

Fixed amount of bytes. It depends on the volume of logs how much time that would cover.

dfx tail would simply regularly poll the logs. It would be equivalent to querying the logs on a timer

2 Likes

Regarding an ETA, I think we’re talking on the order of short weeks to have something available for people to try.

1 Like

Hey @ielashi @claudio
Is there more information available on this feature call_on_cleanup or any idea on when it will be available in Motoko? I have a usecase for it that I’d love to implement.

No progress on the Motoko side, I’m afraid. The compiler is using it internally to free the resource(s) associated with the unused callback(s), but you can’t hook into that yourself at the moment.

What is your use-case?