DFX Telemetry Proposal

DFX Telemetry Proposal

Summary

We are working on adding telemetry collection to the Internet Computer SDK to improve the platform’s developer experience based on insights derived from behavioral data.

We’d like to create one of the most privacy-friendly telemetry collection pipelines:

  • Data collection is opt-in only.
  • There won’t be required telemetry that you can’t opt-out from.
  • The data will be stored on the IC platform and managed by a canister with open-source code.
  • The information will be thoroughly anonymized.

Please let us know what you think!

Intro

DFX is the default tool for developers to create, build, and manage Internet Computer canisters.

We were hearing community feedback that we need to improve the developer tools. For example, a survey of SuperNova participants flagged the developer experience as the #1 concern for the Internet Computer:

What should we improve exactly?

Taking a step back, how do you decide in what direction you should improve the product?

Approach #1: Heuristic.

The product team would use DFX, notice inconsistencies, and add features that improve the situation.

The good part of this approach is that the team would feel deep empathy for users. If you experience how bad (or good) something works, it’s easy to get motivated to improve it. Also, developers will know exactly how users feel to the tiniest detail instead of general categories like “improve onboarding” or “optimize identity.”

However, the product team usage may differ from the community’s. They may use tools less often than most devoted developers. The product team is cursed by knowing how things work internally and don’t stumble on the issues that confuse beginners.

Approach #2: Collect user feedback.

The product team would ask users to complete a survey or conduct an in-depth interview. This is a traditional way to do product research.

While this method works, there are a lot of limitations:

  • Users don’t like surveys. Participation is usually low. If there’s a reward, some users fill out surveys randomly just to collect the incentive.
  • Users don’t necessarily know what’s wrong. Only a minority of developers can express what exactly needs improvement.
  • This is the most expensive way to collect information. The team needs to hire a UX researcher to create surveys and conduct deep interviews.
  • The feedback cycle is slow. Learning the users’ opinions will take weeks or months if something goes wrong.

Approach #3: Collect behavioral data ← BEST

Collecting direct behavioral data is often the best way to know what’s happening with the product. You see your product usage dynamics, can conclude quickly, and iterate on features.

The data collected by the product is usually called “telemetry.” The product collects telemetry and sends the data to the server for further analysis.

The product team doesn’t know if dfx users experience many errors. We must rely on Twitter and the developer forum to see if users experience SDK failures. When the data collection infrastructure is established, it will be trivial to collect errors to proactively detect spikes and save environment information that will have to detect root causes.

However, data collection can be privacy-invasive. Let’s take Visual Studio Code as an example:

  • Opt-In vs. Opt-out. The IDE collects telemetry by default without asking for explicit user consent (maybe somewhere in the ToS, but who reads those anyways). If you have never thought about telemetry, the chances are you will never know that you can opt-out.
  • Required telemetry. Furthermore, only the “optional” analytics can be opted out from. The “required” analytics will always be sent to Microsoft unless you block it with a personal firewall.
  • Hidden processing. You don’t know how the telemetry data is processed on the server and whom it’s shared with. There’s only vague language in the privacy policy.
  • Bad anonymization. Anonymization could be inadequate. For example, most IP addresses can pinpoint a particular household (because of the residential ISP) and later correlate with other activities to reconstruct the real identity of the developer.

Solution

Is there a way to keep most of the benefits of telemetry but make data sharing consensual, voluntary, and maximum private?

DFINITY team is working on the telemetry collection system that solves most concerns:

  • Opt-in only. Telemetry will be opt-in only. Users will go through a consent prompt before any data hits the server.
  • All telemetry is optional. Developers are not required to share ANY data to use the IC.
  • Transparent processing. The analytics platform will process the data in an IC canister with open-source code. Users can see exactly how data is stored, anonymized, and aggregated.
  • Good anonymization. The analytics platform will drop IP addresses and other unique identifiers from the dataset. Since the data collection and processing code is open source, developers can check the anonymization logic by themselves.
  • Purposefulness. Inform the community on how data-driven decisions are taken (e.g., we could write we added feature X after analyzing telemetry data over the past Y days in the release notes).

Rollout and timing

We plan to start working on the feature in the coming weeks. Please look for updates on the architecture and timing.

5 Likes

I appreciate this proposal, but I can imagine many in the community might be hesitant to allow their data to be collected unless they know the specifics around which metadata will be collected with each call. Can you provide more specifics around this data as the data collection requirements are finalized?

A few additional questions:

  1. What data is “off-limits” from being collected?
  2. What steps are you putting forth to assure developers that nothing more than usage data is collected?
  3. If additional data is collected, will developers be notified as the tool evolves?
  4. How can developers ensure the security of any seedphrase(s) that they generate and use locally?

Would be a :cherries: on top if we can verify the canister’s wasm hash and controllers.

Can you provide more specifics around this data as the data collection requirements are finalized?

Sure. We are working on the design and will share more information when it’s ready.

  1. What data is “off-limits” from being collected?
  • Generally, we won’t collect PII (personally identifiable data) such as IP addresses, user names, and principal ids.
  • We should not collect information about which application was developed and deployed. So, for example, we won’t collect particular canister IDs.
  1. What steps are you putting forth to assure developers that nothing more than usage data is collected?

DFX is open source, and the telemetry code will be a part of the repo for anyone to check.

  1. If additional data is collected, will developers be notified as the tool evolves?

Trivial changes will not require a separate opt-in. So, for example, if we have consent to collect information on which dfx subcommand is used, it will be collected for newly added dfx subcommands.

Significant changes will require additional opt-in.

  1. How can developers ensure the security of any seedphrase(s) that they generate and use locally?

Do you mean how developers can be sure that we don’t send the seed phrase to the server by mistake?

Would be a :cherries: on top if we can verify the canister’s wasm hash and controllers.

Canister WASM and controller history will become standard for all canisters on the IC as part of the Canister Explorer project that I presented on Global R&D recently.

Yes, the seed phrase or any personally identifiable information. (Why are you using the term “generally” with respect to PII, instead of say “we will never”, or “will not” - absolutes?).

I think maintaining a publicly available list of the types of information collected in a forum thread during the proposal period, and then the sdk repository would be a good starting place.

At least then developers can agree/disagree whether the metadata collected is appropriate, and whether they want to opt-in or not. Developers can verify the open source code afterwards.

Looking forward to this.

1 Like

What is the analytics platform? Do the words “analytics platform” just refer to the canister?

Seeing as many of the dfx calls currently go through the ic-api which is not yet open source, will the canister be the only way that the telemetry data is being collected, or will you additionally be using the ic-api as a source for collecting and aggregating this data?

Edit: After clarifying with @Dylan, it does not appear that dfx hits the ic-api in any way.

I think there may be a misunderstanding here, since dfx does not use the ic-api. This API was created for the IC dashboard and exists outside of the Internet Computer. The API is used by a few non-DFINITY projects, such as the proposals Telegram bot, but it’s not used by any developer tools as far as I know.

1 Like

Hi,
I will not opt-in for your proposal. The speech is exactly the same as Google and Co.
Personally the only acceptable approach is #2. Moreover, one aspect is missing: the sharing of raw data in opendata.

1 Like

You don’t have to. All analytics is optional.

You have a keen eye :slight_smile:
We are discussing if the data should be shared as well. I think sharing the data openly is a good idea but I’d like to think more about potential risks.

We will not collect PII that can identify user directly. We will need some mechanism to collate sessions from the same user (otherwise we won’t be able to tell correlation between events). However, identifier needs to be short-lived to prevent profiling.

Good idea!

1 Like