Idea Discussion:: On-Chain AI Red-Teaming Sandbox for DeAI Models

Hi ICP Community!

I was wondering that, even though the IC enables AI inference to run directly inside tamper-evident canisters, I felt that it still lacks a controlled execution environment where developers can run harmful or adversarial prompts without risking system abuse, data leaks, or accidental unsafe actions. Existing tools run locally or on cloud VMs, with no isolation, no auditability, and no tamper resistance. Red-teaming is mostly performed manually using ad-hoc scripts, with no reproducible logs and no way to compare how models behave across versions.

This leads to difficulty in testing agents’ behaviours against prompt injection, model manipulation, resource abuse, and unsafe decision-making while ensuring the results are deterministic, transparent, and auditable. Consequently, prevents teams from confidently deploying AI agents on ICP.

As more AI models run directly inside canisters, it becomes important to have a systematic way to test them for negative scenarios like:

  • prompt injection
  • adversarial perturbations
  • backdoors or hidden triggers
  • model poisoning
  • nondeterministic drift over time
  • resource-exhaustion or DoS

Addressing this situation will benefit in having an open , standardized red-testing sandbox that could help developers validate their models before deploying publicly.

I’d love to hear thoughts from the community about whether such tools would be useful or any existing tool is already present in which above is being addressed and aligned with.
@Jan @Kyle_Langham @hansl @timo @Manu @alexu @dieter.sommer @lastmjs