funnAI: first Proof‑of‑AI‑Work on ICP

:backhand_index_pointing_right: Introduction Video on Youtube

Say hello to funnAI — the first fully on‑chain, incentivized AI competition built on the Internet Computer Protocol, by onicai.

:gear: Current Status: Main‑Net Stress Test

  1. Launchable funnAI App

    We have deployed a private version of the launchable funnAI app to ICP Main-Net

  2. Scaling on Main‑net

    To ensure funnAI can support hundreds of concurrent AI agents (“mAIners”) without impacting the broader ICP ecosystem, we’re going to run full‑scale system tests on main‑net:

    • Objective: Determine the maximum sustainable number of mAIner canisters
    • Collect Metrics: We will collect CPU, memory, and cycle burn rates (cost)

    Gating policy: We will gradually on‑board mAIners to ensure smooth operations

:magnifying_glass_tilted_left: Discussion & Call for Input

We’d love to kick off a conversation with the DFINITY team and the broader ICP community around the funnAI application:

  • As we collect the metrics we will post them in this thread.
  • If your application is impacted by funnAI tests running on your subnet, please post it here.

Your feedback will help shape funnAI’s next phases and ensure a smooth, community‑friendly launch.

:books: Resources


Looking forward to your thoughts! Let’s make on‑chain AI work for everyone. :rocket:

7 Likes

@berestovskyy, @dsarlis, @free, @Manu, @ielashi

I am tagging you to make you aware of this effort, and request your feedback about the architectural decisions we still have to make based on the main-net testing results.

If you have any feedback or thoughts right now, or if there is any data you recommend we collect during the stress-tests, please let us know.

The backend canister architecture of funnAI is summarized by this diagram:

Some important details:

  • The mAIner agent Creator spins up the mAIner agent & Large Language Model (LLM) canisters, installs the code, and then for the LLM canister, uploads a 644 MB model file (which is the LLM with all its parameters etc). The upload of that file is done with inter-canister update calls, using chunks of 1.9Mb bytes.

  • A mAIner agent canister generates responses from the LLM using a sequence of inter-canister update calls. Each update call is able to generate 13 tokens before hitting the instruction limit.

  • The speed of token generation is important but not critical. Most important is that every request gets handled and produces a response within a reasonable amount of time.

  • Based on work done before, with ICGPT & Charles, we found that 4 LLMs per subnet is optimal.

    We had a session with the Scalability & Performance Working Group last year (here & here), which led to the conclusion that using 4 LLMs per subnet is optimal. We assume this is still the case.

The architectural decisions we need to make are:

  1. How many mAIners can be created at the same time, without bottling up the network?

  2. Is it ok to have one mAIner agent Creator or should we have one for each subnet?

  3. How should we configure the mAIner service to support N mAIner agents?

    We plan to go with one mAIner service and 16 LLMs across 4 subnets. We think this is sufficient, but we need to verify the concurrent load it can support.

  4. How many mAIner Agents of type Own (i.e. having their own dedicated LLM canister attached) can run concurrently on one subnet?

    We will initially limit each mAIner of type Own to just one LLM. It is not feasible to limit the number of mAIners of type Own to just 4 per subnet. But what is the maximum number of LLMs that can be running concurrently and still guarantee that a result is produced?

Some of our main concerns:

  • Our biggest concern is running into time-outs or other bottlenecks and the mAIner agent is not able to generate a response. We will queue things up based on this finding from April 2024:

    From here: “I also tried to go to 16 concurrent users but that resulted in timeouts and unsuccessful update calls”

  • Beyond time-outs, what other bottlenecks do you think are possible and we should anticipate?

  • We also want to ensure that our ICP neighbors (other apps running on the subnets) aren’t drastically affected by our workloads; which measurements would you recommend us to take to check for this?

1 Like

That’s a lot to digest all at once, but at a very high level what I understand is that you intend to have a number of canisters per subnet that are expected to be constantly running DTS executions (in addition to other, lighter weight – in terms of CPU usage – canisters, which I’m going to ignore).

Now, as you observe, a subnet has 4 execution threads / virtual CPU cores. It will not schedule more than 3 DTS executions at the same time, unless there is nothing else to execute on the subnet (and except for the first round of a DTS execution, where it doesn’t yet know it is going to be a DTS execution).

The scheduler also tries to balance DTS and non-DTS executions, by looking at how many canisters (and with how much compute allocation) are in the middle of DTS executions vs. about to start new message / task executions. So e.g. a mix of 10 DTS to 30 “new execution” canisters will result in 1 CPU core allocated for DTS executions and 3 for “new executions”. Canisters with DTS executions will get appended round-robin to the queues of the various execution threads, including to the nominally non-DTS ones, behind the “new” executions, so assuming there isn’t much CPU load from other canisters, you might end up using most of the 4 threads / cores anyway.

Anyway, leaving all of the nitty-gritty aside, having 3(+1) canisters pegging 3(+1) of a subnet’s CPU cores is potentially not great for other canisters on the subnet, who may or may not find themselves “squeezed out” and experiencing high latencies (e.g. getting scheduled every 2-3 rounds instead of every round). And since you’re asking about how to ensure you are being good neighbors, I can think of 2 possible approaches:

  1. Deploy a “canary” canister next to your DTS canisters, only running a heartbeat. Count how many times that heartbeat gets executed (i.e. how many times a “random” canister on the subnet gets scheduled). If it drifts away from 2.5x per second, the subnet is slow and/or your canary didn’t get scheduled every round. Could be due to your other canisters’ DTS executions, could be due to something else. Still, it means you’re not going to get great latency from that subnet. And neither is everyone else.

  2. You can count the instructions executed by and measure the duration of your DTS executions (from the caller, as time does not change during a DTS execution). As long as the executed instructions per second approach the instruction limit (2B?) times the expected subnet block rate (2.5/s), the subnet is doing fine, scheduling canisters with work to do (virtually) every round. (You are likely to be some way away from that target. as even on an otherwise idle subnet, simply executing 2B instructions every round is likely to take longer than 400 ms and reduce block rate. You’ll probably need to do some benchmarking first. Or constantly.) The farther your per-canister throughput drifts from this target, the more loaded the subnet is. And you should likely reduce your load on that particular subnet, directing it elsewhere instead. Looking at it even more simply, the higher the latency you experience, the higher the load on that subnet.

4 Likes

We’re running quite a large test at the moment, and noticed one of the Subnets where we run 3 LLMs suddenly dropped it’s Cycle Burn Rate from 2 B/se to 0.

https://dashboard.internetcomputer.org/network/subnets/io67a-2jmkw-zup3h-snbwi-g6a5n-rm5dn-b6png-lvdpl-nqnto-yih6l-gqe

There appears to be a degraded node in the subnet. Could that be the cause?

That is an issue with the underlying metrics. We have a consumed_cycles_by_use_cases canister metric that underlies all of this (we simply sum all canisters’ consumed cycles), but unfortunately it measures “burned plus reserved” cycles, not only actually burned cycles. So when your canister makes a call and reserves 50B cycles for a maximum of 50B instructions potentially needed to execute the response, this amount increases by 50B, but then drops by almost 50B when the response is actually processed, requiring only a few million instructions.

In order to compute a cycle burn rate from this metric (one that doesn’t go negative, raising even more questions), we first generate a “high water mark” of this metric: a step function that retains the highest value ever seen. The problem with that is that whenever there’s a spike in activity (particularly lots of concurrent canister calls) the underlying metric spikes temporarily, but the “high water mark” gets stuck at that high value for as long as it takes the subnet to actually burn through that many cycles.

E.g. if something on the subnet makes 1k concurrent calls, they would reserve 1k * 50B = 50T cycles and then be refunded most of them. A subnet that doesn’t have a high cycle burn rate would then take many hours to actually burn through 50T cycles. And during that time its cycle burn rate would be reported as zero.

The good news is that the Execution team will soon start working on an actual burned cycles metric, so we should be able to produce less spiky and more responsive cycle burn rates.

2 Likes