funnAI: first Proof‑of‑AI‑Work on ICP

:backhand_index_pointing_right: Introduction Video on Youtube

Say hello to funnAI — the first fully on‑chain, incentivized AI competition built on the Internet Computer Protocol, by onicai.

:gear: Current Status: Main‑Net Stress Test

  1. Launchable funnAI App

    We have deployed a private version of the launchable funnAI app to ICP Main-Net

  2. Scaling on Main‑net

    To ensure funnAI can support hundreds of concurrent AI agents (“mAIners”) without impacting the broader ICP ecosystem, we’re going to run full‑scale system tests on main‑net:

    • Objective: Determine the maximum sustainable number of mAIner canisters
    • Collect Metrics: We will collect CPU, memory, and cycle burn rates (cost)

    Gating policy: We will gradually on‑board mAIners to ensure smooth operations

:magnifying_glass_tilted_left: Discussion & Call for Input

We’d love to kick off a conversation with the DFINITY team and the broader ICP community around the funnAI application:

  • As we collect the metrics we will post them in this thread.
  • If your application is impacted by funnAI tests running on your subnet, please post it here.

Your feedback will help shape funnAI’s next phases and ensure a smooth, community‑friendly launch.

:books: Resources


Looking forward to your thoughts! Let’s make on‑chain AI work for everyone. :rocket:

8 Likes

@berestovskyy, @dsarlis, @free, @Manu, @ielashi

I am tagging you to make you aware of this effort, and request your feedback about the architectural decisions we still have to make based on the main-net testing results.

If you have any feedback or thoughts right now, or if there is any data you recommend we collect during the stress-tests, please let us know.

The backend canister architecture of funnAI is summarized by this diagram:

Some important details:

  • The mAIner agent Creator spins up the mAIner agent & Large Language Model (LLM) canisters, installs the code, and then for the LLM canister, uploads a 644 MB model file (which is the LLM with all its parameters etc). The upload of that file is done with inter-canister update calls, using chunks of 1.9Mb bytes.

  • A mAIner agent canister generates responses from the LLM using a sequence of inter-canister update calls. Each update call is able to generate 13 tokens before hitting the instruction limit.

  • The speed of token generation is important but not critical. Most important is that every request gets handled and produces a response within a reasonable amount of time.

  • Based on work done before, with ICGPT & Charles, we found that 4 LLMs per subnet is optimal.

    We had a session with the Scalability & Performance Working Group last year (here & here), which led to the conclusion that using 4 LLMs per subnet is optimal. We assume this is still the case.

The architectural decisions we need to make are:

  1. How many mAIners can be created at the same time, without bottling up the network?

  2. Is it ok to have one mAIner agent Creator or should we have one for each subnet?

  3. How should we configure the mAIner service to support N mAIner agents?

    We plan to go with one mAIner service and 16 LLMs across 4 subnets. We think this is sufficient, but we need to verify the concurrent load it can support.

  4. How many mAIner Agents of type Own (i.e. having their own dedicated LLM canister attached) can run concurrently on one subnet?

    We will initially limit each mAIner of type Own to just one LLM. It is not feasible to limit the number of mAIners of type Own to just 4 per subnet. But what is the maximum number of LLMs that can be running concurrently and still guarantee that a result is produced?

Some of our main concerns:

  • Our biggest concern is running into time-outs or other bottlenecks and the mAIner agent is not able to generate a response. We will queue things up based on this finding from April 2024:

    From here: “I also tried to go to 16 concurrent users but that resulted in timeouts and unsuccessful update calls”

  • Beyond time-outs, what other bottlenecks do you think are possible and we should anticipate?

  • We also want to ensure that our ICP neighbors (other apps running on the subnets) aren’t drastically affected by our workloads; which measurements would you recommend us to take to check for this?

1 Like

That’s a lot to digest all at once, but at a very high level what I understand is that you intend to have a number of canisters per subnet that are expected to be constantly running DTS executions (in addition to other, lighter weight – in terms of CPU usage – canisters, which I’m going to ignore).

Now, as you observe, a subnet has 4 execution threads / virtual CPU cores. It will not schedule more than 3 DTS executions at the same time, unless there is nothing else to execute on the subnet (and except for the first round of a DTS execution, where it doesn’t yet know it is going to be a DTS execution).

The scheduler also tries to balance DTS and non-DTS executions, by looking at how many canisters (and with how much compute allocation) are in the middle of DTS executions vs. about to start new message / task executions. So e.g. a mix of 10 DTS to 30 “new execution” canisters will result in 1 CPU core allocated for DTS executions and 3 for “new executions”. Canisters with DTS executions will get appended round-robin to the queues of the various execution threads, including to the nominally non-DTS ones, behind the “new” executions, so assuming there isn’t much CPU load from other canisters, you might end up using most of the 4 threads / cores anyway.

Anyway, leaving all of the nitty-gritty aside, having 3(+1) canisters pegging 3(+1) of a subnet’s CPU cores is potentially not great for other canisters on the subnet, who may or may not find themselves “squeezed out” and experiencing high latencies (e.g. getting scheduled every 2-3 rounds instead of every round). And since you’re asking about how to ensure you are being good neighbors, I can think of 2 possible approaches:

  1. Deploy a “canary” canister next to your DTS canisters, only running a heartbeat. Count how many times that heartbeat gets executed (i.e. how many times a “random” canister on the subnet gets scheduled). If it drifts away from 2.5x per second, the subnet is slow and/or your canary didn’t get scheduled every round. Could be due to your other canisters’ DTS executions, could be due to something else. Still, it means you’re not going to get great latency from that subnet. And neither is everyone else.

  2. You can count the instructions executed by and measure the duration of your DTS executions (from the caller, as time does not change during a DTS execution). As long as the executed instructions per second approach the instruction limit (2B?) times the expected subnet block rate (2.5/s), the subnet is doing fine, scheduling canisters with work to do (virtually) every round. (You are likely to be some way away from that target. as even on an otherwise idle subnet, simply executing 2B instructions every round is likely to take longer than 400 ms and reduce block rate. You’ll probably need to do some benchmarking first. Or constantly.) The farther your per-canister throughput drifts from this target, the more loaded the subnet is. And you should likely reduce your load on that particular subnet, directing it elsewhere instead. Looking at it even more simply, the higher the latency you experience, the higher the load on that subnet.

5 Likes

We’re running quite a large test at the moment, and noticed one of the Subnets where we run 3 LLMs suddenly dropped it’s Cycle Burn Rate from 2 B/se to 0.

https://dashboard.internetcomputer.org/network/subnets/io67a-2jmkw-zup3h-snbwi-g6a5n-rm5dn-b6png-lvdpl-nqnto-yih6l-gqe

There appears to be a degraded node in the subnet. Could that be the cause?

That is an issue with the underlying metrics. We have a consumed_cycles_by_use_cases canister metric that underlies all of this (we simply sum all canisters’ consumed cycles), but unfortunately it measures “burned plus reserved” cycles, not only actually burned cycles. So when your canister makes a call and reserves 50B cycles for a maximum of 50B instructions potentially needed to execute the response, this amount increases by 50B, but then drops by almost 50B when the response is actually processed, requiring only a few million instructions.

In order to compute a cycle burn rate from this metric (one that doesn’t go negative, raising even more questions), we first generate a “high water mark” of this metric: a step function that retains the highest value ever seen. The problem with that is that whenever there’s a spike in activity (particularly lots of concurrent canister calls) the underlying metric spikes temporarily, but the “high water mark” gets stuck at that high value for as long as it takes the subnet to actually burn through that many cycles.

E.g. if something on the subnet makes 1k concurrent calls, they would reserve 1k * 50B = 50T cycles and then be refunded most of them. A subnet that doesn’t have a high cycle burn rate would then take many hours to actually burn through 50T cycles. And during that time its cycle burn rate would be reported as zero.

The good news is that the Execution team will soon start working on an actual burned cycles metric, so we should be able to produce less spiky and more responsive cycle burn rates.

8 Likes

Hi @free , we’re observing extended periods of 0 cycle burn rate followed by huge spikes on this subnet since last Wednesday: https://dashboard.internetcomputer.org/network/subnets/snjp4-xlbw4-mnbog-ddwy6-6ckfd-2w5a2-eipqo-7l436-pxqkh-l6fuv-vae?nd-s=100

Afterwards, we noticed that at least some messages were not handled properly by the canister anymore but showed unusually high latency. We also observed backlogs as messages seemed to pile up, and were then processed all at once with great delay.

On Saturday, we noticed a huge increase in latency and that a key canister didn’t seem to make any progress in general anymore. This continued throughout the weekend, practically making the canister unusable.

Is there anything you could recommend us looking into? Are there any known issues with the subnet or were there any recent upgrades that might cause this?

As per the above, the spikes in cycle burn followed by zero cycle usage are due to an issue with how we measure cycle burn. They do not indicate any issue with the health of the subnet.

Subnet snjp was and still is working perfectly fine: a minimum of 1.85 blocks per second and peak end-to-end latency of about 1.5 seconds. There was however (until about 18:00 UTC yesterday) one (or a few) canisters with a backlog of 500 canister messages (canister call requests and/or responses). And one canister (very likely the same) that was rate limited due to mutating too much memory (and thus being scheduled only about 2 rounds our of every 5). Up until 18:00 UTC there were also quite a few DTS executions (one every 4 or so rounds).

Remember that a canister is a (single-threaded) actor, so it can only handle so much throughput before it starts accumulating a backlog.

As far as I can tell, this incident likely had to do with your canister being unable to keep up with its load; and very, very likely nothing to do with the subnet itself (beyond it likely rate limiting your canister for reasons of fairness).

8 Likes

Thank you @free for your response and the info!

That’s really helpful for our investigation. We have a candidate in mind for the canister which had the backlog of messages. We believe we had higher workloads on the canister before but it handled it fine, so we’re not sure which messages caused this. Is there any way we could find out which exact calls/messages started the backlog or were mainly responsible for it?

We’re using the system API a lot (e.g. for randomness, and sending cycles). Is there a way to understand how the system calls contributed to the overall workload of the canister?

Thanks, that’s good to know that a canister got rate limited which likely then contributed to the backlog. When does the rate limiting kick in, e.g. is there a set threshold of memory and/or messages? Is there a way that you’d recommend us to investigate how the canister used its memory, and which messages and data structures mutated lots of memory? And do compute and memory allocations for the canister increase this limit?

2 Likes

This would be precisely the canister that was experiencing high latency. As said, the average canister on the subnet was only seeing 1.5 seconds end-to-end latency. So if your app is made up of multiple canisters, most of said canisters would not have experienced any excessive latency (unless, that is, they were all making calls to the backlogged canister(s) themselves).

There exists a system API call that returns the number of instructions executed by the message or call context. You probably want the latter. And you probably need some macro if you want to easily apply it to arbitrary update methods. Else, you need to manually add a block of code to the end each of your methods (and, if so, you may not intercept every exit path). I don’t know of anything that would do this automatically.

System API calls are usually pretty light. You can use the same performance counter API call from above to actually instrument each call to check if this is indeed so.

Management canister calls, while similar to system API calls, are actual canister calls (hence the await) so they will suspend execution, cause the management canister to execute, then execute the resulting response on your canister (so there’s extra latency involved). raw_rand is a management canister call and will definitely result in at least one round of latency.

There’s a “heap delta” limit per checkpoint interval (500 rounds), with the goal of limiting the amount of data that needs to be written to disk during checkpointing (you don’t want to have to write a full 2 TB to disk every couple of minutes; checkpointing would take forever and the subnet would be unresponsive for the duration). So if a canister mutates more than its “fair share” of that limit, it gets rate limited, i.e. skipped by the scheduler. I’m not entirely sure what the actual numbers are, but the metrics showed something like 60 MB being mutated every round (so the canister being rate limited was presumably mutating some 2.5x that much memory per round when it actually got scheduled, or about 150 MB). Although as said, I’m not particularly familiar with the details, so my calculations may be off.

Compute and memory allocations do nothing about the heap delta limit. Maybe spreading the work across multiple canisters does (i.e. maybe the rate limiting logic won’t let a single canister use up the entire “memory bandwidth”), but again, I don’t know.

9 Likes

Hi @free , thank you once more for your reply and the info!

We implemented several improvements based on it, and ran tests. It indeed seems like the management canister calls (e.g. for randomness, or to verify canister hashes) slowed down this canister a lot, so we replaced them wherever possible.
In addition, there are a few situations where the canister might handle enough data to hit the heap delta limit, and the updated code should prevent this going forward.

Is there any data around this that’d be interesting to you or colleagues at DFINITY? We’d be happy to provide it if we have it or can collect it :+1:

2 Likes

I’ll let others speak to if there’s any data we’d be interested in, but if you could maybe create a list of things you did to improve your situation I’m sure a lot of other devs in similar situations would be very thankful

5 Likes

Hi @free , the canister has been operating more slowly the last 2 or so days again and I wanted to see if there are any current observations and recommendations you could share with us.
The reported block finalization rate of the subnet also has been decreasing in the last days it seems (from slightly above 1.7 Blocks/s to 1.5). Including this sharp increase and then decrease just now.

Are there any current subnet stats you could share that could help us understand what might be going on and how our canisters might be contributing to this or might be affected?

We’re grateful about any insights that help us cope with the slowed down canister :folded_hands:

Not sure if this is relevant to the above but one node in the subnet seems to be offline

The reduction in snjp4’s block rate from around 1.7 blocks/s to around 1.5 blocks/s is indeed related to one replica becoming unhealthy around 2025-08-18 20:00 UTC.

And then, since 2025-08-19 15:30 UTC there’s a persistent backlog (again, across all canisters on the subnet, so impossible to pinpoint) of around 200 canister messages). Which (if in your canisters’ input queues) may account for the high latency you are observing.

Same as before, the subnet has a slightly reduced block rate (1.5 blocks/s), but it’s not enough to actually affect the measured end-to-end latency of the average ingress message, which stayed consistently under 1.2 seconds for the past week (consensus takes slightly longer whenever the missing replica is designated as block maker, but that simply results in the block spending less time backlogged before it’s executed). And the average round execution time also stayed relatively constant over the past week, at around 300 ms.

As said, canisters are (single-threaded) actors, so a backlog of messages in input queues implies extra latency (because all already enqueued messages have to be executed before a newly received one can be).

(As for the spike and dip in block rate, that can still be observed on the public dashboard, but nothing like it shows up on either of our internal monitoring databases. So it’s likely an artifact of the public dashboard metrics outage from last evening.)

1 Like