I would expect the cost of queries to be significantly lower than 1/2 the cost of update calls (because only one instead of 13 machines for common subnetworks is needed to execute the call). We should have more details about this very soon.
Any updates on the latest thinking around query charging?
Hi Jordan
We are currently re-scoping this feature and are now proposing to:
-
Build a feature that allows developers to get access to query-related statistics via the
canister status
API. This is basically the entire mechanism that we envision a possible solution for query charging (a way to deterministically aggregate per-node query statistics), without actually charging for anything. -
Hold back query charging for now.
Step 1. will hopefully help developers to learn something about the usage of query calls in their canisters and help optimize code, also in preparation for query charging.
Does this make sense?
We are planning to put up a motion proposal for voting soon.
Hi all,
here is the motion that we are thinking to put up for voting. What do you all think about it? Does this sound reasonable to everybody?
Background
Many dapps execute a significant number of query calls. Since queries are read-only and do not persist the state changes, it is difficult for the developers to keep track of executed queries. This means that currently developers do not know how many queries and instructions their canisters are executing.
Proposal
This proposal aims to add a mechanism to the IC to deterministically aggregate per-node statistics for query execution and periodically update each canisterâs status with those aggregates.
Aggregates statistics will appear in the canister status in a delayed and approximate manner due to limitations of data processing in byzantine distributed systems.
We propose to collect query statistics for both, replicated and non-replicated query calls.
Details
This proposal will add a quadruple of query statistics to each canisterâs state:
- Approximate sum of number of query calls executed
- Approximate sum of number of instructions executed by query calls
- Approximate sum of request payload size
- Approximate sum of response payload size
Counters are incremented periodically and will be initialized with 0 during canister creation. Developers can determine the rates for those counters by periodically querying them and analyzing the difference over time.
What we are asking the community
- Vote accept or reject on NNS Motion.
- Participate in technical discussions as the motion moves forward.
This sounds like a reasonable first step to me.
I put the motion of for voting here.
Thank you all for your input, here on the forum and offline and happy voting
Any update on the implementation now that the motion has passed?
Yes, we have started the implementation and the first partial MR will land on master soon.
As the changes are quite intricate and span multiple components, it will be a few months until the feature can ultimately be enabled on mainnet.
Any updates on this? What are the next steps towards cycle charging?
Hi Jordan
The query statistics feature is coming along nicely. We have so far managed to put most of the required code in the replica, but it is still disabled by means of a feature flag.
We are now adding more tests and make sure to provide support in CDKs and dfx.
After that has all been merged, we are planning to incrementally enable it on subnets. With that, developers will be able to gain insights into query statistics.
Note that we are not currently planning to actually charge for queries. If we ever do, most likely we would just charge cycles directly proportional to the metrics collected by this feature.
Thanks for the patience
I like to crosslink to this discussion, where it is requested by @jeshli to enable query charging while also enabling an increase in the instructions limit, which is blocking to run AI on chain.
Introducing an Opt-In Mechanism for Query Charging
The discussion in this thread has revolved around whether to enable query charging for all canisters or for none. In contrast, I propose an opt-in mechanism. By integrating a system where canisters can opt-in to pay for queries, we can tailor the ICâs capabilities to diverse needs and use cases, especially in the realm of advanced AI applications. Introducing options like ic_cdk::query(cycles_fee=true)
or ic_cdk::query(cycles_fee=true, composite_query=true)
, could responsibly increase the instruction limits per query. Such a modification would maintain robust security measures against DDOS attacks while enabling significant AI functionalities on the IC.
I am developing an open-source repository for AI inference on the IC facilitating the deployment of PyTorch/Tensorflow AI models. The instruction limit for single queries is a significant obstacle for running quality AI models. Experiments with splitting the AI model across multiple canisters and leveraging composite queries have encountered limitations due to the current instruction cap. This proposed modification to the query structure promises to overcome these barriers, enabling larger model sizes and faster response times, crucial for efficient AI inference and machine learning model training.
A preliminary empirical analysis I conducted focuses on the instruction limits for queries and updates in a local deployment setting, specifically using 32-bit floating point precision and 768 hidden dimensions. The current structure allows up to 20 billion instructions for an update but restricts queries to only 5 billion instructions. Below is a comparative analysis under these limits when using 32 bit floating point precision:
Master Type | Worker Type | Configuration | Tokens | Time | Max Tokens | Max Layers |
---|---|---|---|---|---|---|
Composite | Query | Embed + 1 Hidden | 12 | 0.156s | 12 | f(tokens) |
Composite | Query | Embed + 2 Hidden | 4 | 0.126s | 4 | f(tokens) |
-------------- | ------------- | ------------------- | -------- | --------- | ------------ | ------------ |
Update | Query | Embed + 1 Hidden | 4 | 2.069s | 12 | 12+ |
Update | Query | Embed + 1 Hidden | 12 | 2.062s | 12 | 12+ |
Update | Query | Embed + 2 Hidden | 4 | 2.056s | 12 | 12+ |
Update | Query | Embed + 2 Hidden | 12 | 3.325s | 12 | 12+ |
Update | Query | Embed + 4 Hidden | 12 | 4.788s | 12 | 12+ |
Update | Query | Embed + 12 Hidden | 12 | 11.18s | 12 | 12+ |
-------------- | ------------- | ------------------- | -------- | --------- | ------------ | ------------ |
Update | Update | Embed + 1 Hidden | 4 | 2.049s | 52 | 12+ |
Update | Update | Embed + 1 Hidden | 12 | 3.292s | 52 | 12+ |
Update | Update | Embed + 1 Hidden | 52 | 8.633s | 52 | 12+ |
Update | Update | Embed + 2 Hidden | 4 | 2.091s | 52 | 12+ |
Update | Update | Embed + 2 Hidden | 12 | 6.588s | 52 | 12+ |
Update | Update | Embed + 2 Hidden | 52 | 17.93s | 52 | 12+ |
Update | Update | Embed + 4 Hidden | 4 | 3.289s | 52 | 12+ |
Update | Update | Embed + 4 Hidden | 12 | 11.12s | 52 | 12+ |
Update | Update | Embed + 4 Hidden | 52 | 27.28s | 52 | 12+ |
In practice, we observe a range of context windows in AI models, from BERTâs minimum of 512 tokens to the state-of-the-art Mistralâs 8000+ tokens. My experiments have been on the smaller side of this range, constrained by the ICâs âcontext windowâ of 52 tokens at the 20 billion transaction (tx) limit and 12 tokens at the 5 billion tx limit. Extrapolating, we could estimate that 200 billion tx would allow for a context window of approximately 520 tokens, and 500 billion tx could enable around 1200 tokens. Aiming for a context window of at least 128 tokens, which I estimate to be around 50 billion tx, would represent a significant advancement within the current limitations. Alongside these efforts, I plan to explore increasing the context window capacity by reducing the floating point precision to the lowest feasible level. This initiative, while promising, represents a significant and complex undertaking that may require considerable time to implement effectively.
While integrating GPUs is a promising direction for enhancing computational efficiency, particularly in matrix computations, itâs becoming clear that increasing the query transaction limit could have a more immediate and pronounced impact on performance. This is not to diminish the importance of GPUs but to highlight that they, too, would benefit from such an adjustment in transaction limits, as they do not significantly reduce the number of operations required.
@jeshli I fully agree with the proposal for optional query charging. It maintains the status quo wrt all âsmallâ instruction limit queries being free (for now) while allowing canister code to ask for higher query instruction limits in return for a fee in cycles.
A question about your suggested cdk call options:
Would it be better to ask for a specific increased limit (in billions of wasm instructions) so that a cycle fee can be precalculated and also prevent a single ârunawayâ query continuing until some practical upper limit is reached?
This question might have direct impact on how this proposal would be implemented in practice.
@stefan-kaestle could you comment on this?
On-chain AI and ML inference tasks are the prime use case for greatly increased query wasm instruction limits and therefore the need to implement a flexible canister query charging mechanism.
@jeshli thank you for providing a detailed analysis and discussion about different AI model inference requirements and mapping these back into estimates of single query instruction limit requirements to target.
As we discussed in the recent DeAI Working Group meeting (attn @icpp @patnorris @lastmjs ), providing the Dfinity IC engineering team with concrete use cases and requirements analysis will help to focus and prioritise for the whole IC development community.
@stefan-kaestle could your Dfinity engineering team provide a detailed response to this proposal and analysis by @jeshli and confirm technical feasibility and any blockers. The DeAI WG is keen to resolve this first-order problem of queries being able to execute realistic AI model inference tasks asap. There are multiple individual developers and teams working with on-chain AI inference who are spending precious development time on work-arounds instead of directing those efforts into product development.
I also agree with the above statement regarding adding both on-chain & off-chain GPU compute facility to the IC.
The greatest need for the data protection guarantees offered by the IC is for AI inference queries which will process vast amounts of private and protected information about individuals and businesses. Inference queries clearly belong on-chain using CPU and GPU computation so charging for queries with larger instruction limits and choosing single replica node query computation (default) or verified query computation (by choice) is important.
For AI model training, which typically has much higher memory, throughput and data storage requirements, the need for off-chain GPU compute will be necessary in the near term. We can discuss this in more detail over in the DeAI WG thread to keep this thread focused on query charging. However one related question for the Dfinity engineering team is who should we be asking about the goals and implementation progress for on-chain GPU compute coming in the Gen 3 replica node specification currently under development and testing at Dfinity?
The DeAI WG discussed in our last meeting the need to better understand how Gen 3 replica node GPU compute might be utilised by canister developers in the future and how the choice of physical GPUs specified for Gen 3 will align with their product development requirements wrt AI inference and model training/fine-tuning.
@stefan-kaestle could you point us in the right direction?
As this is somewhat off-topic we can take the GPU discussion over here Technical Working Group DeAI
Hello everybody and thanks for the detailed use case.
What you are suggesting sounds definitely like a good solution to how to introduce query charging. We have internally discussed this briefly already.
As a first step, we have to finish up the work on âquery statisticsâ, which is nearly done. Query stats are a pre-requirement for query charging, as it deterministically aggregates per-node query statistics. Those statistics are then the foundation for charging for query calls.
We will get back to you very soon with detailed thoughts.
Query charging would open up a world of opportunities for the type of AI apps I am working on.
There is a whole field of research in conversational AI to create smaller models, trained on targeted datasets. Those type of models are ideally suited to run on the IC, and be used as embedded AI agents for every dApp imaginable.
The instruction limit is a blocker, while GPU is a great-to-have, but not a blocker.
Hello @icpp , do you calculate compute costs in ICP or USD?
Hi @TonyArtworld ,
I did not yet check how many cycles my LLM is burning per token prediction. Once I do that I can calculate the cost/token in USD, because the price of cycles is tied to XDR.
@timo ,
tagging you since you were part of the conversation. Can you please tag others at DFINITY that would be able to comment?
Iâd like to raise this request for query charging again, because it would unlock a powerful use case for running the Qwen2.5-0.5B-Q8 model. It turns out that this is a great LLM model that fits in a 32bit canister, and the calculation of tokens is fast due to SIMD.
The limiting factor is the instructions limit. With an update call, I can generate 10 tokens, and then need to go through consensus and cache the data in the canisterâs OP, and then call it again. This makes it very slow and expensive.
Nevertheless, the Qwen2.5 LLM running in llama_cpp_canister is a great demo model, and I will release it in ICGPT soon, but I would like to be able to give the users some hope that the model will be more usable via either query charging or an increase of the instructions limit on update calls so it can generate more tokens at a time.
This model is also great for fine-tuning purposes, and it will unlock several use cases in gaming and gambling.