Community Consideration: Explore Query Charging

Introducing an Opt-In Mechanism for Query Charging

The discussion in this thread has revolved around whether to enable query charging for all canisters or for none. In contrast, I propose an opt-in mechanism. By integrating a system where canisters can opt-in to pay for queries, we can tailor the IC’s capabilities to diverse needs and use cases, especially in the realm of advanced AI applications. Introducing options like ic_cdk::query(cycles_fee=true) or ic_cdk::query(cycles_fee=true, composite_query=true), could responsibly increase the instruction limits per query. Such a modification would maintain robust security measures against DDOS attacks while enabling significant AI functionalities on the IC.

I am developing an open-source repository for AI inference on the IC facilitating the deployment of PyTorch/Tensorflow AI models. The instruction limit for single queries is a significant obstacle for running quality AI models. Experiments with splitting the AI model across multiple canisters and leveraging composite queries have encountered limitations due to the current instruction cap. This proposed modification to the query structure promises to overcome these barriers, enabling larger model sizes and faster response times, crucial for efficient AI inference and machine learning model training.

A preliminary empirical analysis I conducted focuses on the instruction limits for queries and updates in a local deployment setting, specifically using 32-bit floating point precision and 768 hidden dimensions. The current structure allows up to 20 billion instructions for an update but restricts queries to only 5 billion instructions. Below is a comparative analysis under these limits when using 32 bit floating point precision:

Master Type Worker Type Configuration Tokens Time Max Tokens Max Layers
Composite Query Embed + 1 Hidden 12 0.156s 12 f(tokens)
Composite Query Embed + 2 Hidden 4 0.126s 4 f(tokens)
-------------- ------------- ------------------- -------- --------- ------------ ------------
Update Query Embed + 1 Hidden 4 2.069s 12 12+
Update Query Embed + 1 Hidden 12 2.062s 12 12+
Update Query Embed + 2 Hidden 4 2.056s 12 12+
Update Query Embed + 2 Hidden 12 3.325s 12 12+
Update Query Embed + 4 Hidden 12 4.788s 12 12+
Update Query Embed + 12 Hidden 12 11.18s 12 12+
-------------- ------------- ------------------- -------- --------- ------------ ------------
Update Update Embed + 1 Hidden 4 2.049s 52 12+
Update Update Embed + 1 Hidden 12 3.292s 52 12+
Update Update Embed + 1 Hidden 52 8.633s 52 12+
Update Update Embed + 2 Hidden 4 2.091s 52 12+
Update Update Embed + 2 Hidden 12 6.588s 52 12+
Update Update Embed + 2 Hidden 52 17.93s 52 12+
Update Update Embed + 4 Hidden 4 3.289s 52 12+
Update Update Embed + 4 Hidden 12 11.12s 52 12+
Update Update Embed + 4 Hidden 52 27.28s 52 12+

In practice, we observe a range of context windows in AI models, from BERT’s minimum of 512 tokens to the state-of-the-art Mistral’s 8000+ tokens. My experiments have been on the smaller side of this range, constrained by the IC’s “context window” of 52 tokens at the 20 billion transaction (tx) limit and 12 tokens at the 5 billion tx limit. Extrapolating, we could estimate that 200 billion tx would allow for a context window of approximately 520 tokens, and 500 billion tx could enable around 1200 tokens. Aiming for a context window of at least 128 tokens, which I estimate to be around 50 billion tx, would represent a significant advancement within the current limitations. Alongside these efforts, I plan to explore increasing the context window capacity by reducing the floating point precision to the lowest feasible level. This initiative, while promising, represents a significant and complex undertaking that may require considerable time to implement effectively.

While integrating GPUs is a promising direction for enhancing computational efficiency, particularly in matrix computations, it’s becoming clear that increasing the query transaction limit could have a more immediate and pronounced impact on performance. This is not to diminish the importance of GPUs but to highlight that they, too, would benefit from such an adjustment in transaction limits, as they do not significantly reduce the number of operations required.