Community Consideration: Explore Query Charging

jeshli · January 6, 2024, 8:59pm

Introducing an Opt-In Mechanism for Query Charging

The discussion in this thread has revolved around whether to enable query charging for all canisters or for none. In contrast, I propose an opt-in mechanism. By integrating a system where canisters can opt-in to pay for queries, we can tailor the IC’s capabilities to diverse needs and use cases, especially in the realm of advanced AI applications. Introducing options like ic_cdk::query(cycles_fee=true) or ic_cdk::query(cycles_fee=true, composite_query=true), could responsibly increase the instruction limits per query. Such a modification would maintain robust security measures against DDOS attacks while enabling significant AI functionalities on the IC.

I am developing an open-source repository for AI inference on the IC facilitating the deployment of PyTorch/Tensorflow AI models. The instruction limit for single queries is a significant obstacle for running quality AI models. Experiments with splitting the AI model across multiple canisters and leveraging composite queries have encountered limitations due to the current instruction cap. This proposed modification to the query structure promises to overcome these barriers, enabling larger model sizes and faster response times, crucial for efficient AI inference and machine learning model training.

A preliminary empirical analysis I conducted focuses on the instruction limits for queries and updates in a local deployment setting, specifically using 32-bit floating point precision and 768 hidden dimensions. The current structure allows up to 20 billion instructions for an update but restricts queries to only 5 billion instructions. Below is a comparative analysis under these limits when using 32 bit floating point precision:

Master Type	Worker Type	Configuration	Tokens	Time	Max Tokens	Max Layers
Composite	Query	Embed + 1 Hidden	12	0.156s	12	f(tokens)
Composite	Query	Embed + 2 Hidden	4	0.126s	4	f(tokens)
--------------	-------------	-------------------	--------	---------	------------	------------
Update	Query	Embed + 1 Hidden	4	2.069s	12	12+
Update	Query	Embed + 1 Hidden	12	2.062s	12	12+
Update	Query	Embed + 2 Hidden	4	2.056s	12	12+
Update	Query	Embed + 2 Hidden	12	3.325s	12	12+
Update	Query	Embed + 4 Hidden	12	4.788s	12	12+
Update	Query	Embed + 12 Hidden	12	11.18s	12	12+
--------------	-------------	-------------------	--------	---------	------------	------------
Update	Update	Embed + 1 Hidden	4	2.049s	52	12+
Update	Update	Embed + 1 Hidden	12	3.292s	52	12+
Update	Update	Embed + 1 Hidden	52	8.633s	52	12+
Update	Update	Embed + 2 Hidden	4	2.091s	52	12+
Update	Update	Embed + 2 Hidden	12	6.588s	52	12+
Update	Update	Embed + 2 Hidden	52	17.93s	52	12+
Update	Update	Embed + 4 Hidden	4	3.289s	52	12+
Update	Update	Embed + 4 Hidden	12	11.12s	52	12+
Update	Update	Embed + 4 Hidden	52	27.28s	52	12+

In practice, we observe a range of context windows in AI models, from BERT’s minimum of 512 tokens to the state-of-the-art Mistral’s 8000+ tokens. My experiments have been on the smaller side of this range, constrained by the IC’s “context window” of 52 tokens at the 20 billion transaction (tx) limit and 12 tokens at the 5 billion tx limit. Extrapolating, we could estimate that 200 billion tx would allow for a context window of approximately 520 tokens, and 500 billion tx could enable around 1200 tokens. Aiming for a context window of at least 128 tokens, which I estimate to be around 50 billion tx, would represent a significant advancement within the current limitations. Alongside these efforts, I plan to explore increasing the context window capacity by reducing the floating point precision to the lowest feasible level. This initiative, while promising, represents a significant and complex undertaking that may require considerable time to implement effectively.

While integrating GPUs is a promising direction for enhancing computational efficiency, particularly in matrix computations, it’s becoming clear that increasing the query transaction limit could have a more immediate and pronounced impact on performance. This is not to diminish the importance of GPUs but to highlight that they, too, would benefit from such an adjustment in transaction limits, as they do not significantly reduce the number of operations required.

Topic		Replies	Views
Canister gas/cycle cost Developers Discussing	3	564	August 19, 2023
"Free queries" when calling from a canister Language Support	5	189	April 4, 2024
Proposal: Composite Queries Roadmap community-consideration	74	5176	January 6, 2024
Understanding Composite Query Instruction Limits on the Internet Computer Getting Started Discussing	1	591	January 6, 2024
Canister user story General	2	515	May 3, 2021

Community Consideration: Explore Query Charging

Related Topics