Technical Working Group: Scalability & Performance

icpp · October 16, 2024, 7:38pm

Hi scalability team,

I am really happy with the fact that the 0.5billion parameter LLM is running stable now in a 32bit canister, and it is giving very good answers.

You can try it out at https://icgpt.icpp.world/

You will notice right away that the token generation is slow, and the main reason for that is the instructions limit on update calls.

This LLM can generate 10 tokens within the instructions limit.

The way we work around it is by doing a sequence of update calls, and using the prompt caching that is a standard feature of the llama.cpp LLM. We store the prompt cache for every conversation of every user identified by their principlal ID.

This works very well, but a speedup would be very welcome.

Would it be possible to further increase the instruction limit?

Doubling it would literally double the token generation speed.

Topic		Replies	Views
Announcing Technical Working Groups Developers	38	26074	July 25, 2024
Technical Working Group: Node Providers Roadmap community-consideration	77	3636	June 18, 2025
Technical Working Group: Inter-canister Event Utility Working Group General	41	1993	February 19, 2025
Welcome to the DFINITY Developer Forum	2	16699	April 5, 2022
First Draft Bitfinity Whitepaper - Sharding EVM on the Internet Computer Developers	5	350	July 16, 2024

Technical Working Group: Scalability & Performance

Related topics