Hi scalability team,
I am really happy with the fact that the 0.5billion parameter LLM is running stable now in a 32bit canister, and it is giving very good answers.
You can try it out at https://icgpt.icpp.world/
You will notice right away that the token generation is slow, and the main reason for that is the instructions limit on update calls.
This LLM can generate 10 tokens within the instructions limit.
The way we work around it is by doing a sequence of update calls, and using the prompt caching that is a standard feature of the llama.cpp LLM. We store the prompt cache for every conversation of every user identified by their principlal ID.
This works very well, but a speedup would be very welcome.
Would it be possible to further increase the instruction limit?
Doubling it would literally double the token generation speed.