We completed the update to the latest llama.cpp version (sha 615212).
Please start fresh by following the instructions at GitHub - onicai/llama_cpp_canister: llama.cpp for the Internet Computer
This update allows you run many new LLM architectures, including the 1.5Billion parameter DeepSeek model that attracted a lot of attention with this X post.
The main limiting factor of running the larger LLMs is the instructions limit. If a model can generate at least 1 token, you can use it, because we generate tokens via multiple update calls. (See the README in the repo for details.)
Latency is off course high, which hopefully will improve with further ICP protocol and hardware updates, but we believe it is already possible to build useful, targeted AI agents with their LLM running on-chain. It requires some smart prompt engineering, and this is an area where we are focusing our efforts.
To assist with prompt engineering, a python notebook prompt-design.ipynb is included in the repository, where you can run against the original llama.cpp compiled for your native system.