Yes. Really great progress you are doing. Maybe for Milestone 2, you can apply the Speculative execution that Karpathy described to speed up inferencing of the Llama2 7B Chat. Assuming 1.1B LLaMA plays nicely with the strategy.
The issue that I’m having with burn is that it doesn’t have sufficient ONNX operations implemented. Currently, I’m debating adapting tract, ORT, or Burn.
The randomness in GPT’s output is not just for text appeal. It is a result of having any Neural Network work in an “action space.” GPT is predicting what the next token (word fragment) should be and has a probability for each token. The probability of all tokens sums to 100%. So the randomness comes in by choosing what word to choose next. Since these probabilities always sum to one, you could put an upperbound on the number of tokens to return and calculate the (psuedo)random numbers in advance to running GPT making the output entirely deterministic.
I would like to introduce Motokolearn, a Motoko package meant to facilitate on-chain training and inference of machine learning models where having a GPU is not a requirement: Introduction of MotokoLearn V0.1: on-chain training and inference of machine learning models
Way to go! I have been working in a similar way. But I have been focused on Rust. I have adapted tract-onnx to run in canister. I wanted to wait until I had a better demo working before I shared, but definitely gonna share this upcoming week. I have already tested that I can handle a broad array of NN methodologies. I tested Mistral, Llama2, GPT2, and ResNet.
The problem is how do you deal with tens of thousands of the tensor computations in a canister?
RAM is the only issue that I am aware of, which can be handled with a intercanister composite query and chopping the models into subsections which can fit within a canister. It would be a problem if a single layer of a NN took more than 4GB RAM though. In my opinion, the biggest priority would be efforts to easily run 4-bit models. I haven’t done any testing here. I focused on what was the most important design requirement for me was to be able to easily be able to import arbitrary designs/models from PyTorch and Tensorflow into canister. This is why I chose the Tract framework which has implemented 85% of ONNX functionality. But long term I would strongly support the best crate that achieves this ambition. Efficiently handling and making it easy to work with lower bit precision NNs would be a huge boon for what could be done efficiently and fit within a single canister. My priority is currently to explore complexity / capability tradeoff curves for models with only QKV layers that are evaluated in terms of purpose specific goals.
I tried the onnx 11 months ago, the problem is, the low-level random library based on rust can not be compiled on ICP.
I removed that library which was only used in two components neither of which were critical for a majority of NN designs (only multinomial sampling) that’s why I have paid attention to what models I am able to run while having verified that the Rust Canister can compile within a canister. If I find it necessary. Later I will rebuild the multinomial functionality using libraries that are usable on the IC. Tract-Onnx does not default work on the IC. I adapted it to while still maintaining the desired functionality.
So other challenges you will deal with are: 1. How to upload the .onnx format model on ICP, 2. How to transform the .pth to the .onnx format. I assumed that the only solution now is to run a simple model but not LLM, but how to run, I haven’t found a solution until now. I can only run a simple machine learning algorithm on ICP. So we need to consider a lot of things, such as how to store computed results on chain? It will be a very complicated thing. “import arbitrary models from Pytorch and Tensorflow into canister”, I can say it’s a huge project, even to import a framework written in Rust.
I’m assuming that you ran into issues with model size. If you would be willing to speak to issues that you faced, I would be happy to listen.
My Rust Crate, although still in early development, is open-source. I post my updates on my taggr account. This post as a link to the open-source tract library that I have started maintaining.
I am currently investigating if placing queries in separate canisters increases the throughput of composite_queries. I’m starting by asking others if this is the case, but I will know for sure after I implement it myself – I would just implement it with more urgency if it were confirmed. I’m excited to share an example that can enable any AI developer to easily publish their methods to the IC.
Thank you for sharing!
I ran into the number of instructions per query limit as well, and asked around if composite queries would help, but the answer was it would not. I never tried it though, and I am very interested to hear the outcome of your experiment.
My solution was to switch to a series of authenticated update calls, managed by the frontend. The frontend calls the LLM and asks for the next 10 tokens. The canister generates those 10 tokens, and it also saves the LLM’s runstate for the logged in principal using Orthogonal persistence. The frontend then calls for the next 10 tokens, etc… until the generation of the full text by the LLM has been completed.
For model size, I was able to load a 110M Llama2 model. Still working on the larger models and investigating how to handle them.
One thing that is really limiting the use of bigger models is still the instructions limit per message. With the 110M model I am using I can only get 1 token per message. I am interested to explore more compute efficient models that require less instructions per token.
I used to be able to get more tokens per message but my hunch, not a proven fact, is that the new calculations of the IC for cycles to be charged is negatively impacting it. Floating point calculations are more heavily weighted, which is fair, but it hurts the LLM type apps.
Hello everyone, I am looking for a motoko example of how to put the llm model fully onchain. Have went through the motoko examples in the official page but they were loading the model onto the browser, instead of into the canister. The rust examples are quite good but my preferred language is motoko, so if there’s a motoko example for putting a model onchain, appreciate someone to share. Thank you!
The Rust example is there because there are inference libraries for Rust, which can be compiled to Wasm. Some of the libraries, for example tract, support Wasm SIMD instructions, so the inference is possible even in query calls.
Unfortunately, there are no such libraries for Motoko. But there is a hope. We’re exploring Wasm component model, which should allow linking Rust libraries in Motoko.
Shall look forward to this, thanks!