Llama2-7B is definitely a milestone goal for canister LLMs. But in terms of “usable” LLMs for applications (ex. NPCs), there are more compact LLMs that are useful enough for generative text/conversational tasks.
TinyLlama 1.1B project follows Llama2’s architecture and tokenizer formats (easier for fine-tuning/Q-LoRA) and is being trained for 3 Trillion tokens with all the speed optimizations and performance hacks for inferencing. Using the Llama.cpp framework, a Mac M2 16GB RAM generates 71.8 tokens/second. That’s really fast for non-GPU inferencing.
Due to its performance for its parameter size, I’d imagine this is to be the model to test canister LLMs.
This size LLM generates a very coherent story and doesn’t suffer from some of the deficiencies of the 15M size model:
--------------------------------------------------
Generate a new story using llama2_110M, 10 tokens at a time, starting with an empty prompt.
(variant { ok = 200 : nat16 })
(variant { ok = "Once upon a time, there was a little girl" })
(variant { ok = " named Lily. She loved to play outside in the" })
(variant { ok = " sunshine. One day, she saw a big" })
(variant { ok = ", red apple on a tree. She wanted to eat" })
(variant { ok = " it, but it was too high up.\nL" })
(variant { ok = "ily asked her friend, a little bird, \"Can" })
(variant { ok = " you help me get the apple?\"\nThe bird said" })
(variant { ok = ", \"Sure, I can fly up and get" })
(variant { ok = " it for you.\"\nThe bird flew up to" })
(variant { ok = " the apple and pecked it off the tree." })
(variant { ok = " Lily was so happy and took a big bite" })
(variant { ok = ". But then, she saw a bug on the apple" })
(variant { ok = ". She didn\'t like bugs, so she threw" })
(variant { ok = " the apple away.\nLater that day, L" })
(variant { ok = "ily\'s mom asked her to help with the la" })
(variant { ok = "undry. Lily saw a shirt that was" })
(variant { ok = " too big for her. She asked her mom, \"" })
(variant { ok = "Can you make it fit me?\"\nHer mom said" })
(variant { ok = ", \"Yes, I can make it fit you.\"" })
(variant { ok = "\nLily was happy that her shirt would fit" })
(variant { ok = " her. She learned that sometimes things don\'t fit" })
(variant { ok = ", but there is always a way to make them fit" })
(variant { ok = "." })
(variant { ok = "" })
Will expose it in the front end shortly, after more testing. Something is not working yet when you use a non-empty prompt.
This size model is a lot better than the 15M model, and it determines by itself when it has completed a story. For example, the prompt Bobby wanted to catch a big fish results in a generative story that ends itself:
Response time of this model is just as good almost as the 15M model, and the words stream onto the screen very naturally.
However, even though the stories are a lot better than the 15M model, this 42M model still throws some nonsensical stuff in there, as in this example. The prompt Bobby was playing a card game with his friend results in a pretty good story, but some of the comprehension is not right and there is a repeat in the middle. This should go away with the 110M model