About CMake, yes, I initially was using that, but it resulted in a very poor developer experience. The icpp-pro CLI takes care of a lot of boilerplate to build a wasm from a C++ project. I hope you got it going…
Thanks, @icpp!
I was thinking about a Wasm binary that inlines both the token and model data to simplify the setup steps:
canister_init()
: initializes everything including the tokenizer, model, chat mode, etc.run()
: performs a fixed inference load.
With such a Wasm binary, the benchmark becomes simple: install the canister and call run()
without uploading data. Do you think you could prepare something like that?
In the meantime, I managed to follow your instructions and the ones in README to run the inference in a test environment with a custom replica.
The results:
- Initialization instructions: 2_901_171_142 before => 2_901_171_142 after (no change)
- Inference instructions: 23_892_491_273 before => 8_015_185_548 (~3x reduction)
It was done with an empty prompt, temperature = 0.1, topp = 0.9, rng_seed = 0.
Both runs produced the same result:
"Once upon a time, there was a little girl named Lily. She loved to play outside in the park. One day, she saw a big, red ball and she wanted to play with it. But the ball was too high for her to reach.\nLily saw a boy named Max playing with
It is interesting that the 3x improvement is similar to the image classification benchmark.
Wow!!!
That is amazing. This is going to help a lot in improving the response times, and being able to run mor powerful LLMs.
LLMs at it’s core are just neutral networks that do the same kind of calculations as image classification, so this outcome is confirming that any kind of AI task will benefit from this.
Thank you for working through the build, deploy and upload steps to be able to perform the test.
I now understand also what you mean wit a standalone wasm that includes everything. Let me look into that. It should be possible.
Is there a shared drive where I can upload the wasm?
@ulan ,
One more item that would be a really powerful demonstration of the impact of this improvement in instructions is the following.
- the story generated by the dfx call is not finished. We only ask for 60 steps, or tokens. If you call it again, you will see that the LLM will continue the story. We have to call it multiple times, until the LLM returns less than the requested 60 tokens. Then the LLM is no longer generating new tokens, but it will always generate the empty string. You can confirm it by just calling it, and see the empty string being returned. To start a new story, you call the new_chat endpoint, which resets the Orthogonal Persisted runstate.
- now, we chose the value of 60 steps, because if you go higher, say 100 tokens, you run into the instructions limit. It would be interesting to see how many steps can be generated with one update call.
- the end goal is to actually be able to generate a full story with a single query call, and this improvement is a big step towards that.
I tried 200 tokens in the optimized replica and here is the generated story:
Once upon a time, there was a little girl named Lily. She loved to play outside in the park. One day, she saw a big, red ball and she wanted to play with it. But the ball was too high for her to reach.\nLily saw a boy named Max playing with a ball. She asked him, \"Can you help me get the ball?\" Max said, \"Sure, I can help you.\" Max picked up the ball and gave it to Lily. She was so happy and said, \"Thank you, Max! You are so generous!\"\nLily and Max played with the ball for a while. They had so much fun together. When it was time to go home, Lily said, \"Thank you for helping me, Max. I had a great time playing with you.\" Max replied, \"You're welcome, Lily. I had fun too. Let's play again soon
It used 25_973_374_060 instructions, so it will fit in an update message limit.
Increasing above 200 token (e.g. 300, 400) doesn’t change the story and the instructions.
@ulan ,
That is really great news. Meaning we can now generate that story with one update call. Before, it took 4 update calls.
The length of the story generated varies based on the settings of the temperature
and the prompt
.
Can you try this? It varies from call to call, but this typically generates a longer story:
dfx canister call llama2_15M new_chat '()'
dfx canister call llama2_15M inference '(record {prompt = "Joe loves writing stories" : text; steps = 600 : nat64; temperature = 0.1 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
The temperature
setting introduces variability in the sampling of the next token to be generated. Sometimes the LLM goes of the rails and creates a crazy story, but most of the time it works really well, and you get a wide variety of creative, but still comprehensible stories.
Something seems off. I tried that and some other prompts and temperatures. All of them return the original prompt as the result.
Only empty prompts returns stories. @icpp: any idea what could be wrong?
@ulan ,
Oh, I am sorry, I forgot that a non empty prompt returns only that prompt. It kind of primes the story generation, and you then have to follow it up with one more call to continue the story generation. All of this quite tedious logic was created to avoid running in the instructions limit.
This is the sequence of calls to make:
# start a new story
dfx canister call llama2_15M new_chat '()'
# run the prompt through the LLM
dfx canister call llama2_15M inference '(record {prompt = "Joe loves writing stories" : text; steps = 600 : nat64; temperature = 0.1 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
# continue the story generation
dfx canister call llama2_15M inference '(record {prompt = "" : text; steps = 600 : nat64; temperature = 0.1 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
No problem. That makes sense now. Here are the results:
- The initial prompt instructions: 657_012_786
- The second call instructions: 19_011_382_852
"Joe loves writing stories"
". He has a special book called a journal. He writes stories about his adventures and his friends.\nOne day, Joe's mom asked him to write a story about his adventures. Joe was so excited! He wrote a story about his adventures and wrote it in his journal.\nJoe's mom was very proud of him. She said, \"You are so creative, Joe! I love your stories!\" Joe smiled and said, \"Thank you, Mom!\"\nJoe was so happy that he had written a story about his adventures. He was so proud of himself. He knew that he would write more stories in his journal."
Hi everyone,
This Thursday we’ll have @stefan.schneider talking about the new storage layer! Come hear how it’ll improve Scalability and Performance!
Thursday May 16th at 5:30 pm CEST Zoom link
Hi all!
We’re going to skip the meeting this month because of some scheduling conflicts, but we’ll be back in July!
Reminder that we’ll be meeting today, 5:30 pm CEST Zoom link
@dsarlis, @stefan.schneider, and @pakhomov-dfinity1 will present on the ongoing work to scale the number of canisters with the eventual goal to support 1M canisters on a single subnet.
Sorry how do we access the recording?
In general, the recordings are linked in the notes doc.
Here’s the recording from yesterday’s meeting: Video Conferencing, Web Conferencing, Webinars, Screen Sharing - Zoom
The meeting on August 15th will be canceled due to the vacation season. Enjoy the summer!
Hi everyone,
Our next meeting will be delayed a week to Thursday September 26th. We’ll have Ognjen Maric talking about the new best-effort messaging model!
Thursday September 26th at 5:30 pm CEST Zoom link
Actually we’ll move the meeting up 30 minutes this week to avoid a conflict:
Thursday September 26th at 5:00 pm CEST Zoom link
Hi, @abk I just joined this meeting and the room was closed! Can you open it back up?
Wasn’t expecting it to be moved by 30 min
Ah I think we have to bump it to next week anyway. Our speaker got sick last minute.
Sorry for the scheduling issues everyone. We’re now set to have Ognjen talking tomorrow at the normal time:
Thursday October 3rd at 5:30 pm CEST Zoom link