We need to reschedule this month’s WG by one week to next Thursday, January 25th. Sorry for the inconvenience. The planned agenda is to give an update on Backup & Restore.
As always, please reach out if you have any requests or ideas for upcoming sessions.
I am working on the next version of ICGPT, a playground for on-chain Llama2 models, and I am looking into performance improvements by using horizontal scaling with a load-balancer in front of multiple LLMs.
I was suprised about the non-linear scaling with the experimental load-balancer and would love to get a full understanding of why this is. Would I be able to join one of the upcoming WG meetings to ask some questions?
Thank your for today’s review of my scalability tests for ICGPT.
Most important learning is to limit the number of LLMs per subnet to 4.
This because a subnet is using 4 threads for update calls, so going above that will not help.
I reran my tests, and I am now indeed getting very consistent & excellent results up to 8 concurrent users.
In this graph I am plotting the max duration of the story generation, i.e. this is the longest a user will have to wait for their story to be completed. The lower the number the better, and the red curve for 4 LLMs is really good!
I’d like to follow up on this post in the Technical Working Group DeAI. @patnorris mentioned that you had some really exciting news about large improvements in floating point computations that could be factors faster.
If you are interested, I can make my test environment available to you. It is in a private repository right now, with bash scripts to:
deploy, load & configure the LLMs and the LoadBalancer
run the test for different number of LLMs and different number of concurrent users.
Or alternatively, if you can give me early access to the new capabilities, I can also run tests and share the results with you.
I guess I should be able to simple a standalone LLM canister from it, right? If so, I’ll give it a try and compare how much improvement the FP optimization brings there.
if you can give me early access to the new capabilities, I can also run tests and share the results with you.
It would require building a custom replica from the source that uses an un-release version of Wasmtime and then a custom dfx based on the replica. If you’re familiar with the build process, then I can send you a patch for the replica code.
I want to let you know that the icpp_llm repo includes 260K and 15M parameter models and tokenizer.
If you like, I can also provide you with 42M and 110M parameter models.
They used to run and be able to predict one token before hitting the instructions limit, but after the revamp of how the instructions were being calculated, that was not true anymore.
I would be very interested in learning if the new approach will allow these larger models to run.
They use the same wasm, but you just upload a different model and tokenizer. I’d be happy to drop them in a shared drive.
And if that works, what about a 7B model. I could prepare that too… It might just fit in memory, but I never tried because of the instructions limit.
I am not aware of a public doc that describes how to build a custom dfx with custom replica. I’ll try to write a doc explaining the steps when I get free time.
I want to let you know that the icpp_llm repo includes 260K and 15M parameter models and tokenizer.
That’s great, thanks a lot! I’ll try this first and report here. After that we can discuss larger models.
@icpp: would you mind building a Wasm binary that has a run() endpoint that does some fixed amount of inference with the 260K or 15M model? I tried to do it myself, but you’ll probably be much faster since you already know Candid in C++.
Out of curiosity: have you considered using the standard CMake for the C++ CDK? For many C++ developers a dependency on a custom python package might be a disadvantage.