Technical Working Group: Scalability & Performance

We need to reschedule this month’s WG by one week to next Thursday, January 25th. Sorry for the inconvenience. The planned agenda is to give an update on Backup & Restore.

As always, please reach out if you have any requests or ideas for upcoming sessions.

Looking forward to see you next week!

4 Likes

Hi everyone,
Tomorrow @Alexandra will be talking about canister Backup & Restore.

Thursday January 25th at 5:30 pm CET Zoom link

6 Likes

Hi everyone,
In two days @free will be discussing the new Scalable Messaging Model. Hope to see you there!

Thursday February 15th at 5:30 pm CET Zoom link

@here Reminder that this is happening today.

1 Like

Recording and transcript of @free’s presentation have been added to the meeting notes.

1 Like

For the next WG meeting (March 14), we’ll have a joint session with the DeAI working group that’ll be an open Q&A session (description here).

1 Like

I am working on the next version of ICGPT, a playground for on-chain Llama2 models, and I am looking into performance improvements by using horizontal scaling with a load-balancer in front of multiple LLMs.

I was suprised about the non-linear scaling with the experimental load-balancer and would love to get a full understanding of why this is. Would I be able to join one of the upcoming WG meetings to ask some questions?

1 Like

This week, we’ll have a joint meeting with the new Inter-canister Canister Event Utility WG.

@skilesare and collaborators will present their first designs and results. We’ll have @abk, @ulan, and @free from DFINITY to provide feedback.

@icpp I’m pretty sure that we can also reserve 15min in the end to cover your questions.

Looking forward to seeing you this Thursday, April 18th at 5:30pm CEST (Zoom link).

5 Likes

Thank your for today’s review of my scalability tests for ICGPT.

Most important learning is to limit the number of LLMs per subnet to 4.

This because a subnet is using 4 threads for update calls, so going above that will not help.

I reran my tests, and I am now indeed getting very consistent & excellent results up to 8 concurrent users.

In this graph I am plotting the max duration of the story generation, i.e. this is the longest a user will have to wait for their story to be completed. The lower the number the better, and the red curve for 4 LLMs is really good!

I also tried to go to 16 concurrent users but that resulted in timeouts and unsuccessful update calls.

3 Likes

Hi @ulan,

I’d like to follow up on this post in the Technical Working Group DeAI. @patnorris mentioned that you had some really exciting news about large improvements in floating point computations that could be factors faster.

If you are interested, I can make my test environment available to you. It is in a private repository right now, with bash scripts to:

  • deploy, load & configure the LLMs and the LoadBalancer
  • run the test for different number of LLMs and different number of concurrent users.

Or alternatively, if you can give me early access to the new capabilities, I can also run tests and share the results with you.

Hi @icpp. There will be a presentation in the upcoming Public Global R&D on April 24 about the floating point optimization.

If you are interested, I can make my test environment available to you.

@patnorris shared a link to this repository

I guess I should be able to simple a standalone LLM canister from it, right? If so, I’ll give it a try and compare how much improvement the FP optimization brings there.

if you can give me early access to the new capabilities, I can also run tests and share the results with you.

It would require building a custom replica from the source that uses an un-release version of Wasmtime and then a custom dfx based on the replica. If you’re familiar with the build process, then I can send you a patch for the replica code.

1 Like

Great session yesterday. You can find the transcript and the recording in the meeting notes.

1 Like

Hi @ulan ,

@patnorris shared a link to this repository

Yes, you can indeed build a standalone LLM using the instructions of that repo. Very interested to see the impact of the FP optimization.

If you’re familiar with the build process, then I can send you a patch for the replica code.

I am not familiar yet with that build process, but am willing to put in the effort to learn it. Is there a description of the build process ?

@domwoe I tried to watch the recording and got this error

Thanks for the heads up. Should be fixed now :pray:

1 Like

Confirmed, works now!

1 Like

@ulan ,

I want to let you know that the icpp_llm repo includes 260K and 15M parameter models and tokenizer.

If you like, I can also provide you with 42M and 110M parameter models.

They used to run and be able to predict one token before hitting the instructions limit, but after the revamp of how the instructions were being calculated, that was not true anymore.

I would be very interested in learning if the new approach will allow these larger models to run.

They use the same wasm, but you just upload a different model and tokenizer. I’d be happy to drop them in a shared drive.

And if that works, what about a 7B model. I could prepare that too… It might just fit in memory, but I never tried because of the instructions limit.

3 Likes

I am not aware of a public doc that describes how to build a custom dfx with custom replica. I’ll try to write a doc explaining the steps when I get free time.

I want to let you know that the icpp_llm repo includes 260K and 15M parameter models and tokenizer.

That’s great, thanks a lot! I’ll try this first and report here. After that we can discuss larger models.

2 Likes

@icpp: would you mind building a Wasm binary that has a run() endpoint that does some fixed amount of inference with the 260K or 15M model? I tried to do it myself, but you’ll probably be much faster since you already know Candid in C++.

Out of curiosity: have you considered using the standard CMake for the C++ CDK? For many C++ developers a dependency on a custom python package might be a disadvantage.

Hi @ulan ,

If you call the inference endpoint with the temperature equal to 0, it will always do the same amount of computations.

Like this:

dfx canister call llama2_15M new_chat '()'
dfx canister call llama2_15M inference '(record {prompt = "" : text; steps = 60 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'

Let me know if this is what you’re looking for, else I can build a custom run() method for you as well…

1 Like