Technical Working Group: Scalability & Performance

domwoe · January 17, 2024, 5:35pm

We need to reschedule this month’s WG by one week to next Thursday, January 25th. Sorry for the inconvenience. The planned agenda is to give an update on Backup & Restore.

As always, please reach out if you have any requests or ideas for upcoming sessions.

Looking forward to see you next week!

abk · January 24, 2024, 9:27am

Hi everyone,
Tomorrow @Alexandra will be talking about canister Backup & Restore.

Thursday January 25th at 5:30 pm CET Zoom link

abk · February 13, 2024, 9:38am

Hi everyone,
In two days @free will be discussing the new Scalable Messaging Model. Hope to see you there!

Thursday February 15th at 5:30 pm CET Zoom link

domwoe · February 15, 2024, 12:43pm

@here Reminder that this is happening today.

abk · February 28, 2024, 12:44pm

Recording and transcript of @free’s presentation have been added to the meeting notes.

abk · March 4, 2024, 3:18pm

For the next WG meeting (March 14), we’ll have a joint session with the DeAI working group that’ll be an open Q&A session (description here).

icpp · April 15, 2024, 7:42pm

I am working on the next version of ICGPT, a playground for on-chain Llama2 models, and I am looking into performance improvements by using horizontal scaling with a load-balancer in front of multiple LLMs.

I was suprised about the non-linear scaling with the experimental load-balancer and would love to get a full understanding of why this is. Would I be able to join one of the upcoming WG meetings to ask some questions?

domwoe · April 16, 2024, 11:30am

This week, we’ll have a joint meeting with the new Inter-canister Canister Event Utility WG.

@skilesare and collaborators will present their first designs and results. We’ll have @abk, @ulan, and @free from DFINITY to provide feedback.

@icpp I’m pretty sure that we can also reserve 15min in the end to cover your questions.

Looking forward to seeing you this Thursday, April 18th at 5:30pm CEST (Zoom link).

icpp · April 18, 2024, 7:56pm

Thank your for today’s review of my scalability tests for ICGPT.

Most important learning is to limit the number of LLMs per subnet to 4.

This because a subnet is using 4 threads for update calls, so going above that will not help.

I reran my tests, and I am now indeed getting very consistent & excellent results up to 8 concurrent users.

In this graph I am plotting the max duration of the story generation, i.e. this is the longest a user will have to wait for their story to be completed. The lower the number the better, and the red curve for 4 LLMs is really good!

I also tried to go to 16 concurrent users but that resulted in timeouts and unsuccessful update calls.

icpp · April 18, 2024, 8:10pm

Hi @ulan,

I’d like to follow up on this post in the Technical Working Group DeAI. @patnorris mentioned that you had some really exciting news about large improvements in floating point computations that could be factors faster.

If you are interested, I can make my test environment available to you. It is in a private repository right now, with bash scripts to:

deploy, load & configure the LLMs and the LoadBalancer
run the test for different number of LLMs and different number of concurrent users.

Or alternatively, if you can give me early access to the new capabilities, I can also run tests and share the results with you.

ulan · April 19, 2024, 6:36am

Hi @icpp. There will be a presentation in the upcoming Public Global R&D on April 24 about the floating point optimization.

If you are interested, I can make my test environment available to you.

@patnorris shared a link to this repository

I guess I should be able to simple a standalone LLM canister from it, right? If so, I’ll give it a try and compare how much improvement the FP optimization brings there.

if you can give me early access to the new capabilities, I can also run tests and share the results with you.

It would require building a custom replica from the source that uses an un-release version of Wasmtime and then a custom dfx based on the replica. If you’re familiar with the build process, then I can send you a patch for the replica code.

domwoe · April 19, 2024, 7:02am

Great session yesterday. You can find the transcript and the recording in the meeting notes.

icpp · April 19, 2024, 9:46am

Hi @ulan ,

@patnorris shared a link to this repository

Yes, you can indeed build a standalone LLM using the instructions of that repo. Very interested to see the impact of the FP optimization.

If you’re familiar with the build process, then I can send you a patch for the replica code.

I am not familiar yet with that build process, but am willing to put in the effort to learn it. Is there a description of the build process ?

icme · April 19, 2024, 7:25pm

@domwoe I tried to watch the recording and got this error

domwoe · April 19, 2024, 7:32pm

Thanks for the heads up. Should be fixed now

icme · April 19, 2024, 7:35pm

Confirmed, works now!

icpp · April 22, 2024, 2:04am

@ulan ,

I want to let you know that the icpp_llm repo includes 260K and 15M parameter models and tokenizer.

If you like, I can also provide you with 42M and 110M parameter models.

They used to run and be able to predict one token before hitting the instructions limit, but after the revamp of how the instructions were being calculated, that was not true anymore.

I would be very interested in learning if the new approach will allow these larger models to run.

They use the same wasm, but you just upload a different model and tokenizer. I’d be happy to drop them in a shared drive.

And if that works, what about a 7B model. I could prepare that too… It might just fit in memory, but I never tried because of the instructions limit.

ulan · April 22, 2024, 8:45am

I am not aware of a public doc that describes how to build a custom dfx with custom replica. I’ll try to write a doc explaining the steps when I get free time.

I want to let you know that the icpp_llm repo includes 260K and 15M parameter models and tokenizer.

That’s great, thanks a lot! I’ll try this first and report here. After that we can discuss larger models.

ulan · April 22, 2024, 4:18pm

@icpp: would you mind building a Wasm binary that has a run() endpoint that does some fixed amount of inference with the 260K or 15M model? I tried to do it myself, but you’ll probably be much faster since you already know Candid in C++.

Out of curiosity: have you considered using the standard CMake for the C++ CDK? For many C++ developers a dependency on a custom python package might be a disadvantage.

icpp · April 22, 2024, 9:13pm

Hi @ulan ,

If you call the inference endpoint with the temperature equal to 0, it will always do the same amount of computations.

Like this:

dfx canister call llama2_15M new_chat '()'
dfx canister call llama2_15M inference '(record {prompt = "" : text; steps = 60 : nat64; temperature = 0.0 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'

Let me know if this is what you’re looking for, else I can build a custom run() method for you as well…

Topic		Replies	Views
Announcing Technical Working Groups Developers	38	25907	July 25, 2024
Technical Working Group: Node Providers Roadmap community-consideration	75	3517	May 8, 2025
Technical Working Group: Inter-canister Event Utility Working Group General	41	1980	February 19, 2025
Welcome to the DFINITY Developer Forum	2	16591	April 5, 2022
First Draft Bitfinity Whitepaper - Sharding EVM on the Internet Computer Developers	5	350	July 16, 2024

Technical Working Group: Scalability & Performance

Related topics