AI and machine learning on the IC?

Gamris · June 21, 2023, 8:59am

I found this LLaMA 7b for Rust Implementation using dfdx.

cymqqqq · June 21, 2023, 9:28am

It consumes a lot of cycles for uploading the model to IC.

mnl · June 21, 2023, 1:23pm

LLaMA -7b takes ~12 GB

even with 4bit quantization, it probably wouldn’t fit in the canister’s heap memory

our current best bet is using sth like flan-t5

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, GenerationConfig

line = "answer the following question reasoning step by step: joe has 22 apples, he gave away half and ate 6 apples how many apples does he have"
# model_name = "google/flan-t5-small" # ~850 Mb peak RAM usage, model is ~350 Mb
model_name = "google/flan-t5-base"    # ~2.1 Gb peak RAM usage, model is ~900 Mb

config = GenerationConfig(max_new_tokens=200)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

tokens = tokenizer(line, return_tensors="pt")
outputs = model.generate(**tokens, generation_config=config)

print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

He gave away 22 / 2 = 8 apples. He ate 6 apples so he has 8 + 6 = 10 apples. Therefore, the final answer is 10.

yvonneanne · June 21, 2023, 2:15pm

Absolutely love the convincing reasoning

mnl · June 21, 2023, 2:20pm

For those who are interested…

Here’s roughly what would need to happen to get the above to work on IC:

LLMs work on a token-by-token basis. To simplify things, think of a token as if it were a syllable. So when you look at this sentence:

He gave away 22 / 2 = 8 apples. He ate 6 apples so he has 8 + 6 = 10 apples. Therefore, the final answer is 10.

you can imagine splitting the above into syllables.

The above python code would need to get reimplemented using burn and cdk-rs:

implement two cache structures in the canister:
a. a cache for the sentence that’s currently getting generated
b. a special cache for embeddings
build an update method fn generate in a canister that generates one syllable during one update call
during the update call, a newly generated “syllable” is appended to a list of generated syllables (stored in a. cache)
at the very end of that update call, the canister calls itself (calls the same fn generate update method) until it generates the stop token (special token used to signify the end of a generated sequence)

It’s not trivial work, but also not super complex. It definitely is not going to be useful by any stretch of the imagination (too slow (the above sentence is ~30 syllables * ~4sec = ~2min), and output quality will be terrible cause we can’t fit bigger models for now), but it still is a stepping stone.

We can always pick a different idea, but we’ll be limited by hardware no matter what we pick because all we have is CPUs without access to vectorized operations (no CUDA, no SIMD) and 4gb or RAM (canister heap). There is only so much stuff that can fit inside 4 GB of RAM, and without vectorized operations, all math for neural nets is bound to be slow. In order to get the quality of bigger models, we would need to figure out how to spread the computation among multiple canisters, but that’s a story for another day

yvonneanne · June 24, 2023, 3:03pm

Is there something that prevents us from using the canister’s 48 GB stable memory instead of squeezing everything into the heap?

Icdev2dev · June 24, 2023, 3:51pm

I would also think that you would also need to reimplement how random numbers are generated in that system because of consensus.

If the above is true, then this is non-trvial because it would likely be a patch in a open-source system that you’d need to carry forward(unless you get that open source system to accept your patch).

mnl · June 26, 2023, 3:14pm

Good point, I haven’t thought about it…

From what I understand, stable memory can imitate runtime memory, but in reality, at runtime, we’re still limited by 32bit address space which is 4gb (until this), therefore at any given time, the runtime memory can only hold 4gb worth of data for calculations. Assuming we’d develop an ML library for the inference that natively uses sth like https://crates.io/crates/ic-stable-memory and we would code things up such that instead of loading the whole model at once into memory (as it’s currently being done in ML world), this library would load weights and makes calculations layer by layer: loads a layer, does matrix multiplication, clears the runtime memory, and loads another layer and so on and so forth until it provides the output… then yes, it could work I think.

Tho, I think it would be super duper slow, because if we look at it from a hardware perspective, we get sth like this:

# pseudo code

model_archtecture = nn.Sequential(
    torch.nn.Linear(3, 1),
    torch.nn.Flatten(0, 1)
)
weights_and_biases = LazyLoadFromStableMemory('model.bin')
cache__last_layer_result = initialize_from_input()

for layer in model_archtecture:
    w = load_weights_for_layer(layer, weights_and_biases)
    cache__last_layer_result = w * cache__last_layer_result

we would be reading from SSD and putting data into the node’s RAM every time a call to load_weights_for_layer(layer, weights_and_biases) is made. The size of w + cache__last_layer_result could not exceed 4gb at any given time.

The original size of the 65 billion parameter LLaMA model (often referred to as the 65B model) is 120 gigabytes. However, when quantized to 4-bit, the size is reduced to 38.5 gigabytes1.

Therefore if we take an example SSD listed in node requirements Node Provider Machine Hardware Guide - Internet Computer Wiki : 6.4 TB NVMe Kioxia SSD 3D-NAND TLC U.3 (Kioxia CM6-V) we can see it can read up to 3,500MB/s, therefore just the loading and unloading parts of the model’s weights and biases would take the total of 38500 MB / 3500 MB/s = 11 seconds

mnl · June 26, 2023, 3:17pm

correct, tho I think this part is not as difficult; we could try talking with some maintainers of Rust ML frameworks and ask them to modify their code such that we can provide a custom implementation of RNG (e.g. based on some trait)

edit: nvm; achievable today with burn Backend in burn::tensor::backend - Rust

apotheosis · June 27, 2023, 2:01am

This is an amazing thread. Kinic Developer Organization is looking into this tech.
There are models that run in constrained environments like phones. I think we can push the edge of what is possible here.

cyberowl · June 27, 2023, 11:38am

Do you know if we have a small model that can be deployed at the moment that could utilize the 4gb?

mnl · June 27, 2023, 11:42am

google/flan-t5-small · Hugging Face would fit for sure

domwoe · June 27, 2023, 12:28pm

There’s essentially no difference between heap and stable memory. They are backed by the same hardware. The only difference is, that currently stable memory needs to be accessed via system API, i.e. there’s a context switch from WASM to Rust which adds overhead. Maybe @abk can give a bit more context and what upcoming work on multiple memories and wasm64 bring.

abk · June 27, 2023, 12:39pm

Yes this is correct. Wasm-native Stable Memory has recently been approved so this allows us to avoid the API overhead in many cases, but stable memory is still not as fast as the main memory.

One reason is that stable memory uses a 64-bit address space which means we need to insert bounds checks around every access. A second difference is that we need to track which pages have been accessed and limit the total number of accessed pages (if a canister could touch all of stable memory during an execution, then a few canisters running in parallel could use up the entire replica’s memory).

These checks can both be skipped on regular memory (because the address space is 32-bits and the total size is only 4 GiB), but if we used the Wasm memory64 proposal to support a larger main memory then we would need to add all these checks for regular memory operations.

mnl · June 27, 2023, 12:42pm

question to you and @abk

Let’s say I load an array of u8 integers that’s total size is 48Gb into stable memory, and write a for loop that will increment each u8 by 1. How ~many times will the node access the SSD in order to transfer the data into RAM? The way I see things, it will access SSD at least 48Gb / 4Gb = 12 times

Gamris · June 28, 2023, 5:22pm

For inferencing proof of concept, TinyStories is the smallest model I can think of.

Here’s a link to the datasets and the sub 10M parameter models. roneneldan/TinyStories · Datasets at Hugging Face

But if separate AI compute units are coming to the IC (as mentioned by Dom in this interview), would an inferencing engine like vLLM be possible if the AI compute layers do not need to pass through consensus?

abk · June 29, 2023, 2:50pm

Where are you getting the / 4GB from? If you’re sequentially copying 1 byte from stable memory to main memory at a time the OS should be pretty good at prefetching the needed memory from disk, so I’d imagine we’d get a pretty high throughput.

lastmjs · June 29, 2023, 3:36pm

FYI at Demergent Labs we’re working on Kybra, a Python CDK for the Internet Computer. We already have broad support for the Python language and a limited portion of the stdlib, we’ll soon be releasing a much more capable version of the stdlib. The end goal is for Kybra to support C extensions and the C API, which would allow data science and machine learning. This will probably be very difficult to achieve, but I’m hopeful still.

The underlying computational environment of the IC will need a lot of improvement/change to support the kind of computational workload required from the AI world though, in my assessment.

Gamris · September 6, 2023, 2:04am

Has anyone tried TinyLlama yet? 1.1B parameters, Llama2 architecture and tokenizer, trained on 3 trillion tokens.

Current checkpoint available: PY007/TinyLlama-1.1B-step-50K-105b · Hugging Face
GitHub: GitHub - jzhang38/TinyLlama: The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.

Since the 4-bit quantized model weights of TinyLlama consumes only ~550MB RAM, I’d imagine the Llama.cpp 4-bit quantized version would run nicely inside a canister.

icpp · September 6, 2023, 4:36am

@Gamris ,
Actively working on it!

see icpp-llm allows you to run the llama2.c model in a canister. I tested it on the TinyStories model for now, but working on larger models.
see also this forum post

Is this what you’re looking for?

Topic		Replies	Views
Training AI with ICP General	3	1178	November 12, 2024
Is ic-onchain-node got GPU to training and inference for deAI? if not ,any plan? General Discussing	4	455	April 12, 2024
Is rust the better choice for canister development? Developers	4	618	January 17, 2022
Does rust-connect-py-ai-to-ic have a tokenizer? Developers	9	70	January 10, 2025
Now General LLM is running on a canister!🚀 Developers	2	817	October 20, 2023

AI and machine learning on the IC?

Here’s roughly what would need to happen to get the above to work on IC:

Related topics