AI and machine learning on the IC?

I found this LLaMA 7b for Rust Implementation using dfdx.

1 Like

It consumes a lot of cycles for uploading the model to IC.

LLaMA -7b takes ~12 GB

even with 4bit quantization, it probably wouldn’t fit in the canister’s heap memory

our current best bet is using sth like flan-t5

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, GenerationConfig

line = "answer the following question reasoning step by step: joe has 22 apples, he gave away half and ate 6 apples how many apples does he have"
# model_name = "google/flan-t5-small" # ~850 Mb peak RAM usage, model is ~350 Mb
model_name = "google/flan-t5-base"    # ~2.1 Gb peak RAM usage, model is ~900 Mb

config = GenerationConfig(max_new_tokens=200)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

tokens = tokenizer(line, return_tensors="pt")
outputs = model.generate(**tokens, generation_config=config)

print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

He gave away 22 / 2 = 8 apples. He ate 6 apples so he has 8 + 6 = 10 apples. Therefore, the final answer is 10.

2 Likes

Absolutely love the convincing reasoning :smile:

4 Likes

For those who are interested…

Here’s roughly what would need to happen to get the above to work on IC:

LLMs work on a token-by-token basis. To simplify things, think of a token as if it were a syllable. So when you look at this sentence:

He gave away 22 / 2 = 8 apples. He ate 6 apples so he has 8 + 6 = 10 apples. Therefore, the final answer is 10.

you can imagine splitting the above into syllables.

The above python code would need to get reimplemented using burn and cdk-rs:

  1. implement two cache structures in the canister:
    a. a cache for the sentence that’s currently getting generated
    b. a special cache for embeddings
  2. build an update method fn generate in a canister that generates one syllable during one update call
  3. during the update call, a newly generated “syllable” is appended to a list of generated syllables (stored in a. cache)
  4. at the very end of that update call, the canister calls itself (calls the same fn generate update method) until it generates the stop token (special token used to signify the end of a generated sequence)

It’s not trivial work, but also not super complex. It definitely is not going to be useful by any stretch of the imagination (too slow (the above sentence is ~30 syllables * ~4sec = ~2min), and output quality will be terrible cause we can’t fit bigger models for now), but it still is a stepping stone.

We can always pick a different idea, but we’ll be limited by hardware no matter what we pick because all we have is CPUs without access to vectorized operations (no CUDA, no SIMD) and 4gb or RAM (canister heap). There is only so much stuff that can fit inside 4 GB of RAM, and without vectorized operations, all math for neural nets is bound to be slow. In order to get the quality of bigger models, we would need to figure out how to spread the computation among multiple canisters, but that’s a story for another day

5 Likes

Is there something that prevents us from using the canister’s 48 GB stable memory instead of squeezing everything into the heap?

2 Likes

I would also think that you would also need to reimplement how random numbers are generated in that system because of consensus.

If the above is true, then this is non-trvial because it would likely be a patch in a open-source system that you’d need to carry forward(unless you get that open source system to accept your patch).

Good point, I haven’t thought about it…

From what I understand, stable memory can imitate runtime memory, but in reality, at runtime, we’re still limited by 32bit address space which is 4gb (until this), therefore at any given time, the runtime memory can only hold 4gb worth of data for calculations. Assuming we’d develop an ML library for the inference that natively uses sth like https://crates.io/crates/ic-stable-memory and we would code things up such that instead of loading the whole model at once into memory (as it’s currently being done in ML world), this library would load weights and makes calculations layer by layer: loads a layer, does matrix multiplication, clears the runtime memory, and loads another layer and so on and so forth until it provides the output… then yes, it could work I think.

Tho, I think it would be super duper slow, because if we look at it from a hardware perspective, we get sth like this:

# pseudo code

model_archtecture = nn.Sequential(
    torch.nn.Linear(3, 1),
    torch.nn.Flatten(0, 1)
)
weights_and_biases = LazyLoadFromStableMemory('model.bin')
cache__last_layer_result = initialize_from_input()

for layer in model_archtecture:
    w = load_weights_for_layer(layer, weights_and_biases)
    cache__last_layer_result = w * cache__last_layer_result

we would be reading from SSD and putting data into the node’s RAM every time a call to load_weights_for_layer(layer, weights_and_biases) is made. The size of w + cache__last_layer_result could not exceed 4gb at any given time.

The original size of the 65 billion parameter LLaMA model (often referred to as the 65B model) is 120 gigabytes. However, when quantized to 4-bit, the size is reduced to 38.5 gigabytes​1​.

Therefore if we take an example SSD listed in node requirements Node Provider Machine Hardware Guide - Internet Computer Wiki : 6.4 TB NVMe Kioxia SSD 3D-NAND TLC U.3 (Kioxia CM6-V) we can see it can read up to 3,500MB/s, therefore just the loading and unloading parts of the model’s weights and biases would take the total of 38500 MB / 3500 MB/s = 11 seconds

2 Likes

correct, tho I think this part is not as difficult; we could try talking with some maintainers of Rust ML frameworks and ask them to modify their code such that we can provide a custom implementation of RNG (e.g. based on some trait)

edit: nvm; achievable today with burn Backend in burn::tensor::backend - Rust

3 Likes

This is an amazing thread. Kinic Developer Organization is looking into this tech.
There are models that run in constrained environments like phones. I think we can push the edge of what is possible here.

2 Likes

Do you know if we have a small model that can be deployed at the moment that could utilize the 4gb?

google/flan-t5-small · Hugging Face would fit for sure

There’s essentially no difference between heap and stable memory. They are backed by the same hardware. The only difference is, that currently stable memory needs to be accessed via system API, i.e. there’s a context switch from WASM to Rust which adds overhead. Maybe @abk can give a bit more context and what upcoming work on multiple memories and wasm64 bring.

4 Likes

Yes this is correct. Wasm-native Stable Memory has recently been approved so this allows us to avoid the API overhead in many cases, but stable memory is still not as fast as the main memory.

One reason is that stable memory uses a 64-bit address space which means we need to insert bounds checks around every access. A second difference is that we need to track which pages have been accessed and limit the total number of accessed pages (if a canister could touch all of stable memory during an execution, then a few canisters running in parallel could use up the entire replica’s memory).

These checks can both be skipped on regular memory (because the address space is 32-bits and the total size is only 4 GiB), but if we used the Wasm memory64 proposal to support a larger main memory then we would need to add all these checks for regular memory operations.

9 Likes

question to you and @abk

Let’s say I load an array of u8 integers that’s total size is 48Gb into stable memory, and write a for loop that will increment each u8 by 1. How ~many times will the node access the SSD in order to transfer the data into RAM? The way I see things, it will access SSD at least 48Gb / 4Gb = 12 times

1 Like

For inferencing proof of concept, TinyStories is the smallest model I can think of.

Here’s a link to the datasets and the sub 10M parameter models. roneneldan/TinyStories · Datasets at Hugging Face

But if separate AI compute units are coming to the IC (as mentioned by Dom in this interview), would an inferencing engine like vLLM be possible if the AI compute layers do not need to pass through consensus?

2 Likes

Where are you getting the / 4GB from? If you’re sequentially copying 1 byte from stable memory to main memory at a time the OS should be pretty good at prefetching the needed memory from disk, so I’d imagine we’d get a pretty high throughput.

FYI at Demergent Labs we’re working on Kybra, a Python CDK for the Internet Computer. We already have broad support for the Python language and a limited portion of the stdlib, we’ll soon be releasing a much more capable version of the stdlib. The end goal is for Kybra to support C extensions and the C API, which would allow data science and machine learning. This will probably be very difficult to achieve, but I’m hopeful still.

The underlying computational environment of the IC will need a lot of improvement/change to support the kind of computational workload required from the AI world though, in my assessment.

9 Likes

Has anyone tried TinyLlama yet? 1.1B parameters, Llama2 architecture and tokenizer, trained on 3 trillion tokens.

Current checkpoint available: PY007/TinyLlama-1.1B-step-50K-105b · Hugging Face
GitHub: GitHub - jzhang38/TinyLlama: The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.

Since the 4-bit quantized model weights of TinyLlama consumes only ~550MB RAM, I’d imagine the Llama.cpp 4-bit quantized version would run nicely inside a canister.

1 Like

@Gamris ,
Actively working on it!

  • see icpp-llm allows you to run the llama2.c model in a canister. I tested it on the TinyStories model for now, but working on larger models.
  • see also this forum post

Is this what you’re looking for?

5 Likes