Hi all, I have a question, raw_rand in ic_cdk::api::management_canister::main - Rust
So raw_rand is the Verifiable Random Function(VRF)? right?
Have a look at this thread: Is raw_rand a VRF?
training a leading-edge, commercial large language model is very expensive (fine-tuning a LLaMA 64b variant is much easier but it cannot be used commercially) and it is done on clusters of high-end TPUs and GPUs designed to spread the load rather than duplicate it. The Gen-2 IC nodes and blockchain consensus aren’t designed for this.
The latest language models are increasingly capable of providing instructions to weaponize dangerous viruses and perform other dire acts so distributing the weights for inference is becoming a sensitive issue. The rapid increases in parameter count and training have resulted in sudden jumps in the ability of these models to reason. The benchmarks used to measure them are becoming significantly harder and the models are likely to surpass average human scores within the next 1-2 years. Dfinity and the 2nd gen clouds might well have to satisfy governmental scrutiny before they will be allowed to host inference weights for the newer models. I don’t really use Siri or Alexa that much, but some believe that the new models will give a significant boost to the assistants and transform user interfaces. Regardless, the backends are very likely to make heavy use of the new models. If so, then persuading one of the (commercially-licensed) models to jump to the IC, at least for inference, might be important.
perhaps an optional inference co-processor or specialized inference software for the Gen-2 nodes would help.
For now we’re trying to keep nodes pretty homogenous because it keeps node management and node provider rewards substantially more simple. Mandating specialised HW for that seems a bit overkill for now. But for the further future having specialised nodes sounds really useful
it does seem likely that the move to 3nm will allow CPUs to dedicate more space to neural co-processors… as the mobile/desktop ARM designs are doing.
if ChatGPT is not hosted on the IC, then it seems that a mechanism like https outcalls would be used. If so, my understanding, from Internet Computer Loading, is that multiple requests would be sent to ChatGPT to satisfy a request from a canister and the results must satisfy a consensus check before they affect the state of the canister. ChatGPT intentionally randomizes its output to increase the appeal of the text it generates. Would this require a custom consensus check? Perhaps there are alternatives… just curious at this point.
Vicuna (13B parameters, Llama based model) was finetuned for a low cost. It is possible to finetune or LoRA ICP and Motoko documentation then quantize to 4-bits to shrink the model down to a canister-friendly size.
As for inferencing, CMU utilized webGPU and wasm to run Vicuna 4bit over the web. A 10Gb+ VRAM GPU can run 13B-Q4 Vicuna locally, so I’m excited if WebLLM can do the same. GitHub - mlc-ai/web-llm: Bringing large-language models and chat to web browsers. Everything runs inside the browser with no server support.
If you want to inference it on IC, the challenge is how to generate candid file.
lol… ask permission? scrutiny of who 80 year olds who have 0 clue? more like do it and ask for forgiveness later. it’s 1st come 1st serve in nascent markets… we want this stuff running on the IC and controlled by the NNS (has a nice safe guardrails spin to it lol)
who talks about mandating? why isn’t it possible to have various h/w subnets competing for rewards… instead of just going with some plain vanilla stuff… competition is always good esp when there is subsidies… otherwise some other networks will eat our lunch and we are inflation the ICP supply for less value then we extract out of these nodes existing (bad business imho)
this is rather unfortunate… we should want these billions of fiats flowing into this very expensive activity instead of subsidizing a bunch of nodes burning up electricity for nada they can start earning money for actually providing computation
There’s a concept called idempotent requests
, where multiple calls to the same API will produce the same result. So you’d have to have some mechanism to use the same randomness for multiple calls, otherwise there’s no way to achieve consensus between the nodes
The IC’s consensus does not use competition between nodes like PoW or PoS. And subnets are independent from each other in the sense that they don’t work on the same data/requests, so there’s nothing to compete over either
Does it help that LLaMa inferencing is ported to rust and can run off the CPU? GitHub - rustformers/llama-rs: Run LLaMA inference on CPU, with Rust 🦀🚀🦙
Dolly 2.0 by Databricks is fully open. So you’re not restricted like LLaMa to research purposes only when running that LLM.
Hi Gamris, I have read the code of this github.
But, unfortunately, it can not be compiled and deployed on IC now, because it has rand library and C library.
Maybe we can rewrite it in pure rust, and try some ways to deploy it on IC.
Just found https://edgematrix.pro/
Looking forward to being helpful to you
that should be changed then. the IC shouldn’t become some sort of node welfare state… distributing ICP to nodes that ain’t doing nuthin’ or providing much less marginal value vs potentially other new nodes/subnet.
we are burning money and getting what for it?
I urge you Dom to move us away from a socialistic welfare state funding model to a free-market model as possible.
Just found a port of llama.cpp in … Rust. Has anyone tried to put that in a canister (if that’s even possible)?
Yep, I tried and it failed:(
I found a GitHub for Rust bindings for Tensorflow and one for Pytorch Rust Bindings. Maybe there’s hope to run a small HuggingFace model now?