Persistence for Llama2.c LLM weights in canister

Hey, an absolutely amazing project by @icpp ! surely pushing what can be done on IC and on blockchain in general. I saw the post by the author here:
https://forum.dfinity.org/t/llama2-c-llm-running-in-a-canister/21991/56 and read this comment

When installing an LLM in a canister, there are two steps:

Deploy the wasm
Upload the trained model (the weights)
Step 1 is always the same, and that fits without any problem in a canister.

Step 2 however is a different story. Uploading the model weights is done by running a python script that is provided in the icpp_llm repo. This script must be pointed at a model.bin file on your disk. It will read it in chunks, and send it to an endpoint of the canister where the llm is running. When the canister endpoint receives those bytes, they are stored in an std::vector, which is orthogonally persisted. That vector grows dynamically, and there is where the fit comes into play.

Note that this upload is done only once. If the upload succeeds, then the model fits!

Small models, with eg. Millions of parameters fit just fine, but models with several billion of parameters will hit the canister limit.

If you want to dig deeper to see where the memory goes once all fully loaded, [this](https://github.com/icppWorld/icpp_llm/blob/c29712f4d65e8c9ffea62e37b45a7c0f1cd23492/icpp_llama2/README_icpp_llama2_resource_requirements.md#canister-resource-requirements-for-icpp_llama2) is a study I did. I find it very interesting and believe there is still a lot of room for improvement.

My doubt is that the models .bin file are stored in orthogonally persisted vector, that means the whole model weights file is not loaded in wasm memory (canister memory) and only lazily fetches the data as an when required ?