I am working on the next version of ICGPT, a playground for on-chain Llama2 models, and I am looking into performance improvements by using horizontal scaling with a load-balancer in front of multiple LLMs.
I was suprised about the non-linear scaling with the experimental load-balancer and would love to get a full understanding of why this is. Would I be able to join one of the upcoming WG meetings to ask some questions?