Hello everyone,
I wanted to share an idea for a potential feature that could be built on ICP, particularly around federated learning.
Imagine a university or hospital (let’s say University A) has access to sensitive lung scan data that could be used to train an AI model. However, due to privacy regulations — or because patients only gave consent for local use — this data can’t be shared externally. On top of that, the dataset might be large, making it expensive or impractical to transfer.
Now consider another institution (University B) that wants to train a lung cancer detection model but lacks sufficient data. Instead of transferring the data, the model could be sent to University A, where it trains directly on the local server. ICP could act as the coordination and verification layer for this process — ensuring trust, logging training metadata, and possibly even enforcing access control.
This approach could scale across multiple data holders, allowing a model to train across several institutions — all without ever exposing raw data. Smaller universities and research centers that lack data or infrastructure could greatly benefit from using models trained on shared (but secure) resources.
There could be a problem with data preprocessing as the researchers that want to make a model cannot do without knowing some information about data. Maybe the institution that host the data could share a short snippet of data as an example.
Training might be too expensive to run directly inside a canister, especially due to the lack of GPU support and the potential costs involved. A practical solution could be to perform the training off-chain, and then provide the updated model weights along with a verifiable proof that training was completed — such as the number of epochs, validation metrics, or logs. That said, smaller models might still be feasible to train directly on ICP, and in some cases, even the data could be stored on-chain using solutions like Filecoin.
Furthermore this could become the new type of benchmark. For example: our model got accuracy of 92% from Dfinity lung dataset. ![]()
I’d love to hear your thoughts on this concept and whether you see potential in exploring it further.