Non reproducibility of ic-os build

A group of interested community members (5) are attempting to rebuild ic-os for each ICP NNS Replica Version Management proposal under the guidelines of https://www.codegov.org. We are doing so for the intent of serving as a part of independent validation of the claims of replica version; PRIOR to it being adopted. (the other part, not in scope for this post, is the sanity checks on the release notes).

As background “ic-os” is an umbrella term for all the operating systems within the IC, including SetupOS, HostOS, GuestOS, and Boundary-guestOS. Currently we are focused on SetupOs, HostOs and GuestOs.

The method of validation at the high level is
(a) building on local machines
(b) downloading corresponding artifacts from different OSs
(c) comparing the sha256sums of the same artifact ; built on local machine and downloaded artifact

Our findings (for the latest proposal : DSCVR)
1. ALL members have been successful in building guestos and the sha256sums are consistent with each other as well as the downloaded artifacts

2. NONE of the members have the correct sha256sum as compared to the downloaded artifact for the  SetupOS:  disk-img.tar.gz
       2.1 The  sha256sum of downloaded disk-img.tar.gz :                                                         7d729...
        2.2 Two Team Members have a common sha256sum for the same disk-img.tar.gz:         21a60... (that does NOT match the sha256sum of the downloaded artifact)
        2.3 Three other Team Members have differing sha256sum matching neither within the three team members NOR 2.1 NOR 2.2
  3. The situation is similar for HostOS

We need immediate help in resolving this issue. If we cannot trust the SetupOS (or the HostOs), we cannot rely simply on the correctness of the GuestOs.

4 Likes

I think its very healthy that folks validate the proposals. Good service for the community. Let me ping some folks to see what may be going on.

3 Likes

btw maybe its just me but this hyperlink liked to a blank page

Definitely strange. @wpb would you know? This is what I get below.

@diegop can you log into dscvr and see if it resolves?

I believe it’s a private dscvr portal. Anyone would need an invite first

1 Like

I assumed so, so i wanted to make sure @Icdev2dev knew in case they intended to link to a public URL

You have to join the portal to view the contents. This can be done without any admin support.

1 Like

@Icdev2dev

I asked team. We have seen a problem with some folks who used codegov and codegov had outdated info. Site would recommend using their own copy of build-ic.sh file, not the one from the IC repo. And that file was just not up to date: it was not containing Clean up icos folders · dfinity/ic@254bf3d · GitHub

Can you see if it works with the file from the IC repo?

@wpb do you know how we can update the codegov website?

1 Like

this worked! thank you very much

1 Like

I maintain the website, so I can fix whatever needs to be fixed. I double checked the IC-OS Verification script that is posted here and it reads exactly the same as the latest proposal 118023. The codegov.sh file is not a script to build the replica…it is used for a different purpose. The script to build the replica is intended to be the script in the proposal and I have simply copied it verbatim for convenience, but can certainly take it down if it’s creating confusion.

@jwiegley initially had an issue with his build on this proposal too and reached out to Dfinity for clarification. However, the problem wasn’t the codegov.sh file like was originally thought. I don’t recall what was causing his issue, but can review messages when I get home later.

2 Likes

makes sense. Thank you. Lets see what John says!

1 Like

To be clear, I am using ic/ic-os at master · dfinity/ic · GitHub.

In there, it maintains

"As an alternative, the following script can be used to build the images in a container with the correct environment already configured:

./gitlab-ci/container/container-run.sh"

Since that script checks for a porcelain repository prior to build (and only git and podman are required to be configured), I presume that the instructions there are correct.

@jwiegley will definitely comment. But, imo, his posted sha256sum also has the same issues with the hostos and setupos…i.e. it does NOT match up with sha256sum of the downloaded artifact in setupos (and therefore, I presume with hostos)

1 Like

To be clear, the problem is not with following the build instructions, and as the post says, all codecov participants managed to successfully reproduce the guestos image.

The question that @Icdev2dev asks is why the hashes of the other artifacts that are being built differ (the hostos and setupos images). I don’t have a full answer to that, but do note that the proposal we’re voting on only elects a guest os image, as you can see in the payload of the proposal. So even though the script we run builds more artifacts, the only one relevant for checking the proposal is the guest os image, and the proposal only approves a guestos upgrade to the hash that is in the payload.

4 Likes

Thanks, @Manu.

It is correct that the proposal only approves a guestos upgrade to the hash that is in the payload.

HOWEVER it is my understanding that the hostos that manifests that intent in a technical sense.

“Hostos: Its main responsibility is to launch and run the GuestOS in a virtual machine”

Q1. Then, to the extent we are building the Hostos as a part of this process, shouldn’t we be checking for the fidelity of the Hostos?

Q2. I suppose that the Hostos is not updated in a replica version management update. When is the Hostos updated ; if it is not a part of the proposal that we are checking?

Q3. A set of similar questions for SetupOs.

4 Likes

Hi @Icdev2dev!

Thanks again for you and the other community members who took the initiative to test the reproducibility of the builds!

While having a reproducible GuestOS is the most crucial element, you are absolutely correct that ensuring the reproducibility of all the IC-OS builds is essential for the integrity and decentralization of the project, so thank you!

We have tests in place to try and ensure the reproducibility of all the IC-OS builds, but evidently, they are not perfect, and we are continuing to work on them.

  1. You are right; the primary role of HostOS is to launch and run the GuestOS in a virtual machine. And this proposal to upgrade the GuestOS will have no impact on the HostOSes running on the IC. So for the sake of this proposal, it’s not crucial to check the fidelity of the HostOS (though always important!)

  2. The HostOS has actually never been upgraded. However, we’re currently working on a proposal for the first ever HostOS upgrade, so thank you for bringing these reproducibility issues to our attention before we went ahead with the proposal. The HostOS is upgraded in a manner similar to the GuestOS. A HostOS upgrade proposal is sent out, and upon approval via the NNS, the upgrade is triggered.

  3. SetupOS is the operating system installing the HostOS hypervisor and GuestOS virtual machine. SetupOS is a convenience wrapper for Node Operators, as it streamlines the process of installing HostOS and GuestOS. As it only exists for installing HostOS and GuestOS, SetupOS is never upgraded, and so, the are no NNS proposals to upgrade SetupOS. However, we still want SetupOS to be reproducible for the sake of decentralization, security, and determinism, and will be working hard to resolve these issues.

Once again, thank you for your work! If you have any more questions or concerns please share!

6 Likes

Thanks for the kind words and update on hostos.

Since our local build infrastructure has been proven to work for guestos and we are building hostos as a part of that build, we would like to extend this to include hostos as well.

Can we (codegov & dfinity) work together on reproducibility of hostos ; PRIOR to the proposal to upgrade the hostos being present? In this manner, we (codegov/others) don’t struggle at the last moment to figure out build issues.

The fact that two of codegov members were able to build hostos (and setupos) with same sha256sum (but not matching with downloaded artifacts sha256sum) seems to indicate some version omission that subsequent codegov members presumably pulled in as inadvertent updates.

Additionally i wanted to understand how do you intend to test hostos update; which is likely trickier than guestos. This is because you would probably have nodes with different versions of hostos for a certain length of time.

6 Likes

Yes, we’d love to work together on the reproducibility of HostOS prior to the proposal to upgrade the HostOS. I will be in touch regarding this.

I think you might be right about the version omission. I will look into this and be in touch.

Yes, you’re correct that the HostOS upgrades are trickier than the GuestOS:

GuestOS upgrades are currently done at a subnet level. Once a proposal is accepted to upgrade GuestOS in a subnet, all the nodes in the subnet upgrade, and this upgrade process occurs very quickly causing a small window of subnet downtime. However, for HostOS upgrades to take effect, all services must shut down and the machine must reboot. This process takes a few minutes, which is too long of a time to bring down an entire subnet. As a result, we are planning to propose that HostOS upgrades happen on a per-node or a per-datacenter level. This means that a proposal will be sent out to upgrade an individual node or all the nodes in a particular datacenter. This means only a single node per subnet will upgrade the HostOS at a given time, and there will be no subnet downtime. So it’s actually not a problem if nodes in a subnet have different HostOS versions at a given time (unlike the GuestOS where all nodes in a subnet must be running the same version).

I hope that helps!

Regarding the build reproducibility, I’m curious to learn more about your build environment so that we can track down the source of the indeterminism.

Were you using the gitlab-ci/container/container-run.sh script? Or did you build images directly through Bazel? Or did members do a mixture of the two? And I presume that all members were building production images?

Any information would be greatly appreciated!

Thank you!

5 Likes

I have changed the portal role permissions so everyone can view the contents of the portal. It no longer requires you to join the portal to view content. I still have restrictions enabled for posting and commenting because I don’t want the portal to become cluttered with messages that are outside the scope or don’t benefit these proposal reviews. That said, I have no issue with giving people a role that enables comments. Just join the portal and let me know that you want a role that can comment. We actually have many people in the portal that are able to provide reviews if they want, but their intention is just to be an advisor. Those contributions are welcome.

3 Likes

I am using gitlab-ci/container/container-run.sh. I imagine most others are using the same as well.

Just fyi. I did try to run the build on WSL under limited (16gb) ram. The build failed with “not enough space” message. I modified to NOT mount tmpfs. That build successfully completed ; BUT with wrong sha256sums; even for guestos. Others have been able to complete on WSL withe expanded RAM with correct sha256sum.

I then reverted to a standalone 20.04 Ubuntu with 24gb ram; where the porcelain build built with correct sha256sums

2 Likes