Non reproducibility of ic-os build

You can see the sha256 results for each build from each reviewer on the codegov portal. Each reviewer is required to post a screen capture of their artifacts.

My environment is windows 11 running wsl2 and Ubuntu 22.04 LSR (not the desktop version) and podman. I run the script that is posted in the proposal without modification. I have also been able to run this script in Ubuntu 22.04 desktop in a virtualbox virtual machine on the same pc, but I usually prefer wsl2 instead. I created the VM for these ic-os verifications back when docker was required. Hence docker is installed on the VM and directly on my pc, but I donā€™t intentionally use it unless it is being used by the proposal script.

I canā€™t quote the stats from memory, but I do believe that people have struggled to build the replica due to either lack of hard drive space, lack of memory, or lack of sufficient cores. If it were possible to reduce the hardware requirements then that may result in more people who can perform the work to build the replica.

Other codegov reviewers can provide info about their own environment. I will tag them here. @Icdev2dev @jwiegley @Gekctek @northman @NathanosDev @Zane

I hope this helps. Thank you so much for offering feedback and guidance that can help improve our understanding and contributions to these reviews. I love seeing this type of collaboration. DFINITY is doing a great job of engaging the community and enabling independent reviews and voting. Itā€™s very impressive and exciting.

6 Likes

Same here.
Windows 11
WSL2
Ubuntu 22.04
Podman

3 Likes

I want to make sure I follow the TLDR of this thread so far, is this right?

The main issue that folks computers ran out of memoryā€¦ but for those that had enough memory, they were able to match the hashes?

I want to make sure there is a happy path where hashes matched first, before expanding that path.

My issue ended up being that I was running an out-of-date version of build-ic.sh. Once I switched to what is provided within the repo, my build completed fine.

2 Likes

Itā€™s a little bit more nuanced.

There are three sets of hashes: setupos, hostos & guestos.

  1. For guestos, a couple of reviewers had 16 GB RAM on which they attempted to build on WSL2. The build failed reporting "no enough space " or something similar.
    1.1 i had the same issue. I modified the config to NOT mount ā€œtmpfsā€. Then the build itself succeeded; but produced incorrect hashs.
  2. All reviewers with correctly configured machines were able to produced correct hashes for guestos.
  3. None of the reviewers were able to produce correct hashes (and hashes were not consistent amongst reviewers) for setupos. For hostos , i verified that hashes were not consistent amongst reviewers (did not check for correctness)
3 Likes

I have moved my build to a different system with Ubuntu2004 16M on an I7 with 8 cores. I wish it would run cleanly under WSL2.

2 Likes

Hello codegov!

Thank you again for all your work!

Iā€™ve been trying to get to the bottom of the build indeterminism. First, some further background on the issue:

  • If HostOS is not deterministic, SetupOS wonā€™t be either because SetupOS contains a HostOS image inside of it. So the source of the indeterminism should be isolated to HostOS.
  • @wpb and @Gekctek, you two were running very similar build environments and almost had the same builds. You had differing update-img-test images, and weā€™re looking into this, but this is a separate issue that we believe will be straightforward to resolve.
    • You both said the build environment you were running was:
      • Windows 11
      • WSL2
      • Ubuntu 22.04
      • Podman
    • Iā€™m curious to know what Podman version you were running.

Anyone who ran the test, could you take a moment to upload your hostOS disk-img.tar.gz file to a shared folder:
tinyurl .com/2p8wfuz9
NOTE: You are unable to share link in these threads, so please just remove the space in the URL.

Iā€™ve created folders with each of your account handles. Please place the file in your respective folders. This way, I can inspect the issue and identify the source of the indeterminism. And after uploading your images, can you also comment in this thread your OS, Podman version, and any other build environment information you think might be relevant.

Additionally, @wpb and @Gekctek, can you both upload your update-img-test.tar.gz files so that I can take a look at that issue.

Thanks again Codegov for all your work!

2 Likes

CC: @jwiegley @northman

CC: @NathanosDev @Zane

Thank you @andrewbattat for diving deep to understand the lack of determinism. Please let us know what you find!

2 Likes

Uploaded disk-img.tar.gz from hostos in icdev2dev

OS - Ubuntu 20.04.2 LTS
podman - 3.4.2

1 Like

Done. Uploaded update-img-test.tar.gz

Podman 3.4.4

2 Likes

Uploaded disk-img.tar.gz from hostos in northman
OS baseline found in readme.txt

Thank you.

2 Likes

My uploads are in progress.

12th Gen Intel(R) Coreā„¢ i7-12700T 1.40 GHz
32.0 GB (31.7 GB usable)
64-bit operating system, x64-based processor

Windows 11 Pro Version 22H2
Ubuntu 22.04.2 LTS
podman version 3.4.4

I hope this helps! Thanks for taking a deeper look.

2 Likes

Thanks for investigating @andrewbattat. Iā€™ve uploaded my HostOS img to the folder.
Iā€™m using Windows 11, Ubuntu 22.10 on WSL2 and Podman 3.4.4.

3 Likes

Hi all! Update:

We believe we have found and resolved the source of the indeterminism!

For those curious about the issue:
In the bootloader, microcode updates were being added to the initrd based on the detected CPU of the machine building the image. We solved the issue by disabling initramfs from including microcode updates.

Hereā€™s the commit in question: Remove microcode updates from initramfs Ā· dfinity/ic@d8549e6 Ā· GitHub
Hats off to Eero for tracking down the issue! :clap:

Note: the reason our existing testing infrastructure did not discover the issue is that our machines were all using the same AMD processors, which meant they were getting the same microcode updates, causing them to create deterministic images.

Iā€™m not sure if this change will be included in the next update proposal, but if not, it should definitely be included in the subsequent one!

9 Likes

Wow, very interesting. Thank you for the explanation. Sounds like Eero did some amazing forensic work to get to the bottom of it.

2 Likes

Indeed. This is great! I will resume testing. Just to be clear, this fix should resolve both hostos & setupos. Did i get that right?

2 Likes

It should, yes! Because a SetupOS image contains GuestOS and HostOS images, we believe the SetupOS indeterminism was a result of the HostOS indeterminism.

3 Likes

aagh!

From the looks of the builds in the codegov review for the latest replica update proposal, we still have a HostOS indeterminism issue. Apologies for the pre-mature celebration. Weā€™ve either overlooked something, or this is a new issue.

Like before, NNS proposal 122529 is just a replica update proposal, and the indeterminism is just in HostOS and SetupOSā€”not GuestOS. So this proposal can still pass.

I am posting another link to a shared drive, and requesting that people upload their HostOS disk-img.tar.gz file.

Thanks again for all your help and patience. We will get this resolved!

4 Likes