DRAFT Motion Proposal: New Hardware specification and remuneration for IC nodes

Hi @Sormarler m

Edit: I crossed out below because I was wrong

I need to verify, but my understanding is that aspiring NPs can currently use the wiki instructions but the hardware spec may block certain people that cannot get access to the machines. An aspiring node provider can follow the instructions to onboard themselves.

The work to make this much more user-friendly (e.g. by using the NNS Frontend dapp) is likely coming after SNS and hardware spec.~~

@diegop I am little confused here.
What I understood initially is that this form (Node Provider Interest) is closed, and was waiting for DRAFT Motion Proposal: New Hardware specification and remuneration for IC nodes - #10 by diegop proposal to reach a conclusion before new nodes can be onboarded to the network.

Are you suggesting any one can follow instructions in the wiki will be able to become NP and receive rewards ?

1 Like

Same. I am a little confused as well. I was under the assumption the whole process stopped because a new method was in development.

@ritvick @Sormarler i am not surprised I confused you all since it turns out I was wrong and team corrected me when I checked in with them. The new process is still under development and wiki instructions are still under development (they could work, but have some rough areas the team is still working on) so they are not the experience the IC should have.

That being said…the hardware stuff (which is the main theme of this thread) i believe is a necessary condition so @garym will post an update on hardware tests.

Apologies for confusion I caused. This is not my area. I should have verified earlier.

We have executed the validation plan described previously on two ASUS machines. These meet the generic Gen 2 hardware specifications as specified earlier in this thread. Some abbreviated specs:

  • 2x AMD EPYC 7313 (3,00 GHz, 16-Core, 128 MB)
  • 512 GB (16x 32GB) ECC Reg ATP DDR4 3200 RAM
  • 32TB (5x 6,4TB) NVMe Kioxia SSD 3D-NAND TLC U.3 (Kioxia CM6-V)
  • Swiss price: 21’595.75 CHF

Validation results:

  • Low Level

    • Stress tests (using stress-ng) - increase confidence in the hardware configuration and these specific machine instances - Passed :white_check_mark:
    • System benchmarks (using sysbench) - Gauge performance against known Gen 1 node performance - Passed :white_check_mark:
      • About 2x performance increase for cpu and memory.
      • Disk performance ranges from equally good to better for the majority of tests
    • SEV-SNP capability - Verified BIOS and kernel support working in tandem - Pending :construction:
  • High Level

    • Method: deploy machines into subnets and ensure subnet metrics do not deviate negatively by a meaningful threshold
    • Low usage subnet deployment. Scalability benchmarks (system baseline and large memory)
      • All metrics nominal - Passed :white_check_mark:
    • High usage subnet deployment
      • All metrics nominal, except
        • Individual node checkpointing performance discrepancy of <3-6%.
          This has no impact on subnet performance, but we’re still keeping an eye on it.
      • Passed :white_check_mark:

We have updated the node provider hardware wiki to include this ASUS server configuration. An example ASUS quote and bill of materials (BOM) is available for interested community members.

6 Likes

What’s the plan now?

We plan on continuing validation of new Gen 2 hardware configurations and publishing the results. Many factors influence how we proceed, e.g., community input on hardware configurations/manufacturers, price, availability.

What We Are Asking The Community

Please comment on and prioritize next hardware choices (abbreviated specs):

  • Dell PowerEdge
    • 2x AMD EPYC 7343 3.2GHz, 16C/32T, 128M Cache (190W) DDR4- 3200
    • 16x 32GB RDIMM, 3200MT/s, Dual Rank 16Gb (BASE x8)
    • 5x 6.4TB Enterprise NVMe Mixed Use AG Drive U.2 Gen4 with carrier
    • Swiss price: 27’159.09 CHF
    • USA price: $26,460.32 USD
  • HPE Proliant
    • 2x AMD EPYC 7343 3.2GHz 16-core 190W Processor for HPE
    • 16x HPE 32GB Dual Rank x4 DDR4-3200 CAS-22-22-22 Registered Smart Memory Kit
    • 5x HPE 6.4TB NVMe Gen4 Mainstream Performance Mixed Use SFF BC U.3 Static Multi Vendor SSD
    • Swiss price: 27’031.83 CHF
  • Lenovo
    • 2 x ThinkSystem AMD EPYC 7343 16C 190W 3.2GHz Processor
    • 16 x ThinkSystem 32GB TruDDR4 3200MHz (2Rx4 1.2V) RDIMM-A
    • 5 x ThinkSystem U.3 Kioxia CM6-V 6.4TB Mainstream NVMe PCIe4.0 x4 Hot Swap SSD
    • Swiss price: 30’534.27 CHF
    • USA price: $28,525.54 USD

Note: Prices are provided as rough examples and don’t include tax. USA prices are provided for comparison - the hardware will be validated in Switzerland. Example quotes and BOM’s for these hardware configurations are available on request.

Having instances of identical hardware as node providers has an additional benefit: if node providers face problems, the DFINITY engineering team can reproduce and debug independently on an identical environment. This must be done without access to node provider owned machines.

4 Likes

Are variations like using Kioxia disks in Dell servers acceptable?
or only the specific combinations that have been validated can be used?

Are variations like using Kioxia disks in Dell servers acceptable?

Yes. Some vendors may not provide components like the configurations above.
That said, performance characteristics of alternatives should be equivalent.

I have not seen any connectivity requirements.
What are the expectations per node in this respect? and how is that compensated?

For instance, if a rack has a 10Gbps dedicated connection shared by N nodes… what are the requirements and how are rewards calculated in that case?

The aim is for 10Gb connectivity per the second requirement here: Node Provider Onboarding - Internet Computer Wiki

Regarding node rewards: still under development.

EDIT: I’ll check on per-node connectivity expectations

How and Where do we request this ?

DM’ing me here works!

Evolving spreadsheet of BOM listings: IC Gen2 HW BOMs - Google Sheets

What are the expectations per node in this respect?

Consulted with the team. We’re aiming for a 10Gb connectivity per 1 or 2 racks.
A more nuanced answer including info about rewards is coming.

1 Like

Thanks! This makes sense.

More information about rewards would be useful, in particular given that DC costs are likely to keep rising due to energy costs. Also if the reward multiplier goes down to 2.5/1.75 (as I read somewhere else) the economic equation changes quite a bit.

We have run the validation procedure on a machine from an additional vendor: Gigabyte.

Gigabyte test machine abbreviated specs:

  • Dual AMD EPYC 7413 2.64 GHz, 24 Core, 180W
  • 512GB (16x32GB) 3200MHz DDR4 RDIMM
  • 10x7.68TB NVMe SSD (exceeds minimum specs)

Validation Results:

  • Stress tests - Passed :white_check_mark:
  • System benchmarks - Passed :white_check_mark:
    • As with the ASUS, about a 2x performance increase for cpu and memory.
    • Disk performance similar or better to previous gen hardware
  • SEV-SNP capability - Verified BIOS and kernel support working in tandem - Passed :white_check_mark:

We have updated the node provider hardware wiki to include this configuration.

4 Likes

Does this mean that it is valid to use 7.68TB drives ( DWPD < 3 ) in gen2?

Adding this additional flexibility would be really useful!

Does this mean that it is valid to use 7.68TB drives ( DWPD < 3 ) in gen2?

Yes No. See update: DRAFT Motion Proposal: New Hardware specification and remuneration for IC nodes - #31 by garym

The intent of the 5x 6.4TB recommendation was to cover minimum storage requirements while balancing cost and reliability. Reducing the number of drives reduces probability of failure of any one drive (the disk layout does not utilize redundancy).

More ‘functional’ guidance and specifications for SSD’s and RAM are in the works.

2 Likes

We have run the validation procedure on a machine from an additional vendor: Supermicro.

Supermicro test machine abbreviated specs:

  • Dual AMD EPYC 7543 32-Core Processor 2.8 GHz, 225W
  • 1024GB (16x64GB) 3200MHz DDR4 RDIMM
  • 10x7.68TB NVMe SSD

Note that the configuration of each of these components exceeds the Gen2 specs.

Validation Results:

  • Stress tests - Passed :white_check_mark:
  • System benchmarks - Passed :white_check_mark:
    • As with other Gen2 configurations, about a 2x performance increase for cpu and memory.
    • Disk performance similar or better to previous gen hardware.
  • SEV-SNP capability - Verified BIOS and kernel support working in tandem - Passed :white_check_mark:

The sysbench tool used in previous validation runs gave us some trouble and odd numbers. We switched to fio which provided more stable and predictable performance. The validation procedure is being updated and will be run again on previous machines to maintain a fair comparison.

We have updated the node provider hardware wiki to include this configuration.

2 Likes

When this validated hardware is expected to get final approval ( if there is a thing around this ).
I am asking this because when is it safe to place an order for hardware like this? I have checked with vendors like dell they have a lead time of 6-8 months ( at least here in Canada).

1 Like

Is there a timeline by which we can define the minimum storage and ram requirements for gen 2 hardware?

1 Like