Community Consideration: GPU Use Cases and Specifications

TLDR: The focus is on community feedback for GPU types and usage in the upcoming Gen 3 node specifications for Dfinity. Current nodes lack GPU support, and the discussion centers on whether to include GPUs in the Gen 3 specifications or in specific sub-nets, seeking community insights on GPU utilization.

  1. Status Quo:
  • Current gen 2 nodes in Dfinity’s network boast 500 GB of RAM and substantial processing power, adept at parallelizing processes and running multiple single-threaded WASM canisters, each with a 4GB RAM limit. Despite their capability, these nodes do not feature GPU support, which could offer significant improvements in execution parallelization.
  • However, the architectural design of GPUs does not effectively support task-specific parallelization. For instance, a task requiring only 1GB of VRAM would monopolize an entire GPU with 80GB VRAM, leading to potential inefficiencies.
  • The integration of GPUs, in any form, into Dfinity’s nodes, therefore, represents a major step forward, transcending the current state-of-the-art (SotA) hardware capabilities of the gen 2 nodes. The question now is whether to integrate GPUs into the new Gen 3 specifications — and if so, what are the VRAM and compute requirements — or to develop specific GPU-focused sub-nets.
  1. What We Are Asking the Community:
  • Community input is sought to determine the type of GPU integration preferred. Key questions include:
    • How would GPUs be utilized in various projects?
    • What level of compute power is needed?
    • How long would the GPU be in use for typical tasks?
    • What are the VRAM requirements for these applications?

This approach aims to align the development of Gen 3 nodes with the practical needs and technological aspirations of the Dfinity community.

11 Likes

Kinic is testing various types of cloud hardware for AI inference to get ideal specs.

  • How would GPUs be utilized in various projects?
  1. Browser based models work work but are still clunky. Having GPU subnet for inference would be great.
  • Deep Learning AMI GPU CUDA 11.5.2 (Amazon Linux 2) 20230222
  • Deep Learning Base AMI (Ubuntu 18.04) Version 56.6
    g4dn.xlarge equavalent.
    • How long would the GPU be in use for typical tasks?
      10s to 50s.
  1. 16GB

5 Likes

Thank you for your feedback. It would be great to hear from others about additional applications:

  • Graphics and Simulation: @UnfoldVR and @cgiteach might provide insights on GPUs’ original purpose in graphics rendering. GPUs are extensively used in gaming, film production, and architectural design for complex 3D graphics, VR, and AR.
  • Media and Webcasting: @rlaracue, considering Catalyze’s interest in webcasting integration, it’s worth noting that GPUs are pivotal in the media industry for efficient video encoding and decoding. This is crucial in streaming services and video production for processing high-resolution content.
  • Scientific Simulations: For SCINET_INC, GPUs’ role in replicating simulations in physics, chemistry, and biology is significant. Their use extends to weather forecasting, molecular modeling, and astronomy.
  • Cryptography: There’s also a noteworthy application of GPUs in cryptographic tasks, including hashing, encryption, and decryption.
  • Signal Processing: In telecommunications, signal processing tasks such as filtering and transforming signals benefit greatly from GPU acceleration.
  • Network and Cybersecurity: GPUs can also accelerate network-related tasks, enhancing capabilities in traffic analysis and intrusion detection systems.
  • [Amend] Multiplayer Gaming : @Eimolad may be able to provide insight as to GPUs role in server-side GPUs in multiplayer gaming.

I look forward to hearing more insights and perspectives on these diverse applications.

1 Like

My current focus in AI applications involves a critical examination of GPU usage, especially in how GPUs facilitate execution parallelization. A notable observation is that while GPUs significantly enhance parallel processing, they are not as effective in task-specific parallelization. Task-specific parallelization is critical for efficiently used subnets and IC ecosystem health. GPU task-specific inefficiency is evident when a task requiring only 1GB of VRAM monopolizes an entire 80GB VRAM GPU.

In the realm of AI, particularly with popular Transformer layers, we see a massive inflation in data size between layers. This creates an opportunity to pass smaller data packets between layers more efficiently. If there were enough 1 GB VRAM GPUs available, it might be feasible to distribute state-of-the-art neural network models across multiple GPUs. While this approach, involving data transfer between GPUs or from GPU to CPU and back, might be slower compared to processing on a single GPU, it could potentially offer a more modular and efficient utilization of GPU resources across various applications.

To delve deeper into this concept, I aim to answer these specific questions through empirical analysis:

  1. What is the minimum GPU capacity required for a given task?
  2. How many GPUs, of this smaller size, would be necessary to work together effectively?
  3. What is the time cost associated with each additional GPU used in the application (this should boil down to data transfer time)?

This inquiry aims to optimize GPU utilization in AI, balancing performance with subnet resource efficiency.

[Amendment] How the GPUs are connected will also be critical in evaluating performance. At home, I use PCI Express (PCIe) which offers lower bandwidth compared to more specialized interconnects. PCIe 4.0, for instance, offers up to 32 GB/s in each direction. But something like NVIDIA NVLink designed to link GPUs directly or connect GPUs to CPUs offers significantly higher bandwidth than PCIe, with NVLink 2.0 providing up to 300 GB/s bi-directional bandwidth. In our scenario, we are not likely greatly concerned with bandwidth. More important would be the areas where the NVLink or Infinity Fabric outperform PCIe which are applicable to smaller data:

  • Lower Latency: the time it takes for data to start its transfer from one point to another, which is especially beneficial for small data transfers.
  • Efficient Data Transfer: a more direct and efficient path for data transfer between GPUs, or between GPUs and CPUs. This efficiency is not just about the volume of data that can be moved, but also about how quickly and seamlessly these transfers can occur, regardless of the size of the data.
  • Reduced Overheads: NVLink is designed to reduce protocol overheads that are typical in PCIe transfers. This streamlined approach allows for faster data movement, again benefiting both small and large data transfers.
2 Likes

Interesting proposition @jeshli
At first glance the main utility for GPU processing on-chain would be for AI training done within a secure and transparent environment.

From our (UnfoldVR) perspective, when it comes to gaming or intensive graphics you typically defer to the end user’s hardware to act as the graphical processor of whatever you’re trying to render. Either the browser or the PC or the mobile device renders the assets downloaded from ICP canisters and reads/writes instructions to the blockchain.

The main utility from a gaming perspective would be strictly from running an AI algorithm on-chain to do interactive chat on characters or other similar tasks.

One potentially relevant use case would be having GPU machines that would do pixel streaming (as Epic Games references it) or anything similar to cloud real-time rendering and then stream that to the end-user using WebRTC protocol. In this scenario the user has a WebRTC connection to the GPU machine. The user makes an action and the GPU would run a render engine which would render, in real-time, locally, a scene, based on User’s input (movement, configurations, selections, etc.). Then it would stream that render back to the user. This technology is similar to Xbox Cloud or Nvidia GeForce Now

In practice, as most recently demonstrated by Xbox Cloud support on the new Quest 3 VR headset, this isn’t a great solution for fast paced games/experiences. Even on a traditional architecture, even with Wifi 6 and high internet speeds, getting user input, rendering a high quality scene on a server machine, compressing a video frame, sending that over and finally decompressing it locally introduces at least 0.5 seconds lag.

As such, I think it’s optimal to have the rendering burden on the end-user and the graphics optimization burden on the developers, for graphics intensive scenarios. As devices become more and more powerful and miniaturized, the use cases for cloud real-time rendering are very niche.

Finally, there are multiply use-cases for non-real-time rendering solutions (web based product design, architectural design, etc). You could render real-time, on-device, a simplified version of the 3D scene and process a high fidelity render on-chain, on demand. But this could also happen on the CPU. And the use cases I can think of don’t warrant the level of security the blockchain provides.

4 Likes

GPU subnets might be worth it for this alone.

2 Likes

Representing SCINET here. We foresee a need for AI applications but haven’t determined what our specific GPU needs would be. I’m interested to see what the capabilities of on-chain GPUs are though.

As far as applications of the GPUs, we could use them for simulation replication per your comment, as well as analyzing research data and determining new scientific applications for the data. There are also likely a myriad of other applications to be discovered in time.

1 Like

I conducted an experimental analysis in order to understand the performance implications of splitting a GPT-2 model for distributed processing across multiple GPUs. By dividing the model into two distinct components and executing these components across different GPUs, I aimed to explore the efficiency and potential latency trade-offs involved in such a distributed computing approach.

For my comparison, I had two NVIDIA GeForce GTX 1080 Ti graphics cards available, interconnected via PCIe 4.0 links. The experimental design involved running the two halves of the GPT-2 model both sequentially on a single GPU and separately on two different GPUs. The key variable in this setup was the inter-GPU data transfer required when the model was split across two GPUs. This transfer represents a critical aspect of distributed GPU computing, as it is often a bottleneck in multi-GPU configurations.

The experiment analyzes three distinct token sizes for our input data: 12 tokens, representing the query limit; 52 tokens, the update limit; and 512 tokens, corresponding to GPT-2’s context window limit. These sizes were selected to provide insights across a spectrum of potential use cases and to understand how input size affects performance in distributed GPU setups.

The results of the experiments revealed a consistent pattern across all input sizes. When the model was split across two GPUs, there was an observable increase in runtime compared to running the model sequentially on a single GPU. On average, this increase was averaged around 5 milliseconds, which translates to approximately a 3% increase in runtime.

Token Size Configuration Run 1 (ms) Run 2 (ms) Run 3 (ms)
12 Tokens Single GPU 166.481 164.100 162.965
12 Tokens Separate GPUs 171.462 169.392 169.625
52 Tokens Single GPU 162.891 163.413 159.587
52 Tokens Separate GPUs 165.439 166.575 162.839
512 Tokens Single GPU 177.951 179.436 178.569
512 Tokens Separate GPUs 185.505 182.381 185.721

If we were to segment the GPT-2 model across the maximum number of 12 distinct GPUs, based on our current findings, we could approximate its runtime at around 226.481 milliseconds or a 36% increase. Although any reduction in speed is generally undesirable, this preliminary investigation reveals that, within the constraints of current canister limitations, transferring data between GPUs for deep transformer architectures is relatively efficient and emerges as a promising approach. Improvements in interconnect technology could further mitigate latency issues. However, this leads to another consideration: the limitation of how many GPUs can be feasibly connected within a single rack and to each other. This highlights a delicate balance between enhancing computational power through distributed processing and the practical constraints of hardware architecture and interconnect capabilities.

5 Likes