How to Improve Current Situation – “Ask for Raise or Promotion”

Modern deep learning infrastructure often requires transferring large amounts of data between GPUs across machines. At Perplexity, we encountered a unique technical challenge: efficiently transferring non-contiguous GPU memory regions between machines at maximum possible speed. Our target platform, AWS p5 instances, offers an impressive 3200 Gbps of network bandwidth through 32 network cards. This article shares our journey of building a custom high-performance networking solution that achieves 97.1% of this theoretical bandwidth.

 

The Technical Challenge

Our use case presented several key technical requirements:

  • High-bandwidth transfer between remote GPUs of non-contiguous memory chunks
  • Support for peer-to-peer communication patterns

While NVIDIA’s NCCL library is the de facto standard for distributed deep learning, it wasn’t ideal for our use case:

  • NCCL excels at collective communication but requires establishing a static “world”, which requires restarting the entire cluster when adjusting the participating nodes.
  • NCCL’s synchronous communication model adds complexity for our asynchronous workload
  • We wanted direct control over our memory transfer patterns for optimization

Modern High-Performance Networks

Most networks we use daily rely on TCP/IP protocols, where applications communicate with the network card through the operating system kernel using sockets. However, high-performance networks use RDMA (Remote Direct Memory Access) – a completely different hardware and software stack that enables direct memory access between machines without involving the CPU.

  1. Buffer Ownership: Unlike traditional sockets where the kernel manages network buffers and requires copying between user space and kernel space, RDMA requires applications to manage their own buffers. When an application initiates a network operation, it transfers buffer ownership to the network card until the operation completes, eliminating the need for data copying.
  2. Memory Registration: Applications must register memory regions with the operating system kernel. The kernel sets up virtual address mappings that allow the CPU, GPUs, and network cards to all understand the same virtual addresses. This registration is a one-time operation that enables subsequent zero-copy data transfers.
  3. Control Plane vs Data Plane: High-performance networks separate operations into two categories:
    • Control plane operations (like connection setup and memory registration) go through the kernel to ensure security
    • Data plane operations (actual data transfer) bypass the kernel for maximum performance
  4. Reception Before Transmission: Without kernel-managed buffers, applications must pre-post receive operations, specifying where incoming data should be placed. This is a fundamental shift from the socket model where applications can receive data at any time.

AWS p5 instances have a sophisticated internal architecture. As shown below, each instance contains two CPU sockets forming two NUMA nodes, with each NUMA node connecting to four PCIe switches:

Get started.

Applying for an account is free and takes less than 2 minutes. It won’t affect your credit score!

Blog form