𝗡𝗩𝗜𝗗𝗜𝗔 𝗣𝗘𝗘𝗥𝗠𝗘𝗠 𝗜𝗡𝗩𝗔𝗟𝗜𝗗 𝗔𝗥𝗚𝗨𝗠𝗘𝗡𝗧 𝗙𝗜𝗫

You see the error "Invalid argument" when running sudo modprobe nvidia-peermem on Ubuntu.

This happens because you use the standard Ubuntu InfiniBand stack. The nvidia-peermem module needs a specific API found only in MLNX_OFED. If you use the inbox rdma-core stack, the module will always fail.

Do not install MLNX_OFED just to fix this. That is a heavy and unnecessary step.

If you use Hopper or Blackwell GPUs with the NVIDIA open driver, you do not need nvidia-peermem. You should use DMA-BUF instead. It provides GPUDirect RDMA natively.

How to fix it:

You must enable nvidia-drm modeset=1. This is the most common reason DMA-BUF fails.

Check your current status: cat /sys/module/nvidia_drm/parameters/modeset

If it shows N, follow these steps:

To fix it for the current session: sudo modprobe -r nvidia_drm sudo modprobe nvidia_drm modeset=1

To fix it permanently: echo 'options nvidia-drm modeset=1' | sudo tee /etc/modprobe.d/nvidia-drm-modeset.conf sudo update-initramfs -u

Requirements for DMA-BUF: • NVIDIA open kernel driver • HCA with ODP support (ConnectX-6 or ConnectX-7) • Hopper or newer GPU (H100, H200, or B200)

To verify it works, ensure these three steps succeed in your code:

  1. cudaMalloc() to allocate memory.
  2. cuMemGetHandleForAddressRange() to export memory as a DMA-BUF file descriptor.
  3. ibv_reg_dmabuf_mr() to register that descriptor with the HCA.

Comparison:

Legacy (nvidia-peermem):

  • Needs MLNX_OFED
  • Needs an external module
  • Works on all GPUs

Modern (DMA-BUF):

  • Works on inbox rdma-core
  • No external module needed
  • Works on Hopper and newer
  • Preferred by NVIDIA

If your InfiniBand ports stay in INIT mode, you likely lack a Subnet Manager. Run: sudo apt install opensm sudo systemctl start opensm

Source: https://dev.to/fpolica91/nvidia-peermem-invalid-argument-fix-2b3n

Optional learning community: https://t.me/GyaanSetuAi