𝗡𝗩𝗜𝗗𝗜𝗔 𝗣𝗘𝗘𝗥𝗠𝗘𝗠 𝗜𝗡𝗩𝗔𝗟𝗜𝗗 𝗔𝗥𝗚𝗨𝗠𝗘𝗡𝗧 𝗙𝗜𝗫
You see the error "Invalid argument" when running sudo modprobe nvidia-peermem on Ubuntu.
This happens because you use the standard Ubuntu InfiniBand stack. The nvidia-peermem module needs a specific API found only in MLNX_OFED. If you use the inbox rdma-core stack, the module will always fail.
Do not install MLNX_OFED just to fix this. That is a heavy and unnecessary step.
If you use Hopper or Blackwell GPUs with the NVIDIA open driver, you do not need nvidia-peermem. You should use DMA-BUF instead. It provides GPUDirect RDMA natively.
How to fix it:
You must enable nvidia-drm modeset=1. This is the most common reason DMA-BUF fails.
Check your current status: cat /sys/module/nvidia_drm/parameters/modeset
If it shows N, follow these steps:
To fix it for the current session: sudo modprobe -r nvidia_drm sudo modprobe nvidia_drm modeset=1
To fix it permanently: echo 'options nvidia-drm modeset=1' | sudo tee /etc/modprobe.d/nvidia-drm-modeset.conf sudo update-initramfs -u
Requirements for DMA-BUF: • NVIDIA open kernel driver • HCA with ODP support (ConnectX-6 or ConnectX-7) • Hopper or newer GPU (H100, H200, or B200)
To verify it works, ensure these three steps succeed in your code:
- cudaMalloc() to allocate memory.
- cuMemGetHandleForAddressRange() to export memory as a DMA-BUF file descriptor.
- ibv_reg_dmabuf_mr() to register that descriptor with the HCA.
Comparison:
Legacy (nvidia-peermem):
- Needs MLNX_OFED
- Needs an external module
- Works on all GPUs
Modern (DMA-BUF):
- Works on inbox rdma-core
- No external module needed
- Works on Hopper and newer
- Preferred by NVIDIA
If your InfiniBand ports stay in INIT mode, you likely lack a Subnet Manager. Run: sudo apt install opensm sudo systemctl start opensm
Source: https://dev.to/fpolica91/nvidia-peermem-invalid-argument-fix-2b3n
Optional learning community: https://t.me/GyaanSetuAi