راهنمای فدورا برای اجرای مدل‌های زبانی بزرگ (LLM) روی NPU شرکت AMD با استفاده از FastFlowLM

Translated for your language. Read the original.

AI-assisted draft.

راهنمای فدورا برای اجرای مدل‌های زبانی بزرگ (LLM) روی NPU شرکت AMD با استفاده از FastFlowLM

Running LLMs on AMD NPU with FastFlowLM - Fedora Guide

You can now run Large Language Models directly on your AMD NPU using Fedora. This guide shows you how to set up the stack on an ASUS ROG Flow Z13 with a Ryzen AI Max 390 chip.

The setup requires four working layers:

Kernel + DKMS driver (amdxdna): Creates the device node and loads firmware.
XRT base: The AMD runtime.
XRT NPU plugin: Allows XRT to see the NPU.
FastFlowLM (flm): The tool that runs the models.

Since Fedora lacks prebuilt packages for this, you must build from source.

⚠️ Critical Fixes Before You Start

Enable IOMMU Many users disable IOMMU for GPU tuning. This breaks the NPU. Check your settings: cat /proc/cmdline If you see amd_iommu=off, remove it from /etc/default/grub, regenerate your grub config, and reboot.
Set Unlimited Memlock The NPU needs locked memory. Check your limit: ulimit -l If it is not unlimited, add this to /etc/security/limits.d/99-memlock.conf:

soft memlock unlimited
hard memlock unlimited Then log out and back in.

Fix the xrt-smi Path Do not symlink xrt-smi. It breaks the internal script. Use a wrapper instead: sudo tee /usr/local/bin/xrt-smi <<'EOF'

#!/bin/sh exec /opt/xilinx/xrt/bin/xrt-smi "$@" EOF sudo chmod +x /usr/local/bin/xrt-smi

Build Steps Summary

Install dependencies: Use dnf to install git, dkms, cmake, and various development libraries.
Build XRT: Clone the xdna-driver repo. Create a cmake3 wrapper for Fedora. Build and install the RPMs.
Install NPU Plugin: Build the xrt_plugin from the xdna-driver repo and install the resulting RPM.
Build FastFlowLM: Clone the FastFlowLM repo and use cmake to build and install.

Verification Commands

Check the kernel and NPU: flm validate

Check the hardware: xrt-smi examine xrt-smi validate

Run a model: flm run gemma4-it:e4b

Performance Benchmarks (Ryzen AI Max 390)

Time to first token: 1.21 s
Prefill speed: 18 tok/s
Decoding speed: 11 tok/s

Source: https://dev.to/ankk98/running-llms-on-amd-npu-with-fastflowlm-fedora-guide-1oo5

Optional learning community: https://t.me/GyaanSetuAi

راهنمای فدورا برای اجرای مدل‌های زبانی بزرگ (LLM) روی NPU شرکت AMD با استفاده از FastFlowLM

Continue reading

بیست سال با لینوکس، و حالا هوش مصنوعی خودم را مدیریت می‌کنم

اجرای GLM 5.2 به‌صورت محلی روی دسکتاپ شما

اولین API مدل زبانی بزرگ (LLM) خود روی کوبرنتیز

هوش مصنوعی محلی: چگونه مدل‌های متن‌باز را به‌صورت محلی اجرا کنیم