The Node.js bug that's invisible to your monitoring
Your health check says everything is fine. It takes one millisecond. Then traffic grows. Suddenly, your p99 latency jumps to 400ms. You look at your dashboards. Everything looks green.
CPU usage is moderate. Event loop lag is flat. Memory is healthy. Your APM shows a slow request but tells you nothing about why. There are no slow database calls. There are no errors.
The time is spent in the libuv thread pool.
Standard Node observability focuses on the event loop. The pool is a separate queue. It sits outside your reach.
Node runs JavaScript on the event loop. It pushes heavy tasks to the libuv thread pool. This includes:
- Filesystem work (fs.readFile, fs.writeFile).
- Crypto tasks (bcrypt, scrypt, pbkdf2).
- Compression (zlib gzip, deflate).
- DNS lookups (dns.lookup).
The pool defaults to only four threads. This is true regardless of how many CPU cores your machine has.
Four threads are not enough. Here are three ways the pool breaks:
Bcrypt at login. A single bcrypt hash can take 250ms. If four people log in at once, all slots are full. A fifth person waits in a queue. They pay 250ms just to start. Your latency doubles.
Large gzip operations. Compressing a large response holds a pool slot. If four requests do this at once, every other task waits. DNS lookups and file reads get stuck in line.
DNS lookups. Most Node apps use dns.lookup. This uses a blocking system call. It puts the task on the pool. If your network has a hiccup, these lookups stall the whole pool.
A request stuck in the pool queue is invisible. It is not using CPU. It is not running JavaScript. It is just parked.
How to find it:
If your p99 latency rises under load but event loop lag stays flat, check the pool.
The fastest test: Increase UV_THREADPOOL_SIZE. Set it to 64 in your environment and restart. If latency drops, you found the problem.
How to fix it properly:
- Use worker_threads for heavy crypto like bcrypt. This keeps them off the libuv pool.
- Use dns.resolve instead of dns.lookup. It is a real async resolver.
- Use streams for zlib work to release slots faster.
- Avoid heavy filesystem work on your main request paths.
Stop staring at green dashboards while your users wait.
Source: https://dev.to/r9v/the-nodejs-bug-thats-invisible-to-your-monitoring-oo8
