𝗧𝗲𝘀𝘁𝗶𝗻𝗴 𝗔𝗜 𝗦𝗮𝗳𝗲𝘁𝘆: 𝗪𝗶𝗹𝗱𝗚𝘂𝗮𝗿𝗱 𝘃𝘀 𝗡𝗲𝗺𝗼𝘁𝗿𝗼𝗻

📅2 hours ago⏱1 min read

I tested three local AI safety classifiers for medical use.

I used 50 test cases. 22 cases were real attacks like jailbreaks. 18 cases were safe medical queries.

The results show a big trade-off between catching attacks and blocking real users.

𝗧𝗵𝗲 𝗧𝗲𝘀𝘁 𝗥𝗲𝘀𝘂𝗹𝘁𝘀:

• WildGuard (F1: 0.941): It caught every single attack. However, it blocked many safe medical terms. It marked "BRCA1 mutation" and "GFR calculation" as harmful. This is bad for medical AI. • Nemotron-3-CS (F1: 0.857): It had zero false alarms. It never blocked a safe medical query. But, it missed 8 attacks. It failed to catch attacks using Base64 or ROT13 encoding. • LlamaGuard3 (F1: 0.800-0.821): It is a middle ground. It is better at allowing medical terms than WildGuard, but less accurate overall.

𝗪𝗵𝗮𝘁 𝗜 𝗟𝗲𝗮𝗿𝗻𝗲𝗱:

Medical terms cause false positives. Models like WildGuard often treat rare medical strings as high-risk content. You will need a whitelist if you use this in a clinic.
Encoding bypasses work. Nemotron-3-CS cannot stop technical attacks like Kubernetes infra probes or encoded commands.
Testing automation needs speed. I used Passmark AI to run these tests. I found that clicking through UIs like Swagger is too slow for AI agents. It is better to navigate directly to REST URLs or use Playwright request fixtures.

𝗪𝗵𝗶𝗰𝗵 𝗼𝗻𝗲 𝘀𝗵𝗼𝘂𝗹𝗱 𝘆𝗼𝘂 𝗰𝗵𝗼𝗼𝘀𝗲?

Choose WildGuard if you must catch every attack and can handle manual reviews for blocked terms.
Choose Nemotron-3-CS if you cannot afford to block legitimate medical questions.
Choose LlamaGuard if you need a balance on limited hardware.

No model is perfect for medical AI yet. We still need better audit logs and specialized training for medical vocabulary.

Source: https://dev.to/jh5_pulse/nemoclaw-shi-ce-ping-zheng-qie-qu-vs-zi-liao-wai-xie-11ga

Optional learning community: https://t.me/GyaanSetuAi

𝗧𝗲𝘀𝘁𝗶𝗻𝗴 𝗔𝗜 𝗦𝗮𝗳𝗲𝘁𝘆: 𝗪𝗶𝗹𝗱𝗚𝘂𝗮𝗿𝗱 𝘃𝘀 𝗡𝗲𝗺𝗼𝘁𝗿𝗼𝗻

Continue reading

𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 𝗦𝘂𝗰𝗰𝗲𝘀𝘀 𝗜𝘀 𝗔 𝗦𝗮𝗳𝗲𝘁𝘆 𝗙𝗮𝗶𝗹𝘂𝗿𝗲

𝗕𝗿𝗼𝘄𝘀𝗲𝗿 𝗔𝗴𝗲𝗻𝘁 𝗙𝗶𝗿𝗲𝘄𝗮𝗹𝗹 𝗳𝗼𝗿 𝗔𝗜 𝗦𝗮𝗮𝗦

𝗘𝘃𝗮𝗹𝘀 𝗔𝗿𝗲 𝗔𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 𝗘𝗻𝗳𝗼𝗿𝗰𝗲𝗺𝗲𝗻𝘁: 𝗥𝘂𝗻𝘁𝗶𝗺𝗲 𝗦𝗮𝗳𝗲𝘁𝘆 𝗖𝗵𝗲𝗰𝗸𝘀

𝗔𝗜 𝗔𝘁𝘁𝗮𝗰𝗸 𝗖𝗼𝗺𝗽𝗿𝗲𝘀𝘀𝗶𝗼𝗻: 𝗧𝗵𝗲 𝗡𝗲𝘄 𝗥𝗶𝘀𝗸

𝗦𝗲𝗹𝗲𝗰𝘁𝗶𝘃𝗲 𝗔𝘁𝘁𝗮𝗰𝗸𝗲𝗿𝘀 𝗖𝘂𝘁 𝗔𝗜 𝗦𝗮𝗳𝗲𝘁𝘆