Carlini on LLM Security: Black-hat Attacks Explained

A talk by Nicholas Carlini titled “Black-hat LLMs” is climbing Hacker News, where it has pulled in 160 points and a heavy discussion thread. According to Hacker News, the video features the longtime adversarial ML researcher walking through how attackers actually use large language models in the wild, and what defenders keep getting wrong.

Carlini built his reputation tearing apart security assumptions baked into modern AI systems. He spent years at Google Brain and DeepMind publishing some of the most cited work on extracting training data from production models, slipping past safety filters, and probing the gap between what labs claim their models do and what the models actually do under pressure. When he gives a talk framed around “black-hat” use of LLMs, the AI security crowd pays attention.

What the talk covers

The framing flips the usual conversation. Most LLM security coverage focuses on defenders: red teams, refusal training, alignment evals. Carlini’s beat is the attacker side. His published work spans:

Training data extraction, where prompts pull verbatim copyrighted text or personal data out of a deployed model
Jailbreaks that survive RLHF and constitutional AI defenses
Prompt injection in agentic systems, where untrusted input hijacks a model with tool access
Membership inference and model stealing attacks against API-only endpoints

The HN discussion latches onto his recurring argument that current LLM defenses are paper-thin and that the field has not internalized 50 years of computer security lessons. Commenters echo the Schneier-style refrain that running adversarial input through a probabilistic system and hoping it stays safe is not a security model.

Why it matters for builders

If you’re shipping an LLM-backed product, Carlini’s lens is the one that should keep you up at night. A few practical takeaways surface across his body of research:

Treat any model output as untrusted, especially when tools or shells run downstream. The model is not your security boundary.
Assume jailbreaks exist for your system prompt. Design as if it’s already leaked.
Sensitive data in your training set or RAG corpus can be extracted. Audit before you pipe it in.
Refusal training is a UX feature, not a control. It does not stop a motivated attacker.

The talk lands as agentic systems push LLMs deeper into developer tools, customer support pipelines, and internal data lakes. Each new tool call is another attack surface, and the gap between marketing copy and real threat model keeps widening.

What stands out

Carlini’s pattern is consistent: he picks a class of attacks the field has dismissed, builds a working version, and forces vendors to respond. The interesting question is whether labs treat his arguments as actionable engineering or as PR problems to manage. The HN thread is split between practitioners who already build under his assumptions and folks discovering for the first time that “aligned” does not mean “safe.”

The black-hat framing also matters because it changes who shows up. Defensive AI research is crowded. Offensive research, done responsibly and in public, is rarer, and it’s the work that tends to actually move security postures inside the labs.

For anyone shipping AI in production, the video is worth the watch and the comment thread is worth the scroll. Full talk and discussion are at the original source.

Read original article

What the talk covers

Why it matters for builders

What stands out

Related: