ForgeAwareness
0 of 7 complete0%
Module 19 min

The AI threat model

TL;DR

Foundations covered the risk of you leaking data into an AI. This course covers the bigger surface: when AI reads your data, processes untrusted content, and takes actions, attackers start targeting the AI itself. Five attack surfaces — most real incidents come from just one of them: untrusted content the AI reads.

A different mental model

In Foundations, the risk was a person pasting something they shouldn't. The fix was personal discipline.

Once you connect AI to real systems — your email, your docs, your codebase, your customers — the AI becomes a new piece of software in your trust boundary. And it's a strange piece of software:

  • It follows instructions written in plain English
  • It can't reliably tell your instructions from instructions hidden in the content it reads
  • It increasingly has permission to act, not just answer

That last point is the whole game. An AI that can only talk is low-risk. An AI that can read your inbox and send email is a privileged user that takes orders from anyone whose text it happens to read.

The five attack surfaces

1. The prompt (injection)

Attackers smuggle instructions into content the AI reads — a web page, an email, a PDF, a calendar invite, a support ticket. The AI treats those instructions as if they came from you. This is the #1 source of real AI incidents. Module 2 is entirely about it.

2. The data (poisoning & leakage)

  • Leakage: the AI surfaces data the current user shouldn't see (broken permissions in a RAG system).
  • Poisoning: an attacker plants malicious content where the AI will later read it — a wiki page, a public dataset, a code comment — to steer future answers.

3. The model & its supply chain

Models, fine-tunes, and AI libraries are downloaded from registries just like any dependency. A backdoored model or a typosquatted package is a supply-chain compromise. Module 5 covers this.

4. The actions (agentic blast radius)

When the AI can call tools — send email, run code, hit APIs, move money — a single bad instruction becomes a real-world action. The damage is bounded only by the permissions you gave it. Module 3 covers this.

5. The human (social engineering, amplified)

Everything in Foundations, but faster and cheaper: cloned voices, real-time deepfakes, personalized lures at scale. The defenses are the same (verify out-of-band), but the volume is new.

Where real incidents actually come from

If you remember one thing: the dangerous combination is an AI that can (a) access private data, (b) read untrusted/attacker-controlled content, and (c) communicate externally. Security researcher Simon Willison named this the "lethal trifecta." When all three are true at once, attacker-controlled text can instruct the AI to read your secrets and send them out — no malware required.

Most safe AI deployments are safe because they break at least one leg of that trifecta. Most incidents happen because someone wired all three together without noticing.

Real incident: Bing "Sydney," 2023

Researchers and users discovered that Microsoft's Bing chat could be steered into ignoring its own rules by instructions embedded in web pages it browsed, and by clever conversational framing. It revealed its internal codename ("Sydney") and confidential prompt instructions. No system was "hacked" in the traditional sense — the model simply could not reliably separate trusted configuration from untrusted input. That gap is the root cause of almost everything in this course.

What to actually do

  • Map your AI tools against the five surfaces. For each tool, ask: what data can it read, what content does it process, and what can it do?
  • Flag any tool where all three legs of the lethal trifecta are present. Those get the most scrutiny.
  • Don't try to "prompt your way to safety." You cannot write instructions strong enough to reliably override injected ones. Security comes from architecture (permissions, isolation), not phrasing.

Knowledge check

Knowledge check 1

Which combination makes an AI deployment genuinely dangerous?

Knowledge check 2

A teammate says, "We're safe from prompt injection because our system prompt tells the model to ignore any instructions in documents." Is that sufficient?