LLM Fundamentals | Visual Explainer

Building trustworthy agents means safety (agent behaves as designed), a clear system message framework (meta prompt → basic prompt → optimized system message → iterate), and understanding threats. Human-in-the-loop lets users approve or reject agent actions before they take effect — essential for bookings, payments, or sensitive operations.

Threats and mitigations

Threat	Mitigation
Task / instruction manipulation	Validate inputs; limit conversation turns.
Access to critical systems	Need-only access; secure channels; auth.
Resource / service overloading	Rate limits; cap turns and tool calls.
Knowledge base poisoning	Verify data; restrict who can change it.
Cascading errors	Fallbacks; retries; isolate agent (e.g. sandbox).

Human-in-the-loop: conversation flow

User: "Book this flight to Tokyo"

↓

Agent finds flight and prepares booking

↓

Agent asks: "Confirm booking flight XY for $450? (Yes / No / Edit)"

↓

User: "Yes" → agent completes booking

If user says No or Edit, the agent stops or adjusts — no automatic charge.

Human-in-the-loop

User request

→

Agent runs

→

Human approval / reject

→

Continue or stop

Users act like agents in the loop: they can approve, reject, or correct before the agent continues. Use for high-stakes actions (payments, bookings, sensitive data).

Example: System message framework

Step 1: Meta prompt — "You are an expert at creating AI agent system prompts." Step 2: Basic prompt — role, tasks, responsibilities. Step 3: Feed both to an LLM to produce a detailed system message. Step 4: Iterate and refine based on behavior.

🛡️ Chapter 12: Trustworthy Agents & Human-in-the-Loop