AI Agents path

🤖 Using AI & AI Agents

Chapter 12 of 24

🛡️ Chapter 12: Trustworthy Agents & Human-in-the-Loop

Safety, system message framework, threats, and human approval flows

Building trustworthy agents means safety (agent behaves as designed), a clear system message framework (meta prompt → basic prompt → optimized system message → iterate), and understanding threats. Human-in-the-loop lets users approve or reject agent actions before they take effect — essential for bookings, payments, or sensitive operations.

Threats and mitigations

ThreatMitigation
Task / instruction manipulationValidate inputs; limit conversation turns.
Access to critical systemsNeed-only access; secure channels; auth.
Resource / service overloadingRate limits; cap turns and tool calls.
Knowledge base poisoningVerify data; restrict who can change it.
Cascading errorsFallbacks; retries; isolate agent (e.g. sandbox).

Human-in-the-loop: conversation flow

User: "Book this flight to Tokyo"
Agent finds flight and prepares booking
Agent asks: "Confirm booking flight XY for $450? (Yes / No / Edit)"
User: "Yes" → agent completes booking

If user says No or Edit, the agent stops or adjusts — no automatic charge.

Human-in-the-loop

User request
Agent runs
Human approval / reject
Continue or stop

Users act like agents in the loop: they can approve, reject, or correct before the agent continues. Use for high-stakes actions (payments, bookings, sensitive data).

Example: System message framework

Step 1: Meta prompt — "You are an expert at creating AI agent system prompts." Step 2: Basic prompt — role, tasks, responsibilities. Step 3: Feed both to an LLM to produce a detailed system message. Step 4: Iterate and refine based on behavior.