Building trustworthy agents means safety (agent behaves as designed), a clear system message framework (meta prompt → basic prompt → optimized system message → iterate), and understanding threats. Human-in-the-loop lets users approve or reject agent actions before they take effect — essential for bookings, payments, or sensitive operations.
Threats and mitigations
| Threat | Mitigation |
|---|---|
| Task / instruction manipulation | Validate inputs; limit conversation turns. |
| Access to critical systems | Need-only access; secure channels; auth. |
| Resource / service overloading | Rate limits; cap turns and tool calls. |
| Knowledge base poisoning | Verify data; restrict who can change it. |
| Cascading errors | Fallbacks; retries; isolate agent (e.g. sandbox). |
Human-in-the-loop: conversation flow
If user says No or Edit, the agent stops or adjusts — no automatic charge.
Human-in-the-loop
Users act like agents in the loop: they can approve, reject, or correct before the agent continues. Use for high-stakes actions (payments, bookings, sensitive data).
Example: System message framework
Step 1: Meta prompt — "You are an expert at creating AI agent system prompts." Step 2: Basic prompt — role, tasks, responsibilities. Step 3: Feed both to an LLM to produce a detailed system message. Step 4: Iterate and refine based on behavior.