The Problem: Manual Data Quality Debugging
Data quality failures are inevitable in modern data warehouses. A uniqueness test fails at 3 AM. A null constraint is violated in production. A relationship test breaks downstream pipelines. The traditional response? A data engineer wakes up, starts investigating, and begins a manual process that can take hours.
Tools like dbt have made data testing straightforward. You define tests in YAML, run them, and get pass/fail results. But when a test fails, the real work begins:
- Understanding the failure — Which table? Which column? What constraint was violated?
- Finding the data — Query the warehouse to see the actual failing records.
- Tracing lineage — Where did the bad data come from? Which upstream table?
- Multi-level investigation — Often the root cause isn't in the immediate upstream table, but 2–3 hops back.
- Root cause analysis — Logic bug? Source data issue? Join explosion?
- Solution design — Deduplication? Null handling? Join refinement?
- Implementation — Modify SQL, test locally, verify the fix.
- Documentation — Record findings so knowledge isn't lost.
This process is time-consuming (several hours per investigation), context-heavy (requires understanding data models, business logic, and SQL), error-prone (easy to miss upstream causes), and knowledge-siloed (investigation insights live in someone's head or scattered across Slack threads).
What if an AI agent could handle the entire investigation autonomously—tracing data lineage through multiple tables, identifying root causes, implementing fixes, and generating comprehensive documentation?
The Agentic Solution
We built a multi-tool agentic workflow, grounded in the ReACT (Reasoning + Acting) pattern, that handles dbt test failure investigation end to end. Given a failing test name, the agent autonomously investigates the failure, traces causation through the data lineage, designs a context-appropriate fix, implements it, and generates a comprehensive investigation report.
The key insight is that debugging data quality failures follows a reasoning loop, not a fixed script. Each finding informs what to investigate next. The depth of investigation varies. The solution depends on what the agent discovers along the way. This is a natural fit for an agentic approach.
The ReACT Pattern at Work
The workflow follows the ReACT framework (Yao et al., 2022): the agent iterates through cycles of Thought, Action, and Observation until it reaches a resolution.
Each iteration looks like this:
"This is a uniqueness test failure on a surrogate key. I need to find which records are duplicated and why."
Query the warehouse for duplicate values of the surrogate key column.
"Found 2 complete row duplicates. The rows are identical. Need to understand how they were generated."
The agent doesn't execute a predetermined script. It reasons about findings, decides what to do next, and terminates when it has enough evidence—whether that takes 3 iterations or 12.
Architecture and Tools
The system comprises three custom Python tools and an AI agent orchestrator:
Tool Details
| Tool | Input | Output | ReACT Role |
|---|---|---|---|
| Manifest Parser | dbt manifest.json + test name |
Test type, target model, column, SQL file path, dependencies | Perception |
| Warehouse Query | SQL query string | Structured query results (JSON/table) | Investigation |
| SQL Lineage Tracer | SQL file path + column name | Upstream sources, CTE chain, column transformations | Analysis |
| File Read/Write | File path | File contents / modified file | Perception + Modification |
The AI agent doesn't need specialized orchestration code. It operates within an AI coding assistant (like Cursor with Claude) guided by a detailed skill prompt—a structured document that teaches the agent the investigation methodology, tool usage patterns, solution design heuristics, and reporting format.
The Investigation Flow
When given a failing test name, here is the typical autonomous workflow. Note that the flow is not rigid—the agent may skip steps, repeat steps, or investigate different branches depending on what it finds.
Context Gathering
Action: Parse dbt manifest for the failing test.
Learns: Test type (unique/not_null/relationship), target model and column, SQL file location, upstream dependencies.
Data Investigation
Action: Query the warehouse for failing records.
Learns: Specific records that violate the constraint—duplicate keys, null values, orphaned foreign keys. Patterns in the data.
Code Analysis
Action: Read the SQL model file.
Learns: Transformation logic, join patterns, aggregation logic, existing deduplication attempts.
Lineage Tracing (Recursive)
Action: Trace the failing column through CTEs and upstream models.
Decision point: If the root cause isn't in this model, the agent follows the lineage upstream and repeats steps 2–4. It can trace 3–4 levels deep.
Root Cause Determination
Action: Synthesize findings across all investigated layers.
Determines: Duplicate source records? Join explosion? Missing aggregation? Logic bug?
Solution Design & Implementation
Action: Design a context-appropriate fix considering business logic, performance, and correctness. Implement the change in the SQL model file.
Documentation
Action: Generate a comprehensive HTML investigation report with executive summary, data flow diagrams, investigation queries with results, code changes, and verification queries.
Worked Example
To make this concrete, here's a simplified walkthrough of a real investigation.
unique_order_items_item_sk
A uniqueness test on a surrogate key column in an order items model.
Iteration 1: Understand the test
A uniqueness test is failing on a surrogate key. I need to parse the manifest to find out which model and column.
Run manifest parser → target model is order_items, column is item_sk, SQL file is located at models/mart/order_items.sql.
This is a uniqueness test on a surrogate key. Surrogate key duplicates usually mean rows that should be distinct are generating the same hash.
Iteration 2: Find the duplicates
Query the table for duplicated surrogate keys to see how many and which ones.
Query: SELECT item_sk, COUNT(*) FROM order_items GROUP BY 1 HAVING COUNT(*) > 1 → 2 duplicate records found for key a3f8c1...
Only 2 duplicates, and the rows are completely identical—not just key collisions. This means the same source record is appearing twice.
Iteration 3: Analyze the SQL
Read the model SQL to understand how duplicates could emerge.
Read SQL file → model uses UNION ALL combining 4 CTEs (one per data source), each with its own GROUP BY.
UNION ALL preserves duplicates. If any source CTE produces duplicates, they'll persist. I see a commented-out QUALIFY clause. Need to check upstream.
Iteration 4: Trace upstream
Trace the lineage for item_sk to find which source CTE the duplicate originates from.
Lineage tracer → column composed from item_id, source_type, event_timestamp. One branch comes from the web_events CTE. Query that source → source has multiple records per item_id.
Root cause found: source table has duplicate records per item that produce identical surrogate keys. The GROUP BY in the CTE wasn't preventing this because it grouped on different columns.
Iteration 5: Design and implement fix
qualify row_number() over ( partition by item_id, source_type, event_type, event_timestamp order by priority_rank, -- Business logic: prefer higher priority event_id -- Deterministic tie-breaker ) = 1
The agent chose QUALIFY ROW_NUMBER() over DISTINCT or GROUP BY because:
- The partition columns match the surrogate key components exactly.
- The ordering respects business logic (priority ranking).
- The tie-breaker column ensures deterministic results.
QUALIFYis optimized for columnar warehouses and avoids a subquery.
What Makes It Truly Agentic
This workflow goes beyond simple automation. Here are the characteristics that make it genuinely agentic:
1. Autonomous Goal Pursuit
The agent receives a high-level goal ("fix this test") and autonomously determines the investigation strategy. It doesn't follow a fixed script—it decides when to query upstream, how deep to trace, and when it has enough evidence to design a solution.
2. Dynamic Decision Making
Every investigation is different. The agent adapts its approach based on the test type:
- Unique test failures → investigate duplicates, trace surrogate key components
- Not-null failures → investigate nulls, check upstream propagation and COALESCE gaps
- Relationship failures → investigate orphaned records, check referential integrity
3. Variable Investigation Depth
This is the critical agentic feature. Sometimes the root cause is in the immediate model. Sometimes it's 3–4 tables upstream. The agent doesn't stop at a fixed depth—it continues the ReACT loop until it finds the actual root cause.
4. Context-Aware Solution Design
The agent doesn't apply template fixes. It synthesizes everything it has learned to design a solution that:
- Respects business logic (e.g., priority fields for deduplication ordering)
- Considers performance implications (e.g.,
QUALIFYvs subquery in columnar warehouses) - Ensures deterministic behavior (tie-breaker columns)
- Preserves existing code patterns and style
5. Tool Orchestration
Each tool invocation is informed by the previous one. The agent chains tools dynamically:
Manifest → "Uniqueness test on item_sk" ↓ Warehouse → "2 duplicates found, here are the IDs" ↓ File Read → "Model has UNION ALL with GROUP BYs" ↓ Lineage → "Column comes from web_events CTE" ↓ Warehouse → "Source has multiple records per ID" ↓ Reasoning → "Need dedup based on SK components" ↓ Implement → "QUALIFY ROW_NUMBER() applied"
Extension: Automated Merge Requests
The base workflow ends with a human review checkpoint: the agent implements the fix and creates documentation, then waits for approval. A natural extension is automating the merge request creation.
With this extension, the merge request becomes a self-contained work package: code changes, investigation rationale, data evidence, and verification queries all in one reviewable artifact.
The agent automates the tedious, investigative, mechanical work. Humans retain control of the critical decision: whether to merge to production. Not too early (human doesn't need to investigate), not too late (human doesn't blindly trust the agent).
Results and Impact
Qualitative Improvements
- Reduced cognitive load: Engineers focus on architecture and system design rather than reactive debugging.
- Better root cause identification: The agent doesn't take shortcuts, get tired, or stop at surface-level symptoms.
- Faster onboarding: New engineers read past investigation reports to understand the data model and common failure patterns.
- Audit trail: Every investigation is documented with the queries run, data found, reasoning applied, and solution chosen.
Limitations and Honest Assessment
This workflow is not a silver bullet. Being clear about limitations is important for setting expectations.
What it handles well
- Standard dbt test types (unique, not_null, relationship)
- Root causes within 3–4 hops of the failing model
- SQL-level fixes (deduplication, null handling, join refinement)
- Failures with clear data evidence
What it struggles with
- Business logic ambiguity: When the "correct" behavior requires domain knowledge the agent doesn't have.
- Cross-system issues: Root causes in source systems (API changes, ingestion bugs) that aren't visible in the warehouse.
- Complex custom tests: Tests with custom SQL logic that the manifest parser can't decompose.
- Performance regression: The agent focuses on correctness, not query performance. A fix that passes the test might introduce expensive patterns.
Cost considerations
- Each investigation involves multiple LLM calls (10–30 depending on depth).
- Warehouse queries during investigation consume compute credits. Smart sampling and partition filters are essential guardrails.
- Agent occasionally proposes overly aggressive fixes (e.g., refactoring an entire model when a targeted fix suffices). The skill prompt includes explicit guidance to implement minimal, targeted changes.
Challenges Overcome in Development
| Challenge | Symptom | Solution |
|---|---|---|
| Infinite recursion | Agent traces lineage endlessly | Max depth limits + "found source table" termination |
| Query costs | Full table scans on large tables | Smart sampling, partition filters, LIMIT clauses |
| Solution overreach | Agent refactors entire models | Explicit "minimal targeted fix" guidance in prompt |
| False root causes | Stops at surface symptoms | Require verification queries that prove the fix |
Lessons Learned
Tool quality is the foundation
The agent is only as good as its tools. We invested heavily in the manifest parser, lineage tracer, and query executor before building the agent logic. Tools that return well-structured JSON/tables enable much better agent reasoning than free-text outputs.
Prompt engineering scales with complexity
The skill prompt is several hundred lines because the agent needs detailed guidance on when to stop investigating, how to structure reports, what makes a good solution, and how to handle edge cases. This isn't a limitation—it's how you encode domain expertise into an agentic system.
Documentation is a first-class output
The HTML investigation reports aren't afterthoughts. They're how humans understand agent decisions, how knowledge gets preserved, and how auditing happens. Treating documentation as a core deliverable (not a nice-to-have) changed how we designed the entire workflow.
Sandbox-first development
Early in development, the agent occasionally proposed aggressive fixes. Having it work in a dev environment first—and requiring verification queries before declaring success—was essential for building confidence.
The right human checkpoint
Full autonomy isn't the goal. The agent automates investigation, analysis, and implementation. Humans retain judgment over production changes. This division of labor plays to each party's strengths.
Future Directions
Self-Correction Loop
The agent implements a fix, runs the test in a sandbox, and if it still fails, analyzes why and tries an alternative approach. This moves from single-shot fixes to iterative problem-solving.
Learning from History
When a new failure occurs, the agent searches past investigation reports for similar patterns: "This looks like investigation #47: a duplicate UNION ALL issue." It applies proven solution patterns rather than investigating from scratch.
Proactive Detection
Instead of reacting to failures, the agent monitors test runs for flakiness trends—tests that intermittently fail or show degrading data quality—and investigates before a full failure occurs.
Multi-Agent Collaboration
Specialized sub-agents handle different aspects of investigation: a SQL analysis agent reviews query patterns, a data profiling agent analyzes distributions, a testing agent generates comprehensive test cases. A coordinator agent delegates and synthesizes their findings.
Conclusion
Building an agentic workflow for dbt test debugging transformed data quality incidents from multi-hour firefighting into autonomous, well-documented resolutions. The workflow exhibits key agentic characteristics:
- Autonomy: Pursues high-level goals without step-by-step instructions.
- Reasoning: Follows the ReACT pattern—analyzing data, tracing causation, designing solutions.
- Adaptation: Adjusts investigation depth and approach based on findings.
- Tool orchestration: Chains tools dynamically where each informs the next.
- Knowledge creation: Generates comprehensive, reusable documentation.
The future isn't about replacing data engineers. It's about elevating them from reactive debugging to proactive system design. Agentic workflows make that shift possible today.
Yao, S. et al. (2022). "ReAct: Synergizing Reasoning and Acting in Language Models." arXiv:2210.03629.