Building an Agentic Workflow for Automated Data Quality Debugging

The Problem: Manual Data Quality Debugging

Data quality failures are inevitable in modern data warehouses. A uniqueness test fails at 3 AM. A null constraint is violated in production. A relationship test breaks downstream pipelines. The traditional response? A data engineer wakes up, starts investigating, and begins a manual process that can take hours.

Tools like dbt have made data testing straightforward. You define tests in YAML, run them, and get pass/fail results. But when a test fails, the real work begins:

Understanding the failure — Which table? Which column? What constraint was violated?
Finding the data — Query the warehouse to see the actual failing records.
Tracing lineage — Where did the bad data come from? Which upstream table?
Multi-level investigation — Often the root cause isn't in the immediate upstream table, but 2–3 hops back.
Root cause analysis — Logic bug? Source data issue? Join explosion?
Solution design — Deduplication? Null handling? Join refinement?
Implementation — Modify SQL, test locally, verify the fix.
Documentation — Record findings so knowledge isn't lost.

This process is time-consuming (several hours per investigation), context-heavy (requires understanding data models, business logic, and SQL), error-prone (easy to miss upstream causes), and knowledge-siloed (investigation insights live in someone's head or scattered across Slack threads).

The question:

What if an AI agent could handle the entire investigation autonomously—tracing data lineage through multiple tables, identifying root causes, implementing fixes, and generating comprehensive documentation?

The Agentic Solution

We built a multi-tool agentic workflow, grounded in the ReACT (Reasoning + Acting) pattern, that handles dbt test failure investigation end to end. Given a failing test name, the agent autonomously investigates the failure, traces causation through the data lineage, designs a context-appropriate fix, implements it, and generates a comprehensive investigation report.

The key insight is that debugging data quality failures follows a reasoning loop, not a fixed script. Each finding informs what to investigate next. The depth of investigation varies. The solution depends on what the agent discovers along the way. This is a natural fit for an agentic approach.

The ReACT Pattern at Work

The workflow follows the ReACT framework (Yao et al., 2022): the agent iterates through cycles of Thought, Action, and Observation until it reaches a resolution.

The ReACT Loop

Each iteration looks like this:

Thought

"This is a uniqueness test failure on a surrogate key. I need to find which records are duplicated and why."

Action

Query the warehouse for duplicate values of the surrogate key column.

Observation

"Found 2 complete row duplicates. The rows are identical. Need to understand how they were generated."

The agent doesn't execute a predetermined script. It reasons about findings, decides what to do next, and terminates when it has enough evidence—whether that takes 3 iterations or 12.

Architecture and Tools

The system comprises three custom Python tools and an AI agent orchestrator:

System Architecture

Tool Details

Tool	Input	Output	ReACT Role
Manifest Parser	dbt `manifest.json` + test name	Test type, target model, column, SQL file path, dependencies	Perception
Warehouse Query	SQL query string	Structured query results (JSON/table)	Investigation
SQL Lineage Tracer	SQL file path + column name	Upstream sources, CTE chain, column transformations	Analysis
File Read/Write	File path	File contents / modified file	Perception + Modification

The AI agent doesn't need specialized orchestration code. It operates within an AI coding assistant (like Cursor with Claude) guided by a detailed skill prompt—a structured document that teaches the agent the investigation methodology, tool usage patterns, solution design heuristics, and reporting format.

The Investigation Flow

When given a failing test name, here is the typical autonomous workflow. Note that the flow is not rigid—the agent may skip steps, repeat steps, or investigate different branches depending on what it finds.

Context Gathering

Action: Parse dbt manifest for the failing test.

Learns: Test type (unique/not_null/relationship), target model and column, SQL file location, upstream dependencies.

Data Investigation

Action: Query the warehouse for failing records.

Learns: Specific records that violate the constraint—duplicate keys, null values, orphaned foreign keys. Patterns in the data.

Code Analysis

Action: Read the SQL model file.

Learns: Transformation logic, join patterns, aggregation logic, existing deduplication attempts.

Lineage Tracing (Recursive)

Action: Trace the failing column through CTEs and upstream models.

Decision point: If the root cause isn't in this model, the agent follows the lineage upstream and repeats steps 2–4. It can trace 3–4 levels deep.

Root Cause Determination

Action: Synthesize findings across all investigated layers.

Determines: Duplicate source records? Join explosion? Missing aggregation? Logic bug?

Solution Design & Implementation

Action: Design a context-appropriate fix considering business logic, performance, and correctness. Implement the change in the SQL model file.

Documentation

Action: Generate a comprehensive HTML investigation report with executive summary, data flow diagrams, investigation queries with results, code changes, and verification queries.

Worked Example

To make this concrete, here's a simplified walkthrough of a real investigation.

Failing test: unique_order_items_item_sk

A uniqueness test on a surrogate key column in an order items model.

Iteration 1: Understand the test

Thought

A uniqueness test is failing on a surrogate key. I need to parse the manifest to find out which model and column.

Action

Run manifest parser → target model is order_items, column is item_sk, SQL file is located at models/mart/order_items.sql.

Observation

This is a uniqueness test on a surrogate key. Surrogate key duplicates usually mean rows that should be distinct are generating the same hash.

Iteration 2: Find the duplicates

Thought

Query the table for duplicated surrogate keys to see how many and which ones.

Action

Query: SELECT item_sk, COUNT(*) FROM order_items GROUP BY 1 HAVING COUNT(*) > 1 → 2 duplicate records found for key a3f8c1...

Observation

Only 2 duplicates, and the rows are completely identical—not just key collisions. This means the same source record is appearing twice.

Iteration 3: Analyze the SQL

Thought

Read the model SQL to understand how duplicates could emerge.

Action

Read SQL file → model uses UNION ALL combining 4 CTEs (one per data source), each with its own GROUP BY.

Observation

UNION ALL preserves duplicates. If any source CTE produces duplicates, they'll persist. I see a commented-out QUALIFY clause. Need to check upstream.

Iteration 4: Trace upstream

Thought

Trace the lineage for item_sk to find which source CTE the duplicate originates from.

Action

Lineage tracer → column composed from item_id, source_type, event_timestamp. One branch comes from the web_events CTE. Query that source → source has multiple records per item_id.

Observation

Root cause found: source table has duplicate records per item that produce identical surrogate keys. The GROUP BY in the CTE wasn't preventing this because it grouped on different columns.

Iteration 5: Design and implement fix

qualify
    row_number() over (
        partition by
            item_id,
            source_type,
            event_type,
            event_timestamp
        order by
            priority_rank,  -- Business logic: prefer higher priority
            event_id        -- Deterministic tie-breaker
    ) = 1

The agent chose QUALIFY ROW_NUMBER() over DISTINCT or GROUP BY because:

The partition columns match the surrogate key components exactly.
The ordering respects business logic (priority ranking).
The tie-breaker column ensures deterministic results.
QUALIFY is optimized for columnar warehouses and avoids a subquery.

What Makes It Truly Agentic

This workflow goes beyond simple automation. Here are the characteristics that make it genuinely agentic:

1. Autonomous Goal Pursuit

The agent receives a high-level goal ("fix this test") and autonomously determines the investigation strategy. It doesn't follow a fixed script—it decides when to query upstream, how deep to trace, and when it has enough evidence to design a solution.

2. Dynamic Decision Making

Every investigation is different. The agent adapts its approach based on the test type:

Unique test failures → investigate duplicates, trace surrogate key components
Not-null failures → investigate nulls, check upstream propagation and COALESCE gaps
Relationship failures → investigate orphaned records, check referential integrity

3. Variable Investigation Depth

This is the critical agentic feature. Sometimes the root cause is in the immediate model. Sometimes it's 3–4 tables upstream. The agent doesn't stop at a fixed depth—it continues the ReACT loop until it finds the actual root cause.

Variable Depth Investigation

4. Context-Aware Solution Design

The agent doesn't apply template fixes. It synthesizes everything it has learned to design a solution that:

Respects business logic (e.g., priority fields for deduplication ordering)
Considers performance implications (e.g., QUALIFY vs subquery in columnar warehouses)
Ensures deterministic behavior (tie-breaker columns)
Preserves existing code patterns and style

5. Tool Orchestration

Each tool invocation is informed by the previous one. The agent chains tools dynamically:

Manifest  → "Uniqueness test on item_sk"
   ↓
Warehouse → "2 duplicates found, here are the IDs"
   ↓
File Read → "Model has UNION ALL with GROUP BYs"
   ↓
Lineage  → "Column comes from web_events CTE"
   ↓
Warehouse → "Source has multiple records per ID"
   ↓
Reasoning → "Need dedup based on SK components"
   ↓
Implement → "QUALIFY ROW_NUMBER() applied"

Extension: Automated Merge Requests

The base workflow ends with a human review checkpoint: the agent implements the fix and creates documentation, then waits for approval. A natural extension is automating the merge request creation.

Extended Workflow with Merge Request Automation

With this extension, the merge request becomes a self-contained work package: code changes, investigation rationale, data evidence, and verification queries all in one reviewable artifact.

The human-in-the-loop sweet spot:

The agent automates the tedious, investigative, mechanical work. Humans retain control of the critical decision: whether to merge to production. Not too early (human doesn't need to investigate), not too late (human doesn't blindly trust the agent).

Results and Impact

~10x

Investigation Speed

Hours → minutes per failure

100%

Documentation Rate

Every investigation produces a report

24/7

Coverage

Agent investigates at any hour

Searchable

Knowledge Base

Reports become institutional memory

Qualitative Improvements

Reduced cognitive load: Engineers focus on architecture and system design rather than reactive debugging.
Better root cause identification: The agent doesn't take shortcuts, get tired, or stop at surface-level symptoms.
Faster onboarding: New engineers read past investigation reports to understand the data model and common failure patterns.
Audit trail: Every investigation is documented with the queries run, data found, reasoning applied, and solution chosen.

Limitations and Honest Assessment

This workflow is not a silver bullet. Being clear about limitations is important for setting expectations.

What it handles well

Standard dbt test types (unique, not_null, relationship)
Root causes within 3–4 hops of the failing model
SQL-level fixes (deduplication, null handling, join refinement)
Failures with clear data evidence

What it struggles with

Business logic ambiguity: When the "correct" behavior requires domain knowledge the agent doesn't have.
Cross-system issues: Root causes in source systems (API changes, ingestion bugs) that aren't visible in the warehouse.
Complex custom tests: Tests with custom SQL logic that the manifest parser can't decompose.
Performance regression: The agent focuses on correctness, not query performance. A fix that passes the test might introduce expensive patterns.

Cost considerations

Each investigation involves multiple LLM calls (10–30 depending on depth).
Warehouse queries during investigation consume compute credits. Smart sampling and partition filters are essential guardrails.
Agent occasionally proposes overly aggressive fixes (e.g., refactoring an entire model when a targeted fix suffices). The skill prompt includes explicit guidance to implement minimal, targeted changes.

Challenges Overcome in Development

Challenge	Symptom	Solution
Infinite recursion	Agent traces lineage endlessly	Max depth limits + "found source table" termination
Query costs	Full table scans on large tables	Smart sampling, partition filters, LIMIT clauses
Solution overreach	Agent refactors entire models	Explicit "minimal targeted fix" guidance in prompt
False root causes	Stops at surface symptoms	Require verification queries that prove the fix

Lessons Learned

Tool quality is the foundation

The agent is only as good as its tools. We invested heavily in the manifest parser, lineage tracer, and query executor before building the agent logic. Tools that return well-structured JSON/tables enable much better agent reasoning than free-text outputs.

Prompt engineering scales with complexity

The skill prompt is several hundred lines because the agent needs detailed guidance on when to stop investigating, how to structure reports, what makes a good solution, and how to handle edge cases. This isn't a limitation—it's how you encode domain expertise into an agentic system.

Documentation is a first-class output

The HTML investigation reports aren't afterthoughts. They're how humans understand agent decisions, how knowledge gets preserved, and how auditing happens. Treating documentation as a core deliverable (not a nice-to-have) changed how we designed the entire workflow.

Sandbox-first development

Early in development, the agent occasionally proposed aggressive fixes. Having it work in a dev environment first—and requiring verification queries before declaring success—was essential for building confidence.

The right human checkpoint

Full autonomy isn't the goal. The agent automates investigation, analysis, and implementation. Humans retain judgment over production changes. This division of labor plays to each party's strengths.

Future Directions

Self-Correction Loop

The agent implements a fix, runs the test in a sandbox, and if it still fails, analyzes why and tries an alternative approach. This moves from single-shot fixes to iterative problem-solving.

Learning from History

When a new failure occurs, the agent searches past investigation reports for similar patterns: "This looks like investigation #47: a duplicate UNION ALL issue." It applies proven solution patterns rather than investigating from scratch.

Proactive Detection

Instead of reacting to failures, the agent monitors test runs for flakiness trends—tests that intermittently fail or show degrading data quality—and investigates before a full failure occurs.

Multi-Agent Collaboration

Specialized sub-agents handle different aspects of investigation: a SQL analysis agent reviews query patterns, a data profiling agent analyzes distributions, a testing agent generates comprehensive test cases. A coordinator agent delegates and synthesizes their findings.

· · ·

Conclusion

Building an agentic workflow for dbt test debugging transformed data quality incidents from multi-hour firefighting into autonomous, well-documented resolutions. The workflow exhibits key agentic characteristics:

Autonomy: Pursues high-level goals without step-by-step instructions.
Reasoning: Follows the ReACT pattern—analyzing data, tracing causation, designing solutions.
Adaptation: Adjusts investigation depth and approach based on findings.
Tool orchestration: Chains tools dynamically where each informs the next.
Knowledge creation: Generates comprehensive, reusable documentation.

The future isn't about replacing data engineers. It's about elevating them from reactive debugging to proactive system design. Agentic workflows make that shift possible today.

References

Yao, S. et al. (2022). "ReAct: Synergizing Reasoning and Acting in Language Models." arXiv:2210.03629.

Building an Agentic Workflow for Automated Data Quality Debugging

Contents

The Problem: Manual Data Quality Debugging

The Agentic Solution

The ReACT Pattern at Work

Architecture and Tools

Tool Details

The Investigation Flow

Context Gathering

Data Investigation

Code Analysis

Lineage Tracing (Recursive)

Root Cause Determination

Solution Design & Implementation

Documentation

Worked Example

Iteration 1: Understand the test

Iteration 2: Find the duplicates

Iteration 3: Analyze the SQL

Iteration 4: Trace upstream

Iteration 5: Design and implement fix

What Makes It Truly Agentic

1. Autonomous Goal Pursuit

2. Dynamic Decision Making

3. Variable Investigation Depth

4. Context-Aware Solution Design

5. Tool Orchestration

Extension: Automated Merge Requests

Results and Impact

Qualitative Improvements

Limitations and Honest Assessment

What it handles well

What it struggles with

Cost considerations

Challenges Overcome in Development

Lessons Learned

Tool quality is the foundation

Prompt engineering scales with complexity

Documentation is a first-class output

Sandbox-first development

The right human checkpoint

Future Directions

Self-Correction Loop

Learning from History

Proactive Detection

Multi-Agent Collaboration

Conclusion