Back to Essays

Event-Driven Observability: Making ETL Pipelines Debuggable

When an ETL pipeline fails at 3 AM, you need to know exactly what happened. Event-driven observability gives you that story.

ETL Observability

ETL observability is the ability to understand what happened inside your pipeline by examining its outputs — events, logs, metrics, and traces — without modifying the pipeline code. It turns a “something broke” error into a complete story: which record, which transformation, what state before and after, and exactly where things went wrong. Your ETL pipeline fails at 3 AM. The log says “Error processing record.” Which record? Which transformation? What was the data before it failed? What did the cleaner try to do? You have no answers, only the knowledge that something broke.

This is the ETL observability problem. Traditional logging gives you noise — timestamps and messages scattered across files. Event-driven ETL observability gives you a story. This builds on the pipeline architecture from The 6-Phase Pipeline Pattern and the data quality work from Production-Tested Data Cleaners. Events are how those phases communicate what happened.

What Actually Happens When an ETL Observability Event is Dispatched

An event dispatcher is like a radio broadcast system. When something significant happens in your pipeline, it broadcasts a message. Anyone listening on that frequency receives the message and can act on it.

Let us trace what happens step by step when a pipeline starts:

Step 1: Pipeline creates an event

Event Creation: pipeline.started
Event: "pipeline.started"
Pipeline ID: "customer-sync-2024-01-15-103000"
Payload: {
  config: { source: "legacy_crm", target: "main_db" },
  context: { batch_size: 1000, started_by: "scheduler" }
}

Step 2: Dispatcher receives the event

The dispatcher checks: who is listening for “pipeline.started” events?

Registered Listeners
Registered listeners for "pipeline.started":
  - LoggingListener (priority: 100)
  - MetricsListener (priority: 50)
  - SlackNotifier (priority: 10)

Step 3: Listeners are called in priority order

Higher priority first. LoggingListener runs before MetricsListener.

Listener Execution Order
→ LoggingListener receives event
  Writes: "[2024-01-15 10:30:00] Pipeline customer-sync started with config: ..."

→ MetricsListener receives event
  Records: pipeline_start_count++, start_time = now()

→ SlackNotifier receives event
  Sends: "#data-ops: Pipeline customer-sync started"

Step 4: Event is marked as handled

All listeners completed. The pipeline continues. The event itself did not change how the pipeline runs — it just announced what happened so observers could react. This separation is what makes ETL observability powerful: the pipeline does its work, and the observers do theirs, independently.

Why Priority Order Matters for ETL Observability

Not all listeners are equal. When something fails, you want logging to happen before alerting. Why? Because if alerting fails (Slack is down, email server unreachable), you still have logs. If logging fails first, you have nothing.

PriorityListener TypeWhy This OrderFailure Impact
100 (Highest)LoggingAlways runs first. Even if everything else fails, you have a record.Low — writes to local file system
75Audit TrailCompliance record of what happened. Must not depend on external services.Low — writes to local database
50MetricsRecord timing and counts. Important for dashboards but not critical for debugging.Medium — may depend on metrics service
25AlertingSend notifications. Depends on external services that might be down.High — Slack, PagerDuty, email all external
10 (Lowest)AnalyticsAggregate statistics for business reporting. Can be replayed from logs if missed.Medium — can recover from logs

The priority system ensures your most reliable observers act first. Logging to a local file will almost never fail. Sending a Slack message depends on network connectivity, Slack’s uptime, and API rate limits. By running the reliable listeners first, you guarantee that the ETL observability data exists even when external services are unavailable.

What Happens When a Record Fails: ETL Observability in Action

This is where event-driven ETL observability proves its value. Let us trace a failure from start to finish.

Record being processed:

Problem Record
{
  customer_id: 12345,
  phone: "---",
  email: "not-an-email"
}

Step 1: Phone cleaning fails

The phone cleaner cannot extract any digits from “—“. It emits an event:

Event: record.field.cleaned
Event: "record.field.cleaned"
Payload: {
  record_id: 12345,
  field: "phone",
  original_value: "---",
  cleaned_value: null,
  status: "cleaned_to_null",
  reason: "No valid digits found"
}

Step 2: Email validation fails

The email cleaner cannot make “not-an-email” into a valid email:

Event: record.field.cleaned
Event: "record.field.cleaned"
Payload: {
  record_id: 12345,
  field: "email",
  original_value: "not-an-email",
  cleaned_value: null,
  status: "cleaned_to_null",
  reason: "Invalid email format"
}

Step 3: Business logic rejects the record

A business rule requires at least a phone OR email. This record has neither:

Event: record.rejected
Event: "record.rejected"
Payload: {
  record_id: 12345,
  phase: "refine",
  rule: "contact_info_required",
  reason: "Record has no valid phone or email",
  original_record: { customer_id: 12345, phone: "---", email: "not-an-email" },
  cleaned_record: { customer_id: 12345, phone: null, email: null }
}

What the logging listener captured:

Complete Event Trail
[10:30:01] Record 12345: phone cleaned "---" → null (No valid digits)
[10:30:01] Record 12345: email cleaned "not-an-email" → null (Invalid format)
[10:30:01] Record 12345: REJECTED at refine phase (contact_info_required)

Now when you debug at 3 AM, you know exactly what happened. The phone was all dashes. The email was garbage. The record got rejected because it had no valid contact info. No guessing required. This is what ETL observability delivers: a complete, traceable narrative of every decision the pipeline made.

Mental Model: The Sports Broadcast

Think of a sports game with multiple broadcasters:

ComponentSports AnalogyETL Equivalent
The GamePlayers move, scores happen, fouls are calledPipeline processes records through phases
The Press BoxAnnounces every significant moment over PAEvent dispatcher broadcasts events
TV CrewRecords video of the gameLogging listener writes to files
Stats TeamUpdates the scoreboardMetrics listener tracks counts and timing
Social MediaTweets highlightsAlerting listener sends Slack/email

Each broadcaster does their job independently. The TV crew does not wait for the tweet to go out. If the photographer’s camera breaks, the game continues and everyone else keeps working. The game itself does not care who is watching — it plays out the same regardless of the audience. Your ETL pipeline works the same way. The processing logic does not know or care about the observers. It just announces events, and whoever is listening can react.

Designing Your ETL Observability Event Taxonomy

Not every action needs an event. Emitting too many events creates noise that is as useless as having no events at all. A well-designed event taxonomy captures the significant state changes without drowning you in detail.

Event CategoryEvent NameWhen to EmitPayload Includes
Pipeline lifecyclepipeline.startedPipeline begins processingConfig, source, target, batch size
Pipeline lifecyclepipeline.completedPipeline finishes successfullyDuration, record counts, error counts
Pipeline lifecyclepipeline.failedPipeline encounters fatal errorError details, last record processed, phase
Batch processingbatch.processedA batch of records is loadedBatch number, record count, duration
Record eventsrecord.field.cleanedA field value was modified by cleaningRecord ID, field, original value, cleaned value
Record eventsrecord.rejectedA record failed validation or business rulesRecord ID, phase, rule, reason, record data
Record eventsrecord.errorUnexpected error during processingRecord ID, phase, exception, stack trace
Performancephase.timingA phase completes for a batchPhase name, duration, record count

Notice that “record.processed.successfully” is NOT in the list. If you have 100,000 records and 99,990 succeed, you do not need 99,990 success events. You need the pipeline.completed event with the total count, and the 10 failure events with details. ETL observability is about knowing what went wrong and what changed, not about logging every success.

The field-level cleaning event (record.field.cleaned) is optional and configurable. For high-volume pipelines, you might only emit it when the cleaning actually changed a value. For debugging a new pipeline, you might emit it for every field. The event taxonomy should be adjustable without changing pipeline code — this is where Configuration-Driven ETL connects to observability.

Error Isolation: Why Listeners Cannot Crash Your Pipeline

Here is what happens when a listener throws an exception:

Scenario: Slack is down

Error Isolation in Action
→ LoggingListener receives event → Success ✓
→ MetricsListener receives event → Success ✓
→ SlackNotifier receives event → THROWS: "Connection refused"

Dispatcher catches the exception:
  → Logs: "SlackNotifier failed: Connection refused"
  → Continues to next listener (if any)
  → Pipeline processing continues uninterrupted

Result: Data gets loaded. Slack notification was missed.
         But you have logs and metrics. ETL observability data is preserved.

The pipeline continues. Data gets loaded. You see the Slack error in logs but your ETL did not fail just because Slack was down. This is error isolation. Observability is important, but it is not more important than the actual work. A pipeline that crashes because it could not send a notification is worse than a pipeline that silently processes data while logging the notification failure.

Propagation Control: When to Stop Broadcasting

Sometimes a listener needs to stop further processing. Maybe your primary alerting system already handled the error and calling the backup alert would be spam.

Scenario: PagerDuty acknowledges the alert

Propagation Control
→ PagerDutyListener receives "pipeline.error" event
  Sends alert to on-call engineer
  Engineer acknowledges within 30 seconds
  PagerDutyListener calls: event.stopPropagation()

→ BackupEmailListener would receive event, but...
  Dispatcher checks: isPropagationStopped() → true
  BackupEmailListener is SKIPPED

The backup email would have been redundant. The on-call engineer is already looking at the problem. Propagation control prevents alert fatigue — a real problem in operations teams that receive hundreds of duplicate notifications for a single incident.

Common Anti-Patterns in ETL Observability

These mistakes appear in almost every first attempt at building an observable pipeline. They turn ETL observability from an asset into a liability.

Anti-PatternWhat HappensWhy It FailsWhat to Do Instead
Logging everythingEmit an event for every record, every field, every step100,000 records × 10 fields = 1 million log entries. Searching for the actual problem becomes impossible. Storage fills up. I/O slows the pipeline.Log state changes and failures. Skip successes unless debugging.
No correlation IDsEvents exist but cannot be linked to a specific pipeline run or recordWhen two pipelines run simultaneously, their events interleave and you cannot tell which event belongs to which run.Include pipeline_id and record_id in every event payload.
Alerting on every errorSend a Slack message for each failed recordA batch with 500 failures sends 500 Slack messages. Alert fatigue sets in and real problems get ignored.Alert on aggregates: “Pipeline X had 500 failures (12% error rate)” — one message.
Synchronous external callsListener sends HTTP request to external service and waits for responseIf the external service is slow (5 seconds per call), each event adds 5 seconds to pipeline processing time.Queue events for asynchronous delivery, or use fire-and-forget for non-critical listeners.
No error isolationListener exception propagates and crashes the pipelineSlack goes down → SlackNotifier throws → Pipeline crashes → 50,000 records not processed, because of a notification failure.Wrap every listener in try/catch. Log the listener error. Continue processing.

The “logging everything” anti-pattern is the most common. It comes from a good instinct — more data means better debugging, right? Wrong. More data means more noise. When you have a million log lines and need to find the one that explains a failure, the volume works against you. Good ETL observability is about emitting the right events at the right granularity, not about capturing everything.

How ETL Observability Affects Pipeline Performance

Every event has a cost: memory to create the event object, CPU to execute listeners, and I/O to write logs or send messages. The question is whether that cost is acceptable.

Observability LevelEvents per RecordOverhead per 100K RecordsWhen to Use
Minimal0-1 (only failures)<1 secondHigh-volume production, stable pipelines
Standard1-3 (failures + state changes)2-5 secondsNormal production, most pipelines
Verbose5-10 (every phase, every field)10-30 secondsDebugging, new pipeline development
Trace10+ (everything)30-60+ secondsInvestigating specific issues, never in production

For most production pipelines, “Standard” is the right level. You capture failures and significant state changes without materially affecting throughput. The overhead of 2-5 seconds per 100,000 records is negligible compared to the hours you save when debugging a failure.

The key principle: make the observability level configurable. A single flag in your pipeline configuration should control how verbose the event system is. When everything is running smoothly, use “Minimal.” When investigating a problem, switch to “Verbose” for a single run, get the data you need, then switch back. This is another place where Configuration-Driven ETL and observability work together.

Building Toward Distributed Tracing

This event system is the foundation for OpenTelemetry integration. Events map directly to spans and traces:

ETL EventOpenTelemetry SpanWhat It Captures
“pipeline.started”Opens a new trace spanPipeline ID becomes the trace ID. All subsequent events are child spans.
“batch.processed”Child span under pipeline spanBatch number, duration, record count. Shows where time is spent.
“record.cleaned”Child span under batch spanField-level changes. Only emitted in verbose mode.
“pipeline.completed”Closes the trace span with durationTotal records, errors, duration. The summary of the entire run.

When you need distributed tracing across microservices, the infrastructure is already in place. Each event becomes a span. The pipeline ID becomes the trace ID. Correlation happens automatically because the event system was designed with these identifiers from the start.

Start simple. Add listeners as you need them. The event-driven pattern scales from basic logging to enterprise-grade ETL observability without changing your pipeline code. The first listener you write should be a logging listener that writes to a file. The second should be a metrics counter. Everything else — alerting, dashboards, distributed tracing — can be added later as listeners, without touching the pipeline logic.

Key Takeaways

ETL observability transforms pipeline debugging from guesswork into investigation. The event-driven approach separates what the pipeline does from what observers see, keeping both concerns clean and independent.

  1. Events tell a story: Each event captures a decision the pipeline made — what changed, what failed, and why. Together, they form a narrative you can follow during debugging.
  2. Priority order protects reliability: Run logging listeners first (priority 100) and alerting last (priority 10). When external services fail, you still have local logs.
  3. Error isolation is mandatory: Wrap every listener in try/catch. A notification failure must never crash a data pipeline.
  4. Design your event taxonomy deliberately: Emit state changes and failures. Skip successes unless debugging. More events does not mean better observability.
  5. Include correlation IDs in every event: Pipeline ID and record ID in every payload. Without them, events from concurrent pipelines become unsortable noise.
  6. Make observability level configurable: Minimal for stable production, verbose for debugging, trace for investigation. A single config flag should control this.
  7. Alert on aggregates, not individuals: One message saying “500 failures (12% error rate)” is more actionable than 500 individual failure alerts.
  8. Build for distributed tracing from the start: Use pipeline IDs and event hierarchies that map naturally to OpenTelemetry spans. The upgrade path should be adding a listener, not rewriting the pipeline.

For more on distributed tracing and observability standards, the OpenTelemetry Observability Primer provides the foundational concepts for building observable systems.