- What Happens When an Event is Dispatched
- Why Priority Order Matters
- What Happens When a Record Fails
- Mental Model: The Sports Broadcast
- Designing Your Event Taxonomy
- Error Isolation
- Propagation Control
- Common Anti-Patterns to Avoid
- Performance Impact of Observability
- Building Toward Distributed Tracing
- Key Takeaways
ETL observability is the ability to understand what happened inside your pipeline by examining its outputs — events, logs, metrics, and traces — without modifying the pipeline code. It turns a “something broke” error into a complete story: which record, which transformation, what state before and after, and exactly where things went wrong. Your ETL pipeline fails at 3 AM. The log says “Error processing record.” Which record? Which transformation? What was the data before it failed? What did the cleaner try to do? You have no answers, only the knowledge that something broke.
This is the ETL observability problem. Traditional logging gives you noise — timestamps and messages scattered across files. Event-driven ETL observability gives you a story. This builds on the pipeline architecture from The 6-Phase Pipeline Pattern and the data quality work from Production-Tested Data Cleaners. Events are how those phases communicate what happened.
What Actually Happens When an ETL Observability Event is Dispatched
An event dispatcher is like a radio broadcast system. When something significant happens in your pipeline, it broadcasts a message. Anyone listening on that frequency receives the message and can act on it.
Let us trace what happens step by step when a pipeline starts:
Step 1: Pipeline creates an event
Event: "pipeline.started"
Pipeline ID: "customer-sync-2024-01-15-103000"
Payload: {
config: { source: "legacy_crm", target: "main_db" },
context: { batch_size: 1000, started_by: "scheduler" }
}
Step 2: Dispatcher receives the event
The dispatcher checks: who is listening for “pipeline.started” events?
Registered listeners for "pipeline.started":
- LoggingListener (priority: 100)
- MetricsListener (priority: 50)
- SlackNotifier (priority: 10)
Step 3: Listeners are called in priority order
Higher priority first. LoggingListener runs before MetricsListener.
→ LoggingListener receives event
Writes: "[2024-01-15 10:30:00] Pipeline customer-sync started with config: ..."
→ MetricsListener receives event
Records: pipeline_start_count++, start_time = now()
→ SlackNotifier receives event
Sends: "#data-ops: Pipeline customer-sync started"
Step 4: Event is marked as handled
All listeners completed. The pipeline continues. The event itself did not change how the pipeline runs — it just announced what happened so observers could react. This separation is what makes ETL observability powerful: the pipeline does its work, and the observers do theirs, independently.
Why Priority Order Matters for ETL Observability
Not all listeners are equal. When something fails, you want logging to happen before alerting. Why? Because if alerting fails (Slack is down, email server unreachable), you still have logs. If logging fails first, you have nothing.
| Priority | Listener Type | Why This Order | Failure Impact |
|---|---|---|---|
| 100 (Highest) | Logging | Always runs first. Even if everything else fails, you have a record. | Low — writes to local file system |
| 75 | Audit Trail | Compliance record of what happened. Must not depend on external services. | Low — writes to local database |
| 50 | Metrics | Record timing and counts. Important for dashboards but not critical for debugging. | Medium — may depend on metrics service |
| 25 | Alerting | Send notifications. Depends on external services that might be down. | High — Slack, PagerDuty, email all external |
| 10 (Lowest) | Analytics | Aggregate statistics for business reporting. Can be replayed from logs if missed. | Medium — can recover from logs |
The priority system ensures your most reliable observers act first. Logging to a local file will almost never fail. Sending a Slack message depends on network connectivity, Slack’s uptime, and API rate limits. By running the reliable listeners first, you guarantee that the ETL observability data exists even when external services are unavailable.
What Happens When a Record Fails: ETL Observability in Action
This is where event-driven ETL observability proves its value. Let us trace a failure from start to finish.
Record being processed:
{
customer_id: 12345,
phone: "---",
email: "not-an-email"
}
Step 1: Phone cleaning fails
The phone cleaner cannot extract any digits from “—“. It emits an event:
Event: "record.field.cleaned"
Payload: {
record_id: 12345,
field: "phone",
original_value: "---",
cleaned_value: null,
status: "cleaned_to_null",
reason: "No valid digits found"
}
Step 2: Email validation fails
The email cleaner cannot make “not-an-email” into a valid email:
Event: "record.field.cleaned"
Payload: {
record_id: 12345,
field: "email",
original_value: "not-an-email",
cleaned_value: null,
status: "cleaned_to_null",
reason: "Invalid email format"
}
Step 3: Business logic rejects the record
A business rule requires at least a phone OR email. This record has neither:
Event: "record.rejected"
Payload: {
record_id: 12345,
phase: "refine",
rule: "contact_info_required",
reason: "Record has no valid phone or email",
original_record: { customer_id: 12345, phone: "---", email: "not-an-email" },
cleaned_record: { customer_id: 12345, phone: null, email: null }
}
What the logging listener captured:
[10:30:01] Record 12345: phone cleaned "---" → null (No valid digits)
[10:30:01] Record 12345: email cleaned "not-an-email" → null (Invalid format)
[10:30:01] Record 12345: REJECTED at refine phase (contact_info_required)
Now when you debug at 3 AM, you know exactly what happened. The phone was all dashes. The email was garbage. The record got rejected because it had no valid contact info. No guessing required. This is what ETL observability delivers: a complete, traceable narrative of every decision the pipeline made.
Mental Model: The Sports Broadcast
Think of a sports game with multiple broadcasters:
| Component | Sports Analogy | ETL Equivalent |
|---|---|---|
| The Game | Players move, scores happen, fouls are called | Pipeline processes records through phases |
| The Press Box | Announces every significant moment over PA | Event dispatcher broadcasts events |
| TV Crew | Records video of the game | Logging listener writes to files |
| Stats Team | Updates the scoreboard | Metrics listener tracks counts and timing |
| Social Media | Tweets highlights | Alerting listener sends Slack/email |
Each broadcaster does their job independently. The TV crew does not wait for the tweet to go out. If the photographer’s camera breaks, the game continues and everyone else keeps working. The game itself does not care who is watching — it plays out the same regardless of the audience. Your ETL pipeline works the same way. The processing logic does not know or care about the observers. It just announces events, and whoever is listening can react.
Designing Your ETL Observability Event Taxonomy
Not every action needs an event. Emitting too many events creates noise that is as useless as having no events at all. A well-designed event taxonomy captures the significant state changes without drowning you in detail.
| Event Category | Event Name | When to Emit | Payload Includes |
|---|---|---|---|
| Pipeline lifecycle | pipeline.started | Pipeline begins processing | Config, source, target, batch size |
| Pipeline lifecycle | pipeline.completed | Pipeline finishes successfully | Duration, record counts, error counts |
| Pipeline lifecycle | pipeline.failed | Pipeline encounters fatal error | Error details, last record processed, phase |
| Batch processing | batch.processed | A batch of records is loaded | Batch number, record count, duration |
| Record events | record.field.cleaned | A field value was modified by cleaning | Record ID, field, original value, cleaned value |
| Record events | record.rejected | A record failed validation or business rules | Record ID, phase, rule, reason, record data |
| Record events | record.error | Unexpected error during processing | Record ID, phase, exception, stack trace |
| Performance | phase.timing | A phase completes for a batch | Phase name, duration, record count |
Notice that “record.processed.successfully” is NOT in the list. If you have 100,000 records and 99,990 succeed, you do not need 99,990 success events. You need the pipeline.completed event with the total count, and the 10 failure events with details. ETL observability is about knowing what went wrong and what changed, not about logging every success.
The field-level cleaning event (record.field.cleaned) is optional and configurable. For high-volume pipelines, you might only emit it when the cleaning actually changed a value. For debugging a new pipeline, you might emit it for every field. The event taxonomy should be adjustable without changing pipeline code — this is where Configuration-Driven ETL connects to observability.
Error Isolation: Why Listeners Cannot Crash Your Pipeline
Here is what happens when a listener throws an exception:
Scenario: Slack is down
→ LoggingListener receives event → Success ✓
→ MetricsListener receives event → Success ✓
→ SlackNotifier receives event → THROWS: "Connection refused"
Dispatcher catches the exception:
→ Logs: "SlackNotifier failed: Connection refused"
→ Continues to next listener (if any)
→ Pipeline processing continues uninterrupted
Result: Data gets loaded. Slack notification was missed.
But you have logs and metrics. ETL observability data is preserved.
The pipeline continues. Data gets loaded. You see the Slack error in logs but your ETL did not fail just because Slack was down. This is error isolation. Observability is important, but it is not more important than the actual work. A pipeline that crashes because it could not send a notification is worse than a pipeline that silently processes data while logging the notification failure.
Propagation Control: When to Stop Broadcasting
Sometimes a listener needs to stop further processing. Maybe your primary alerting system already handled the error and calling the backup alert would be spam.
Scenario: PagerDuty acknowledges the alert
→ PagerDutyListener receives "pipeline.error" event
Sends alert to on-call engineer
Engineer acknowledges within 30 seconds
PagerDutyListener calls: event.stopPropagation()
→ BackupEmailListener would receive event, but...
Dispatcher checks: isPropagationStopped() → true
BackupEmailListener is SKIPPED
The backup email would have been redundant. The on-call engineer is already looking at the problem. Propagation control prevents alert fatigue — a real problem in operations teams that receive hundreds of duplicate notifications for a single incident.
Common Anti-Patterns in ETL Observability
These mistakes appear in almost every first attempt at building an observable pipeline. They turn ETL observability from an asset into a liability.
| Anti-Pattern | What Happens | Why It Fails | What to Do Instead |
|---|---|---|---|
| Logging everything | Emit an event for every record, every field, every step | 100,000 records × 10 fields = 1 million log entries. Searching for the actual problem becomes impossible. Storage fills up. I/O slows the pipeline. | Log state changes and failures. Skip successes unless debugging. |
| No correlation IDs | Events exist but cannot be linked to a specific pipeline run or record | When two pipelines run simultaneously, their events interleave and you cannot tell which event belongs to which run. | Include pipeline_id and record_id in every event payload. |
| Alerting on every error | Send a Slack message for each failed record | A batch with 500 failures sends 500 Slack messages. Alert fatigue sets in and real problems get ignored. | Alert on aggregates: “Pipeline X had 500 failures (12% error rate)” — one message. |
| Synchronous external calls | Listener sends HTTP request to external service and waits for response | If the external service is slow (5 seconds per call), each event adds 5 seconds to pipeline processing time. | Queue events for asynchronous delivery, or use fire-and-forget for non-critical listeners. |
| No error isolation | Listener exception propagates and crashes the pipeline | Slack goes down → SlackNotifier throws → Pipeline crashes → 50,000 records not processed, because of a notification failure. | Wrap every listener in try/catch. Log the listener error. Continue processing. |
The “logging everything” anti-pattern is the most common. It comes from a good instinct — more data means better debugging, right? Wrong. More data means more noise. When you have a million log lines and need to find the one that explains a failure, the volume works against you. Good ETL observability is about emitting the right events at the right granularity, not about capturing everything.
How ETL Observability Affects Pipeline Performance
Every event has a cost: memory to create the event object, CPU to execute listeners, and I/O to write logs or send messages. The question is whether that cost is acceptable.
| Observability Level | Events per Record | Overhead per 100K Records | When to Use |
|---|---|---|---|
| Minimal | 0-1 (only failures) | <1 second | High-volume production, stable pipelines |
| Standard | 1-3 (failures + state changes) | 2-5 seconds | Normal production, most pipelines |
| Verbose | 5-10 (every phase, every field) | 10-30 seconds | Debugging, new pipeline development |
| Trace | 10+ (everything) | 30-60+ seconds | Investigating specific issues, never in production |
For most production pipelines, “Standard” is the right level. You capture failures and significant state changes without materially affecting throughput. The overhead of 2-5 seconds per 100,000 records is negligible compared to the hours you save when debugging a failure.
The key principle: make the observability level configurable. A single flag in your pipeline configuration should control how verbose the event system is. When everything is running smoothly, use “Minimal.” When investigating a problem, switch to “Verbose” for a single run, get the data you need, then switch back. This is another place where Configuration-Driven ETL and observability work together.
Building Toward Distributed Tracing
This event system is the foundation for OpenTelemetry integration. Events map directly to spans and traces:
| ETL Event | OpenTelemetry Span | What It Captures |
|---|---|---|
| “pipeline.started” | Opens a new trace span | Pipeline ID becomes the trace ID. All subsequent events are child spans. |
| “batch.processed” | Child span under pipeline span | Batch number, duration, record count. Shows where time is spent. |
| “record.cleaned” | Child span under batch span | Field-level changes. Only emitted in verbose mode. |
| “pipeline.completed” | Closes the trace span with duration | Total records, errors, duration. The summary of the entire run. |
When you need distributed tracing across microservices, the infrastructure is already in place. Each event becomes a span. The pipeline ID becomes the trace ID. Correlation happens automatically because the event system was designed with these identifiers from the start.
Start simple. Add listeners as you need them. The event-driven pattern scales from basic logging to enterprise-grade ETL observability without changing your pipeline code. The first listener you write should be a logging listener that writes to a file. The second should be a metrics counter. Everything else — alerting, dashboards, distributed tracing — can be added later as listeners, without touching the pipeline logic.
Key Takeaways
ETL observability transforms pipeline debugging from guesswork into investigation. The event-driven approach separates what the pipeline does from what observers see, keeping both concerns clean and independent.
- Events tell a story: Each event captures a decision the pipeline made — what changed, what failed, and why. Together, they form a narrative you can follow during debugging.
- Priority order protects reliability: Run logging listeners first (priority 100) and alerting last (priority 10). When external services fail, you still have local logs.
- Error isolation is mandatory: Wrap every listener in try/catch. A notification failure must never crash a data pipeline.
- Design your event taxonomy deliberately: Emit state changes and failures. Skip successes unless debugging. More events does not mean better observability.
- Include correlation IDs in every event: Pipeline ID and record ID in every payload. Without them, events from concurrent pipelines become unsortable noise.
- Make observability level configurable: Minimal for stable production, verbose for debugging, trace for investigation. A single config flag should control this.
- Alert on aggregates, not individuals: One message saying “500 failures (12% error rate)” is more actionable than 500 individual failure alerts.
- Build for distributed tracing from the start: Use pipeline IDs and event hierarchies that map naturally to OpenTelemetry spans. The upgrade path should be adding a listener, not rewriting the pipeline.
For more on distributed tracing and observability standards, the OpenTelemetry Observability Primer provides the foundational concepts for building observable systems.