The 80/20 ETL Framework Architecture: Maximizing Code Reuse

You start a new ETL project. You write code to stream data from a file. You write code to batch records for efficient database inserts. You write code to dispatch events for logging. You write code to handle errors gracefully. Two weeks later, you realize you have written this exact code before — on the last three projects. This is the problem that ETL framework architecture solves.

About 80% of every pipeline is infrastructure: streaming, batching, error handling, event dispatching, configuration management. Only 20% is unique: your specific field mappings, business rules, and domain logic. The 80/20 ETL framework architecture captures that common infrastructure so you write less code and reuse more. Instead of rebuilding the plumbing for every project, you build it once, test it thoroughly, and focus your time on the business logic that actually matters.

This is the capstone of the ETL Pipeline Series. It brings together everything: the 6-Phase Pipeline Pattern, production-tested cleaners, event-driven observability, configuration-driven design, and dependency management. Each of these patterns is a building block. The framework is the structure that holds them together.

What is the 80/20 ETL Framework Architecture?

The 80/20 principle applied to ETL means that most of the code you write for any pipeline is the same code you wrote for the last pipeline. The framework captures this repeated code — the infrastructure, the patterns, the edge case handling — and makes it reusable. You stop solving solved problems and focus on the 20% that makes each pipeline unique.

Category	Percentage	What It Includes	Changes How Often
Framework (infrastructure)	80%	Streaming engine, batch processor, event dispatcher, config parser, error handler, transaction manager, checkpoint system	Rarely. Stable after initial development.
Business Logic (your code)	20%	Field mappings, validation rules, cleaning customizations, domain-specific transformations, business rules	Frequently. Changes with every new source or requirement.

This is not a theoretical split. I have tracked it across multiple projects. The streaming code is identical. The batch insert code is identical. The event dispatching code is identical. The configuration loading code is identical. What changes is which fields map to which, what the validation rules are, and what domain-specific logic applies. That is the 20%.

What Gets Reused Across Every Pipeline: The 80%

Let us trace what code is identical in every ETL project. This is the infrastructure that the framework provides.

Component	What It Does	Why It Is Always The Same	Series Reference
Data Streamer	Reads data one record at a time, constant memory	The mechanics of streaming do not change regardless of data source	Memory-Efficient Processing
Batch Processor	Groups records for efficient database inserts	Batch insert logic is the same for every table	Multi-Table ETL
Event Dispatcher	Broadcasts events to listeners with priority ordering	The observer pattern is universal across all pipelines	Event-Driven Observability
Config Loader	Reads, merges, and validates pipeline configuration	Loading JSON and substituting variables is always the same	Configuration-Driven ETL
Error Handler	Catches, logs, and recovers from processing errors	Try/catch patterns and retry logic are universal	6-Phase Pattern
Checkpoint Manager	Records progress for recovery after failures	Checkpoint read/write logic does not change between pipelines	Multi-Table ETL
Data Cleaners	Phone, date, email, address, business ID cleaning	The same edge cases appear in every project	Data Cleaners

All of this is framework code. It does not change between projects. It should be written once, tested thoroughly with edge cases from real production data, and reused forever. Every hour spent rewriting this code is an hour wasted.

What is Actually Unique to Each Pipeline: The 20%

Now let us look at what actually changes between projects. This is the 20% where your time creates real value.

Unique Element	Example: CRM Sync	Example: ERP Import	Example: CSV Load
Field mappings	cust_id → customer_id	customer_number → customer_id	“Customer ID” → customer_id
Source connection	PostgreSQL on crm.internal	Oracle on erp.company.com	SFTP file drop at /data/daily/
Business rules	Require email OR phone	Require valid account number	Require positive quantities
Domain validation	MC number format	Account number checksum	SKU pattern match
Cleaning overrides	Strip “EXT:” from phone	Convert Oracle dates	Handle Excel date serials

This is the code that actually differs between projects. The field names change. The connection details change. The business rules change. But the infrastructure — how you stream records, how you batch inserts, how you dispatch events — stays the same. The framework handles the infrastructure. You handle the business logic.

How Component Resolution Works in the Framework

The framework uses namespace-priority loading. When a configuration file references a component like “MappingTransformer”, the resolver looks for your custom version first. If it does not exist, it falls back to the framework default. This means you get framework behavior for free and override only what you need.

Component Resolution: Namespace Priority

Configuration requests "MappingTransformer"

Scenario 1: No custom version exists
  Step 1: Look for App\ETL\MappingTransformer → Not found
  Step 2: Look for Framework\Infrastructure\MappingTransformer → Found
  Result: Use framework's MappingTransformer ✓

Scenario 2: You created a custom version
  Step 1: Look for App\ETL\MappingTransformer → Found
  Result: Use YOUR MappingTransformer ✓
  (Framework version is ignored)

Scenario 3: You extend the framework version
  Step 1: Look for App\ETL\MappingTransformer → Found
  Result: Use YOUR version, which internally calls parent framework methods
  You get: Framework behavior + your customizations

The configuration does not change between scenarios. It still says “MappingTransformer.” The resolver finds the right implementation automatically based on namespace priority. This is the Open-Closed Principle in action: open for extension, closed for modification.

Extending Without Modifying the Framework

The most powerful aspect of this ETL framework architecture is extension without modification. You never edit framework code. You extend it. The framework’s battle-tested logic stays intact while you add your domain-specific behavior on top.

Extension Pattern: Industry-Specific Phone Cleaning

Framework cleaner (500+ lines of production-tested code):
  class ContactDataCleaner {
      cleanPhoneNumber($phone) {
          // Handles 47 different phone formats
          // International support (7-15 digits)
          // Edge cases from millions of records
      }
  }

Your extension (10 lines):
  class IndustryContactCleaner extends ContactDataCleaner {
      cleanPhoneNumber($phone) {
          // Remove industry-specific prefixes
          $phone = removePrefix($phone, ['EXT:', 'FAX:', 'TEL:']);
          // Delegate to framework for everything else
          return parent::cleanPhoneNumber($phone);
      }
  }

Result:
  You get: 500+ lines of battle-tested cleaning logic for FREE
  You add: 10 lines of industry-specific preprocessing
  Framework code: Untouched, still works for every other project

This pattern works across every framework component. Need custom validation? Extend the validator. Need a different batch strategy? Extend the batch processor. Need industry-specific date handling? Extend the date cleaner. You always start with working, tested code and add only what is specific to your domain.

The Battle-Tested Component Library

The framework includes components extracted from years of production ETL work. These are not theoretical implementations. They handle edge cases that only appear after processing millions of records.

Component	What It Handles	Edge Cases Covered
ContactDataCleaner	Phone, fax, email cleaning	47 phone formats, “—” as phone, international 7-15 digits
AddressCleaner	Address sanitization and normalization	XSS prevention, PO Box detection, “NULL”/”N/A” removal
DateCleaner	Date parsing and normalization	MySQL 0000-00-00, multi-format fallback, timezone handling
BusinessDataCleaner	Industry identifiers	MC numbers, DOT numbers, EINs, SCAC codes
DataStream	Memory-efficient iteration	Files larger than memory, CSV/JSON/database sources
BatchProcessor	Efficient database loading	Configurable batch sizes, upsert support, transaction management
EventDispatcher	Observability infrastructure	Priority ordering, error isolation, propagation control
ConfigLoader	Layered configuration management	Environment variable substitution, validation, merging

This is not theoretical code. The “NULL” literal string, the 0000-00-00 date, the phone number that is just dashes — these all come from real production data. When you use the framework, you get decades of edge case handling without spending the decades.

Mental Model: The Power Tool Workshop

Think of the difference between building furniture from raw lumber versus using a workshop full of power tools.

Approach	What You Do	Time Per Project	ETL Equivalent
Raw lumber	Cut by hand, sand manually, carve joints	Weeks per piece	Write streaming, batching, events from scratch every project
Power tools	Table saw cuts, sander smooths, router makes joints	Days per piece	Use framework for infrastructure, write only business logic
What you focus on	Design, dimensions, finish — not how to build a saw	Creative work	Field mappings, business rules — not how to batch insert

The framework is your power tool workshop. The tools are reliable, tested, and ready to use. You focus on what makes your project unique — the design, the business rules, the domain logic — not on rebuilding infrastructure that already exists.

And just like real power tools, the framework tools get better over time. Every edge case discovered in one project improves the tool for all future projects. A phone number format that broke one pipeline gets added to the cleaner, and every future pipeline handles it automatically.

Why This Architecture Scales Across Teams and Projects

The ETL framework architecture scales in three dimensions: across projects, across team members, and across time.

Dimension	Without Framework	With Framework
New project	2-3 weeks to build infrastructure before business logic starts	Hours to configure, then straight to business logic
New team member	Must understand the entire pipeline codebase	Must understand configuration + their 20% of business logic
Bug in infrastructure	Fixed in one project, still broken in others	Fixed once in framework, all projects benefit
New edge case	Each project discovers and handles it independently	Handled in framework, applied everywhere
Code review	Reviewing infrastructure + business logic together	Reviewing only the 20% business logic

The compounding effect is significant. Project 1 builds the framework. Project 2 extends it. By Project 5, you have a battle-tested framework with edge case handling from five different data sources, five different domains, and five different failure modes. Each project makes the framework stronger for all future projects.

Common Anti-Patterns to Avoid

Framework architecture can go wrong. These anti-patterns undermine the benefits and create more problems than they solve.

Anti-Pattern	Why It Seems Reasonable	Why It Fails	What to Do Instead
Premature abstraction	“Let us build the framework before writing any pipelines”	You abstract the wrong things without real-world usage to guide you	Extract the framework from working pipelines, not the other way around
God framework	“The framework should handle everything”	Overly complex, hard to extend, fights you instead of helping	Keep framework focused on the 80%. Let business logic stay in application code.
Modifying framework code	“It is faster to just change this one line in the framework”	Now your framework diverges from the canonical version, losing future updates	Always extend, never modify. Use namespace priority for overrides.
Config as code	“We can express everything in YAML”	Configuration becomes a programming language without proper tooling	Keep config declarative. Complex logic belongs in extendable code, not config.
Skipping tests	“The framework is stable, no need to test our extensions”	Your extensions might interact with framework code in unexpected ways	Test your business logic. The framework tests its own code.

The most dangerous anti-pattern is premature abstraction. Building a framework before you have written at least two or three real pipelines means you are guessing at what should be abstracted. The result is a framework that abstracts the wrong things and forces awkward workarounds for the things that actually matter. Always extract a framework from working code, never design one in the abstract.

Getting Started: Your First Framework Pipeline

Building your first pipeline with the framework follows a simple pattern: configure, customize, run.

First Pipeline: 3 Steps

Step 1: Create configuration (the "what")
  pipelines/customer-sync.json
  {
    "name": "customer-sync",
    "extractor": { "class": "DatabaseExtractor", "connection": "legacy_crm" },
    "mappings": { "cust_id": "customer_id", "cust_name": "full_name" },
    "cleaners": { "phone": "phone_cleaner", "email": "email_cleaner" },
    "loader": { "class": "DatabaseLoader", "connection": "main_db", "table": "customers" }
  }

Step 2: Add business rules (the "unique 20%")
  class CustomerValidator extends FrameworkValidator {
      validate($record) {
          if (empty($record['email']) && empty($record['phone'])) {
              return reject("Customer must have email or phone");
          }
          return accept();
      }
  }

Step 3: Run
  framework run customer-sync

  The framework handles:
    ✓ Reading configuration and validating it
    ✓ Connecting to source and destination databases
    ✓ Streaming records one at a time (memory efficient)
    ✓ Applying field mappings from config
    ✓ Running your custom validator
    ✓ Cleaning phone and email with framework cleaners
    ✓ Batch inserting into destination (1,000 at a time)
    ✓ Dispatching events for logging and monitoring
    ✓ Recording checkpoints for recovery
    ✓ Handling errors without crashing the entire pipeline

You wrote a JSON config file and a 10-line validator. The framework handled everything else. That is the 80/20 split in practice: a few minutes of configuration and business logic, supported by thousands of lines of tested infrastructure.

Key Takeaways

The ETL framework architecture is about focusing your time where it creates the most value. Infrastructure is a solved problem. Business logic is where your expertise matters.

80% is infrastructure: Streaming, batching, events, config, error handling. Write it once, reuse forever.
20% is business logic: Field mappings, validation rules, domain-specific cleaning. This is where your time creates value.
Extract, do not design: Build the framework from working pipelines, not from abstract architecture diagrams.
Extend, never modify: Use namespace priority to override framework behavior without changing framework code.
The framework gets stronger over time: Every edge case from every project improves it for all future projects.
Configuration drives behavior: What the pipeline does lives in config. How it does it lives in the framework.
New projects take hours, not weeks: Configure, add business rules, run. Infrastructure is already solved.
New team members contribute faster: They only need to understand the 20% business logic, not the 80% infrastructure.

This is not about being lazy. It is about not solving solved problems. Your competitive advantage is in understanding your data, your domain, and your business rules. Not in yet another implementation of batch processing or event dispatching. Build on what works. Extend only what you need. Let the framework handle the rest.

For more on framework design principles, the Open-Closed Principle explains the architectural foundation that makes extensible frameworks possible. For practical patterns, Inversion of Control is the mechanism that allows framework code to call your business logic without knowing about it in advance.

The 80/20 Framework Architecture: Maximizing Reuse in ETL Systems

What is the 80/20 ETL Framework Architecture?

What Gets Reused Across Every Pipeline: The 80%

What is Actually Unique to Each Pipeline: The 20%

How Component Resolution Works in the Framework

Extending Without Modifying the Framework

The Battle-Tested Component Library

Mental Model: The Power Tool Workshop

Why This Architecture Scales Across Teams and Projects

Common Anti-Patterns to Avoid

Getting Started: Your First Framework Pipeline

Key Takeaways

Further Reading

The 80/20 Framework Architecture: Maximizing Reuse in ETL Systems

What is the 80/20 ETL Framework Architecture?

What Gets Reused Across Every Pipeline: The 80%

What is Actually Unique to Each Pipeline: The 20%

How Component Resolution Works in the Framework

Extending Without Modifying the Framework

The Battle-Tested Component Library

Mental Model: The Power Tool Workshop

Why This Architecture Scales Across Teams and Projects

Common Anti-Patterns to Avoid

Getting Started: Your First Framework Pipeline

Key Takeaways

Further Reading

Related Essays

Types of ETL Pipelines: A Complete Guide to Every Data Source

Building AI-Ready ETL Pipelines: Embeddings, Chunking, and Vector Storage

Configuration-Driven ETL: Separating Logic from Declaration

Get essays in your inbox