Back to Essays

The 80/20 Framework Architecture: Maximizing Reuse in ETL Systems

80 percent of ETL code is the same across every pipeline. The 80/20 framework captures that common infrastructure so you focus on what makes your pipeline unique.

etl-framework-architecture

You start a new ETL project. You write code to stream data from a file. You write code to batch records for efficient database inserts. You write code to dispatch events for logging. You write code to handle errors gracefully. Two weeks later, you realize you have written this exact code before — on the last three projects. This is the problem that ETL framework architecture solves.

About 80% of every pipeline is infrastructure: streaming, batching, error handling, event dispatching, configuration management. Only 20% is unique: your specific field mappings, business rules, and domain logic. The 80/20 ETL framework architecture captures that common infrastructure so you write less code and reuse more. Instead of rebuilding the plumbing for every project, you build it once, test it thoroughly, and focus your time on the business logic that actually matters.

This is the capstone of the ETL Pipeline Series. It brings together everything: the 6-Phase Pipeline Pattern, production-tested cleaners, event-driven observability, configuration-driven design, and dependency management. Each of these patterns is a building block. The framework is the structure that holds them together.

What is the 80/20 ETL Framework Architecture?

The 80/20 principle applied to ETL means that most of the code you write for any pipeline is the same code you wrote for the last pipeline. The framework captures this repeated code — the infrastructure, the patterns, the edge case handling — and makes it reusable. You stop solving solved problems and focus on the 20% that makes each pipeline unique.

CategoryPercentageWhat It IncludesChanges How Often
Framework (infrastructure)80%Streaming engine, batch processor, event dispatcher, config parser, error handler, transaction manager, checkpoint systemRarely. Stable after initial development.
Business Logic (your code)20%Field mappings, validation rules, cleaning customizations, domain-specific transformations, business rulesFrequently. Changes with every new source or requirement.

This is not a theoretical split. I have tracked it across multiple projects. The streaming code is identical. The batch insert code is identical. The event dispatching code is identical. The configuration loading code is identical. What changes is which fields map to which, what the validation rules are, and what domain-specific logic applies. That is the 20%.

What Gets Reused Across Every Pipeline: The 80%

Let us trace what code is identical in every ETL project. This is the infrastructure that the framework provides.

ComponentWhat It DoesWhy It Is Always The SameSeries Reference
Data StreamerReads data one record at a time, constant memoryThe mechanics of streaming do not change regardless of data sourceMemory-Efficient Processing
Batch ProcessorGroups records for efficient database insertsBatch insert logic is the same for every tableMulti-Table ETL
Event DispatcherBroadcasts events to listeners with priority orderingThe observer pattern is universal across all pipelinesEvent-Driven Observability
Config LoaderReads, merges, and validates pipeline configurationLoading JSON and substituting variables is always the sameConfiguration-Driven ETL
Error HandlerCatches, logs, and recovers from processing errorsTry/catch patterns and retry logic are universal6-Phase Pattern
Checkpoint ManagerRecords progress for recovery after failuresCheckpoint read/write logic does not change between pipelinesMulti-Table ETL
Data CleanersPhone, date, email, address, business ID cleaningThe same edge cases appear in every projectData Cleaners

All of this is framework code. It does not change between projects. It should be written once, tested thoroughly with edge cases from real production data, and reused forever. Every hour spent rewriting this code is an hour wasted.

What is Actually Unique to Each Pipeline: The 20%

Now let us look at what actually changes between projects. This is the 20% where your time creates real value.

Unique ElementExample: CRM SyncExample: ERP ImportExample: CSV Load
Field mappingscust_id → customer_idcustomer_number → customer_id“Customer ID” → customer_id
Source connectionPostgreSQL on crm.internalOracle on erp.company.comSFTP file drop at /data/daily/
Business rulesRequire email OR phoneRequire valid account numberRequire positive quantities
Domain validationMC number formatAccount number checksumSKU pattern match
Cleaning overridesStrip “EXT:” from phoneConvert Oracle datesHandle Excel date serials

This is the code that actually differs between projects. The field names change. The connection details change. The business rules change. But the infrastructure — how you stream records, how you batch inserts, how you dispatch events — stays the same. The framework handles the infrastructure. You handle the business logic.

How Component Resolution Works in the Framework

The framework uses namespace-priority loading. When a configuration file references a component like “MappingTransformer”, the resolver looks for your custom version first. If it does not exist, it falls back to the framework default. This means you get framework behavior for free and override only what you need.

Component Resolution: Namespace Priority
Configuration requests "MappingTransformer"

Scenario 1: No custom version exists
  Step 1: Look for App\ETL\MappingTransformer → Not found
  Step 2: Look for Framework\Infrastructure\MappingTransformer → Found
  Result: Use framework's MappingTransformer ✓

Scenario 2: You created a custom version
  Step 1: Look for App\ETL\MappingTransformer → Found
  Result: Use YOUR MappingTransformer ✓
  (Framework version is ignored)

Scenario 3: You extend the framework version
  Step 1: Look for App\ETL\MappingTransformer → Found
  Result: Use YOUR version, which internally calls parent framework methods
  You get: Framework behavior + your customizations

The configuration does not change between scenarios. It still says “MappingTransformer.” The resolver finds the right implementation automatically based on namespace priority. This is the Open-Closed Principle in action: open for extension, closed for modification.

Extending Without Modifying the Framework

The most powerful aspect of this ETL framework architecture is extension without modification. You never edit framework code. You extend it. The framework’s battle-tested logic stays intact while you add your domain-specific behavior on top.

Extension Pattern: Industry-Specific Phone Cleaning
Framework cleaner (500+ lines of production-tested code):
  class ContactDataCleaner {
      cleanPhoneNumber($phone) {
          // Handles 47 different phone formats
          // International support (7-15 digits)
          // Edge cases from millions of records
      }
  }

Your extension (10 lines):
  class IndustryContactCleaner extends ContactDataCleaner {
      cleanPhoneNumber($phone) {
          // Remove industry-specific prefixes
          $phone = removePrefix($phone, ['EXT:', 'FAX:', 'TEL:']);
          // Delegate to framework for everything else
          return parent::cleanPhoneNumber($phone);
      }
  }

Result:
  You get: 500+ lines of battle-tested cleaning logic for FREE
  You add: 10 lines of industry-specific preprocessing
  Framework code: Untouched, still works for every other project

This pattern works across every framework component. Need custom validation? Extend the validator. Need a different batch strategy? Extend the batch processor. Need industry-specific date handling? Extend the date cleaner. You always start with working, tested code and add only what is specific to your domain.

The Battle-Tested Component Library

The framework includes components extracted from years of production ETL work. These are not theoretical implementations. They handle edge cases that only appear after processing millions of records.

ComponentWhat It HandlesEdge Cases Covered
ContactDataCleanerPhone, fax, email cleaning47 phone formats, “—” as phone, international 7-15 digits
AddressCleanerAddress sanitization and normalizationXSS prevention, PO Box detection, “NULL”/”N/A” removal
DateCleanerDate parsing and normalizationMySQL 0000-00-00, multi-format fallback, timezone handling
BusinessDataCleanerIndustry identifiersMC numbers, DOT numbers, EINs, SCAC codes
DataStreamMemory-efficient iterationFiles larger than memory, CSV/JSON/database sources
BatchProcessorEfficient database loadingConfigurable batch sizes, upsert support, transaction management
EventDispatcherObservability infrastructurePriority ordering, error isolation, propagation control
ConfigLoaderLayered configuration managementEnvironment variable substitution, validation, merging

This is not theoretical code. The “NULL” literal string, the 0000-00-00 date, the phone number that is just dashes — these all come from real production data. When you use the framework, you get decades of edge case handling without spending the decades.

Mental Model: The Power Tool Workshop

Think of the difference between building furniture from raw lumber versus using a workshop full of power tools.

ApproachWhat You DoTime Per ProjectETL Equivalent
Raw lumberCut by hand, sand manually, carve jointsWeeks per pieceWrite streaming, batching, events from scratch every project
Power toolsTable saw cuts, sander smooths, router makes jointsDays per pieceUse framework for infrastructure, write only business logic
What you focus onDesign, dimensions, finish — not how to build a sawCreative workField mappings, business rules — not how to batch insert

The framework is your power tool workshop. The tools are reliable, tested, and ready to use. You focus on what makes your project unique — the design, the business rules, the domain logic — not on rebuilding infrastructure that already exists.

And just like real power tools, the framework tools get better over time. Every edge case discovered in one project improves the tool for all future projects. A phone number format that broke one pipeline gets added to the cleaner, and every future pipeline handles it automatically.

Why This Architecture Scales Across Teams and Projects

The ETL framework architecture scales in three dimensions: across projects, across team members, and across time.

DimensionWithout FrameworkWith Framework
New project2-3 weeks to build infrastructure before business logic startsHours to configure, then straight to business logic
New team memberMust understand the entire pipeline codebaseMust understand configuration + their 20% of business logic
Bug in infrastructureFixed in one project, still broken in othersFixed once in framework, all projects benefit
New edge caseEach project discovers and handles it independentlyHandled in framework, applied everywhere
Code reviewReviewing infrastructure + business logic togetherReviewing only the 20% business logic

The compounding effect is significant. Project 1 builds the framework. Project 2 extends it. By Project 5, you have a battle-tested framework with edge case handling from five different data sources, five different domains, and five different failure modes. Each project makes the framework stronger for all future projects.

Common Anti-Patterns to Avoid

Framework architecture can go wrong. These anti-patterns undermine the benefits and create more problems than they solve.

Anti-PatternWhy It Seems ReasonableWhy It FailsWhat to Do Instead
Premature abstraction“Let us build the framework before writing any pipelines”You abstract the wrong things without real-world usage to guide youExtract the framework from working pipelines, not the other way around
God framework“The framework should handle everything”Overly complex, hard to extend, fights you instead of helpingKeep framework focused on the 80%. Let business logic stay in application code.
Modifying framework code“It is faster to just change this one line in the framework”Now your framework diverges from the canonical version, losing future updatesAlways extend, never modify. Use namespace priority for overrides.
Config as code“We can express everything in YAML”Configuration becomes a programming language without proper toolingKeep config declarative. Complex logic belongs in extendable code, not config.
Skipping tests“The framework is stable, no need to test our extensions”Your extensions might interact with framework code in unexpected waysTest your business logic. The framework tests its own code.

The most dangerous anti-pattern is premature abstraction. Building a framework before you have written at least two or three real pipelines means you are guessing at what should be abstracted. The result is a framework that abstracts the wrong things and forces awkward workarounds for the things that actually matter. Always extract a framework from working code, never design one in the abstract.

Getting Started: Your First Framework Pipeline

Building your first pipeline with the framework follows a simple pattern: configure, customize, run.

First Pipeline: 3 Steps
Step 1: Create configuration (the "what")
  pipelines/customer-sync.json
  {
    "name": "customer-sync",
    "extractor": { "class": "DatabaseExtractor", "connection": "legacy_crm" },
    "mappings": { "cust_id": "customer_id", "cust_name": "full_name" },
    "cleaners": { "phone": "phone_cleaner", "email": "email_cleaner" },
    "loader": { "class": "DatabaseLoader", "connection": "main_db", "table": "customers" }
  }

Step 2: Add business rules (the "unique 20%")
  class CustomerValidator extends FrameworkValidator {
      validate($record) {
          if (empty($record['email']) && empty($record['phone'])) {
              return reject("Customer must have email or phone");
          }
          return accept();
      }
  }

Step 3: Run
  framework run customer-sync

  The framework handles:
    ✓ Reading configuration and validating it
    ✓ Connecting to source and destination databases
    ✓ Streaming records one at a time (memory efficient)
    ✓ Applying field mappings from config
    ✓ Running your custom validator
    ✓ Cleaning phone and email with framework cleaners
    ✓ Batch inserting into destination (1,000 at a time)
    ✓ Dispatching events for logging and monitoring
    ✓ Recording checkpoints for recovery
    ✓ Handling errors without crashing the entire pipeline

You wrote a JSON config file and a 10-line validator. The framework handled everything else. That is the 80/20 split in practice: a few minutes of configuration and business logic, supported by thousands of lines of tested infrastructure.

Key Takeaways

The ETL framework architecture is about focusing your time where it creates the most value. Infrastructure is a solved problem. Business logic is where your expertise matters.

  1. 80% is infrastructure: Streaming, batching, events, config, error handling. Write it once, reuse forever.
  2. 20% is business logic: Field mappings, validation rules, domain-specific cleaning. This is where your time creates value.
  3. Extract, do not design: Build the framework from working pipelines, not from abstract architecture diagrams.
  4. Extend, never modify: Use namespace priority to override framework behavior without changing framework code.
  5. The framework gets stronger over time: Every edge case from every project improves it for all future projects.
  6. Configuration drives behavior: What the pipeline does lives in config. How it does it lives in the framework.
  7. New projects take hours, not weeks: Configure, add business rules, run. Infrastructure is already solved.
  8. New team members contribute faster: They only need to understand the 20% business logic, not the 80% infrastructure.

This is not about being lazy. It is about not solving solved problems. Your competitive advantage is in understanding your data, your domain, and your business rules. Not in yet another implementation of batch processing or event dispatching. Build on what works. Extend only what you need. Let the framework handle the rest.

For more on framework design principles, the Open-Closed Principle explains the architectural foundation that makes extensible frameworks possible. For practical patterns, Inversion of Control is the mechanism that allows framework code to call your business logic without knowing about it in advance.