Types of ETL Pipelines: Complete Guide to Every Data Source

Not all types of ETL pipelines are the same. The term “ETL” often gets treated as a single concept, but the reality is that every data source requires a fundamentally different extraction strategy, transformation approach, and loading pattern. A pipeline that moves data from MySQL to PostgreSQL looks nothing like one that transcribes audio files into searchable text, or one that ingests real-time events from a message queue.

After years of building data integration systems, I have worked with most of these types of ETL pipelines. Each one taught me something different about how data moves through systems. This guide covers every major type, what makes each one unique, and the real-world challenges you will face with each approach.

If you are new to ETL concepts, start with Understanding ETL: The Philosophy Behind Data Pipelines first. If you already understand the basics, this guide will show you the full landscape of what types of ETL pipelines exist and when to use each one.

Types of ETL Pipelines: Overview

Database-to-Database ETL

This is the most common type of ETL pipeline. You are moving data from one relational database to another. MySQL to PostgreSQL. SQL Server to BigQuery. Oracle to Snowflake. The source and destination both understand SQL, but they speak slightly different dialects.

Aspect	Details
Sources	MySQL, PostgreSQL, SQL Server, Oracle, MariaDB, SQLite
Extraction	SQL queries, change data capture (CDC), database replication logs
Key Challenge	Schema differences, data type mismatches, character encoding, large table pagination
Common Pattern	Full load for initial migration, incremental load for ongoing sync

The under-the-hood challenge here is pagination. You cannot just run SELECT * FROM customers when the table has 50 million rows. You need cursor-based pagination or chunked queries using primary key ranges. The extraction phase must handle connection timeouts, lock contention, and the reality that the source database is still serving production traffic while you are reading from it.

Type conversion is another hidden complexity. MySQL’s TINYINT(1) might mean boolean, but PostgreSQL expects an actual BOOLEAN type. SQL Server’s DATETIME2 has different precision than PostgreSQL’s TIMESTAMP. These differences are invisible until they cause failures at 3 AM.

Flat File ETL

Flat file ETL processes data from CSV, TSV, fixed-width, and other delimited text files. Despite being one of the oldest data formats, flat files remain one of the most common data exchange methods in enterprise environments.

Aspect	Details
Formats	CSV, TSV, fixed-width, pipe-delimited, custom delimiters
Extraction	File system read, stream parsing, line-by-line processing
Key Challenge	Encoding issues, inconsistent delimiters, quoted fields with embedded newlines, no schema enforcement
Common Pattern	Streaming iterator with memory-efficient line-by-line processing

The real problem with flat files is that they have no schema enforcement. A CSV file does not tell you that column 3 should be a date. It does not tell you that column 7 should never be null. Every assumption you make about the data must be validated during the transform and clean phases. One malformed row with an extra comma can shift every column in every subsequent row.

Memory management is critical here. A 10 GB CSV file cannot be loaded into memory all at once. You need iterator patterns that process one row at a time while keeping memory constant. For more on this, see Memory-Efficient Data Processing.

API-Based ETL

API-based ETL pulls data from web services using REST, GraphQL, or SOAP protocols. This is the fastest-growing type of ETL pipeline because every modern SaaS product exposes an API. Stripe for payments. Salesforce for CRM. HubSpot for marketing. GitHub for development metrics.

Aspect	Details
Protocols	REST, GraphQL, SOAP, gRPC
Extraction	HTTP requests with pagination, authentication tokens, rate limit handling
Key Challenge	Rate limiting, pagination strategies, authentication expiry, API versioning changes
Common Pattern	Paginated fetch with exponential backoff and retry logic

The under-the-hood complexity with API ETL is rate limiting. Stripe allows 100 requests per second. Salesforce gives you a daily API call limit. If your pipeline does not respect these limits, the API will throttle you or block you entirely. You need retry logic with exponential backoff baked into your extraction phase.

Pagination is another challenge that varies by API. Some APIs use page numbers. Others use cursors. Some use link headers. Your extractor needs to handle all these patterns, and it needs to detect when an API silently changes its pagination behavior during a version update.

JSON, XML, and Document ETL

Document-format ETL handles nested, hierarchical data structures. Unlike flat CSV rows, JSON and XML documents can contain arrays, nested objects, and mixed-depth structures that do not map cleanly to relational tables.

Aspect	Details
Formats	JSON, XML, YAML, TOML, MessagePack, Protocol Buffers
Extraction	File parsing, streaming parsers (SAX for XML, JSON streaming), path-based selection
Key Challenge	Nested structure flattening, array handling, schema evolution, large document memory usage
Common Pattern	Recursive flattening with path-based field mapping

The core challenge is flattening. A JSON object like {"address": {"city": "London", "country": "UK"}} needs to become flat columns: address_city and address_country. But when you have arrays like {"orders": [...]}, you need to decide: do you create a separate table for orders, or do you denormalize them into the parent record?

XML adds its own complexity with namespaces, attributes versus elements, and the distinction between document-oriented and data-oriented XML. A SOAP response from a legacy banking system looks nothing like a clean REST JSON response, even when they carry the same information.

Streaming and Event-Driven ETL

Streaming ETL processes data in real-time as events arrive, rather than in scheduled batches. This type of ETL pipeline is fundamentally different from all others because there is no “extract all data” step. Data arrives continuously, and your pipeline must keep up.

Aspect	Details
Sources	Apache Kafka, RabbitMQ, Amazon SQS, webhooks, WebSockets, server-sent events
Extraction	Consumer groups, topic subscriptions, webhook listeners
Key Challenge	Ordering guarantees, exactly-once processing, backpressure, consumer lag
Common Pattern	Consumer with offset tracking, dead letter queues for failures

The fundamental difference is that batch ETL asks “give me everything since last time,” while streaming ETL says “process each event as it arrives.” This changes your entire architecture. You cannot retry a failed batch easily because new events keep coming. You need dead letter queues for failed events, offset tracking to know where you stopped, and backpressure handling for when events arrive faster than you can process them.

For observability patterns that work with streaming pipelines, see Event-Driven Observability.

NoSQL Database ETL

NoSQL ETL extracts data from document stores, key-value stores, wide-column databases, and graph databases. The challenge is that these systems store data in fundamentally different structures than relational databases.

Aspect	Details
Sources	MongoDB, DynamoDB, Cassandra, Redis, Neo4j, CouchDB, Elasticsearch
Extraction	Collection scans, change streams, export utilities, query-based extraction
Key Challenge	Schema-less documents, inconsistent field types across documents, embedded vs referenced relationships
Common Pattern	Schema inference from sample documents, then validation against full dataset

The biggest challenge with NoSQL ETL is that there is no guaranteed schema. In MongoDB, one customer document might have a phone field as a string, another might have it as an array of strings, and a third might not have it at all. Your pipeline must handle all these variations without crashing. Schema inference from a sample is not enough — you need to validate and adapt as you process the full dataset.

Cloud Storage ETL

Cloud storage ETL processes files stored in object storage services like Amazon S3, Azure Blob Storage, or Google Cloud Storage. These files often use columnar formats optimized for analytics workloads.

Aspect	Details
Sources	Amazon S3, Azure Blob, Google Cloud Storage, MinIO
Formats	Parquet, Avro, ORC, Delta Lake, compressed CSV/JSON
Key Challenge	Partition discovery, schema evolution in columnar files, cost of listing/scanning large buckets
Common Pattern	Partition-aware reading with predicate pushdown for efficiency

Columnar formats like Parquet and Avro are fundamentally different from row-based formats like CSV. Instead of storing data row by row, they store data column by column. This means reading just the customer_name column from a 100-column Parquet file is extremely fast because you skip the other 99 columns entirely. Your ETL pipeline needs to understand these formats to take advantage of predicate pushdown and column pruning.

Spreadsheet ETL

Spreadsheet ETL is surprisingly common in enterprise environments. Business teams export data to Excel, add manual annotations, and upload it back. Finance sends monthly reports as spreadsheets. HR sends employee data in XLSX format.

Aspect	Details
Sources	Excel (.xlsx, .xls), Google Sheets, LibreOffice Calc, Numbers
Extraction	Library-based parsing, Google Sheets API, multi-sheet handling
Key Challenge	Multiple sheets, merged cells, formula dependencies, inconsistent formatting, manual data entry errors
Common Pattern	Header detection, data region identification, cell-by-cell type inference

Spreadsheets are the most unpredictable data source. Users merge cells, leave blank rows as separators, mix data types in the same column, and use formulas that reference other sheets. Your pipeline needs to handle all of this gracefully. The extraction phase for spreadsheets is often the most complex because you are essentially parsing a visual layout, not a data structure.

Log Processing ETL

Log ETL transforms unstructured or semi-structured log files into queryable, structured data. Application logs, web server access logs, system logs, and audit trails all contain valuable data buried in text patterns.

Aspect	Details
Sources	Application logs, nginx/Apache access logs, syslog, audit trails, CloudWatch, Stackdriver
Extraction	Regex pattern matching, log format parsers, structured logging (JSON lines)
Key Challenge	Multi-line log entries, inconsistent formats across versions, high volume, time zone handling
Common Pattern	Grok patterns or regex with streaming tail-based ingestion

The under-the-hood challenge with log ETL is multi-line entries. A Java stack trace spans dozens of lines, but it is logically one log entry. Your parser needs to detect where one entry ends and the next begins. Structured logging formats like JSON Lines solve this problem by putting one complete JSON object per line, but legacy applications rarely use structured logging.

FTP and Legacy System ETL

FTP-based ETL is still alive and well in finance, healthcare, government, and manufacturing. Many enterprise partners exchange data by dropping files on FTP or SFTP servers on a daily schedule.

Aspect	Details
Sources	FTP, SFTP, FTPS, mainframe exports, EDI (Electronic Data Interchange), legacy proprietary formats
Extraction	Scheduled file polling, file-based triggers, download-then-process
Key Challenge	File arrival timing, partial uploads, duplicate detection, fixed-width legacy formats
Common Pattern	Poll → download → verify checksum → process → archive

The real-world challenge with FTP ETL is timing. How do you know when a file is fully uploaded? If you start reading a file that is still being written, you get partial data. Common solutions include watching for a companion “.done” flag file, checking file size stability over time, or requiring the sender to upload to a staging directory and then move it to the pickup directory atomically.

OCR and Document Extraction ETL

OCR (Optical Character Recognition) ETL extracts structured data from unstructured documents. Invoices become line items. Scanned contracts become searchable text. Handwritten forms become database records.

Aspect	Details
Sources	PDF documents, scanned images, photographs of documents, fax outputs
Extraction	OCR engines (Tesseract, AWS Textract, Google Vision), layout analysis, table detection
Key Challenge	Recognition accuracy, layout preservation, handwriting variation, multi-language support
Common Pattern	Pre-process image → OCR → post-process text → validate against expected schema

OCR ETL has a unique challenge that other types do not: confidence scoring. When the OCR engine reads a blurry invoice, it might recognize “1,000” or “1.000” or “l,000”. Your pipeline needs confidence thresholds. Records below the threshold get flagged for human review instead of being loaded automatically. This means your pipeline architecture must include a human-in-the-loop path alongside the automated path.

Web Scraping ETL

Web scraping ETL extracts data from websites that do not provide APIs. Price monitoring, competitor analysis, public data collection, and content aggregation all rely on scraping pipelines.

Aspect	Details
Sources	HTML pages, JavaScript-rendered SPAs, public databases, government portals
Extraction	HTML parsing (DOM selectors, XPath), headless browsers, HTTP requests
Key Challenge	Website structure changes, anti-bot measures, JavaScript rendering, rate limiting, legal considerations
Common Pattern	Fetch HTML → parse selectors → validate extracted fields → handle structure changes gracefully

The fundamental problem with scraping ETL is fragility. When a website changes its HTML structure, your selectors break. Unlike APIs which have versioning and deprecation notices, websites change without warning. Your pipeline needs robust error detection that alerts you when extraction yields unexpected results, rather than silently loading garbage data.

Media ETL: Audio, Video, and Image

Media ETL is the newest frontier. With AI and machine learning capabilities becoming accessible, pipelines that process audio, video, and image data are becoming practical for mainstream engineering teams.

Media Type	ETL Use Case	Tools
Audio	Speech-to-text transcription, sentiment analysis, speaker identification	Whisper, AWS Transcribe, Google Speech-to-Text
Video	Frame extraction, subtitle generation, object detection, metadata extraction	FFmpeg, AWS Rekognition, OpenCV
Image	OCR, EXIF metadata extraction, classification, feature extraction for search	Tesseract, Pillow, AWS Rekognition, CLIP

Media ETL pipelines share a common pattern: the raw media file goes through an AI model to produce structured data, and then that structured data flows through a traditional ETL pipeline. A 2-hour meeting recording becomes a transcript, the transcript becomes speaker-labeled segments, and those segments become searchable records in a database. The ETL part is the same — it is the extraction step that is fundamentally different.

For pipelines that prepare data for AI processing, see Building AI-Ready ETL Pipelines.

How to Choose the Right Type of ETL Pipeline

Choosing the right type of ETL pipeline depends on three factors: your data source, your latency requirements, and your data volume.

If Your Source Is…	Use This Type	Key Consideration
Another relational database	Database-to-Database ETL	Use CDC for incremental loads over full extracts
Files from partners or exports	Flat File or FTP ETL	Validate file completeness before processing
SaaS platforms (Stripe, Salesforce)	API-Based ETL	Respect rate limits, handle pagination properly
Message queues or event streams	Streaming ETL	Design for exactly-once processing from the start
Physical documents or scans	OCR ETL	Build a human review path for low-confidence results
Websites without APIs	Web Scraping ETL	Monitor for structure changes, have fallback selectors
Audio, video, or images	Media ETL	Separate AI extraction from data transformation

In practice, most production systems use multiple types of ETL pipelines. You might pull customer data from a database, payment data from Stripe’s API, and invoice scans from an OCR service — all feeding into the same target warehouse. The key insight is that while the extraction strategy differs, the transformation and loading patterns remain consistent. That is why the 6-phase pipeline pattern works regardless of source type.

For a practical approach to building pipelines that handle multiple source types with maximum code reuse, see The 80/20 Framework Architecture.

Key Takeaways

The type of ETL pipeline you build is dictated by the source — not by the technology you prefer. Match the pipeline pattern to the data you are extracting, and most of the architectural decisions become obvious.

Database-to-database is the cleanest case: Schemas are known, types are explicit, transactions are available. When you can use CDC or a replication slot, prefer it over batch SELECT-and-INSERT.
Flat files demand schema discovery first: Headers, encodings, delimiters, and date formats vary even within a single source. Validate every file before the load phase, never after.
API ETL is rate-limit ETL: Pagination, throttling, retry-with-backoff, and idempotent upserts are the entire job. Build for the limit you see, not the limit you wish for.
Document ETL is schema-on-read: JSON and XML hide their schema until you look. Normalise into typed columns at the boundary, not later — debugging a JSON blob in production is misery.
Streaming ETL trades simplicity for freshness: Pick streaming only when the business value of seconds-old data exceeds the operational cost of a Kafka or Kinesis pipeline.
OCR and scraping ETL are best-effort: Both are probabilistic — design for incorrect extractions, downstream human review, and an audit trail that lets you go back to the source.
Choose the type by source, not by tool: The right pipeline pattern is determined by the shape of the source data. Same destination, same business rules — but ten very different pipelines depending on where the data comes from.

Knowing which type you are building is half the work. Once the type is named, the patterns, libraries, and failure modes follow predictably. The rest of this series covers each one in depth.

Types of ETL Pipelines: A Complete Guide to Every Data Source

Database-to-Database ETL

Flat File ETL

API-Based ETL

JSON, XML, and Document ETL

Streaming and Event-Driven ETL

NoSQL Database ETL

Cloud Storage ETL

Spreadsheet ETL

Log Processing ETL

FTP and Legacy System ETL

OCR and Document Extraction ETL

Web Scraping ETL

Media ETL: Audio, Video, and Image

How to Choose the Right Type of ETL Pipeline

Key Takeaways

Further Reading

Leave a Comment Cancel

Types of ETL Pipelines: A Complete Guide to Every Data Source

Database-to-Database ETL

Flat File ETL

API-Based ETL

JSON, XML, and Document ETL

Streaming and Event-Driven ETL

NoSQL Database ETL

Cloud Storage ETL

Spreadsheet ETL

Log Processing ETL

FTP and Legacy System ETL

OCR and Document Extraction ETL

Web Scraping ETL

Media ETL: Audio, Video, and Image

How to Choose the Right Type of ETL Pipeline

Key Takeaways

Further Reading

Leave a Comment Cancel

Related Essays

The 80/20 Framework Architecture: Maximizing Reuse in ETL Systems

Building AI-Ready ETL Pipelines: Embeddings, Chunking, and Vector Storage

Configuration-Driven ETL: Separating Logic from Declaration