Iterator Patterns: Proven Guide to Avoid Memory Crashes in Data Processing

Iterator patterns allow you to process datasets of any size without increasing memory usage. They are the foundation of memory-efficient data processing in ETL pipelines, enabling you to handle millions of records with the same memory footprint as processing a single record. You have 100,000 rows in your database. You need to process each one. The obvious approach loads everything into an array, loops through it, and writes the results. This works fine with 1,000 records. With 100,000, your server runs out of memory and crashes.

This is not a tutorial on using PHP iterators. This is about iterator patterns — understanding what actually happens in memory when you process data, and why these patterns enable memory-efficient processing at its root. This builds on the philosophy discussed in Understanding ETL: The Philosophy Behind Data Pipelines.

Why Iterator Patterns Matter: What Actually Happens in Memory

Let us start with the simplest case. You have three records in a database. Each record is 1KB. You want to process them all.

The Array Approach:

Array Loading: fetchAll()

$rows = $database->query("SELECT * FROM users")->fetchAll();
// At this point: ALL rows are loaded into memory

Step by step, here is what happens in memory:

Step 1: You call fetchAll()
→ PHP allocates memory for an empty array
→ Memory used: ~200 bytes

Step 2: First row arrives from database
→ PHP allocates 1KB for this row
→ PHP adds it to the array
→ Memory used: ~1.2KB

Step 3: Second row arrives
→ PHP allocates another 1KB
→ Both rows now in memory
→ Memory used: ~2.2KB

Step 4: Third row arrives
→ PHP allocates another 1KB
→ All three rows in memory
→ Memory used: ~3.2KB

With three rows, this is fine. Now imagine 100,000 rows at 1KB each. That is 100MB of memory. Your server has 64MB allocated to PHP. The process crashes before it even starts processing.

The key insight: fetchAll() waits until EVERY row is loaded before returning. Your code cannot even start processing until all 100MB is in memory.

Approach	Memory at 1K Rows	Memory at 100K Rows	Memory at 1M Rows
Array (fetchAll)	~1 MB	~100 MB	~1 GB (crash)
Iterator (cursor)	~1 KB	~1 KB	~1 KB
Batched Iterator	~1 MB	~1 MB	~1 MB

How Iterator Patterns Enable Memory-Efficient Processing

Iterator patterns do not load everything at once. They load one record, give it to you, then forget it and load the next.

Same three records, different approach:

Iterator Approach: Cursor-Based Loading

$cursor = $database->query("SELECT * FROM users");
foreach ($cursor as $row) {
    process($row);
}

Step by step, here is what happens in memory:

Step 1: You start the foreach loop
→ PHP asks the cursor for the first row
→ Database sends row 1
→ PHP allocates 1KB for this row
→ Memory used: ~1.2KB

Step 2: You call process($row)
→ Your code works with the row
→ Memory still: ~1.2KB

Step 3: Loop moves to next iteration
→ PHP releases row 1 from memory
→ PHP asks cursor for row 2
→ Database sends row 2
→ PHP allocates 1KB for row 2
→ Memory used: ~1.2KB (not 2.2KB!)

Step 4: Loop moves again
→ PHP releases row 2
→ PHP gets row 3
→ Memory used: ~1.2KB (still!)

Whether you have 3 rows or 100,000 rows, memory usage stays at ~1.2KB. Iterator patterns only hold ONE row at a time.

Array vs Iterator: Memory Behavior

Iterator Patterns Mental Model: The Assembly Line

Think of a factory assembly line versus a warehouse approach.

Approach	Analogy	How It Works
Array (Warehouse)	Trucks deliver ALL materials before work begins	Need warehouse space for 100,000 parts. If it does not fit, nothing gets done.
Iterator (Assembly Line)	One part arrives, gets processed, moves on	Floor only holds what is being worked on. Same space for 100 or 100,000 units.

The warehouse approach requires you to know the total size in advance. If you order 100,000 parts but only have space for 50,000, nothing happens. With iterator patterns, the assembly line does not care how many parts are coming. It processes whatever arrives, one at a time, at a constant pace. The line itself never changes size.

What yield Actually Does Inside Iterator Patterns

In PHP, generators use the yield keyword. This is not just syntax sugar. Understanding what yield does internally explains why memory stays constant.

PHP Generator with yield

function getRows($pdo) {
    $stmt = $pdo->query("SELECT * FROM users");
    while ($row = $stmt->fetch()) {
        yield $row;  // What happens here?
    }
}

Under the hood, yield does three things:

Step	What yield Does	What It Means
1. Returns the value	The current $row is sent to the caller	The caller gets one row to work with
2. Pauses execution	PHP remembers line number, variables, loop position	The function is frozen in place, not terminated
3. Waits for next()	Execution stays paused until someone asks for more	No CPU or memory used while waiting

When the caller asks for the next value, PHP jumps back to exactly where it paused, continues the while loop, fetches the next row, hits yield again, and pauses again.

The previous row? PHP released it when execution moved past the yield. It is no longer referenced. Garbage collection frees that memory. This is why memory stays constant — the generator only keeps the execution context (a few hundred bytes), not the data it has yielded.

Think of it this way: yield is a two-way door. Data goes out to the caller. Execution pauses. When the caller knocks again, execution resumes, the next piece of data goes out, and the door pauses again. The room behind the door (the generator function) only holds the current piece of data, never the history.

Iterator Patterns Meet Batch Processing

Processing one record at a time is memory-safe but slow. Database inserts one at a time are expensive. The network round trip for each record adds up.

The solution: batch processing with bounded memory. Iterator patterns provide the streaming, and batching provides the efficiency.

Here is what happens with batches of 1000:

Step 1: Iterator yields record 1
→ Add to batch array
→ Memory: ~1KB

Step 2-999: More records yielded
→ Add each to batch array
→ Memory grows: 1KB → 999KB

Step 1000: Batch is full
→ Bulk insert all 1000 records (one database call)
→ Clear the batch array
→ Memory drops back to ~0KB

Step 1001: Start new batch
→ Memory: ~1KB again

Maximum memory is always (batch size × record size). With 1000-record batches at 1KB each, you never use more than ~1MB, whether processing 10,000 or 100,000 records.

Batch Processing with Iterators

function processBatched($iterator, $batchSize = 1000) {
    $batch = [];
    foreach ($iterator as $record) {
        $batch[] = $record;
        if (count($batch) >= $batchSize) {
            bulkInsert($batch);   // One database call for 1000 records
            $batch = [];          // Release memory
        }
    }
    if (!empty($batch)) {
        bulkInsert($batch);       // Handle remaining records
    }
}

How to Choose the Right Batch Size for Iterator Patterns

Batch size is a trade-off between memory usage, database efficiency, and error recovery. There is no universal right answer, but there are clear guidelines based on what you are optimizing for.

Batch Size	Memory Impact	Database Efficiency	Error Granularity	Best For
100	Low (~100KB)	Moderate (many round trips)	Fine (lose max 100 records)	Small records, strict memory limits
1,000	Moderate (~1MB)	Good (balanced)	Acceptable (lose max 1K records)	General-purpose ETL
5,000	Higher (~5MB)	Excellent (few round trips)	Coarse (lose max 5K records)	Large records, fast networks
10,000+	High (~10MB+)	Diminishing returns	Poor (lose many records on failure)	Rarely recommended

The sweet spot for most ETL pipelines is 500 to 2,000 records per batch. Below 500, you make too many database calls. Above 5,000, the memory savings from iterator patterns start to erode, and a failed batch means retrying more records.

One factor people overlook: batch size affects error recovery. If a batch of 1,000 records fails on insert, you need to figure out which record caused the failure. With a batch of 100, the investigation scope is ten times smaller. In production, this matters more than the performance difference.

Chaining Transformations with Iterator Patterns

What happens when you chain multiple operations? Map, then filter, then transform again?

Chained Iterator Operations

$stream
    ->map(fn($row) => normalize($row))
    ->filter(fn($row) => $row['active'])
    ->map(fn($row) => enrich($row))

Each operation returns a NEW iterator that wraps the previous one. No data is processed yet. No memory is used yet. This is lazy evaluation — the chain is a blueprint, not an execution.

When you finally iterate, here is what happens for ONE record:

Step 1: Outer iterator asks inner for next value
Step 2: Inner asks its inner for next value
Step 3: Innermost fetches row from source
Step 4: Row bubbles up through each transformation
Step 5: Final value emerges
→ Memory used: ONE record (plus transformation overhead)

Ten transformations chained together still only hold ONE record in memory. The transformations are not storing intermediate results. Each is a function that transforms and passes through. This is the power of iterator patterns combined with lazy evaluation — you build complex processing pipelines without multiplying memory usage.

Real-World Performance: What Iterator Patterns Actually Save

The memory savings are dramatic, but iterator patterns also affect processing speed, database load, and system reliability. Here is what I have measured across production pipelines processing real data.

Metric	Array Approach	Iterator + Batching	Improvement
Peak memory (100K records)	~100 MB	~1 MB	99% reduction
Time to first record processed	30+ seconds (load all first)	<100ms (immediate)	Instant start
Database connections held	1 long-running query	1 cursor + batch inserts	Shorter lock time
Failure recovery	Restart from beginning	Restart from last batch	Minutes vs hours
Concurrent pipelines possible	1-2 (memory limited)	10+ (memory efficient)	5-10x throughput

The “time to first record” metric is often overlooked. With fetchAll(), your pipeline does nothing until every record is loaded. With iterator patterns, processing begins immediately. For a 100,000-record dataset, that can mean the difference between a pipeline that starts working in under a second and one that sits idle for 30 seconds loading data into memory.

The concurrency improvement is the most impactful in production. When each pipeline uses only 1MB instead of 100MB, you can run dozens of pipelines simultaneously on the same server. This is how teams scale ETL processing without scaling infrastructure.

Common Anti-Patterns in Memory-Efficient Data Processing

Understanding iterator patterns is not enough. You also need to know the mistakes that silently undo their benefits. These anti-patterns appear frequently in production code.

Anti-Pattern	What Happens	Why It Fails	What to Do Instead
Collecting into arrays	`$all = iterator_to_array($gen)`	Loads everything into memory, defeating the entire purpose of the iterator	Process records inside the foreach loop, never convert to array
Logging every record	Appending each record to a log array	The log array grows unbounded, consuming the memory you saved	Log summaries per batch, or write to file/database incrementally
Accumulating errors	Storing all failed records in an array	If 50% of records fail, you hold 50% of the dataset in memory	Write errors to a file or error table as they occur
Multiple passes	Iterating the same generator twice	Generators are consumed on first pass — second pass yields nothing	Combine operations into a single pass, or recreate the generator
Unbounded caching	Caching lookup results without a size limit	Cache grows with every unique lookup, eventually consuming all memory	Use LRU cache with a fixed maximum size

The “collecting into arrays” anti-pattern is the most common. A developer uses an iterator to stream data from the source, then immediately converts it to an array for processing. This loads the entire dataset into memory, exactly as if they had used fetchAll(). The iterator becomes decoration instead of architecture.

The “accumulating errors” anti-pattern is the most dangerous because it only appears under stress. When data quality is good, maybe 0.1% of records fail, and the error array stays small. When the source sends bad data and 40% of records fail, suddenly half your dataset is sitting in an error array. The pipeline crashes, and the error handling code is the cause.

When Iterator Patterns Break: Operations That Require Memory

Iterator patterns have limits. Some operations fundamentally REQUIRE holding data in memory. Recognizing these early prevents architectural mistakes.

Operation	Why It Breaks Iterators	Solution
Sorting	You cannot sort without seeing all values	Push sorting to the database with ORDER BY
Deduplication	Must remember all previous values to check uniqueness	Use database DISTINCT or GROUP BY
Aggregations	Sum, average, count need to see everything	Use database aggregate functions (SUM, AVG, COUNT)
Cross-record joins	Matching records across datasets requires holding one dataset	Use database JOINs or pre-join before streaming
Windowed analytics	Moving averages need a window of records	Use bounded windows (fixed size) or database window functions

For these operations with large datasets, push the work to the database. Databases are designed to sort and aggregate on disk, not in memory. Let them do what they are good at. The general rule: if an operation needs to see more than one record at a time, it either needs bounded memory (like a batch or window) or it belongs in the database layer.

How to Test and Debug Iterator-Based Pipelines

Testing iterator patterns requires a different approach than testing array-based code. You cannot just dump the results and compare. Here are the patterns that work in production.

Test Type	What to Verify	How to Test
Memory stability	Memory stays flat regardless of input size	Process 100 records, measure peak memory. Process 10,000 records, compare. Should be nearly identical.
Record accuracy	Each record is transformed correctly	Test with 5-10 known records. Compare output to expected values. Use small datasets, not large ones.
Batch boundaries	Records at batch boundaries are handled correctly	Test with exactly batch_size records, batch_size + 1, and batch_size – 1. The remainder batch must be flushed.
Empty input	Pipeline handles zero records gracefully	Pass an empty iterator. Pipeline should complete without errors and report zero processed.
Generator exhaustion	Generator is not accidentally iterated twice	After processing, attempt to iterate again. Should yield zero records, not throw an error.

The batch boundary test catches the most common bug: forgetting to flush the final partial batch. If your batch size is 1,000 and you have 2,500 records, the last 500 records must still be inserted. This off-by-one error is easy to miss because it only affects the tail end of the data, and in testing with round numbers it never appears.

Memory Stability Test

// Test: Memory should not grow with dataset size
$memoryBefore = memory_get_peak_usage();

// Process 10,000 records through iterator pipeline
foreach (getRows($source) as $row) {
    process($row);
}

$memoryAfter = memory_get_peak_usage();
$growth = $memoryAfter - $memoryBefore;

// Growth should be bounded (batch size × record size)
// NOT proportional to total records
assert($growth < 2 * 1024 * 1024, "Memory grew beyond 2MB — iterator pattern likely broken");

The Core Principle of Memory-Efficient Data Processing

Memory efficiency in data processing comes from one principle: never hold what you have already processed.

When you use yield, PHP releases the previous value. When you use iterators, each record flows through and disappears. When you batch, you bound the accumulation and flush regularly.

The mental simulation: imagine you ARE the pipeline. Data enters your left hand. You process it. It exits your right hand. Your hands only ever hold what is being processed right now. Nothing piles up behind you.

This is not about PHP specifically. This is how memory-efficient data processing works in any language. Python has generators. Java has streams. Go has channels. The syntax differs but the principle is identical: process and release, process and release. Iterator patterns are the mechanism; constant memory is the result.

Build for 100,000 records even when testing with 100. The cost of using iterator patterns is zero for small datasets. The cost of NOT using them appears suddenly and catastrophically when data grows beyond what memory can hold.

For a deeper understanding of how PHP generators work internally, the PHP Manual on Generators provides the official reference. The Lazy Evaluation concept on Wikipedia explains the broader computer science principle that makes chained iterators memory-efficient.

Key Takeaways

Iterator patterns are the foundation of scalable data processing. They turn memory from a limiting factor into a non-issue, allowing your pipelines to handle any dataset size.

Never load all data at once: Use cursors and generators instead of fetchAll(). Memory should stay constant regardless of dataset size.
Understand yield internally: It pauses execution, returns one value, and releases previous values. This is the mechanism behind constant memory usage.
Combine iterators with batching: Stream one record at a time but insert in batches of 500-2,000 for database efficiency. Memory is bounded by batch size, not dataset size.
Choose batch size deliberately: Balance memory usage, database efficiency, and error recovery. Smaller batches mean easier debugging when things fail.
Watch for anti-patterns: Converting iterators to arrays, accumulating errors in memory, and logging every record all silently undo memory savings.
Push heavy operations to the database: Sorting, deduplication, and aggregation belong in SQL, not in application memory.
Test memory stability: Verify that peak memory with 10,000 records matches peak memory with 100 records. If it does not, something is accumulating.
Design for the worst case: Build with iterator patterns from the start. The cost is zero for small datasets. The savings are critical for large ones.

Iterator Patterns: Proven Guide to Avoid Memory Crashes

Why Iterator Patterns Matter: What Actually Happens in Memory

How Iterator Patterns Enable Memory-Efficient Processing

Iterator Patterns Mental Model: The Assembly Line

What yield Actually Does Inside Iterator Patterns

Iterator Patterns Meet Batch Processing

How to Choose the Right Batch Size for Iterator Patterns

Chaining Transformations with Iterator Patterns

Real-World Performance: What Iterator Patterns Actually Save

Common Anti-Patterns in Memory-Efficient Data Processing

When Iterator Patterns Break: Operations That Require Memory

How to Test and Debug Iterator-Based Pipelines

The Core Principle of Memory-Efficient Data Processing

Key Takeaways

Further Reading

Leave a Comment Cancel

Iterator Patterns: Proven Guide to Avoid Memory Crashes

Why Iterator Patterns Matter: What Actually Happens in Memory

How Iterator Patterns Enable Memory-Efficient Processing

Iterator Patterns Mental Model: The Assembly Line

What yield Actually Does Inside Iterator Patterns

Iterator Patterns Meet Batch Processing

How to Choose the Right Batch Size for Iterator Patterns

Chaining Transformations with Iterator Patterns

Real-World Performance: What Iterator Patterns Actually Save

Common Anti-Patterns in Memory-Efficient Data Processing

When Iterator Patterns Break: Operations That Require Memory

How to Test and Debug Iterator-Based Pipelines

The Core Principle of Memory-Efficient Data Processing

Key Takeaways

Further Reading

Leave a Comment Cancel

Related Essays

Types of ETL Pipelines: A Complete Guide to Every Data Source

The 80/20 Framework Architecture: Maximizing Reuse in ETL Systems

Building AI-Ready ETL Pipelines: Embeddings, Chunking, and Vector Storage