tl  tr
  Home | Tutorials | Articles | Videos | Products | Tools | Search
Interviews | Open Source | Tag Cloud | Follow Us | Bookmark | Contact   
 Modern Python > Type Hints and Dataclasses > Python Generators and itertools

Python Generators and itertools

Author: Venkata Sudhakar

A Python generator is a function that uses the yield keyword to produce values one at a time, pausing execution between each value. Unlike a regular function that returns all results in a list, a generator computes each value only when asked and holds only one value in memory at a time. This makes generators ideal for processing large datasets - a generator that reads a 10 GB file yields one line at a time without loading the entire file into memory. In data migration, generators are essential for processing millions of rows in batches without running out of RAM.

The itertools module provides fast, memory-efficient tools for working with iterators. itertools.islice() takes the first N elements from any iterator. itertools.chain() flattens multiple iterables into one. itertools.groupby() groups consecutive elements by key. itertools.takewhile() and dropwhile() take or skip elements while a condition holds. These tools compose cleanly with generators to build efficient data processing pipelines that handle data larger than memory.

The below example shows generators and itertools applied to data migration scenarios: streaming rows from a database, processing in batches, and building a complete extract-transform-load pipeline that never loads all data into memory at once.


It gives the following output,

Inserting 500 rows | first_id=0
Inserting 500 rows | first_id=500
Inserting 500 rows | first_id=1000
Inserting 500 rows | first_id=1500
Inserting 500 rows | first_id=2000
Inserting 500 rows | first_id=2500
Inserting 500 rows | first_id=3000
Inserting 500 rows | first_id=3500
Inserting 500 rows | first_id=4000
Inserting 500 rows | first_id=4500
Done. Total migrated: 5000
# Peak memory: ~500 rows at any time regardless of table size

It gives the following output,

customers:0
customers:1
customers:2
orders:0
orders:1

DELETE: 1 rows - IDs: [998]
INSERT: 2 rows - IDs: [1001, 1002]
UPDATE: 1 rows - IDs: [999]

Stable readings before lag spike: [0, 2, 5, 45]
Total amount: $5004.99

Generator expressions vs list comprehensions:

Write [x*2 for x in data] when you need random access or will iterate multiple times. Write (x*2 for x in data) when you will iterate once and the dataset is large. The built-ins sum(), min(), max(), any(), all(), and sorted() all accept generators directly. sum(row["amount"] for row in million_rows) sums without building a list. sorted() is the one exception - it must materialise the generator to sort, so use it only when the dataset fits in memory.


 
  


  
bl  br