01 — Generators & yield
A list holds data. We can iterate through a list with a for loop.
fruits = ["manzana", "plátano", "fresa"] for fruit in fruits: print(fruit) # manzana # plátano # fresa
When python uses a list as above, it has to hold every value in memory at once. For small lists that's fine. But as you add more and more to the list this becomes impractical and you can wind up with an out of memory (OOM) error.
A generator is a function that wraps around a list and uses yield to pass values one-by-one. It hands back one value at a time, pauses, and resumes when the caller asks for the next one. The whole list doesn't have to exist in memory, just one value at a time.
¡Imagínate! frutas.txt is a file with 5 million fruit names, one per line. We can load the whole file at once,but this would be grosero as it would take up a whole bunch of space. Or we can iterate through the file one line at a time, and do whatever we need to do with each line of fruit.
# frutas.txt manzana plátano fresa mango sandía ... 4,999,995 more lines ...
def get_frutas(): # reads every line into memory at once return open("frutas.txt").readlines() for fruta in get_frutas(): print(fruta)
def get_frutas(): # hands back one line at a time for fruta in open("frutas.txt"): yield fruta.strip() for fruta in get_frutas(): print(fruta)
yield turns a function into a generator. The calling code looks identical, a for loop works on both. The difference is entirely inside the function: one builds everything upfront, the other produces values on demand. This is why generators are called lazy evaluators, they do the minimum amount of work necessary, only when asked.
Remember, return sends a value back from a function to whatever called it, and the function is done. ¿Cuál es la diferencia entre return y yield a ti?
Technically, a generator function does return an object when called. This special type of object is called a generator, and it has a method called next() that lets you step through it one value at a time. You can see this yourself by opening a Python shell and running the following:
>>> def func(): ... for x in range(100): ... yield x ... >>> fun = func() >>> next(fun) >>> next(fun)
When you write for x in func(), Python is calling next() behind the scenes on every iteration.
As you know, Yield has two meanings in inglés: 1) to produce a result, and 2) to give way. yield in Python does both. It produces whatever value you ask it for, and then gives control back to the calling function — waiting patiently until it's asked for the next one.
02 — async / await / gather
When a function talks to something, whether it be fetching a URL, waiting for a database, or running an operation on another machine it spends most of its time sitting idle. Shooting off a task and waiting for the response is called synchronous execution. Shooting off a task and not waiting for the response to get back to you and working on other shit is called asynchronous execution.
Async functions (marked async def) can be awaited. When they hit an await, they pause and let other tasks run. asyncio.gather() launches multiple coroutines at the same time and waits for all of them to finish.
A coroutine is a function that can pause itself mid-execution and hand control back to the caller, then be resumed later from where it left off. Normal functions run sequentially — one must finish before the next begins. Coroutines run concurrently — multiple can be in-progress at the same time, each pausing and resuming in turn. They're called coroutines because they cooperate with each other to make progress together rather than waiting in line.
async def main(): a = await text_trump() b = await text_kurt() c = await text_christian() return [a, b, c]
async def main(): a, b, c = await asyncio.gather( text_kurt(), text_christian(), text_trump(), ) return [a, b, c]
Concurrency is not the same as parallelism. asyncio.gather() doesn't run on multiple CPU cores, it interleaves tasks on a single thread, letting each one run while the others wait. We only have one brain, but we can use it at different times to text different people while we wait for a response from others.
03 — Async Generators
An async generator combines both ideas. It is an async def function that uses yield. The caller iterates it with async for, and each iteration can itself await something.
Imagine you're texting a group of friends one by one, waiting for each reply before moving on. You could collect every reply first and respond to all of them at the end, but by then the conversation is stale, your notifications are piling up, and you end up with no friends. You want to respond the moment you realize you have a reply, then move on to the next person.
async def check_in(people): all_replies = [] while people: person = people.pop(0) reply = await text(person) all_replies.append(reply) return all_replies # caller reads all at the end replies = await check_in(friends) for reply in replies: respond(reply)
async def check_in(people): while people: person = people.pop(0) reply = await text(person) yield reply # caller responds as each reply arrives async for reply in check_in(friends): respond(reply)
The bad version uses async to fire off requests concurrently. But then it collects everything into a list and hands it all back at once, so it's mierda. If you need to do something with each response, gathering doesn't help you. You still have to wait for everyone before you can act. An async generator lets you act the moment each result is ready, then keep going. yield the instant you have something; don't make the caller wait for the whole pile.
04 — Twisted & DeferredQueue
Scrapy's core runs on Twisted, an older async framework that predates Python's built-in async support. Modern versions of Scrapy can bridge to asyncio as well, but Twisted is still the engine underneath. The concepts are the same, but the vocabulary is different.
In Twisted, a Deferred is the equivalent of an awaitable,an object representing a value that isn't ready yet. You use @inlineCallbacks to write Twisted async code that looks like normal Python, with yield in place of await.
A DeferredQueue is Twisted's producer/consumer queue. A spider puts requests in; workers pull them out. Each queue.get() returns a Deferred that fires when an item is available.
from twisted.internet.defer import ( DeferredQueue, inlineCallbacks, gatherResults ) @inlineCallbacks def worker(queue, results): request = yield queue.get() # wait for an item result = yield fetch(request) # wait for the fetch results.append(result) @inlineCallbacks def run_engine(spider): queue = DeferredQueue() for req in spider.start_requests(): queue.put(req) results = [] # spawn all workers at once,Twisted's gather() workers = [worker(queue, results) for _ in range(5)] yield gatherResults(workers) return results
Every Scrapy spider ever written has run on top of this. start_requests() populates a DeferredQueue. Scrapy's downloader is a pool of workers draining that queue concurrently. gatherResults() is how it waits for all of them,the same idea as asyncio.gather(), different framework.
Twisted can do more complicated shit, but it's not as important to know. Use the hint to get out of this one.