2017-09-19 16:35:00 +0000
Python 3, tutorial
This post briefly summarizes the chapter 14 of Fluent Python by Ramalho. Please check the book for more detailed examples and usages.
Iteration is fundamental to data processing. And when scanning datasets that don’t fit in memory, we need a way to fetch the items lazily, that is, one at a time and on demand. This is what the Iterator pattern is about.
import re
import reprlib
RE_WORD = re.compile(r'\w+')
class Sentence:
def __init__(self, text):
self.text = text
self.words = RE_WORD.findall(text)
def __getitem__(self, index): # for loop is available
return self.words[index]
def __len__(self): # it completes the sequence protocol
return len(self.words)
def __repr__(self):
return 'Sentence(%s)' % reprlib.repr(self.text)
>>> s = Sentence('Hello there! Mighty fine morning! if you ask me, I am Waldo.')
>>> s
Sentence('Hello there!..., I am Waldo.')
>>> for word in s: print(word)
Hello
there
Mighty
...
Waldo
>>> list(s) # list takes any iterable to build a list instance
['Hello', 'there', 'Mighty', ..., 'Waldo']
How is the above for
loop possible?
Whenever the interpreter needs to iterate over an object
x
, it automatically callsiter(x)
.
And the iter
built-in function needs two methods of the object, namely __iter__
and __getitem__
.
__iter__
first,iter
wants to make an iterator via __getitem__
.iter
raises TypeError
.Based on these observations, we can describe how the Python iterable quacks:
Iterable: Any object from which the iter built-in function can obtain an iterator. Objects implementing an
__iter__
method returning an iterator are iterable. Sequences are always iterable; so as are objects implementing a__getitem__
method which takes 0-based indexes.
Note also that the goose-typing is supported with abc.Iterable
, which strictly requires __iter__
method; it does not consider __getitem__
.
>>> s = 'ABC'
>>> for c in s:
... print(c)
The above code is equivalent with the below.
>>> s = 'ABC'
>>> it = iter(s)
>>> while True:
... try:
... print(next(it))
... except StopIteration: # it signals the iterator exhausted
... del it
... break
Iterator: Any object that implements the
__next__
no-argument method which returns the next item in a series or raisesStopIteration
when there are no more items. Python iterators also implement the__iter__
method so they are iterable as well.
Any Python function that has the
yield
keyword in its body is a generator function: a function which, when called, returns a generator object. In other words, a generator function is a generator factory.
A generator function in Python make the implementation of an iterator much easier.
def fib(): # invoking it does not result in the infinite loop
a, b = 0, 1
while 1:
yield b
a, b = b, a+b
>>> fibgen = fib() # it just gives us a generator
>>> fibgen # used for loops
<generator object fib at 0x7efec9e86620>
We can make Sentence
truly lazy via re.finditer
(also a lazy version of re.findall
).
class Sentence:
def __init__(self, text):
self.text = text
def __iter__(self):
for match in RE_WORD.finditer(self.text):
yield match.group()
Also Sentence
gets shorter using generator expressions.
A generator expression can be understood as a lazy version of a list comprehension
class Sentence:
def __iter__(self):
return (match.group() for match in RE_WORD.finditer(self.text))
The below chain
implementation tells us what the new keyword does for generators.
def chain(*iters):
for i in iters:
yield from i
>>> list(chain('ABC', range(3)))
['A', 'B', 'C', 0, 1, 2]
See the docs.