An iterator enables element-by-element processing of a collection. Before we look at iterators which operate over collection objects, let us examine more fundamental iteration statements.
The C for loop - under the hood
The for loop is a basic building block in languages which derive their syntax from C.
char i;
for (i = 'a'; i < 'z'; i++) {
printf("%c\n", i);
}
Looking at the assembly generated for the loop (using Visual Studio/Windows running on an Intel processor):
;...
; char i;
; for (i = 'a'; i < 'z'; i++) {
00A81A2E mov byte ptr [i],61h
00A81A32 jmp main+2Ch (0A81A3Ch)
;A. Start of loop
00A81A34 mov al,byte ptr [i]
00A81A37 add al,1
00A81A39 mov byte ptr [i],al
00A81A3C movsx eax,byte ptr [i]
;B. Compare and jump to the next iteration or out of the loop
00A81A40 cmp eax,7Ah
00A81A43 jge main+53h (0A81A63h)
;... printf("%c\n", i); is implemented here
; }
;C. Jump back to start of loop
00A81A61 jmp main+24h (0A81A34h)
;}
0A81A63 xor eax,eax
;...
The compiled code is straightforward. Essentially a pair of jumps (cmp/jge and jmp marked as B. and C. in comments) are used to move to the next loop iteration and eventually terminate the loop.An iterator in Scheme
An iterator is a mechanism which:1. Operates on a collection of objects (usually all of the same type, but not necessarily).
2. Provides a way to apply an arbitrary procedure to the current object.
3. Provides a way to move to the next object.
4. Provides a uniform interface for the collection, independent of the type of the objects in the collection.
It is straightforward to implement a simple list iterator in Scheme:
(define (for-each function collection)
(cond ((null? collection) #t)
(else (let ((rest (cdr collection)))
(function (car collection))
(for-each function rest)))))
;Use for-each to print a list of symbols
(for-each print (list 'hello 'world '!))
;Use for-each to square a list of numbers
(for-each
(lambda (x)
(print (* x x)))
(list 1 2 3))
Our for-each takes an arbitrary function of 1 argument and a collection. It applies the function to each item in the collection till there are no elements left.We used a recursive function to implement foreach. This looks like an inefficient way to do things, since we would be using up stack frames and if the collection was big enough cause a stack overflow.
However the for-each has been written using a special case of recursion. Since the recursion is in the very last statement, we have what is called tail recursion. With tail recursion, the Scheme compiler can avoid pushing the recursive calls onto the stack. Instead it can optimize the generated code into something similar to the jump mechanism we saw with the generated assembly statement in the C for loop. Whether this optimization leads to better performance in practice or not is another issue which we will not bother about in this post.*
Iterators in C#
That leads us to iterators in C#. One common mechanism consists of a pair of complementary constructs:IEnumerable interface: Has one method GetEnumerator().
foreach statement: operates on an IEnumerable and allows us to iterate over its objects.
using (EnumeratedFile file = new EnumeratedFile( "c:\\temp\\test.txt" )
foreach ( var line in file ) {
Console.WriteLine( line );
}
}
Here EnumeratedFile implements IEnumerable. foreach applies whatever procedure we provide on EnumeratedFile (here just a Console.WriteLine).But for this to work, there has to be an enumerator object which IEnumerable.GetEnumerator() provides, which actually iterates over the file. C# provides another interface IEnumerator for this purpose. The enumerator which implements IEnumerator does the real work of giving access to the current object and moving to the next.
IEnumerator has 3 methods whose intentions are obvious from their names: Current (actually a property), MoveNext() and Reset().
Note that the same object can implement both the IEnumerable and IEnumerator interfaces. For example, here is an implementation of EnumeratedFile:
class EnumeratedFile : IEnumerable, IEnumerator, IDisposable
{
string line;
StreamReader stream;
//Constructor
public EnumeratedFile( string path ) {
line = string.Empty;
if ( File.Exists( path ) ) {
stream = new StreamReader( path );
}
else {
stream = null;
}
}
IEnumerator IEnumerable.GetEnumerator() {
return this;
}
public object Current {
get { return line; }
}
public bool MoveNext() {
if ( stream == null ) return false;
return (line = stream.ReadLine()) != null ? true : false;
}
public void Reset() {
stream.BaseStream.Position = 0;
stream.DiscardBufferedData();
}
//We also implement IDisposable to clean up the StreamReader.
public void Dispose() {
if ( stream != null ) {
stream.Dispose();
}
}
}
So foreach provides a shorthand for the following:
using ( EnumeratedFile file = new EnumeratedFile( "c:\\temp\\test.txt" ) ) {
var line = string.Empty;
while ( file.MoveNext() ) {
line = (string)file.Current;
Console.WriteLine( line );
}
}
foreach can infer IEnumerable interface for array and dynamic types. The C# language spec has details on the behavior of foreach.C# provides a simpler way to implement an iterator. Instead of implementing Current, MoveNext() and Reset(); we could use the yield keyword to simplify things.
Using yield, EnumeratedFile can be re-written as:
class EnumeratedFile : IEnumerable, IDisposable
{
string line;
StreamReader stream;
//Constructor
public EnumeratedFile( string path ) {
line = string.Empty;
if ( File.Exists( path ) ) {
stream = new StreamReader( path );
}
else {
stream = null;
}
}
public IEnumerator GetEnumerator() {
if (stream == null) yield break;
while ( (line = stream.ReadLine()) != null ) {
yield return line;
}
yield break;
}
//We also implement IDisposable to clean up the StreamReader.
public void Dispose() {
if ( stream != null ) {
stream.Dispose();
}
}
}
yield does the heavy lifting of generating current/movenext and maintaining state between iterations. Raymond Chen and Jon Skeet have explained the code which is generated by the compiler. We could also look at the generated code directly using Ildasm disassembler or indirectly using a .NET decompiler. IEnumerable and IEnumerator also have generic counterparts IEnumerable< T > and IEnumerator< T >.What is the big deal?
What is the use of iterator abstraction, if it boils down eventually to a pair of jump statements?Two things:
1. Iteration is a very common operation while automating work with computers: When we model real world objects, create several instances of them and apply procedures repeatedly on all the instances**. Enough times that many languages provide the iterator as a built in construct or make it easy to write the construct ourselves.
2. By making iterators a part of the language, the language designers provider us users a 'conceptual framework'*** to build higher levels of abstraction. For example IEnumerable in C# lays the groundwork for transforming sequences to sequences, which is one of the underlying ideas behind the Linq framework.
* See this post for tail optimization with respect to C#↩
** Iteration is not the only way to apply a procedure on a group of objects. Set based operations are another way. In fact, it is a common programming error for application programmers new to databases to iterate over each object in a set, instead of operating over the sets of data, and encounter performance problems as the data set grows.↩
*** From SICP, on map and similar operations over lists: ...The difference between the two definitions is not that the computer is performing a different process (it isn't) but that we think about the process differently. In effect, map helps establish an abstraction barrier that isolates the implementation of procedures that transform lists from the details of how the elements of the list are extracted and combined. Like the barriers shown in figure 2.1, this abstraction gives us the flexibility to change the low-level details of how sequences are implemented, while preserving the conceptual framework of operations that transform sequences to sequences.↩
Dump of assembler code for function main:
ReplyDelete0x0000000000400500 <+0>: push %rbp
0x0000000000400501 <+1>: mov %rsp,%rbp
0x0000000000400504 <+4>: sub $0x10,%rsp
0x0000000000400508 <+8>: movb $0x61,-0x1(%rbp)
0x000000000040050c <+12>: jmp 0x40052d
0x000000000040050e <+14>: movsbl -0x1(%rbp),%eax
0x0000000000400512 <+18>: mov %eax,%esi
0x0000000000400514 <+20>: mov $0x4005c4,%edi
0x0000000000400519 <+25>: mov $0x0,%eax
0x000000000040051e <+30>: callq 0x4003e0
0x0000000000400523 <+35>: movzbl -0x1(%rbp),%eax
0x0000000000400527 <+39>: add $0x1,%eax
0x000000000040052a <+42>: mov %al,-0x1(%rbp)
0x000000000040052d <+45>: cmpb $0x79,-0x1(%rbp)
0x0000000000400531 <+49>: jle 0x40050e
0x0000000000400533 <+51>: leaveq
0x0000000000400534 <+52>: retq
End of assembler dump.
clang: Dump of assembler code for function main:
0x0000000000400590 <+0>: push %rbp
0x0000000000400591 <+1>: mov %rsp,%rbp
0x0000000000400594 <+4>: sub $0x10,%rsp
0x0000000000400598 <+8>: movl $0x0,-0x4(%rbp)
0x000000000040059f <+15>: movb $0x61,-0x5(%rbp)
0x00000000004005a3 <+19>: movsbl -0x5(%rbp),%eax
0x00000000004005a7 <+23>: cmp $0x7a,%eax
0x00000000004005ac <+28>: jge 0x4005d5
0x00000000004005b2 <+34>: lea 0x400664,%rdi
0x00000000004005ba <+42>: movsbl -0x5(%rbp),%esi
0x00000000004005be <+46>: mov $0x0,%al
0x00000000004005c0 <+48>: callq 0x400490
0x00000000004005c5 <+53>: mov %eax,-0xc(%rbp)
0x00000000004005c8 <+56>: mov -0x5(%rbp),%al
0x00000000004005cb <+59>: add $0x1,%al
0x00000000004005cd <+61>: mov %al,-0x5(%rbp)
0x00000000004005d0 <+64>: jmpq 0x4005a3
0x00000000004005d5 <+69>: mov -0x4(%rbp),%eax
0x00000000004005d8 <+72>: add $0x10,%rsp
0x00000000004005dc <+76>: pop %rbp
0x00000000004005dd <+77>: retq
End of assembler dump.
Great post, How can i subscribe to ur articles?
ReplyDelete