Mailing list debates considered harmful

2010/05/29 § 24 Comments

Recently, one particular thread at python-dev has piqued my interest. The thread discusses the possible pronouncement of PEP 3148 (re. the addition of futures to stdlib), and has been going on for over a hundred messages with no end in site. I’d like to use some examples from that thread in this post, but it could have been another thread from another list. I do not want to discuss the contents of the PEP itself or even the contents of the discussion – if I thought I had interesting opinions about these matters, I’d say them in public on the mailing list. I am interested in talking about the form of the discussion, which I think models many similar discussions in other threads and other mailing lists. To make a long story short, I feel the broad OSS community is largely under equipped for massive online debate.

We need a better way to debate in large, decentralized and open groups, because this is a lot of what we do, probably more than actually coding. Mailing list, newsgroups and bug-tracker comments are a good enough place to debate things which are one or more of (a) under a relative consensus, or (b) not too interesting to many of the listening audience, or (c) too arcane to actually be debated by many of the listening audience. The problem lies with the “hot” topics, about which it seems everyone has something to say and which are arguably the most important topics to debate effectively (to be able to do things, to conserve time, etc). The nature of these large, decentralized and open debates is that the participants vary a lot from one another. Some are well known members of the community and some passersby, some of them are better informed and some don’t support your opinion, some of them easier to take offence and some easier to deal it, some of them care a lot about the subject under discussion and some have nothing good on TV, some read and reply to new posts in near-real-time and some are married. In these kinds of debates you see the tools we currently use in all their crippled anti-glory; they’re just not meant to handle these things.

Because our tools don’t rise to the challenge, once a discussion goes beyond a few posts it’s hard to keep track of who is arguing for and who is arguing against (indeed in many cases people may do both). It’s hard to gauge whether the silent majority generally agrees with any particular argument (‘+1’ and ‘-1’ are a hack to alleviate this, not a terribly good one). Gauging the correct weight to attribute to what people are saying (and people vary a lot, as I said) is difficult and relies mostly on one being familiar or Googling the speakers. It’s even hard to simply keep the discussion on the same track: more often than not, the discussion creeps onto completely unrelated issues (maybe important ones), and ‘steals’ the focus from the original, probably also-important, subject. In short, we’re arguing in a mess and I’m not sure we apply our hackerish instincts of science, openness and rule-of-the-competent to how we make decisions.

I’m by no means an expert on these matters, so I Googled a bit (wasn’t sure which keywords to use) and looked at Wikipedia, the source of all knowledge which covers everything and is never wrong or inaccurate(™). It seems to me like the challenge of harnessing collective intelligence is well known, but it doesn’t seem like any of the vast array of ‘collaboration software‘ solutions out there tackles this particular issue in manner that befits a group of hackers working on an OSS project (what’s the name for a group of hackers? a pod of hackers? a pack of hackers? a pride of hackers? but I digress). IBM has a natural-language-processing 100,000 ton behemoth which can crunch the collective opinions of hundreds of thousands of people, but in typical IBM fashion it seems to also require a few millions dollars and the sacrifice of a virgin PhD (abundant resource) to operate. If indeed I didn’t miss an existing solutions, it seems obvious that a solution is needed. Maybe I missed a good solution, but the fact of the matter is that hackers around the globe still debate in mailing lists, groups, wiki-discussion-pages, tracker issues, forums, IRC channels, etc, all of which seem to me like the wrong tool for the job.

I think we need a rather simple solution, and luckily I think we’re pretty much the right crowd to implement it. It appears to me that even a naive analysis of the problem is sufficient to implement a ‘good enough’ solution, or at least a ‘better than now’ one. Consider this: a single subject (introduce futures into stdlib) can be debated from many different angles: Is the terminology chosen adequate, or should futures be called something else in Python? Should this be a part of stdlib or a part of PyPI? Should futures be implemented as one object or two separate instances, one on the emitting side and one on the consuming side? Each one of these angles can be argued for or against: futures is a good name because it’s a well-established name in computer science, futures is a bad name because Python already has __future__, and so on, and so forth.

I’m thinking of a tool a bit like Rietveld, only not intended to discuss the very specific implementation aspects of patches, but rather to discuss opinions. I think my naive analysis of ‘a debate’ reveals three abstractions: “what’s under discussion” (PEP 3148), “why don’t we just accept whatever it is that’s under discussion as-is” (because futures is the wrong terminology, etc) and “who has an opinion about any of the ‘why’s” (I do, I think ‘deferreds’ is a better idea, etc). I think these three abstractions cover most of what we need to implement a ‘debating tool’, if coupled with some development of the idea that people’s opinion carry different weights (the ‘who’s differ; Guido carries more weight than I do, and thought should be given to how the implementation represents that). My assessment is that implementing something like this shouldn’t be too much work for web-technology-savvy people, even a small team could probably finish it in one sprint.

I agree that some questions remain about the interaction of this with current tools (this isn’t a mailing list or bug tracker or anything else’s replacement, much like Rietveld doesn’t replace any of them), as well as maintaining ‘law and order’ in the discussion (among the target audience I’m talking about, I think this isn’t such a big deal). Also, there are lots of possible additions to this concept, and maybe some should be added, but I think simplicity is key here. Part of the whole idea is that it won’t turn into incredibly complex ‘jamming’ software. And finally, no, I don’t know how to monetize it other than adwords and no, I don’t mind if you implement it so long as you keep it free as in speech. That’s mostly it, I could post the post right now and be reasonably satisfied with it. Ten minutes later someone would come along, point me at a website that already does it, and then I’d just have to advocate using that website on python-dev (the debate over whether to use the debating-website for future debates is a self-hosting one). But no. I didn’t just write this post and let ‘good enough’ be; I had to expose myself to public humiliation by creating a mockup UI for this imaginary website.

I hope that in this case a picture will be worth more than the thousand words I wrote here (1250+ words after editing, I must be doing something wrong), so here goes (*gulp*, I know it’s not the best mockup UI you’ve seen):

Mockup UI for a technical debating website

debate.org

Thanks for listening, I’ll await your comments.

EDIT: I found this on the interwebs. Not exactly what I was thinking of, and maybe not streamlined enough to really use as a debating tool in online communities, but probably better than a mailing list.

Advertisement

Python’s Innards: pystate

2010/05/26 § 11 Comments

We started our series discussing the basics of Python’s object system (Objects 101 and 102), and it’s time to move on. Though we’re not done with objects by any stretch of the imagination, when I think of Python’s implementation I visualize this big machine with a conveyor belt feeding opcodes into a hulking processing plant with cranes and cooling towers sticking out, and I just have to friggin’ peer inside already. So whether our knowledge of objects is complete or not, we’ll put them aside for a bit and look into the Interpreter State and the Thread State structures (both implemented in ./Python/pystate.c). I may be naïve, but I chose these structures because I’d like this post to be a simple basis for our understanding of actual bytecode evaluation that we’ll begin in the next few posts. Soon enough we’ll pry open Pandora boxes like frames, namespaces and code objects, and before we tackle these concepts I’d like us see the broad-picture of how the data structures that bind them together are laid out. Finally, note that this post is a tad heavier on OS terminology, I assume at least passing familiarity with it (kernel, process, thread, etc).

As you probably know, in many operating systems user-space code is executed by an abstraction called threads that run inside another abstraction called processes (this includes most Unices/Unix-likes and the decent members of the Windows family). The kernel is in charge of setting up and tearing down these processes and execution threads, as well as deciding which thread will run on which logical CPU at any given time. When a process invokes Py_Initialize another abstraction comes into play, and that is the interpreter. Any Python code that runs in a process is tied to an interpreter, you can think of the interpreter as the root of all other concepts we’ll discuss. Python’s code base supports initializing two (or more) completely separate interpreters that share little state with one another. This is rather rarely done (never in the vanilla executable), because too much subtly shared state of the interpreter core and of C extensions exists between these ‘insulated’ interpreters. Still, since support exists for it and for completeness’ sake, I’ll try anyway to write this post from the standpoint of a multi-interpreter world. Anyhow, we said all execution of code occurs in a thread (or threads), and Python’s Virtual Machine is no exception. However, Python’s Virtual Machine itself is something which supports the notion of threading, so Python has its own abstraction to represent Python threads. This abstraction’s implementation is fully reliant on the kernel’s threading mechanisms, so both the kernel and Python are aware of each Python thread and Python threads execute as separate kernel-managed threads, running in parallel with all other threads in the system. Uhm, almost.

There’s an elephant in that particular living room, and that is the Global Interpreter Lock or the GIL. Due to many reasons which we’ll cover briefly today and revisit at length some other time, many aspects of Python’s CPython implementation are not thread safe. This is has some benefits, like simplifying the implementation of easy-to-screw-up pieces of code and guaranteed atomicity of many Python operations, but it also means that a mechanism must be put in place to prevent two (or more) Pythonic threads from executing in parallel, lest they corrupt each other’s data. The GIL is a process-wide lock which must be held by a thread if it wants to do anything Pythonic – effectively limiting all such work to a single thread running on a single logical CPU at a time. Threads in Python multitask cooperatively by relinquishing the GIL voluntarily so other threads can do Pythonic work; this cooperation is built-in to the evaluation loop, so ordinarily authors of Python code and some extensions don’t need to do something special to make cooperation work (from their point of view, they are preempted). Do note that while a thread doesn’t use any of Python’s APIs it can (and many threads do) run in parallel to another Pythonic thread. We will discuss the GIL again briefly later in this post and at length at a later time, but for the time being I can refer the interested readers to this excellent PyCon lecture by David Beazely for additional information about how the GIL works (how the GIL works is not the main subject of the lecture, but the explanation of how it works is very good).

With the concepts of a process (OS abstraction), interpreter(s) (Python abstraction) and threads (an OS abstraction and a Python abstraction) in mind, let’s go inside-out by zooming out from a single opcode outwards to the whole process. This should give us a good overview, since so far we mainly went inwards from the implementation of some object-centric opcodes to the actual implementation of how they operate on objects. Let’s look again at the disassembly of the bytecode generated for the simple statement spam = eggs - 1:

# what's 'diss'? see 'tools' under 'metablogging' above!
>>> diss("spam = eggs - 1")
  1           0 LOAD_NAME                0 (eggs)
              3 LOAD_CONST               0 (1)
              6 BINARY_SUBTRACT
              7 STORE_NAME               1 (spam)
             10 LOAD_CONST               1 (None)
             13 RETURN_VALUE
>>>

In addition to the actual ‘do work’ opcode BINARY_SUBTRACT, we see opcodes like LOAD_NAME (eggs) and STORE_NAME (spam). It seems obvious that evaluating such opcodes requires some storage room: eggs has to be loaded from somewhere, spam has to be stored somewhere. The inner-most data structures in which evaluation occurs are the frame object and the code object, and they point to this storage room. When you’re “running” Python code, you’re actually evaluating frames (recall ceval.c: PyEval_EvalFrameEx). For now we’re happy to lump frame objects and code objects together; in reality they are rather distinct, but we’ll explore that some other time. In this code-structure-oriented post, the main thing we care about is the f_back field of the frame object (though many others exist). In frame n this field points to frame n-1, i.e., the frame that called us (the first frame that was called in any particular thread, the top frame, points to NULL).

This stack of frames is unique to every thread and is anchored to the thread-specific structure ./Include.h/pystate.h: PyThreadState, which includes a pointer to the currently executing frame in that thread (the most recently called frame, the bottom of the stack). PyThreadState is allocated and initialized for every Python thread in a process by _PyThreadState_Prealloc just before new thread creation is actually requested from the underlying OS (see ./Modules/_threadmodule.c: thread_PyThread_start_new_thread and >>> from _thread import start_new_thread). Threads can be created which will not be under the interpreter’s control; these threads won’t have a PyThreadState structure and must never call a Python API. This isn’t so common in a Python application but is more common when Python is embedded into another application. It is possible to ‘Pythonize’ such foreign threads that weren’t originally created by Python code in order to allow them to run Python code (PyThreadState will have to be allocated for them). APIs exist that can do such a migration so long as only one interpreter exists, it is also possible though harder to do it manually in a multi-interpreter environment. I hope to revisit these APIs and their operation in a later post, possibly one about embedding. Finally, a bit like all frames are tied together in a backward-going stack of previous-frame pointers, so are all thread states tied together in a linked list of PyThreadState *next pointers.

The list of thread states is anchored to the interpreter state structure which owns these threads. The interpreter state structure is defined at ./Include.h/pystate.h: PyInterpreterState, and it is created when you call Py_Initialize to initialize the Python VM in a process or Py_NewInterpreter to create a new interpreter state for multi-interpreter processes. Mostly as an exercise to sharpen your understanding, note carefully that Py_NewInterpreter does not return an interpreter state – it returns a (newly created) PyThreadState for the single automatically created thread of the newly created interpreter. There’s no sense in creating a new interpreter state without at least one thread in it, much like there’s no sense in creating a new process with no threads in it. Similarly to the list of threads anchored to its interpreter, so does the interpreter structure have a next field which forms a list by linking the interpreters to one another.

This pretty much sums up our zooming out from the resolution of a single opcode to the whole process: opcodes belong to currently evaluating code objects (currently evaluating is specified as opposed to code objects which are just lying around as data, waiting for the opportunity to be called), which belong to currently evaluating frames, which belong to Pythonic threads, which belong to interpreters. The anchor which holds the root of this structure is the static variable ./Python/pystate.c: interp_head, which points to the first interpreter state (through that all interpreters are reachable, through each of them all thread states are reachable, and so fourth). The mutex head_mutex protects interp_head and the lists it points to so they won’t be corrupt by concurrent modifications from multiple threads (I want it to be clear that this lock is not the GIL, it’s just the mutex for interpreter and thread states). The macros HEAD_LOCK and HEAD_UNLOCK control this lock. interp_head is typically used when one wishes to add/remove interpreters or threads and for special purposes. That’s because accessing an interpreter or a thread through the head variable would get you an interpreter state rather than the interpreter state owning the currently running thread (just in case there’s more than one interpreter state).

A more useful variable similar to interp_head is ./Python/pystate.c: _PyThreadState_Current which points to the currently running thread state (important terms and conditions apply, see soon). This is how code typically accesses the correct interpreter state for itself: first find its your own thread’s thread state, then dereference its interp field to get to your interpreter. There are a couple of functions that let you access this variable (get its current value or swap it with a new one while retaining the old one) and they require that you hold the GIL to be used. This is important, and serves as an example of CPython’s lack of thread safety (a rather simple one, others are hairier). If two threads are running and there was no GIL, to which thread would this variable point? “The thread that holds the GIL” is an easy answer, and indeed, the one that’s used. _PyThreadState_Current is set during Python’s initialization or during a new thread’s creation to the thread state structure that was just created. When a Pythonic thread is bootstrapped and starts running for the very first time it can assume two things: (a) it holds the GIL and (b) it will find a correct value in _PyThreadState_Current. As of that moment the Pythonic thread should not relinquish the GIL and let other threads run without first storing _PyThreadState_Current somewhere, and should immediately re-acquire the GIL and restore _PyThreadState_Current to its old value when it wants to resume running Pythonic code. This behaviour is what keeps _PyThreadState_Current correct for GIL-holding threads and is so common that macros exist to do the save-release/acquire-restore idioms (Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS). There’s much more to say about the GIL and additional APIs to handle it and it’s probably also interesting to contrast it with other Python implementation (Jython and IronPython are thread safe and do run Pythonic threads concurrently). But we’ll leave all that to a later post.

We now have all pieces in place, so here’s a little diagram I quickly jotted1 which shows the relation between the state structures within a single process hosting Python as described so far. We have in this example two interpreters with two threads each, you can see each of these threads points to its own call stack of frames.

A Diagram of Python's State Structures

A Diagram of Python's State Structures

Lovely, isn’t it. Anyway, something we didn’t discuss at all is why these structures are needed. I mean, what’s in them? The reason we didn’t discuss the contents so far and will only briefly mention it now is that I wanted the structure to be clear more than the features. We will discuss the roles of each of these objects as we discuss the feature that relies on that role; for example, interpreter states contain several fields dealing with imported modules of that particular interpreter, so we can talk about that when we talk about importing. That said, I wouldn’t want to leave you completely in the dark, so we’ll briefly add that in addition to managing imports they hold bunch of pointers related to handling Unicode codecs, a field to do with dynamic linking flags and a field to do with TSC usage for profiling (see last bullet here), I didn’t look into it much.

Thread states have more fields but to me they were more easily understood. Not too surprisingly, they have fields that deal with things that relate to the execution flow of a particular thread and are of too broad a scope to fit particular frame. Take for example the fields recursion_depth, overflow and recursion_critical, which are meant to trap and raise a RuntimeError during overly deep recursions before the stack of the underlying platform is exhausted and the whole process crashes. In addition to these fields, this structure accommodates fields related to profiling and tracing, exception handling (exceptions can be thrown across frames), a general purpose per-thread dictionary for extensions to store arbitrary stuff in and counters to do with deciding when a thread ran too much and should voluntarily relinquish the GIL to let other threads run.

I think this pretty much sums up what I have to say about the general layout of a Python process. I hope it was indeed simple enough to follow, I plan on getting into rough waters soon and wanted this post to sink in well with you (and me!). The next two (maybe three) posts will start chasing a real heavyweight, the frame object, which will force us to discuss namespaces and code objects. Engage.


I would like to thank Antoine Pitrou and Nick Coghlan for reviewing this article; any mistakes that slipped through are my own.


1 This is a vicious lie. I worked far harder than I’d like to on this diagram (and desperately need your encouragement in the comments for it ), trying several tools along the way (dia, inkscape, Google Docs drawing and gaphor). If you know how to make diagrams like this one, especially in SVG format, using FOSS tools that run on Ubuntu and don’t suck, please let me know. While I’m ranting, if you knows CSS and WordPress.com and can figure out why the right border of the diagram is chopped, also let me know. Hrmf.

Python’s Innards: Series Renamed

2010/05/22 § Leave a comment

Just yesterday I promised to tell you if I make significant mistakes, hoping it would be a seldom occasion. Yet here I am writing to you about another mistake I made, though this time not one of technical nature. I chose the old name for this series of articles about Python’s inner working as I was musing about how creators may see their creations far more thoroughly than other people possibly can (the old name of the series was “Guido’s Python”). It was recently brought to my attention that the name I chose might have been incorrect, misleading and/or simply tasteless. I certainly did not mean to give a false impression of affiliation or endorsement; of course Python’s creator and community leader never had anything to do with the series. Due to these reasons and to set things straight, the series will be promptly and retroactively renamed to “Python’s Innards”.

Since the series’ old name was used in URLs and category names, things will break as I do the rename and readers will have to update their bookmarks. I’m sorry for the discomfort I’ll cause my readers with the rename, but far worse, I apologize if I misled anyone or tainted someone’s good name. I’ll use this unfortunate opportunity to say explicitly that unless stated otherwise, everything you can find anywhere under the aknin.name domain (so far there’s only this blog) is solely my creation, not endorsed nor affiliated with anyone else (be assured that I’ll give credit where such is due if I’ll receive some kind of help in the future).

Genuinely no malice was intended and I hope this will be forgotten soon enough; I’m working on the next article and look forward to see you as we continue this journey into Python.

Correction for ‘Python’s Innards: Objects 102’

2010/05/20 § 1 Comment

Alas, it has happened, the first mistake in the ‘Python’s Innards’ series has been found. I was trying to answer a question raised by one of my Reddit readers regarding properties, and realized that I have overlooked a fine point about descriptors in my post. Oops.

As was originally (and correctly) written in the post, descriptors are objects whose type has their tp_descr_get and/or tp_descr_set slots set to non-NULL. However, I also wrote, incorrectly, that descriptors take precedence over regular instance attributes (i.e., attributes in the object’s __dict__). This is partly correct but misleading, as it doesn’t distinguish non-data descriptors from data-descriptors. An object is said to be a data descriptor if its type has its tp_descr_set slot implemented (there’s no particularly special term for a non-data descriptor). Only data descriptors override regular object attributes, non-data descriptors do not.

I apologize for the mistake, but alas can’t promise it will never happen again. I’ve created an Errata Policy page which explains how I intend to treat mistakes that will be found from now on. Briefly, it says that the policy is to correct the original page as if the mistake never existed, and if the mistake is significant enough, also write a separate post (like this one) which explains what happened. I plan not to tag errata posts with all the tags of the original post (I’ve done it with this post as an exception), so unless you explicitly register to the RSS feed of the Errata category in my blog, you won’t get further errata via Planet Python or other aggregation services that rely on the ‘python’ or ‘pythons-innards’ tags.

I hope you will keep enjoying and relying on this blog.

Python’s Innards: Objects 102

2010/05/19 § 4 Comments

Welcome to Object 102, the third post in our series of Python internals and a direct continuation to the earlier post, Objects 101 (reading this post without reading 101 totally voids your warranty, so maybe you should head there first if you haven’t yet). In this post we will touch upon a central subject we have tiptoed around so far – attributes. You have surely used attributes in the past if you’ve done any Python programming: an object’s attributes are other objects related to it and accessible by invoking the . (dot) operator, like so: >>> my_object.attribute_name. Let’s begin with a quick overview of observable attribute access behaviour in Python, a behaviour determined by the accessed object’s type (as we’ve demonstrated ad nauseum about the behaviour of all operations on an object).

A type can define one (or more) specially named methods that will customize attribute access to its instances, these special methods are listed here (and we already know they will be wired into the type’s slots using fixup_slot_dispatchers when the type is created… you did read Objects 101, right?). These methods can do pretty much anything they want; whether writing your type in C or pure Python, you could write such methods that store and retrieve the attributes in some insane backend if you were so inclined, you could radio the attributes to the International Space Station and back or even store them in a relational database. However, under less exciting circumstances these methods simply store the attribute as a key/value pair (attribute name/attribute value) in some object-specific dictionary when an attribute is set and retrieve the attribute from that dictionary when an attribute is get (or raise an AttributeError if the dictionary doesn’t have a key matching the requested attribute’s name). This is so straightforward and lovely, thanks for listening, this concludes Objects 102.

Not. My friends, the fecal material is accelerating rapidly towards the spinning wind generation device, let me tell you that much. Let’s get into trouble together by looking closely at how this works in the interpreter, asking annoying questions as we often do. Below is a collapsed example of simple use of attributes, you can either read it in code (the surprising bits highlighted) or proceed to the English description of it in the next paragraph (the surprising bits are emphasized).

>>> print(object.__dict__)
{'__ne__': <slot wrapper '__ne__' of 'object' objects>, ... , '__ge__': <slot wrapper '__ge__' of 'object' objects>}
>>> object.__ne__ is object.__dict__['__ne__']
True
>>> o = object()
>>> o.__class__
<class 'object'>
>>> o.a = 1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'object' object has no attribute 'a'
>>> o.__dict__
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'object' object has no attribute '__dict__'
>>> class C:
...     A = 1
... 
>>> C.__dict__['A']
1
>>> C.A
1
>>> o2 = C()
>>> o2.a = 1
>>> o2.__dict__
{'a': 1}
>>> o2.__dict__['a2'] = 2
>>> o2.a2
2
>>> C.__dict__['A2'] = 2
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'dict_proxy' object does not support item assignment
>>> C.A2 = 2
>>> C.__dict__['A2'] is C.A2
True
>>> type(C.__dict__) is type(o2.__dict__)
False
>>> type(C.__dict__)
<class 'dict_proxy'>
>>> type(o2.__dict__)
<class 'dict'>
>>> 

In English: we can see that object (as in, the most basic built-in type which we’ve discussed before) has a private dictionary, and we see that stuff we access on object as an attribute is identical to what we find in object.__dict__. We might have been somewhat surprised to see that instances of object (o, in the example) don’t support arbitrary attribute assignment and don’t have a __dict__ at all, though they do support some attribute access (try o.__class__, o.__hash__, etc; these do return things). After that we created our own class, C, derived from object and adding an attribute A, and saw that A was accessible via C.A and C.__dict__['A'] just the same, as expected. We then instantiated o2 from C, and demonstrated that as expected, attribute assignment on it indeed mutates its __dict__ and vice versa (i.e., mutations to its __dict__ are exposed as attributes). We were then probably more surprised to learn that even though attribute assignment on the class (C.A2) worked fine, our class’ __dict__ is actually read-only. Finally, we saw that our class’ __dict__ is not of the same type as our object’s __dict__, but rather an unfamiliar beast called dict_proxy. And if all that wasn’t enough, recall the mystery from the end of Objects 101: if plain object instances like o have no __dict__, and C extends object without adding anything significant, why do instances of C like o2 suddenly do have a __dict__?

Curiouser and curiouser indeed, but worry not: all will be revealed in due time. First, we shall look at the implementation of a type’s __dict__. Looking at the definition of PyObjectType (a zesty and highly recommended exercise), we see a slot called tp_dict, ready to accept a pointer to a dictionary. All types must have this slot, and all types have a dictionary placed there when ./Objects/typeobject.c: PyType_Ready is called on them, either when the interpreter is first initialized (remember Py_Initialize? It invokes _Py_ReadyTypes which calls PyType_Ready on all known types) or when the type is created dynamically by the user (type_new calls PyType_Ready on the newborn type before returning). In fact, every name you bind within a class statement will turn up in the newly created type’s __dict__ (see ./Objects/typeobject.c: type_new: type->tp_dict = dict = PyDict_Copy(dict);). Recall that types are objects too, so they have a type, their type is type and it has slots with functions that satisfy attribute access in a manner befitting types. These functions use the dictionary each type has and pointed to by tp_dict to store/retrieve the attributes, that is, getting attributes on a type is directly wired to dictionary assignment for the type instance’s private dictionary pointed to by the type’s structure. Reasonably similar stuff happens when attributes are set or deleted (in the eyes of the interpreter, deletion is just a set operation of NULL value, by the way). So far I hope it’s been rather simple, and explains types’ attribute retrieval.

Before we proceed to explain how instances’ attribute access works, allow me to introduce an unfamiliar (but very important!) term: Descriptor. Descriptors play a special role in instances’ attribute access, and I’d like to provide their bare definition here for a moment to simplify things in the next paragraph. An object is said to be a descriptor if it’s type has one or two slots (tp_descr_get and/or tp_descr_set) filled with non-NULL value. These slots are wired to the special method names __get__, __set__ and __delete__, when the type is defined in pure Python (i.e., if you create a class which has a __get__ method it will be wired to its tp_descr_get slot, and if you instantiate an object from that class, the object is a descriptor). Finally, an object is said to be a data descriptor if its type has a non-NULL tp_descr_set slot (there’s no particularly special term for a non-data descriptor). Descriptors play a special role in attribute access, as we shall soon see, I will provide further explanations and relevant links to documentation soon.

So we’ve defined descriptors, and we know how types’ dictionaries and attribute access work. But most objects aren’t types, that is to say, their type isn’t type, it’s something more mundane like int or dict or a user defined class. All these rely on generic attribute access functions, which are either set on the type explicitly or inherited from the type’s base when the type is created (that’s slot inheritance; this was also covered in Object 101). The generic attribute-getting function (PyObject_GenericGetAttr) and its algorithm is like so: (a) search the accessed instance’s type’s dictionary, and then all the type’s bases’ dictionaries. If a data descriptor was found, invoke it’s tp_desr_get function and return the results. If something else is found, set it aside (we’ll call it X). (b) Now search the object’s dictionary, and if something is found, return it. (c) If nothing was found in the object’s dictionary, inspect X, if one was set aside at all; if X is a non-data descriptor, invoke it’s tp_descr_get function and return the result, and if it’s a plain object it returns it. (d) Finally, if nothing was found, it raise an AttributeError exception.

So we learn that descriptors can execute code when they’re accessed as an attribute (so when you do foo = o.a or o.a = foo, a runs code). A powerful notion, that, and it’s used in several cases to implement some of Python’s more ‘magical’ features. Data-descriptors are even more powerful, as they take precedence over instance attributes (if you have an object o of class C, class C has a foo data-descriptor and o has a foo instance attribute, when you do o.foo the descriptor will take precedence). You can read about what are descriptors and how to implement descriptors. The former link (‘what are’) is particularly recommended: while it might seem daunting at first, if you read it slowly and literally you will see it’s actually simple and more concisely written than my drivel. There’s also a terrific article by Raymond Hettinger which describes descriptors in Python 2.x; other than the disappearance of unbound methods, it’s still very relevant to py3k and is highly recommended. While descriptors are really important and you’re advised to take the time to understand them, for brevity and due to the well written resources I’ve just mentioned I will explain them no further, other than show you how they behave in the interpreter (super simple example!):

>>> class ShoutingInteger(int):
...     # __get__ implements the tp_descr_get slot
...     def __get__(self, instance, owner):
...             print('I was gotten from %s (instance of %s)'
...                   % (instance, owner))
...             return self
... 
>>> class Foo:
...     Shouting42 = ShoutingInteger(42)
... 
>>> foo = Foo()
>>> 100 - foo.Shouting42
I was gotten from <__main__.Foo object at 0xb7583c8c> (instance of <class __main__.'foo'>)
58
# Remember: descriptors are only searched on types!
>>> foo.Silent666 = ShoutingInteger(666)
>>> 100 - foo.Silent666
-566
>>>

Let’s take a moment to note that we have completed our understanding of Python’s Object Oriented inheritance: since attributes are searched in the object’s type, and then in all its bases, then we now understand that accessing attribute A on object O instantiated from class C1 which inherits C2 which inherits C3 can return A either from O, C1, C2 or C3, depending on something called the method resolution order and described rather well here. This way of resolving attributes, when coupled with slot inheritance, is enough to explain most of Python’s inheritance functionality (though, as always, the Devil is in the details).

We know quite a lot now, but we still don’t know where the reference to these sneaky instance dictionaries are stored. We’ve seen the definition of PyObject, and it most definitely didn’t have a pointer to a dictionary, so where is the reference the object’s dictionary stored? The answer is rather interesting. If you look closely at the definition of PyTypeObject (it’s good clean and wholesome! read it every day!), you will see a field called tp_dictoffset. This field provides a byte offset into the C-structure allocated for objects instantiated from this type; at this offset, a pointer to a regular Python dictionary should be found. Under normal circumstances, when creating a new type, the size of the memory region necessary to allocate objects of that type will be calculated, and that size will be larger than the size of vanilla PyObject. The extra room will typically be used (among other things) to store the pointer to the dictionary (all this happens in ./Objects/typeobject.c: type_new, see may_add_dict = base->tp_dictoffset == 0; onwards). Using gdb, we can easily see this extra room and the object’s private dictionary:

>>> class C: pass
... 
>>> o = C()
>>> o.foo = 'bar'
>>> o
<__main__.C object at 0x846b06c>
>>>
# break into GDB, see 'metablogging'->'tools' above
Program received signal SIGTRAP, Trace/breakpoint trap.
0x0012d422 in __kernel_vsyscall ()
(gdb) p ((PyObject *)(0x846b06c))->ob_type->tp_dictoffset
$1 = 16
(gdb) p *((PyObject **)(((char *)0x846b06c)+16))
$3 = {u'foo': u'bar'}
(gdb) 

We have created a new class, instantiated an object from it and set some attribute on the object (o.foo = 'bar'), broke into gdb, dereferenced the object’s type (C) and checked its tp_dictoffset (it was 16), and then checked what’s to be found at the address pointed to by the pointer located at 16 bytes’ offset from the object’s C-structure, and indeed we found there a dictionary object with the key foo pointing to the value bar. Of course, if you check tp_dictoffset on a type which doesn’t have a __dict__, like object, you will find that it is zero. You sure have a warm fuzzy feeling now, don’t you.

What could be misleading here is how type dictionaries and the common instance dictionaries look so similar, while in fact they’re rather differently implemented and allocated. Several riddles remain. Let’s recap and see what we’re missing: I define a class C inheriting object and doing nothing much else in Python, and then I instantiate o from that class, causing the extra memory for the dictionary pointer to be allocated at tp_dictoffset (room is reserved from the get-go, but the dictionary itself is allocated on first use of any kind; smart buggers). I then type in my interpreter o.__dict__, which byte-compiles to the LOAD_ATTR opcode, which causes the PyObject_GetAttr function to be called, which dereferences the type of o and finds the slot tp_getattro, which causes the default attribute searching mechanism described earlier in this post and implemented in PyObject_GenericGetAttr. So when all that happens, what returns my object’s dictionary? I know where the dictionary is stored, but I can see that __dict__ isn’t recursively inside itself, so there’s a chicken and egg problem here; who gives me my dictionary when I access __dict__ if it is not in my dictionary?

Someone who has precedence over the object’s dictionary – a descriptor. Check this out:

>>> class C: pass
... 
>>> o = C()
>>> o.__dict__
{}
>>> C.__dict__['__dict__']
<attribute '__dict__' of 'C' objects>
>>> type(C.__dict__['__dict__'])
<class 'getset_descriptor'>
>>> C.__dict__['__dict__'].__get__(o, C)
{}
>>> C.__dict__['__dict__'].__get__(o, C) is o.__dict__
True
>>> 

Gee. Seems like there’s something called getset_descriptor (it’s in ./Objects/typeobject.c), which are groups of functions implementing the descriptor protocol and meant to be attached to an object placed in type’s __dict__. This descriptor will intercept all attribute access to o.__dict__ on instances of this type, and will return whatever it wants, in our case, a reference to the dictionary found at the tp_dictoffset of o. This is also the explanation of the dict_proxy business we’ve seen earlier. If in tp_dict there’s a pointer to a plain dictionary, what causes it to be returned wrapped in this read only proxy, and why? The __dict__ descriptor of the type’s type type does it.

Resuming that example:

>>> type(C)
<class 'type'>
>>> type(C).__dict__['__dict__']
<attribute '__dict__' of 'type' objects>
>>> type(C).__dict__['__dict__'].__get__(C, type)
<dict_proxy object at 0xb767e494>

This descriptor is a function that wraps the dictionary in a simple object that mimics regular dictionaries’ behaviour but only allows read only access to the dictionary it wraps. And why is it so important to prevent people from messing with a type’s __dict__? Because a type’s namespace might hold them specially named methods, like __sub__. When you create a type with these specially named methods or when you set them on the type as an attribute, the function update_one_slot will patch these methods into one of the type’s slots, as we’ve seen in 101 for the subtraction operation. If you were to add these methods straight into the type’s __dict__, they won’t be wired to any slot, and you’ll have a type that looks like it has a certain behaviour (say, has __sub__ in its dictionary), but doesn’t behave that way.

How sad it is that we’re way over the dreaded attention span deadline of 2,000 words, and I still didn’t talk about __slots__, which I really wanted to. How about you go read about them yourself, cowboy? You have at your disposal everything you need to figure them out alone! Read the aforementioned link, play with __slots__ a bit in the interpreter, maybe look at the code or research them in gdb; have fun. In the next post, I think we’ll leave objects alone for a bit and talk about interpreter state and thread state. I hope it sounds interesting, but even if it doesn’t you really should read it – chicks totally dig people who know about these matters, I can certainly promise you that much.

Switching to mercurial: taming zsh

2010/05/14 § 12 Comments

A quick one, so you’ll know not all my posts must have so many words. PEP 385 is materializing, and its time to learn Mercurial. I can’t say I’m a Mercurial expert, but I thought migrating all my git-oriented-zsh-gizmos would help me along the way. The conversion is almost done and had just one somewhat noteworthy tidbit.

A while back I copied from a friend a rather elaborate zsh prompt (not as elaborate as some people’s…), which includes my current git branch (if any) in it. The code to make it looked like this:

parse_git_branch() {
        git branch --no-color 2> /dev/null | sed -e '/^[^*]/d' -e 's/* \(.*\)/(\1)/'
}

function _rprompt_git {
    local git_branch="$(parse_git_branch)"
    if [ -n "$git_branch" ]; then
        result=":%{$fg[blue]%}$git_branch$rst"
    fi
    echo -n $result
}

I’ve naïvely added a Mercurial equivalent to parse_git_branch using hg bookmarks or hg branch, and retrofitted/redubbed _rprompt_git into _rprompt_dvcs. It worked well, but was slow. I’m not the kind of person to give hg grief over reasonable speed differences with the speed monster, but you can’t wait 0.15 seconds for your friggin’ prompt, now can you (this is not an invitation for a git/hg/bzr performance holy war in the comments, people). Removing a call to hg by using $ hg root just once, storing that value and using cat to get the actual branch/bookmark didn’t speed things up enough. #mercurial on freenode was kind but didn’t know how to help, other than suggest I buy a really fast computer… Blah, I’ll have to jot something out in C.

One bitbucket account later and a bit of tinkering led to fast_hg_root, which is a dumb C program which acts as a (fast) replacement to hg root. So now my dvcs-prompt-related code includes:

parse_git_branch() {
        git branch --no-color 2> /dev/null | sed -e '/^[^*]/d' -e 's/* \(.*\)/(\1)/'
}

parse_hg_branch() {
        if ! HG_ROOT=$(fast_hg_root) 2> /dev/null; then
            # not an HG repository, quit
            return
        fi
        BOOKMARK=$(cat "$HG_ROOT"/.hg/bookmarks.current 2> /dev/null)
        if [ -n "$BOOKMARK" ]; then
            # have a current bookmark, display that
            echo $BOOKMARK
            return
        fi
        # display the current branch or 'default'
        cat "$HG_ROOT"/.hg/branch 2> /dev/null || echo 'default'
}

function _rprompt_dvcs {
    local git_branch="$(parse_git_branch)"
    if [ -n "$git_branch" ]; then
        result=":%{$fg[blue]%}$git_branch$rst"
    else
        local hg_branch="$(parse_hg_branch)"
        if [ -n "$hg_branch" ]; then
            result=":%{$fg[green]%}$hg_branch$rst"
        fi
    fi
    echo -n $result
}

Which leads to my terminal sessions on my netbook monica looking like so:
DVCS aware zsh prompts

C’est tout. If someone is interested, I’ll open a repository with more of my .zsh.d stuff. By the way, anyone knows of a good way to capture a terminal into HTML, preserving ANSI color? I know of ttyrec and found HTML::FromANSI, but was hoping for a finished program or at least, uhm, a library less in Perl. :p

Python’s Innards: Objects 101

2010/05/12 § 16 Comments

As I said in the introduction to this series (which was rather successful; thank you everyone, your hits and comments literally keep me going!) – today’s post will be about Python 3.x’s implementation of objects. When I set out to write this post, I thought we’ll start with objects because it would be a gentle start for our series. After reading all the code I had to read to write this post, I can hardly say Python’s object system is, uhm, ‘gentle’ (and I definitely can’t say I grokked it). However, if anything, I was only further convinced that the implementation of objects is a good place for us to start because as we’ll see in later posts it’s so fundamental to the innards of Python, yet I suspect it’s not very often understood in its full glory even among Python veterans. Objects are not very tightly coupled with anything else in Python (I hardly visited ./Python for this post, I frequented mostly ./Objects and ./Include), so I found it easiest to look at the implementation of objects as if they’re unrelated to the ‘rest’, as if they’re a general purpose C API for creating an object subsystem. Maybe you will benefit from that line of thought, too: remember these are just a bunch of structures and some functions to manipulate them.

Without further ado, let’s start. Mostly everything in Python is an object, from integer to dictionaries, from user defined classes to built-in ones, from stack frames to code objects. Given a pointer to a piece of memory, the very least you must expect of it to treat it as an object are just a couple of fields defined in a C structure called ./Include/object.h: PyObject:

typedef struct _object {
    Py_ssize_t ob_refcnt;
    struct _typeobject *ob_type;
} PyObject;

Many objects extend this structure to accommodate other variables required to represent the object’s value, but these two fields must always exist: a reference count and type (in special debug builds, a couple other esoteric fields are added to track references).

The reference count is an integer which counts how many times the object is referenced. >>> a = b = c = object() instantiates an empty object and binds it to three different names: a, b and c. Each of these names creates another reference to it even though the object is allocated only once. Binding the object to yet another name or adding the object to a list will create another reference – but will not create another object! There is much more to say about reference counting, but that’s less central to the overall object system and more related to Garbage Collection. I’d rather consider writing a separate post about that later than elaborate here, we’ve a heady post ahead of us. However, before we leave this subject for now, I’d just like to note that we can now better understand the ./Include/object.h: Py_DECREF macro we’ve seen used in the introduction and didn’t know how to explain: It simply decrements ob_refcnt (and initiates deallocation, if ob_refcnt hit zero). That’s all we’ll say about reference counting for now.

Which leaves us with ob_type, a pointer to an object’s type, a central piece of Python’s object model (bear this deep in mind: in Python 3, type and class effectively mean the same thing; historic reasons lead to the usage of one name over the other depending on context). Every object has exactly one type, which never changes during the lifetime of the object (under extremely unusual conditions it can be changed, no API exists for that and if you handle type-changing objects you’re not likely to be reading this). Possibly most importantly, the type of an object (and only the type of an object) determines what can be done with an object (see the collapsed snippet at the end of this paragraph for a demonstration in code). Recall from the introduction that when the interpreter evaluates the subtraction opcode, a single C function (PyNumber_Subtract) will be called regardless of whether its operands are an integer and an integer, an integer and a float or even something nonsensical (subtract an exception from a dictionary).

# n2w: the type, not the instance, determines what can be done with an instance
>>> class Foo(object):
...     "I don't have __call__, so I can't be called"
... 
>>> class Bar(object):
...     __call__ = lambda *a, **kw: 42
... 
>>> foo = Foo()
>>> bar = Bar()
>>> foo()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'Foo' object is not callable
>>> bar()
42
# will adding __call__ to foo help?
>>> foo.__call__ = lambda *a, **kw: 42
>>> foo()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'Foo' object is not callable
# how about adding it to Foo?
>>> Foo.__call__ = lambda *a, **kw: 42
>>> foo()
42
>>> 

Initially, this is a peculiar thing. How can a single C function be used to handle any kind of object that is thrown at it? It can receive a void * pointer (actually it receives a PyObject * pointer, which is also opaque insofar as the object’s data is concerned), but how will it know how to manipulate the object it is given? In the object’s type lies the answer. A type is in itself a Python object (it also has a reference count and a type of its own, the type of almost all types is type), but in addition to the refcount and the type of the type, there are many more fields in the C structure describing type objects. This page has some information about types as well as type‘s structure’s definition, which you can also find it at ./Include/object.h: PyTypeObject, I suggest you refer to the definition occasionally as you read this post. Many of the fields a type object has are called slots and they point to functions (or to structures that point to a bunch of related functions). These functions are what will actually be called when Python C-API functions are invoked to operate on an object instantiated from that type. So while you think you’re calling PyNumber_Subtract on both a, say, int and a float, in reality what happens is that the types of it operands are dereferenced and the type-specific subtraction function in the ‘subtraction’ slot is used. So we see that the C-API functions aren’t generic, but rather rely on types to abstract the details away and appear as if they can work on anything (valid work is also just to raise a TypeError).

Let’s see this play out in detail: PyNumber_Subtract calls a generic two-argument function called ./Object/abstract.c: binary_op, and tells it to operate on the number-like slot nb_subtract (similar slots exists for other functionality, like, say, the number-like slot nb_negative or the sequence-like slot sq_length). binary_op is an error-checking wrapper around binary_op1, the real ‘do work’ function. ./Objects/abstract.c: binary_op1 (an eye-opening read in itself) receives BINARY_SUBTRACT‘s operands as v and w, and then tries to dereference v->ob_type->tp_as_number, a structure pointing to many numeric slots which represents how v can be used as a number. binary_op1 will expect to find at tp_as_number->nb_subtract a C function that will either do the subtraction or return the special value Py_NotImplemented, to signal that these operands are ‘insubtracticable’ in relation to one another (this will cause a TypeError exception to be raised).

If you want to change how objects behave, you can write an extension in C which will statically define its own PyObjectType structure in code and fill the slots away as you see fit. But when we create our own types in Python (make no mistake, >>> class Foo(list): pass creates a new type, class and type are the same thing), we don’t manually allocate a C structure and we don’t fill up its slots. How come these types behave just like built-in types? The answer is inheritance, where typing plays a significant role. See, Python arrives with some built-in types, like list or dict. As we said, these types have a certain set of functions populating their slots and thus objects instantiated from them behave in a certain way, like a mutable sequence of values or like a mapping of keys to values. When you define a new type in Python, a new C structure for that type is dynamically allocated on the heap (like any other object) and its slots are filled from whichever type it is inheriting, which is also called its base (what about multiple inheritance, you may ask? some other post, I might answer). Since the slots are copied over, the newly created sub-type has mostly identical functionality to its base. Python also arrives with a featureless base object type called object (PyBaseObject_Type in C), which has mostly null slots and which you can extend without inheriting any particular functionality.

So you never really ‘create’ a type in pure Python, you always inherit one (if you define a class without inheriting anything explicitly, you will implicitly inherit object; in Python 2.x, not inheriting anything explicitly leads to the creation of a so called ‘classic class’, which is out of our scope). Of course, you don’t have to inherit everything. You can, obviously, mutate the behaviour of a type created in pure Python, as I’ve demonstrated in the code snippet earlier in this post. By setting the special method __call__ on our class Bar, I made instances of that class callable. Someone, sometime during the creation of our class, noticed this __call__ method exists and wired it into our newly created type’s tp_call slot. ./Objects/typeobject.c: type_new, an elaborate and central function, is that function. We shall revisit that function, at length, in “Objects 102” (or 103, or 104…), but for now, let’s look at a small line right at the end, after the new type has been fully created and just before returning: fixup_slot_dispatchers(type);. This function iterates over the correctly named methods defined for the newly created type and wires them to the correct slots in the type’s structure, based on their particular name (where are these methods stored? later!).

Another thing remains unanswered in the sea of small details: we’ve demonstrated already that setting the method __call__ on a type after it’s created will also make objects instantiated from that type callable (even objects already instantiated from that type). How come? With elegance, my friends. Recall that a type is an object, and that the type of a type is type (if your head is spinning, try: >>> class Foo(list): pass ; type(Foo)). So when we do stuff to a class (I could use the word type instead of class just as well, but since we use the word ‘type’ so much in different context, I’ll call our type a class for a bit), like calling a class, or subtracting a class, or, indeed, setting an attribute on a class, what happens is that the class’ object’s ob_type member is dereferenced, finding that the class’ type is type. Then the type->tp_setattro slot is used to do the actual attribute setting. So a class, like an integer or a list can have its own attribute-setting function. And the type-specific attribute-setting function (./Objects/typeobject.c: type_setattro, if you didn’t know its name and want to befriend it in Facebook) calls the very same function that fixup_slot_dispatchers uses to actually do the fixup work (update_one_slot) after it has set a new attribute on a class. Another details sussed out!

This is it as far as our introduction to Python objects is concerned. I hope you enjoyed the ride and that you’re still with me. I have to admit this post was far harder to write than I initially imagined (and without Antoine Pitrou‘s and Mark Dickins‘s jolly help late at night in #python-dev, I might have given up!). I’ve left so much out that I’m appalled: which operand’s slot is used in binary opcodes? what happens in multiple inheritance and all the goryminute details it entails? what about this thing called metaclass? what about __slots__ and weak references? what about the actual implementation of all the built-in objects, how do the dictionary, the list, the set and their friends actually do their functional work? And finally, how about this little gem:

>>> a = object()
>>> class C(object): pass
... 
>>> b = C()
>>> a.foo = 5
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'object' object has no attribute 'foo'
>>> b.foo = 5
>>>

How come I can set an arbitrary attribute to b, which is an instance of C, which is a class inheriting object and not changing anything, and yet I can’t do the same with a, an instance of that very same object? Some wise crackers can say: b has a __dict__ and a doesn’t, and that’s true, but where in Guido’s name did this new (and totally non-trivial!) functionality come from if I didn’t inherit it?!

Ha! I’m delighted you asked! The answers are coming, but at another post.


A short reading list on the matter, for those so inclined: Try the documentation of the data model for an explanation on how to use things from Python; the C-API documentation about abstract and concrete objects for an explanation on using things from C; descrintro, AKA Unifying types and classes in Python 2.2, an archeological finding which is long, mind-boggling and hugely relevant (I argue it should be added as an easter egg to the interpreter, I suggest >>> import THAT); but above all else, read ./Objects/typeobject.c. Read it over and over, until you cry yourself to sleep. Sweet dreams.

Searching mailman archives offline (python-dev, anyone?)

2010/05/07 § 8 Comments

Since I’m a newcomer to python-dev, I often need to search the python-dev mailman archives. While I did find this way to do it (using Google with site:), it’s no good for offline searches (and at best it’s a kludge for online searches, too, IMHO). I’m offline quite a lot these days, since cellular 3G isn’t what it never used to be, and as long as I’m travelling the world, pre-paid cellular 3G is even worse. So, I set out looking for a proper solution to search python-dev in an offline manner.

Initially I just downloaded the whole mailing list archive with the shell-concoction listed below, and used grep to fish out what I needed. Obviously, if you’re reading this to look into something which isn’t python-dev (but why would you?!), replace MAILMAN_URL with wherever the mailing list you care about is archived.

MAILMAN_URL=http://mail.python.org/pipermail/python-dev/
for FILENAME in $(wget -O - -q $MAILMAN_URL |
                         egrep -o 'href="[^"]+.txt.gz"' |
                         cut -f2 -d\" )
do
    wget $MAILMAN_URL/$FILENAME
    gunzip $FILENAME
done

Naturally, after a (short) while I realized I need a proper mailbox search utility. I’ve been using Debian for ages, but the pure richness of Ubuntu’s repositories has only recently made my brain rewire ‘task: find new software’ to apt-cache search. So I did, and indeed apt-cache search mbox search found mairix, a “a program for indexing and searching email messages stored in maildir, MH or mbox folders.”. Sweet.

mairix has a slightly odd usage pattern and is geared towards people fleunt in mutt (which I’m not) so it, ugh, took me a while to realize it’s the tool I need and how to use it (gory details below). To sum things up, with mairix, you (a) index all the mail you’d like to search in one invocation and (b) run mairix with a search query, which creates a new mailbox (mbox/Maildir/MH) only with the results. You can later view that mailbox with your favorite reader, but the only one that I know of that would make sense in this context is probably, indeed, mutt.

Initially I set mairix up to index the mboxes as they were, but then I realized that due to the limitations of the mbox format, mairix has to copy every matching message to the results mailbox. If I were to use Maildir, for example, where every message is a file, it would generate a search-result-Maildir made of symlinks, which sounds better. So how do you convert all these mbox’s to Maildir? apt-cache search convert mbox maildir found mb2md for me. I placed all the mbox’s in a directory called mbox, and created a directory called maildir, and ran: mb2md -s $(pwd)/mbox -d $(pwd)/maildir. It chugged along unhappily (spewed a ton of error messages), but seemed to have worked (it later occurred to me that some emails might have been lost, I’m not sure), and a few minutes later I had all of python-dev’s archives in Maildir format.

Now I can use mairix! I setup my .mairixrc like so:

base=/home/teolicy/Projects/python-internals/mail/maildir
maildir=...
database=/home/teolicy/.mairixdb
mfolder=/tmp/mairix-results

The maildir=... bit means “recurse under base and index the maildirs within”. The mfolder line says where to put the resulting mailbox from your last search. I recon the other parameters are rather obvious, but see mairixrc(5) for details if you need something else. Warning! Obviously, if you’re going to index something that’s private, don’t place the results in /tmp!

Now I just had to run mairix with no arguments, and a few (rather short) moments later all the emails in the archive were indexed. To use mairix, you type something like: mairix d:3m- s:gil f:antoine which means, “find all messages in the last three months where the subject has ‘gil’ in it and the sender has ‘antoine’ in it”. The results will be stored in /tmp/mairix-results, which you can read using mutt -f /tmp/mairix-results. I encourage you to read mairix(1), but if you don’t, be aware that the useful -t switch will pull in whole threads into the results, not just matched messages. I use it more often than not.

Two small things remained. The first, for some reason which I didn’t care enough to research, mutt kept complaining I don’t have a ~/Mail folder on startup. Placing set folder=/tmp/mairix-results in my .muttrc made it go away. <sheepish>I didn’t really read what that means</sheepish>, so if that setting eats your homework, well, you deserve it. Also, I wrote a simple function for my zshrc file that reads something like:

mairix() {
    /usr/bin/env mairix -o /tmp/mairix-results $* &&
    mutt -Rf /tmp/mairix-results
}

It makes the whole thing easier.

That’s it. I’d feel pretty happy with myself, having an itch scratched so nicely, unless I was so dumbheaded as to fail to see that mairix is essentially the tool I was looking for in the first place. After about three minutes in its manpage, I figured it’s “unwieldy crap”, and started writing my own mailbox search engine in Python, based on whoosh. Fortunately, after a couple of days of mellow hacking, and having learned of the horrors that are email algorithm (email just sucks, you know?), it dawned on me that I’m slowly changing my design thus that I’m bloody rewriting mairix, so I ditched my effort, spent a few more minutes reading mairix’s manpage and not dismissing it unconsciously all the time as crap and realized it’s exactly what I needed. I learned some from the experience about free text searching in Python and RFC2822 and stuff, but honestly, I wish I weren’t such an arse in the first place. There, I confessed.

Below you can find all the stuff written here in easily copy-pastable form, you lazy bastard. Note this isn’t a script, as it doesn’t check for any kind of error, so it’s up to you to make sure this doesn’t botch your computer or whatever.

ARCHIVE_LOCATION=$HOME/python-dev
MAILMAN_URL=http://mail.python.org/pipermail/python-dev/

echo installing mutt, mairix, mb2md
sudo apt-get install mutt mairix mb2md

echo creating directories
mkdir -p $ARCHIVE_LOCATION/mbox
mkdir -p $ARCHIVE_LOCATION/maildir
cd $ARCHIVE_LOCATION/mbox

echo downloading $MAILMAN_URL
for FILENAME in $(wget -O - -q $MAILMAN_URL |
                         egrep -o 'href="[^"]+.txt.gz"' |
                         cut -f2 -d\")
do
    echo downloading $FILENAME
    wget -q $MAILMAN_URL/$FILENAME
    gunzip $FILENAME
done

echo converting to maildir
cd $ARCHIVE_LOCATION
mb2md -s $(pwd)/mbox -d $(pwd)/maildir 2>/dev/null 1>/dev/null

echo removing converted mailboxes
rm -fr $ARCHIVE_LOCATION/mbox
mv $ARCHIVE_LOCATION/maildir/* $ARCHIVE_LOCATION/maildir/.??* $ARCHIVE_LOCATION
rmdir $ARCHIVE_LOCATION/maildir

echo setting up mairixrc and muttrc
cat << EOF > ~/.mairixrc
base=$ARCHIVE_LOCATION
maildir=...
database=$HOME/.mairixdb
mfolder=/tmp/mairix-results
EOF

cat << EOF > ~/.muttrc
set folder=/tmp/mairix-results
EOF

echo indexing archive
mairix

echo 'mairix is all set-up; maybe you want to use this function:'
echo 'mairix() {'
echo '  /usr/bin/env mairix -o /tmp/mairix-results $* &&'
echo '   mutt -Rf /tmp/mairix-results'
echo '}'

The question of updates remains; a simple script should be able to do the trick, and maybe I’ll write it sometime. Or not.

Where Am I?

You are currently viewing the archives for May, 2010 at NIL: .to write(1) ~ help:about.