Python’s Innards: Objects 102

Welcome to Object 102, the third post in our series of Python internals and a direct continuation to the earlier post, Objects 101 (reading this post without reading 101 totally voids your warranty, so maybe you should head there first if you haven’t yet). In this post we will touch upon a central subject we have tiptoed around so far – attributes. You have surely used attributes in the past if you’ve done any Python programming: an object’s attributes are other objects related to it and accessible by invoking the . (dot) operator, like so: >>> my_object.attribute_name. Let’s begin with a quick overview of observable attribute access behaviour in Python, a behaviour determined by the accessed object’s type (as we’ve demonstrated ad nauseum about the behaviour of all operations on an object).

A type can define one (or more) specially named methods that will customize attribute access to its instances, these special methods are listed here (and we already know they will be wired into the type’s slots using fixup_slot_dispatchers when the type is created… you did read Objects 101, right?). These methods can do pretty much anything they want; whether writing your type in C or pure Python, you could write such methods that store and retrieve the attributes in some insane backend if you were so inclined, you could radio the attributes to the International Space Station and back or even store them in a relational database. However, under less exciting circumstances these methods simply store the attribute as a key/value pair (attribute name/attribute value) in some object-specific dictionary when an attribute is set and retrieve the attribute from that dictionary when an attribute is get (or raise an AttributeError if the dictionary doesn’t have a key matching the requested attribute’s name). This is so straightforward and lovely, thanks for listening, this concludes Objects 102.

Not. My friends, the fecal material is accelerating rapidly towards the spinning wind generation device, let me tell you that much. Let’s get into trouble together by looking closely at how this works in the interpreter, asking annoying questions as we often do. Below is a collapsed example of simple use of attributes, you can either read it in code (the surprising bits highlighted) or proceed to the English description of it in the next paragraph (the surprising bits are emphasized).

 >>> print(object.__dict__)
{‘__ne__’: <slot wrapper ‘__ne__’ of ‘object’ objects>, … , ‘__ge__’: <slot wrapper ‘__ge__’ of ‘object’ objects>}
>>> object.__ne__ is object.__dict__[‘__ne__’]
True
>>> o = object()
>>> o.__class__
<class ‘object’>
>>> o.a = 1
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: ‘object’ object has no attribute ‘a’
>>> o.__dict__
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: ‘object’ object has no attribute ‘__dict__’
>>> class C:
... A = 1
...
>>> C.__dict__[‘A’]
1
>>> C.A
1
>>> o2 = C()
>>> o2.a = 1
>>> o2.__dict__
{‘a’: 1}
>>> o2.__dict__[‘a2’] = 2
>>> o2.a2
2
>>> C.__dict__[‘A2’] = 2
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: ‘dict_proxy’ object does not support item assignment
>>> C.A2 = 2
>>> C.__dict__[‘A2’] is C.A2
True
>>> type(C.__dict__) is type(o2.__dict__)
False
>>> type(C.__dict__)
<class ‘dict_proxy’>
>>> type(o2.__dict__)
<class ‘dict’>
>>>

In English: we can see that object (as in, the most basic built-in type which we’ve discussed before) has a private dictionary, and we see that stuff we access on object as an attribute is identical to what we find in object.__dict__. We might have been somewhat surprised to see that instances of object (o, in the example) don’t support arbitrary attribute assignment and don’t have a __dict__ at all, though they do support some attribute access (try o.__class__, o.__hash__, etc; these do return things). After that we created our own class, C, derived from object and adding an attribute A, and saw that A was accessible via C.A and C.__dict__['A'] just the same, as expected. We then instantiated o2 from C, and demonstrated that as expected, attribute assignment on it indeed mutates its __dict__ and vice versa (i.e., mutations to its __dict__ are exposed as attributes). We were then probably more surprised to learn that even though attribute assignment on the class (C.A2) worked fine, our class’ __dict__ is actually read-only. Finally, we saw that our class’ __dict__ is not of the same type as our object’s __dict__, but rather an unfamiliar beast called dict_proxy. And if all that wasn’t enough, recall the mystery from the end of Objects 101: if plain object instances like o have no __dict__, and C extends object without adding anything significant, why do instances of C like o2 suddenly do have a __dict__?

Curiouser and curiouser indeed, but worry not: all will be revealed in due time. First, we shall look at the implementation of a type’s __dict__. Looking at the definition of PyObjectType (a zesty and highly recommended exercise), we see a slot called tp_dict, ready to accept a pointer to a dictionary. All types must have this slot, and all types have a dictionary placed there when ./Objects/typeobject.c: PyType_Ready is called on them, either when the interpreter is first initialized (remember Py_Initialize? It invokes _Py_ReadyTypes which calls PyType_Ready on all known types) or when the type is created dynamically by the user (type_new calls PyType_Ready on the newborn type before returning). In fact, every name you bind within a class statement will turn up in the newly created type’s __dict__ (see ./Objects/typeobject.c: type_new: type->tp_dict = dict = PyDict_Copy(dict);). Recall that types are objects too, so they have a type, their type is type and it has slots with functions that satisfy attribute access in a manner befitting types. These functions use the dictionary each type has and pointed to by tp_dict to store/retrieve the attributes, that is, getting attributes on a type is directly wired to dictionary assignment for the type instance’s private dictionary pointed to by the type’s structure. Reasonably similar stuff happens when attributes are set or deleted (in the eyes of the interpreter, deletion is just a set operation of NULL value, by the way). So far I hope it’s been rather simple, and explains types’ attribute retrieval.

Before we proceed to explain how instances’ attribute access works, allow me to introduce an unfamiliar (but very important!) term: Descriptor. Descriptors play a special role in instances’ attribute access, and I’d like to provide their bare definition here for a moment to simplify things in the next paragraph. An object is said to be a descriptor if it’s type has one or two slots (tp_descr_get and/or tp_descr_set) filled with non-NULL value. These slots are wired to the special method names __get__, __set__ and __delete__, when the type is defined in pure Python (i.e., if you create a class which has a __get__ method it will be wired to its tp_descr_get slot, and if you instantiate an object from that class, the object is a descriptor). Finally, an object is said to be a data descriptor if its type has a non-NULL tp_descr_set slot (there’s no particularly special term for a non-data descriptor). Descriptors play a special role in attribute access, as we shall soon see, I will provide further explanations and relevant links to documentation soon.

So we’ve defined descriptors, and we know how types’ dictionaries and attribute access work. But most objects aren’t types, that is to say, their type isn’t type, it’s something more mundane like int or dict or a user defined class. All these rely on generic attribute access functions, which are either set on the type explicitly or inherited from the type’s base when the type is created (that’s slot inheritance; this was also covered in Object 101). The generic attribute-getting function (PyObject_GenericGetAttr) and its algorithm is like so: (a) search the accessed instance’s type’s dictionary, and then all the type’s bases’ dictionaries. If a data descriptor was found, invoke it’s tp_desr_get function and return the results. If something else is found, set it aside (we’ll call it X). (b) Now search the object’s dictionary, and if something is found, return it. (c) If nothing was found in the object’s dictionary, inspect X, if one was set aside at all; if X is a non-data descriptor, invoke it’s tp_descr_get function and return the result, and if it’s a plain object it returns it. (d) Finally, if nothing was found, it raise an AttributeError exception.

So we learn that descriptors can execute code when they’re accessed as an attribute (so when you do foo = o.a or o.a = foo, a runs code). A powerful notion, that, and it’s used in several cases to implement some of Python’s more ‘magical’ features. Data-descriptors are even more powerful, as they take precedence over instance attributes (if you have an object o of class C, class C has a foo data-descriptor and o has a foo instance attribute, when you do o.foo the descriptor will take precedence). You can read about what are descriptors and how to implement descriptors. The former link (‘what are’) is particularly recommended: while it might seem daunting at first, if you read it slowly and literally you will see it’s actually simple and more concisely written than my drivel. There’s also a terrific article by Raymond Hettinger which describes descriptors in Python 2.x; other than the disappearance of unbound methods, it’s still very relevant to py3k and is highly recommended. While descriptors are really important and you’re advised to take the time to understand them, for brevity and due to the well written resources I’ve just mentioned I will explain them no further, other than show you how they behave in the interpreter (super simple example!):

 >>> class ShoutingInteger(int):
... # __get__ implements the tp_descr_get slot
... def __get__(self, instance, owner):
... print(‘I was gotten from %s (instance of %s)’
... % (instance, owner))
... return self
...
>>> class Foo:
... Shouting42 = ShoutingInteger(42)
...
>>> foo = Foo()
>>> 100 – foo.Shouting42
I was gotten from <__main__.Foo object at 0xb7583c8c> (instance of <class __main__.’foo’>)
58
# Remember: descriptors are only searched on types!
>>> foo.Silent666 = ShoutingInteger(666)
>>> 100 – foo.Silent666
-566
>>>

Let’s take a moment to note that we have completed our understanding of Python’s Object Oriented inheritance: since attributes are searched in the object’s type, and then in all its bases, then we now understand that accessing attribute A on object O instantiated from class C1 which inherits C2 which inherits C3 can return A either from O, C1, C2 or C3, depending on something called the method resolution order and described rather well here. This way of resolving attributes, when coupled with slot inheritance, is enough to explain most of Python’s inheritance functionality (though, as always, the Devil is in the details).

We know quite a lot now, but we still don’t know where the reference to these sneaky instance dictionaries are stored. We’ve seen the definition of PyObject, and it most definitely didn’t have a pointer to a dictionary, so where is the reference the object’s dictionary stored? The answer is rather interesting. If you look closely at the definition of PyTypeObject (it’s good clean and wholesome! read it every day!), you will see a field called tp_dictoffset. This field provides a byte offset into the C-structure allocated for objects instantiated from this type; at this offset, a pointer to a regular Python dictionary should be found. Under normal circumstances, when creating a new type, the size of the memory region necessary to allocate objects of that type will be calculated, and that size will be larger than the size of vanilla PyObject. The extra room will typically be used (among other things) to store the pointer to the dictionary (all this happens in ./Objects/typeobject.c: type_new, see may_add_dict = base->tp_dictoffset == 0; onwards). Using gdb, we can easily see this extra room and the object’s private dictionary:

 >>> class C: pass

>>> o = C()
>>> o.foo = ‘bar’
>>> o
<__main__.C object at 0x846b06c>
>>>
# break into GDB, see ‘metablogging’->’tools’ above
Program received signal SIGTRAP, Trace/breakpoint trap.
0x0012d422 in __kernel_vsyscall ()
(gdb) p ((PyObject *)(0x846b06c))->ob_type->tp_dictoffset
$1 = 16
(gdb) p *((PyObject **)(((char *)0x846b06c)+16))
$3 = {u’foo’: u’bar’}
(gdb)

We have created a new class, instantiated an object from it and set some attribute on the object (o.foo = 'bar'), broke into gdb, dereferenced the object’s type (C) and checked its tp_dictoffset (it was 16), and then checked what’s to be found at the address pointed to by the pointer located at 16 bytes’ offset from the object’s C-structure, and indeed we found there a dictionary object with the key foo pointing to the value bar. Of course, if you check tp_dictoffset on a type which doesn’t have a __dict__, like object, you will find that it is zero. You sure have a warm fuzzy feeling now, don’t you.

What could be misleading here is how type dictionaries and the common instance dictionaries look so similar, while in fact they’re rather differently implemented and allocated. Several riddles remain. Let’s recap and see what we’re missing: I define a class C inheriting object and doing nothing much else in Python, and then I instantiate o from that class, causing the extra memory for the dictionary pointer to be allocated at tp_dictoffset (room is reserved from the get-go, but the dictionary itself is allocated on first use of any kind; smart buggers). I then type in my interpreter o.__dict__, which byte-compiles to the LOAD_ATTR opcode, which causes the PyObject_GetAttr function to be called, which dereferences the type of o and finds the slot tp_getattro, which causes the default attribute searching mechanism described earlier in this post and implemented in PyObject_GenericGetAttr. So when all that happens, what returns my object’s dictionary? I know where the dictionary is stored, but I can see that __dict__ isn’t recursively inside itself, so there’s a chicken and egg problem here; who gives me my dictionary when I access __dict__ if it is not in my dictionary?

Someone who has precedence over the object’s dictionary – a descriptor. Check this out:

 >>> class C: pass

>>> o = C()
>>> o.__dict__
{}
>>> C.__dict__[‘__dict__’]
<attribute ‘__dict__’ of ‘C’ objects>
>>> type(C.__dict__[‘__dict__’])
<class ‘getset_descriptor’>
>>> C.__dict__[‘__dict__’].__get__(o, C)
{}
>>> C.__dict__[‘__dict__’].__get__(o, C) is o.__dict__
True
>>>

Gee. Seems like there’s something called getset_descriptor (it’s in ./Objects/typeobject.c), which are groups of functions implementing the descriptor protocol and meant to be attached to an object placed in type’s __dict__. This descriptor will intercept all attribute access to o.__dict__ on instances of this type, and will return whatever it wants, in our case, a reference to the dictionary found at the tp_dictoffset of o. This is also the explanation of the dict_proxy business we’ve seen earlier. If in tp_dict there’s a pointer to a plain dictionary, what causes it to be returned wrapped in this read only proxy, and why? The __dict__ descriptor of the type’s type type does it.

Resuming that example:

 >>> type(C)
<class ‘type’>
>>> type(C).__dict__[‘__dict__’]
<attribute ‘__dict__’ of ‘type’ objects>
>>> type(C).__dict__[‘__dict__’].__get__(C, type)
<dict_proxy object at 0xb767e494>

This descriptor is a function that wraps the dictionary in a simple object that mimics regular dictionaries’ behaviour but only allows read only access to the dictionary it wraps. And why is it so important to prevent people from messing with a type’s __dict__? Because a type’s namespace might hold them specially named methods, like __sub__. When you create a type with these specially named methods or when you set them on the type as an attribute, the function update_one_slot will patch these methods into one of the type’s slots, as we’ve seen in 101 for the subtraction operation. If you were to add these methods straight into the type’s __dict__, they won’t be wired to any slot, and you’ll have a type that looks like it has a certain behaviour (say, has __sub__ in its dictionary), but doesn’t behave that way.

How sad it is that we’re way over the dreaded attention span deadline of 2,000 words, and I still didn’t talk about __slots__, which I really wanted to. How about you go read about them yourself, cowboy? You have at your disposal everything you need to figure them out alone! Read the aforementioned link, play with __slots__ a bit in the interpreter, maybe look at the code or research them in gdb; have fun. In the next post, I think we’ll leave objects alone for a bit and talk about interpreter state and thread state. I hope it sounds interesting, but even if it doesn’t you really should read it – chicks totally dig people who know about these matters, I can certainly promise you that much.


Comments

5 responses to “Python’s Innards: Objects 102”

  1. What’s great about your posts is that I’ve been programming in Python for about 10 years, and yet I learned something new from each of your “Guide’s Python” posts. Keep em coming!

  2. Nick Coghlan Avatar
    Nick Coghlan

    For your “MRO” link, I suggest using the following instead of (or as well as) your current one:
    http://www.python.org/download/releases/2.3/mro/

    The 2.2 MRO algorithm had a few issues that were fixed in 2.3.

  3. Nick Coghlan Avatar
    Nick Coghlan

    Also, your gdb example has HTML coming through (the link in the comment is being printed as raw HTML instead of as a link)

    1. Thanks for both comments.

      I updated the MRO link and removed the URL from the gdb example altogether, WordPress’ spell checker had a tendency to screw up my source examples (so I stoped using it [sic]).

  4. Fantastic posts. Thank you for all your work!

    I just need clarification.
    Should the word I’ve highlighted (**type**) in para 5 be the type of type?
    Or else, should the word “instance’s” be removed?

    paragraph 5:
    These functions use the dictionary each type has and pointed to by tp_dict to store/retrieve the attributes, that is, getting attributes on a type is directly wired to dictionary assignment for the **type** instance’s private dictionary pointed to by the type’s structure.