2010/06/05 § Leave a comment
When discussing what I called “Pythonic threads”, which are threads created and managed by Python, which have a PyThreadState structure allocated to them and that are able to call into the Python API and run Python code, I said: “theoretically threads can be created which will not be under the interpreter’s control; these threads won’t have a PyThreadState structure and must never call a Python API; this is not very common“. This is sort of correct, but misleading: such threads are quite obvious to have in a non-Python applications that has a Python interpreter embedded in it (something Graham probably knows a thing or two about :). These ‘foreign’ (I don’t know of formal term) threads indeed can’t call the Python API and run Python code unless they’re ‘migrated’ (don’t know of a formal term) to come under the control of a Python interpreter, having the correct bookkeeping structures allocated and initialized for them.
Graham also noted correctly that such ‘migration’ is easier when there’s just one Python interpreter in the process, and harder if there’s more than one. If you have just one interpreter, you can use the (somewhat confusingly named) PyGILState_Ensure and PyGILState_Release calls to do the job. These calls allow a thread to ensure it has the GIL (blocking if necessary until it can be acquired), but importantly can be called by a thread that isn’t under Python’s control at all; the necessary work to allow the thread to call Python’s API will be done for you. These calls assume there is only one Python interpreter (interp_head points to a PyInterpreterState which points to NULL), hence they mustn’t be used in a multi-interpreter process, if that is your case you’ll have to ‘migrate’ the threat manually with lower level primitives.
As per my errata policy, I’ve chosen to update the original post as if the error never happened and write this post note the mistake for future reference and for those subscribed to the RSS feed of the Errata category.
I’d like to thank Graham for the important correction.
2010/05/26 § 11 Comments
We started our series discussing the basics of Python’s object system (Objects 101 and 102), and it’s time to move on. Though we’re not done with objects by any stretch of the imagination, when I think of Python’s implementation I visualize this big machine with a conveyor belt feeding opcodes into a hulking processing plant with cranes and cooling towers sticking out, and I just have to friggin’ peer inside already. So whether our knowledge of objects is complete or not, we’ll put them aside for a bit and look into the Interpreter State and the Thread State structures (both implemented in ./Python/pystate.c). I may be naïve, but I chose these structures because I’d like this post to be a simple basis for our understanding of actual bytecode evaluation that we’ll begin in the next few posts. Soon enough we’ll pry open Pandora boxes like frames, namespaces and code objects, and before we tackle these concepts I’d like us see the broad-picture of how the data structures that bind them together are laid out. Finally, note that this post is a tad heavier on OS terminology, I assume at least passing familiarity with it (kernel, process, thread, etc).
As you probably know, in many operating systems user-space code is executed by an abstraction called threads that run inside another abstraction called processes (this includes most Unices/Unix-likes and the decent members of the Windows family). The kernel is in charge of setting up and tearing down these processes and execution threads, as well as deciding which thread will run on which logical CPU at any given time. When a process invokes Py_Initialize another abstraction comes into play, and that is the interpreter. Any Python code that runs in a process is tied to an interpreter, you can think of the interpreter as the root of all other concepts we’ll discuss. Python’s code base supports initializing two (or more) completely separate interpreters that share little state with one another. This is rather rarely done (never in the vanilla executable), because too much subtly shared state of the interpreter core and of C extensions exists between these ‘insulated’ interpreters. Still, since support exists for it and for completeness’ sake, I’ll try anyway to write this post from the standpoint of a multi-interpreter world. Anyhow, we said all execution of code occurs in a thread (or threads), and Python’s Virtual Machine is no exception. However, Python’s Virtual Machine itself is something which supports the notion of threading, so Python has its own abstraction to represent Python threads. This abstraction’s implementation is fully reliant on the kernel’s threading mechanisms, so both the kernel and Python are aware of each Python thread and Python threads execute as separate kernel-managed threads, running in parallel with all other threads in the system. Uhm, almost.
There’s an elephant in that particular living room, and that is the Global Interpreter Lock or the GIL. Due to many reasons which we’ll cover briefly today and revisit at length some other time, many aspects of Python’s CPython implementation are not thread safe. This is has some benefits, like simplifying the implementation of easy-to-screw-up pieces of code and guaranteed atomicity of many Python operations, but it also means that a mechanism must be put in place to prevent two (or more) Pythonic threads from executing in parallel, lest they corrupt each other’s data. The GIL is a process-wide lock which must be held by a thread if it wants to do anything Pythonic – effectively limiting all such work to a single thread running on a single logical CPU at a time. Threads in Python multitask cooperatively by relinquishing the GIL voluntarily so other threads can do Pythonic work; this cooperation is built-in to the evaluation loop, so ordinarily authors of Python code and some extensions don’t need to do something special to make cooperation work (from their point of view, they are preempted). Do note that while a thread doesn’t use any of Python’s APIs it can (and many threads do) run in parallel to another Pythonic thread. We will discuss the GIL again briefly later in this post and at length at a later time, but for the time being I can refer the interested readers to this excellent PyCon lecture by David Beazely for additional information about how the GIL works (how the GIL works is not the main subject of the lecture, but the explanation of how it works is very good).
With the concepts of a process (OS abstraction), interpreter(s) (Python abstraction) and threads (an OS abstraction and a Python abstraction) in mind, let’s go inside-out by zooming out from a single opcode outwards to the whole process. This should give us a good overview, since so far we mainly went inwards from the implementation of some object-centric opcodes to the actual implementation of how they operate on objects. Let’s look again at the disassembly of the bytecode generated for the simple statement spam = eggs - 1:
# what's 'diss'? see 'tools' under 'metablogging' above! >>> diss("spam = eggs - 1") 1 0 LOAD_NAME 0 (eggs) 3 LOAD_CONST 0 (1) 6 BINARY_SUBTRACT 7 STORE_NAME 1 (spam) 10 LOAD_CONST 1 (None) 13 RETURN_VALUE >>>
In addition to the actual ‘do work’ opcode BINARY_SUBTRACT, we see opcodes like LOAD_NAME (eggs) and STORE_NAME (spam). It seems obvious that evaluating such opcodes requires some storage room: eggs has to be loaded from somewhere, spam has to be stored somewhere. The inner-most data structures in which evaluation occurs are the frame object and the code object, and they point to this storage room. When you’re “running” Python code, you’re actually evaluating frames (recall ceval.c: PyEval_EvalFrameEx). For now we’re happy to lump frame objects and code objects together; in reality they are rather distinct, but we’ll explore that some other time. In this code-structure-oriented post, the main thing we care about is the f_back field of the frame object (though many others exist). In frame n this field points to frame n-1, i.e., the frame that called us (the first frame that was called in any particular thread, the top frame, points to NULL).
This stack of frames is unique to every thread and is anchored to the thread-specific structure ./Include.h/pystate.h: PyThreadState, which includes a pointer to the currently executing frame in that thread (the most recently called frame, the bottom of the stack). PyThreadState is allocated and initialized for every Python thread in a process by _PyThreadState_Prealloc just before new thread creation is actually requested from the underlying OS (see ./Modules/_threadmodule.c: thread_PyThread_start_new_thread and >>> from _thread import start_new_thread). Threads can be created which will not be under the interpreter’s control; these threads won’t have a PyThreadState structure and must never call a Python API. This isn’t so common in a Python application but is more common when Python is embedded into another application. It is possible to ‘Pythonize’ such foreign threads that weren’t originally created by Python code in order to allow them to run Python code (PyThreadState will have to be allocated for them). APIs exist that can do such a migration so long as only one interpreter exists, it is also possible though harder to do it manually in a multi-interpreter environment. I hope to revisit these APIs and their operation in a later post, possibly one about embedding. Finally, a bit like all frames are tied together in a backward-going stack of previous-frame pointers, so are all thread states tied together in a linked list of PyThreadState *next pointers.
The list of thread states is anchored to the interpreter state structure which owns these threads. The interpreter state structure is defined at ./Include.h/pystate.h: PyInterpreterState, and it is created when you call Py_Initialize to initialize the Python VM in a process or Py_NewInterpreter to create a new interpreter state for multi-interpreter processes. Mostly as an exercise to sharpen your understanding, note carefully that Py_NewInterpreter does not return an interpreter state – it returns a (newly created) PyThreadState for the single automatically created thread of the newly created interpreter. There’s no sense in creating a new interpreter state without at least one thread in it, much like there’s no sense in creating a new process with no threads in it. Similarly to the list of threads anchored to its interpreter, so does the interpreter structure have a next field which forms a list by linking the interpreters to one another.
This pretty much sums up our zooming out from the resolution of a single opcode to the whole process: opcodes belong to currently evaluating code objects (currently evaluating is specified as opposed to code objects which are just lying around as data, waiting for the opportunity to be called), which belong to currently evaluating frames, which belong to Pythonic threads, which belong to interpreters. The anchor which holds the root of this structure is the static variable ./Python/pystate.c: interp_head, which points to the first interpreter state (through that all interpreters are reachable, through each of them all thread states are reachable, and so fourth). The mutex head_mutex protects interp_head and the lists it points to so they won’t be corrupt by concurrent modifications from multiple threads (I want it to be clear that this lock is not the GIL, it’s just the mutex for interpreter and thread states). The macros HEAD_LOCK and HEAD_UNLOCK control this lock. interp_head is typically used when one wishes to add/remove interpreters or threads and for special purposes. That’s because accessing an interpreter or a thread through the head variable would get you an interpreter state rather than the interpreter state owning the currently running thread (just in case there’s more than one interpreter state).
A more useful variable similar to interp_head is ./Python/pystate.c: _PyThreadState_Current which points to the currently running thread state (important terms and conditions apply, see soon). This is how code typically accesses the correct interpreter state for itself: first find its your own thread’s thread state, then dereference its interp field to get to your interpreter. There are a couple of functions that let you access this variable (get its current value or swap it with a new one while retaining the old one) and they require that you hold the GIL to be used. This is important, and serves as an example of CPython’s lack of thread safety (a rather simple one, others are hairier). If two threads are running and there was no GIL, to which thread would this variable point? “The thread that holds the GIL” is an easy answer, and indeed, the one that’s used. _PyThreadState_Current is set during Python’s initialization or during a new thread’s creation to the thread state structure that was just created. When a Pythonic thread is bootstrapped and starts running for the very first time it can assume two things: (a) it holds the GIL and (b) it will find a correct value in _PyThreadState_Current. As of that moment the Pythonic thread should not relinquish the GIL and let other threads run without first storing _PyThreadState_Current somewhere, and should immediately re-acquire the GIL and restore _PyThreadState_Current to its old value when it wants to resume running Pythonic code. This behaviour is what keeps _PyThreadState_Current correct for GIL-holding threads and is so common that macros exist to do the save-release/acquire-restore idioms (Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS). There’s much more to say about the GIL and additional APIs to handle it and it’s probably also interesting to contrast it with other Python implementation (Jython and IronPython are thread safe and do run Pythonic threads concurrently). But we’ll leave all that to a later post.
We now have all pieces in place, so here’s a little diagram I quickly jotted1 which shows the relation between the state structures within a single process hosting Python as described so far. We have in this example two interpreters with two threads each, you can see each of these threads points to its own call stack of frames.
Lovely, isn’t it. Anyway, something we didn’t discuss at all is why these structures are needed. I mean, what’s in them? The reason we didn’t discuss the contents so far and will only briefly mention it now is that I wanted the structure to be clear more than the features. We will discuss the roles of each of these objects as we discuss the feature that relies on that role; for example, interpreter states contain several fields dealing with imported modules of that particular interpreter, so we can talk about that when we talk about importing. That said, I wouldn’t want to leave you completely in the dark, so we’ll briefly add that in addition to managing imports they hold bunch of pointers related to handling Unicode codecs, a field to do with dynamic linking flags and a field to do with TSC usage for profiling (see last bullet here), I didn’t look into it much.
Thread states have more fields but to me they were more easily understood. Not too surprisingly, they have fields that deal with things that relate to the execution flow of a particular thread and are of too broad a scope to fit particular frame. Take for example the fields recursion_depth, overflow and recursion_critical, which are meant to trap and raise a RuntimeError during overly deep recursions before the stack of the underlying platform is exhausted and the whole process crashes. In addition to these fields, this structure accommodates fields related to profiling and tracing, exception handling (exceptions can be thrown across frames), a general purpose per-thread dictionary for extensions to store arbitrary stuff in and counters to do with deciding when a thread ran too much and should voluntarily relinquish the GIL to let other threads run.
I think this pretty much sums up what I have to say about the general layout of a Python process. I hope it was indeed simple enough to follow, I plan on getting into rough waters soon and wanted this post to sink in well with you (and me!). The next two (maybe three) posts will start chasing a real heavyweight, the frame object, which will force us to discuss namespaces and code objects. Engage.
I would like to thank Antoine Pitrou and Nick Coghlan for reviewing this article; any mistakes that slipped through are my own.