Python’s Innards: pystate

We started our series discussing the basics of Python’s object system (Objects 101 and 102), and it’s time to move on. Though we’re not done with objects by any stretch of the imagination, when I think of Python’s implementation I visualize this big machine with a conveyor belt feeding opcodes into a hulking processing plant with cranes and cooling towers sticking out, and I just have to friggin’ peer inside already. So whether our knowledge of objects is complete or not, we’ll put them aside for a bit and look into the Interpreter State and the Thread State structures (both implemented in ./Python/pystate.c). I may be naïve, but I chose these structures because I’d like this post to be a simple basis for our understanding of actual bytecode evaluation that we’ll begin in the next few posts. Soon enough we’ll pry open Pandora boxes like frames, namespaces and code objects, and before we tackle these concepts I’d like us see the broad-picture of how the data structures that bind them together are laid out. Finally, note that this post is a tad heavier on OS terminology, I assume at least passing familiarity with it (kernel, process, thread, etc).

As you probably know, in many operating systems user-space code is executed by an abstraction called threads that run inside another abstraction called processes (this includes most Unices/Unix-likes and the decent members of the Windows family). The kernel is in charge of setting up and tearing down these processes and execution threads, as well as deciding which thread will run on which logical CPU at any given time. When a process invokes Py_Initialize another abstraction comes into play, and that is the interpreter. Any Python code that runs in a process is tied to an interpreter, you can think of the interpreter as the root of all other concepts we’ll discuss. Python’s code base supports initializing two (or more) completely separate interpreters that share little state with one another. This is rather rarely done (never in the vanilla executable), because too much subtly shared state of the interpreter core and of C extensions exists between these ‘insulated’ interpreters. Still, since support exists for it and for completeness’ sake, I’ll try anyway to write this post from the standpoint of a multi-interpreter world. Anyhow, we said all execution of code occurs in a thread (or threads), and Python’s Virtual Machine is no exception. However, Python’s Virtual Machine itself is something which supports the notion of threading, so Python has its own abstraction to represent Python threads. This abstraction’s implementation is fully reliant on the kernel’s threading mechanisms, so both the kernel and Python are aware of each Python thread and Python threads execute as separate kernel-managed threads, running in parallel with all other threads in the system. Uhm, almost.

There’s an elephant in that particular living room, and that is the Global Interpreter Lock or the GIL. Due to many reasons which we’ll cover briefly today and revisit at length some other time, many aspects of Python’s CPython implementation are not thread safe. This is has some benefits, like simplifying the implementation of easy-to-screw-up pieces of code and guaranteed atomicity of many Python operations, but it also means that a mechanism must be put in place to prevent two (or more) Pythonic threads from executing in parallel, lest they corrupt each other’s data. The GIL is a process-wide lock which must be held by a thread if it wants to do anything Pythonic – effectively limiting all such work to a single thread running on a single logical CPU at a time. Threads in Python multitask cooperatively by relinquishing the GIL voluntarily so other threads can do Pythonic work; this cooperation is built-in to the evaluation loop, so ordinarily authors of Python code and some extensions don’t need to do something special to make cooperation work (from their point of view, they are preempted). Do note that while a thread doesn’t use any of Python’s APIs it can (and many threads do) run in parallel to another Pythonic thread. We will discuss the GIL again briefly later in this post and at length at a later time, but for the time being I can refer the interested readers to this excellent PyCon lecture by David Beazely for additional information about how the GIL works (how the GIL works is not the main subject of the lecture, but the explanation of how it works is very good).

With the concepts of a process (OS abstraction), interpreter(s) (Python abstraction) and threads (an OS abstraction and a Python abstraction) in mind, let’s go inside-out by zooming out from a single opcode outwards to the whole process. This should give us a good overview, since so far we mainly went inwards from the implementation of some object-centric opcodes to the actual implementation of how they operate on objects. Let’s look again at the disassembly of the bytecode generated for the simple statement spam = eggs - 1:


# what's 'diss'? see 'tools' under 'metablogging' above!
>>> diss("spam = eggs - 1")
  1           0 LOAD_NAME                0 (eggs)
              3 LOAD_CONST               0 (1)
              6 BINARY_SUBTRACT
              7 STORE_NAME               1 (spam)
             10 LOAD_CONST               1 (None)
             13 RETURN_VALUE
>>>

In addition to the actual ‘do work’ opcode BINARY_SUBTRACT, we see opcodes like LOAD_NAME (eggs) and STORE_NAME (spam). It seems obvious that evaluating such opcodes requires some storage room: eggs has to be loaded from somewhere, spam has to be stored somewhere. The inner-most data structures in which evaluation occurs are the frame object and the code object, and they point to this storage room. When you’re “running” Python code, you’re actually evaluating frames (recall ceval.c: PyEval_EvalFrameEx). For now we’re happy to lump frame objects and code objects together; in reality they are rather distinct, but we’ll explore that some other time. In this code-structure-oriented post, the main thing we care about is the f_back field of the frame object (though many others exist). In frame n this field points to frame n-1, i.e., the frame that called us (the first frame that was called in any particular thread, the top frame, points to NULL).

This stack of frames is unique to every thread and is anchored to the thread-specific structure ./Include.h/pystate.h: PyThreadState, which includes a pointer to the currently executing frame in that thread (the most recently called frame, the bottom of the stack). PyThreadState is allocated and initialized for every Python thread in a process by _PyThreadState_Prealloc just before new thread creation is actually requested from the underlying OS (see ./Modules/_threadmodule.c: thread_PyThread_start_new_thread and >>> from _thread import start_new_thread). Threads can be created which will not be under the interpreter’s control; these threads won’t have a PyThreadState structure and must never call a Python API. This isn’t so common in a Python application but is more common when Python is embedded into another application. It is possible to ‘Pythonize’ such foreign threads that weren’t originally created by Python code in order to allow them to run Python code (PyThreadState will have to be allocated for them). APIs exist that can do such a migration so long as only one interpreter exists, it is also possible though harder to do it manually in a multi-interpreter environment. I hope to revisit these APIs and their operation in a later post, possibly one about embedding. Finally, a bit like all frames are tied together in a backward-going stack of previous-frame pointers, so are all thread states tied together in a linked list of PyThreadState *next pointers.

The list of thread states is anchored to the interpreter state structure which owns these threads. The interpreter state structure is defined at ./Include.h/pystate.h: PyInterpreterState, and it is created when you call Py_Initialize to initialize the Python VM in a process or Py_NewInterpreter to create a new interpreter state for multi-interpreter processes. Mostly as an exercise to sharpen your understanding, note carefully that Py_NewInterpreter does not return an interpreter state – it returns a (newly created) PyThreadState for the single automatically created thread of the newly created interpreter. There’s no sense in creating a new interpreter state without at least one thread in it, much like there’s no sense in creating a new process with no threads in it. Similarly to the list of threads anchored to its interpreter, so does the interpreter structure have a next field which forms a list by linking the interpreters to one another.

This pretty much sums up our zooming out from the resolution of a single opcode to the whole process: opcodes belong to currently evaluating code objects (currently evaluating is specified as opposed to code objects which are just lying around as data, waiting for the opportunity to be called), which belong to currently evaluating frames, which belong to Pythonic threads, which belong to interpreters. The anchor which holds the root of this structure is the static variable ./Python/pystate.c: interp_head, which points to the first interpreter state (through that all interpreters are reachable, through each of them all thread states are reachable, and so fourth). The mutex head_mutex protects interp_head and the lists it points to so they won’t be corrupt by concurrent modifications from multiple threads (I want it to be clear that this lock is not the GIL, it’s just the mutex for interpreter and thread states). The macros HEAD_LOCK and HEAD_UNLOCK control this lock. interp_head is typically used when one wishes to add/remove interpreters or threads and for special purposes. That’s because accessing an interpreter or a thread through the head variable would get you an interpreter state rather than the interpreter state owning the currently running thread (just in case there’s more than one interpreter state).

A more useful variable similar to interp_head is ./Python/pystate.c: _PyThreadState_Current which points to the currently running thread state (important terms and conditions apply, see soon). This is how code typically accesses the correct interpreter state for itself: first find its your own thread’s thread state, then dereference its interp field to get to your interpreter. There are a couple of functions that let you access this variable (get its current value or swap it with a new one while retaining the old one) and they require that you hold the GIL to be used. This is important, and serves as an example of CPython’s lack of thread safety (a rather simple one, others are hairier). If two threads are running and there was no GIL, to which thread would this variable point? “The thread that holds the GIL” is an easy answer, and indeed, the one that’s used. _PyThreadState_Current is set during Python’s initialization or during a new thread’s creation to the thread state structure that was just created. When a Pythonic thread is bootstrapped and starts running for the very first time it can assume two things: (a) it holds the GIL and (b) it will find a correct value in _PyThreadState_Current. As of that moment the Pythonic thread should not relinquish the GIL and let other threads run without first storing _PyThreadState_Current somewhere, and should immediately re-acquire the GIL and restore _PyThreadState_Current to its old value when it wants to resume running Pythonic code. This behaviour is what keeps _PyThreadState_Current correct for GIL-holding threads and is so common that macros exist to do the save-release/acquire-restore idioms (Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS). There’s much more to say about the GIL and additional APIs to handle it and it’s probably also interesting to contrast it with other Python implementation (Jython and IronPython are thread safe and do run Pythonic threads concurrently). But we’ll leave all that to a later post.

We now have all pieces in place, so here’s a little diagram I quickly jotted1 which shows the relation between the state structures within a single process hosting Python as described so far. We have in this example two interpreters with two threads each, you can see each of these threads points to its own call stack of frames.

A Diagram of Python's State Structures
A Diagram of Python’s State Structures

Lovely, isn’t it. Anyway, something we didn’t discuss at all is why these structures are needed. I mean, what’s in them? The reason we didn’t discuss the contents so far and will only briefly mention it now is that I wanted the structure to be clear more than the features. We will discuss the roles of each of these objects as we discuss the feature that relies on that role; for example, interpreter states contain several fields dealing with imported modules of that particular interpreter, so we can talk about that when we talk about importing. That said, I wouldn’t want to leave you completely in the dark, so we’ll briefly add that in addition to managing imports they hold bunch of pointers related to handling Unicode codecs, a field to do with dynamic linking flags and a field to do with TSC usage for profiling (see last bullet here), I didn’t look into it much.

Thread states have more fields but to me they were more easily understood. Not too surprisingly, they have fields that deal with things that relate to the execution flow of a particular thread and are of too broad a scope to fit particular frame. Take for example the fields recursion_depth, overflow and recursion_critical, which are meant to trap and raise a RuntimeError during overly deep recursions before the stack of the underlying platform is exhausted and the whole process crashes. In addition to these fields, this structure accommodates fields related to profiling and tracing, exception handling (exceptions can be thrown across frames), a general purpose per-thread dictionary for extensions to store arbitrary stuff in and counters to do with deciding when a thread ran too much and should voluntarily relinquish the GIL to let other threads run.

I think this pretty much sums up what I have to say about the general layout of a Python process. I hope it was indeed simple enough to follow, I plan on getting into rough waters soon and wanted this post to sink in well with you (and me!). The next two (maybe three) posts will start chasing a real heavyweight, the frame object, which will force us to discuss namespaces and code objects. Engage.


I would like to thank Antoine Pitrou and Nick Coghlan for reviewing this article; any mistakes that slipped through are my own.


1 This is a vicious lie. I worked far harder than I’d like to on this diagram (and desperately need your encouragement in the comments for it ), trying several tools along the way (dia, inkscape, Google Docs drawing and gaphor). If you know how to make diagrams like this one, especially in SVG format, using FOSS tools that run on Ubuntu and don’t suck, please let me know. While I’m ranting, if you knows CSS and WordPress.com and can figure out why the right border of the diagram is chopped, also let me know. Hrmf.


Comments

11 responses to “Python’s Innards: pystate”

  1. You say ‘note that theoretically threads can be created which will not be under the interpreter’s control; these threads won’t have a PyThreadState structure and must never call a Python API; this is not very common’.

    I’d suggest that in embedded systems this is more common than you think. It is quite likely to be the case in an embedded system that you will have foreign threads created outside of the interpreter which will be calling into the interpreter as opposed to always having Python threads that call out.

    For the main interpreter, so long as they use the PyGILState_Ensure()/PyGILState_Release() functions appropriately around the call into Python API, the thread state will be created on demand.

    It gets more complicated for sub interpreters as the PyGILState_???() functions only work for main interpreter. In sub interpreters you have to create the thread state objects manually through other APIs and this can get tricky as you need to avoid having multiple thread states for same thread against same sub interpreter. To ensure that thread local storage persists across calls into Python API, you should reuse the thread state rather than destroying it after each use.

    For an example of thread state usage and differences between main interpreter/sub interpreter usage, suggest you have a look at code for mod_wsgi.

    You should also perhaps look more at what PyGILState_Ensure()/PyGILState_Release() do in contect of main interpreter and how they allow foreign threads with no existing thread state to call into Python API.

    1. Thanks for the detailed comment, Graham!

      re. non-Python threads and embedded systems: This is the obvious use-case for such threads. I didn’t do a survey, but I tend to think there is far more CPython code running unembedded than otherwise. I might be wrong with that assumption, but when I said ‘not very common’, I was thinking about that.

      re. PyGILState_*, in writing each article I often tiptoe around terms which I don’t want to introduce, so as not to confuse my readers (or myself, I’m new here!). This article isn’t about the GIL, it’s about pystate.c. I know the topics are related, but then again, if I were to chase every related topic I’d never publish! When I said ‘There’s much more to say about the GIL and additional APIs to handle it’, these APIs were exactly what I was talking about, but, alas, we’ll have to get there some other day indeed.

      On a side-ish note, I’m honored to see the owner of mod_wsgi here! Perhaps you’d care to read a bit about Labour, a WSGI server ‘durability benchmark’ following Ian Bicking’s suggestion, I’d greatly appreciate your comment on that. Labour doesn’t support mod_wsgi (yet?), but I sorta despaired of the effort as I felt not many people are interested and I wasn’t certain I’m heading in the right direction.

      1. The PyGILState_* functions are a bit deceptively named as the prime reasons for their existence is the creation of a thread state when none exists against the thread already and the subsequent acquiring and release of that thread state object without actually having a reference to it. The GIL is really a side issue and they could have implemented the PyGILState_Ensure() function with a prerequisite that you have already separately acquired the GIL and it would still have had a purpose. It was likely just more convenient to have it also acquire the GIL as well to avoid a separate function call.

        As to durability testing of WSGI servers, there are people interested in it, including myself, but I just don’t have to the time to do much these days on anything let alone that area. I imagine it is the same for others who care as well, other things just take priority.

        1. Hmm. Yes, I understand what you’re saying, I’ll see how I can update the post accordingly.

          re. durability testing, I wasn’t expecting active participation (not that I’d object :), but more some comment regarding if the effort is in the right direction. That said, I’m aware that comments take time to write, too. It can wait.

  2. Re: your CSS question, the reason your right border is chopped is because:

    1. The image + border is greather than 460px, the size of the #content div ( in the markup).

    2. The CSS style of your content div is overflow: hidden, which means when a block element of greater width than it appears, it will actually be “hidden” inside the content area.

    So, your three options: remove overflow: hidden from #content in your CSS style, make #content width bigger relative to your sidebar, or make the image + border less than 460px to fit inside your content area.

    1. Damn, the Internet is a cool thing. 🙂

      Thanks for the help, your solution did the trick (and made me understand a WordPress.com setting).

  3. About your plea for help with diagramming on Linux. I had the exactly same problem, having tried all these tools you’ve mentioned to easily create diagrams. At work I’m using Visio all the time to easily draw any diagram I have enough creativity to envision.

    Eventually I broke down and bought a discounted version of Visio 2007 via work, and this is what I use now, happily (but guiltily) ever after. I run it on Windows, of course, but I guess it can also be done through something like Wine (or worst things worst, a Windows installation on top of VirtualBox)

    1. How depressing. I too need a good Linux based diagramming tool. I can’t believe we don’t have one. Inkscape really should do the job but every time I try to learn it I give up in frustration.

      1. Tracy, Inkscape is a wonderful piece of software, but it is not a diagramming tool! Trying to use it as such can only bring you misery. Inkscape is great for drawings and shapes, which are the atoms of diagrams, but not for connecting those.

  4. coot1864 Avatar
    coot1864

    diagrams: you could use tikz latex package (google “tikz diagrams” to see lots of examples). Sphinx (the python documentation module http://www.sphinx-doc.org) has an extension to make tikz diagrams https://bitbucket.org/philexander/tikz .