Python’s Innards: Hello, ceval.c!

Sep 2, 2010

—

The “Python’s Innards” series owes its existence, at least in part, to hearing one of the Python-Fu masters in my previous workplace say something about a switch statement so large that it was needed to break it up just so some compilers won’t choke on it. I remember thinking then: “Choke the compiler with a switch? Hrmf, let me see that code.” Turns out that this switch can be found in ./Python/ceval.c: PyEval_EvalFrameEx and it switches over the current opcode, invoking its implementation. If I had to summarize all of CPython into one line, I’d probably choose that switch (actually I’d refuse, but humour me by assuming I was at gunpoint or something). This choice is rather subjective, as arguably there are more complex/interesting bits in Python’s object system (explored here and there) or parser/compiler related code. But I can’t help seeing that line, and its surrounding function and file, as the ‘do-work’ heart of CPython.

The reason I didn’t start the series from this heart is that I thought it would be too hard (mostly for the author…). Thanks to what we (well, at least I) learned in the previous posts, I think we can now understand it quite well. I’ll try to link backwards as necessary throughout the article, but if you haven’t followed the series so far, you’d probably do much better if you went back and read some of the previous articles before tackling this one. Also, for brevity’s sake in this post, I won’t qualify the file ./Python/ceval.c and the function PyEval_EvalFrameEx in it. Finally, remember that usually in the series when I quote code, I may note that I edited it, and in that case I often prefer clarity and brevity over accuracy; this is true for this post as well, only much more so, excerpts here might bear only slight resemblance to the real code.

So, where were we… Ah, yes, monstrous switch statement. Well, as I said, this switch can be found in the rather lengthy file ceval.c, in the rather lengthy function PyEval_EvalFrameEx, which takes more than half the file’s lines (it’s roughly 2,250 lines, the file is about 4,400). PyEval_EvalFrameEx implements CPython’s evaluation loop, which is to say that it’s a function that takes a frame object and iterates over each of the opcodes in its associated code object, evaluating (interpreting, executing) each opcode within the context of the given frame (this context is chiefly the associated namespaces and interpreter/thread states). There’s more to ceval.c than PyEval_EvalFrameEx, and we may discuss some of the other bits later in this post (or perhaps a follow-up post), but PyEval_EvalFrameEx is obviously the most important part of it.

Having described the evaluation loop in the previous paragraph, let’s see what it looks like in C (edited):

 PyEval_EvalFrameEx(PyFrameObject *f, int throwflag)
{
    /* variable declaration and initialization stuff */
    for (;;) {
        /* do periodic housekeeping once in a few opcodes */
        opcode = NEXTOP();
        if (HAS_ARG(opcode)) oparg = NEXTARG();
        switch (opcode) {
            case NOP:
                goto fast_next_opcode;
            /* lots of more complex opcode implementations */
            default:
                /* become rather unhappy */
        }
        /* handle exceptions or runtime errors, if any */
    }
    /* we are finished, pop the frame stack */
    tstate->frame = f->f_back;
    return retval;
}

As you can see, iteration over opcodes is infinite (forever: fetch next opcode, do stuff), breaking out of the loop must be done explicitly. CPython (reasonably) assumes that evaluated bytecode is correct in the sense that it terminates itself by raising an exception, returning a value, etc. Indeed, if you were to synthesize a code object without a RETURN_VALUE at its end and execute it (exercise to reader: how?¹), you’re likely to execute rubbish, reach the default handler (raises a SystemError) or maybe even segfault the interpreter (I didn’t check this thoroughly, but it looks plausible).

The evaluation loop may look fairly simple so far, but I kept back an important piece: I snipped about 1,450 lines of opcode implementations from within that big switch, all of them presumably more complex than a NOP. In order for you to be able to get a feel for what more serious opcode implementations look like, here’s the (edited) implementation of three more opcodes, illustrating a few more principles:

 case BINARY_SUBTRACT:
                w = *–stack_pointer; /* value stack POP */
                v = stack_pointer[-1];
                x = PyNumber_Subtract(v, w);
                stack_pointer[-1] = x; /* value stack SET_TOP */
                if (x != NULL) continue;
                break;
            case LOAD_CONST:
                x = PyTuple_GetItem(f->f_code->co_consts, oparg);
                *stack_pointer++ = x; /* value stack PUSH */
                goto fast_next_opcode;
            case SETUP_LOOP:
            case SETUP_EXCEPT:
            case SETUP_FINALLY:
                PyFrame_BlockSetup(f, opcode, INSTR_OFFSET() + oparg,
                           STACK_LEVEL());
                continue;

We see several things. First, we see a typical value manipulation opcode, BINARY_SUBTRACT. This opcode (and many others) works with values on the value stack as well as with a few temporary variables, using CPython’s C-API abstract object layer (in our case, a function from the number-like object abstraction) to replace the two top values on the value stack with the single value resulting from subtraction. As you can see, a small set of temporary variables, such as v, w and x are used (and reused, and reused…) as the registers of the CPython VM. The variable stack_pointer represents the current bottom of the stack (the next free pointer in the stack). This variable is initialized at the beginning of the function like so: stack_pointer = f->f_stacktop;. In essence, together with the room reserved in the frame object for that purpose, the value stack is this pointer. To make things simpler and more readable, the real (unedited by me) code of ceval.c defines several value stack manipulation/observation macros, like PUSH, TOP or EMPTY. They do what you imagine from their names.

Next, we see a very simple opcode that loads values from somewhere into the valuestack. I chose to quote LOAD_CONST because it’s very brief and simple, although it’s not really a namespace related opcode. “Real” namespace opcodes load values into the value stack from a namespace and store values from the value stack into a namespace; LOAD_CONST loads constants, but doesn’t fetch them from a namespace and has no STORE_CONST counterpart (we explored all this at length in the article about namespaces). The final opcode I chose to show is actually the single implementation of several different control-flow related opcodes (SETUP_LOOP, SETUP_EXCEPT and SETUP_FINALLY), which offload all details of their implementation to the block stack manipulation function PyFrame_BlockSetup; we discussed the block stack in our discussion of interpreter stacks.

Something we can observe looking at these implementations is that different opcodes exit the switch statement differently. Some simply break, and let the code after the switch resume. Some use continue to start the for loop from the beginning. Some goto various labels in the function. Each exit has different semantic meaning. If you break out of the switch (the ‘normal’ route), various checks will be made to see if some special behaviour should be performed – maybe a code block has ended, maybe an exception was raised, maybe we’re ready to return a value. Continuing the loop or going to a label lets certain opcodes take various shortcuts; no use checking for an exception after a NOP or a LOAD_CONST, for instance.

That’s pretty much it. I can’t really say we’re done (not at all), but this is pretty much the gist of PyEval_EvalFrameEx. Simple, eh? Well, yeah, simple, but I lied a bit with the editing to make it simpler. For example, if you look at the code itself, you will see that none of the case expressions for the big switch are really there. The code for the NOP opcode is actually (remember this series is about Python 3.x unless noted otherwise, so this snippet is from Python 3.1.2):

 TARGET(NOP)
    FAST_DISPATCH();

TARGET? FAST_DISPATCH? What are these? Let me explain. Things may become clearer if we’d look for a moment at the implementation of the NOP opcode in ceval.c of Python 2.x. Over there the code for NOP looks more like the samples I’ve shown you so far, and it actually seems to me that the code of ceval.c gets simpler and simpler as we look backwards at older revisions of it. The reason is that although I think PyEval_EvalFrameEx was originally written as a really exceptionally straightforward piece of code, over the years some necessary complexity crept into it as various optimizations and improvements were implemented (I’ll collectively call them ‘additions’ from now on, for lack of a better term).

To further complicate matters, many of these additions are compiled conditionally with preprocessor directives, so several things are implemented in more than one way in the same source file. In the larger code samples I quoted above, I liberally expanded some preprocessor directives using their least complex expansion. However, depending on compilation flags, these and other preprocessor directives might expand to something else, possibly a more complicated something. I can understand trading simplicity to optimize a tight loop which is used very often, and the evaluation loop is probably one of the more used loops in CPython (and probably as tight as its contributors could make it). So while this is all very warranted, it doesn’t help the readability of the code.

Anyway, I’d like to enumerate these additions here explicitly (some in more depth than others); this should aid future discussion of ceval.c, as well as prevent me from feeling like I’m hiding too many important things with my free spirited editing of quoted code. Fortunately, most if not all these additions are very well commented -actually, some of the explanations below will be just summaries or even taken verbatim from these comments, as I believe that they’re accurate (eek!). So, as you read PyEval_EvalFrameEx (and indeed ceval.c in general), you’re likely to run into any of these:

“Threaded Code” (Computed-GOTOs)

Let’s start with the addition that gave us TARGET, FAST_DISPATCH and a few other macros. The evaluation loop uses a “switch” statement, which decent compilers optimize as a single indirect branch instruction with a lookup table of addresses. Alas, since we’re switching over rapidly changing opcodes (it’s uncommon to have the same opcode repeat), this would have an adverse effect on the success rate of CPU branch prediction. Fortunately gcc supports the use of C-goto labels as values, which you can generally pass around and place in an array (restrictions apply!). Using an array of adresses in memory obtained from labels, as you can see in ./Python/opcode_targets.h, we create an explicit jump table and place an explicit indirect jump instruction at the end of each opcode. This improves the success rate of CPU prediction and can yield as much as 20% boost in performance.

Thus, for example, the NOP opcode is implemented in the code like so:

 TARGET(NOP)
    FAST_DISPATCH();

In the simpler scenario, this would expand to a plain case statement and a goto, like so:


        case NOP:
            goto fast_next_opcode;

But when threaded code is in use, that snippet would expand to (I highlighted the lines where we actually move on to the next opcode, using the dispatch table of label-values):


        TARGET_NOP:
            opcode = NOP;
            if (HAS_ARG(NOP))
                oparg = NEXTARG();
        case NOP:
            {
                if (!_Py_TracingPossible) {
                    f->f_lasti = INSTR_OFFSET();
                    goto *opcode_targets[*next_instr++];
                }
                goto fast_next_opcode;
            }

Same behaviour, somewhat more complicated implementation, up to 20% faster Python. Nifty.

Opcode Prediction

Some opcodes tend to come in pairs. For example, COMPARE_OP is often followed by JUMP_IF_FALSE or JUMP_IF_TRUE, themselves often followed by a POP_TOP. What’s more, there are situations where you can determine that a particular next-opcode can be run immediately after the execution of the current opcode, without going through the ‘outer’ (and expensive) parts of the evaluation loop. PREDICT (and a few others) are a set of macros that explicitly peek at the next opcode and jump to it if possible, shortcutting most of the loop in this fashion (i.e., if (*next_instr == op) goto PRED_##op). Note that there is no relation to real hardware here, these are simply hardcoded conditional jumps, not an exploitation of some mechanism in the underlying CPU (in particular, it has nothing to do with “Threaded Code” described above).

Low Level Tracing

An addition primarily geared towards those developing CPython (or suffering from a horrible, horrible bug). Low Level Tracing is controlled by the LLTRACE preprocessor name, which is enabled by default on debug builds of CPython (see --with-pydebug). As explained in ./Misc/SpecialBuilds.txt: when this feature is compiled-in, PyEval_EvalFrameEx checks the frame’s global namespace for the variable __lltrace__. If such a variable is found, mounds of information about what the interpreter is doing are sprayed to stdout, such as every opcode and opcode argument and values pushed onto and popped off the value stack. Not useful very often, but very useful when needed.

This is the what the low level trace output looks like (slightly edited):


>>> def f():
...     global a
...     return a - 5
... 
>>> dis(f)
  3           0 LOAD_GLOBAL              0 (a) 
              3 LOAD_CONST               1 (5) 
              6 BINARY_SUBTRACT      
              7 RETURN_VALUE         
>>> exec(f.__code__, {'__lltrace__': 'foo', 'a': 10})
0: 116, 0
push 10
3: 100, 1
push 5
6: 24
pop 5
7: 83
pop 5
# trace of the end of exec() removed
>>>

As you can guess, you’re seeing a real-time disassembly of what’s going through the VM as well as stack operations. For example, the first line says: line 0, do opcode 116 (LOAD_GLOBAL) with the operand 0 (expands to the global variable a), and so on, and so forth. This is a bit like (well, little more than) adding a bunch of printf calls to the heart of VM.

Advanced Profiling

Under this heading I’d like to briefly discuss several profiling related additions. The first relies on the fact that some processors (notably Pentium descendants and at least some PowerPCs) have built-in wall time measurement capabilities which are cheap and precise (correct me if I’m wrong). As an aid in the development of a high-performance CPython implementation, Python 2.4’s ceval.c was instrumented with the ability to collect per-opcode profiling statistics using these counters. This instrumentation is controlled by the somewhat misnamed --with-tsc configuration flag (TSC is an Intel Pentium specific name, and this feature is more general than that). Calling sys.settscdump(True) on an instrumented interpreter will cause the function ./Python/ceval.c: dump_tsc to print these statistics every time the evaluation loop loops.

The second advanced profiling feature is Dynamic Execution Profiling. This is only available if Python was built with the DYNAMIC_EXECUTION_PROFILE preprocessor name. As ./Tools/scripts/analyze_dxp.py says, [this] will tell you which opcodes have been executed most frequently in the current process, and, if Python was also built with -DDXPAIRS, will tell you which instruction _pairs_ were executed most frequently, which may help in choosing new instructions. One last thing to add here is that enabling Dynamic Execution Profiling implicitly disables the “Threaded Code” addition.

The third and last addition in this category is function call profiling, controlled by the preprocessor name CALL_PROFILE. Quoting ./Misc/SpecialBuilds.txt again: When this name is defined, the ceval mainloop and helper functions count the number of function calls made. It keeps detailed statistics about what kind of object was called and whether the call hit any of the special fast paths in the code.

Extra Safety Valves

Two preprocessor names, USE_STACKCHECK and CHECKEXC include extra assertions. Testing an interpreter with these enabled may catch a subtle bug or regression, but they are usually disabled as they’re too expensive.

These are the additions I found, grepping ceval.c for #ifdef. I think we’ll call it a day here, although we’re by no means finished. For example, I’d like to devote a separate post to exceptions, which is where we can discuss the tail of the evaluation loop (everything after the big switch and before the end of the big for), which we merely skimmed today. I’d also like to devote a whole post to locking and synchronization (including the GIL), which we touched upon before but never covered properly. Last but really not least, there’s about 2,000 other lines in ceval.c which we didn’t cover today; none of them are as important as PyEval_EvalFrameEx, but we need to talk at least about some of them.

All these things taken into account, I think we can say that today we finally conquered the evaluation loop. This isn’t the end of the series, far from it, but I do see it as a milestone. “Hooray”, I believe the saying goes. I hope you’re enjoying the show, thanks for the supportive comments (they keep me going), and I’ll see you in the next post.

I would like to thank Nick Coghlan for reviewing this article; any mistakes that slipped through are my own.

¹Lazy or timid readers may choose to defer to Nick Coghlan’s example of one way he did it; I urge you not to look there and solve it on your own, it’s rather easy.

Comments

8 responses to “Python’s Innards: Hello, ceval.c!”

Nick Coghlan

September 3, 2010

And the curious can search ceval.c for CASE_TOO_BIG to see that yes, indeed, there is a #define that allows configure and/or pyport.h to break the switch into two separate pieces 🙂
Eli Bendersky

September 6, 2010

Yaniv, nice job differentiating between the simple core and the optimized & complicated outer shell. As always, optimization and clarity are in a clash here and optimization wins. When trying to understand how these things work under the hood, optimization tricks are a big stumbling block, and simplifying them as much as possible is crucial.
Abafei Software

March 28, 2011

Yasher Koach!

Amazingly explained!
Martin

April 15, 2013

I was very much enjoying this series. Sad to see you never finished with the promised articles on Exceptions, the GIL, etc. maybe this comment will be enough to drive you onwards, 2 years later. 😉
1. Yaniv Aknin
  
  April 15, 2013
  
  Glad you enjoyed it, I’ll do my best to carry on. I became a father in those past two years… 🙂
  1. pdehaye
    
    November 24, 2013
    
    Congratulations on becoming a father. I will second those encouragements to continue the series… You are very gifted at explaining those concepts, and reducing them to their essence.
Andrei Kopats

September 10, 2013

Yaniv, it would be nice, if you created page with index (Table of contents) of Python’s Innards articles.
I found only page with all articles: http://tech.blog.aknin.name/category/my-projects/pythons-innards/, but it is very difficult to find titles on long page, especially when surfing from phone.
soundbbg

July 3, 2014

Grate post! I really love it! Could I make sure one thing about generator?

Every Generator Object has a Frame Object when created, Frame has Value Stack to store values. So every time run a generator by using next(), function gen_iternext called and it call gen_send_ex actually.

Generator Object is a controller of it’s own Frame, it run the code (I image frame as a code object, so it will be easier to understand) and PyEval_EvalFrameEx then goes to `yield` (opcode YIELD_VALUE), do something and pop the result from stack and return. Frame will not be cleaned and stores in memory. So context is saved.

Next time Generator call next, still the same, run the code and PyEval_EvalFrameEx begins from Frame and Frame is not empty, so we have some value and context. Then it goes to `yield`, still do something and pop the result.

Only then Generator Object ref count equals 0, it will be cleaned by GC, include it’s own Frame.

Am I right? Or something wrong?