Python is the second-oldest major language still in widespread use (after C), yet most developers have never looked under the hood. We type python my_script.py and assume magic happens. It does not. What happens is a carefully orchestrated sequence of parsing, compiling, interpreting, memory allocating, and locking — all inside a C program called CPython.
This post walks through every layer of CPython: how your source code becomes bytecode, how the stack machine executes it, how the GIL prevents the CPU from catching fire, how async/await enables concurrency without threads, and how the descriptor protocol powers @property, @staticmethod, and the entire class system. We end with the bleeding edge — NoGIL, subinterpreters, and PyPy’s JIT.
Each major section includes an interactive demo so you can see these mechanisms in action rather than just reading about them.
When people say “Python,” they almost always mean CPython — the C-based reference implementation maintained by the Python Software Foundation. CPython’s job is to read your .py files and execute them. But it is not the only implementation:
| Implementation | Language | Key Feature |
|---|---|---|
| CPython | C | Reference implementation, C extension API |
| PyPy | RPython (subset of Python) | JIT compiler, often 4-7x faster |
| Jython | Java | Runs on JVM, Java interop |
| IronPython | C# | Runs on .NET CLR |
| MicroPython | C | For microcontrollers, minimal memory |
CPython dominates because every new Python feature is designed for it first. C extensions (NumPy, pandas, TensorFlow) target its C API. The Python ecosystem runs on CPython.
What makes CPython unique among these is the Global Interpreter Lock (GIL), the reference-counting memory manager, and the stack-based bytecode VM. These three decisions — made early in Python’s history — shape every program you write.
Python is called an “interpreted language,” but that is only half the story. CPython does compile your source code — just not to machine code. It compiles to bytecode, an intermediate representation that the CPython virtual machine interprets at runtime.
The full pipeline is:
.py source file --> Parser --> AST --> Compiler --> Bytecode (.pyc) --> CEval loop --> OS
Each step transforms your code into a lower-level representation. The parser turns characters into tokens and builds an Abstract Syntax Tree (AST). The compiler walks the AST and emits bytecode instructions. The CEval loop is a giant C switch statement that reads each bytecode and executes the corresponding C code.
The .pyc files you see in __pycache__/ are cached bytecode. If the source file has not changed, CPython skips the parse+compile step entirely and loads the bytecode directly.
This two-phase design (compile to bytecode, then interpret) is the same strategy used by Java (JVM bytecode), Lua, and early versions of JavaScript. It gives you portability across architectures while keeping the implementation simpler than a full native compiler.
CPython is a stack-based VM. Most bytecode instructions operate on a stack: they pop values, compute something, and push the result back. This is different from register-based VMs (like Lua 5’s) where operations name registers directly.
Consider a simple function:
def add(a, b):
return a + b
CPython compiles this to roughly:
import dis
dis.dis(add)
# Output:
# 2 0 LOAD_FAST 0 (a)
# 2 LOAD_FAST 1 (b)
# 4 BINARY_ADD
# 6 RETURN_VALUE
Here is what happens at each bytecode:
LOAD_FAST a — Push the value of local variable a onto the stack.LOAD_FAST b — Push the value of b onto the stack. Stack now has [a, b].BINARY_ADD — Pop the top two values (b, then a), compute a + b, push the result.RETURN_VALUE — Pop the top value and return it to the caller.The stack is the VM’s scratch space. It lives in the C call frame. Every function call pushes a new frame onto the call stack (not the data stack), and that frame has its own data stack, local variables array, and instruction pointer.
The demo below lets you step through any of several example functions bytecode by bytecode, watching the stack grow and shrink with each instruction.
Other common bytecodes include STORE_FAST (pop stack into local variable), LOAD_CONST (push a constant like a number or string), CALL_FUNCTION (pop N arguments and call a function), and BUILD_LIST (pop N values and build a list). The full instruction set is defined in Include/opcode_ids.h in the CPython source.
The GIL is arguably the most controversial feature of CPython. It is a single mutex that protects access to Python objects, preventing two threads from executing Python bytecode at the same time.
Why does it exist? CPython’s memory management uses reference counting (see next section). Each object has a refcount field that tracks how many references point to it. When the refcount hits zero, the object is deallocated. Multiple threads modifying an object’s refcount simultaneously would corrupt memory. The simplest fix is a global lock: only one thread runs Python code at a time.
The GIL is checked periodically (every 100 “ticks” in the default interpreter, or every 5ms in newer CPython versions via PyThreadState_SetAsyncExc). At each check point, the currently running thread releases the GIL and signals other threads to contend for it. This is called check interval or switch interval.
The impact on performance depends on the workload:
Py_BEGIN_ALLOW_THREADS macro. NumPy does this — heavy matrix operations run without the GIL.The demo below shows two threads competing for the GIL. In CPU mode, they alternate every ~5ms. In I/O mode, the working thread releases the GIL when it hits an I/O operation, letting the other thread run immediately.
The GIL is the reason Python has multiprocessing (separate processes, each with its own GIL) as an alternative to threading (same process, shared GIL). It is also why async/await became so important — it provides concurrency without needing threads at all.
Every Python object is represented in C as PyObject*, a pointer to a struct that begins with two fields:
typedef struct _object {
Py_ssize_t ob_refcnt; // reference count
PyTypeObject *ob_type; // pointer to the type object
} PyObject;
ob_refcnt is the reference count. When you write:
a = [1, 2, 3] # refcount = 1 (a points to it)
b = a # refcount = 2 (both a and b point to it)
del a # refcount = 1 (b still points to it)
del b # refcount = 0 (deallocated)
Every assignment (b = a), function argument pass, and container insertion increments the refcount. Every del, rebinding, or container removal decrements it. When the refcount hits zero, CPython immediately frees the object’s memory.
This is deterministic memory management. Unlike a tracing GC (like Java’s or Go’s) that runs periodically, CPython frees objects as soon as they become unreachable. This is why context managers (with open(...) as f:) and close-on-exit work reliably — the file object is freed the moment its refcount hits zero.
Reference counting has one major flaw: cycles. If object A points to B and B points to A (but nothing else points to either), both have refcount 1 and will never be freed:
class Node:
def __init__(self):
self.next = None
a = Node()
b = Node()
a.next = b
b.next = a
del a # refcount of a = 1 (b.next still points to it)
del b # refcount of b = 1 (a.next still points to it)
For this, CPython has a separate garbage collector (the gc module). It runs periodically, finds cycles, and frees unreachable cyclic garbage. The GC is generational (three generations) — most objects die young and are collected quickly; survivors are promoted.
The gc module implements a generational collector. Objects are placed into generation 0 when created. If they survive a collection, they move to generation 1, then generation 2. Older generations are collected less frequently.
The GC runs when the number of newly allocated objects minus the number of deallocated objects exceeds a threshold (default 700 for generation 0). You can inspect and tune these thresholds:
import gc
gc.get_threshold() # (700, 10, 10)
# Manually trigger a collection
gc.collect()
The GC only looks at container objects (list, dict, set, tuple, and custom class instances) — immutable objects like strings and integers cannot form cycles, so the GC ignores them. Reference counting handles all non-cyclic garbage immediately.
The demo below lets you create objects, link them (creating references), delete references, and watch the refcount change in real time. A “Force GC” button triggers cycle detection and shows if the GC freed any cyclic garbage.
Understanding refcounts and the GC explains many Python behaviors: why del sometimes frees memory immediately (no cycle) and sometimes does nothing (still referenced elsewhere); why gc.collect() occasionally recovers significant memory in long-running apps; and why circular data structures like linked lists and graphs need weakref to avoid leaks.
Python’s async/await system (asyncio) is a cooperative multitasking framework. Tasks voluntarily yield control at await points. No thread is involved — a single thread runns an event loop that schedules tasks.
Here is the core mechanism:
import asyncio
async def fetch_data(url):
print(f"Fetching {url}")
await asyncio.sleep(1) # yield control, resume after 1 second
print(f"Done {url}")
return f"data from {url}"
async def main():
# Both tasks run concurrently on a single thread
task1 = asyncio.create_task(fetch_data("a.com"))
task2 = asyncio.create_task(fetch_data("b.com"))
r1 = await task1
r2 = await task2
return r1, r2
asyncio.run(main())
When await asyncio.sleep(1) is reached, the task suspends and returns control to the event loop. The event loop checks its list of scheduled callbacks and runs any task whose sleep has expired. Because there is no thread switch (no OS involvement), the overhead per task switch is tiny — microseconds instead of microseconds-plus-kernel-call.
Compare with time.sleep(1):
import time
def blocking_fetch(url):
print(f"Fetching {url}")
time.sleep(1) # BLOCKS the entire thread
print(f"Done {url}")
return f"data from {url}"
time.sleep(1) blocks the OS thread. If you are using threads, that thread is stuck. If you are using asyncio, the entire event loop is stuck — no other task can run. This is why you must never call blocking functions inside async code without using asyncio.to_thread() or loop.run_in_executor() to offload them to a thread pool.
Beneath the surface, asyncio uses:
epoll/kqueue/IOCP on different platforms) to poll file descriptors for I/O readinessThe event loop is a simple loop:
# Simplified pseudocode for the event loop
while True:
if no more tasks:
break
# Run all ready callbacks
for callback in ready_queue:
callback()
# Poll for I/O with the shortest timer as timeout
timeout = get_next_timer_time()
events = selector.select(timeout)
# Add I/O callbacks to ready queue
for event in events:
ready_queue.append(event.callback)
# Fire expired timers
for timer in expired_timers():
ready_queue.append(timer.callback)
The demo below visualizes this: multiple tasks running on a single event loop, yielding at await points, and resuming when their I/O or sleep completes.
Async/await gives you concurrency without threads and without the GIL limitation. It is ideal for I/O-bound workloads (web servers, API clients, database drivers) but useless for CPU-bound computation (still needs multiprocessing or a C extension).
Python’s class system is built on a simple mechanism called the descriptor protocol. A descriptor is any object that implements __get__(), __set__(), or __delete__(). When an attribute of a class is a descriptor, Python invokes these methods instead of the normal get/set/delete behavior.
class Descriptor:
def __get__(self, instance, owner):
print(f"__get__: instance={instance}, owner={owner}")
return 42
def __set__(self, instance, value):
print(f"__set__: instance={instance}, value={value}")
class MyClass:
attr = Descriptor()
obj = MyClass()
obj.attr # Calls Descriptor.__get__(obj, MyClass) -> 42
obj.attr = 99 # Calls Descriptor.__set__(obj, 99)
This is how @property works:
@property
def name(self):
return self._name
Is syntactic sugar for:
name = property(fget=lambda self: self._name)
The property class is itself a descriptor. Its __get__ calls the getter function you provided. Its __set__ calls the setter. If no setter is provided, attribute assignment raises AttributeError.
@staticmethod and @classmethod are also descriptors:
staticmethod.__get__ returns the underlying function unchanged (no self binding)classmethod.__get__ binds the method to the class (not the instance)The MRO (Method Resolution Order) is the order in which Python searches base classes when looking up an attribute. It is computed using the C3 linearization algorithm and stored in ClassName.__mro__:
class A: pass
class B(A): pass
class C(A): pass
class D(B, C): pass
print(D.__mro__)
# (<class 'D'>, <class 'B'>, <class 'C'>, <class 'A'>, <class 'object'>)
Python searches D, then B, then C, then A, then object. At each step, if the attribute is a descriptor on that class, the descriptor’s __get__ is invoked.
The descriptor protocol is the foundation of properties, bound methods, super(), __slots__, and classmethods. Every time you call obj.method(), a descriptor is involved — functions are descriptors whose __get__ returns a bound method object.
Python’s import statement triggers a multi-step process:
Finder: sys.meta_path lists finder objects. The default finders are _frozen_importlib.BuiltinImporter (for built-in modules like sys), _frozen_importlib.FrozenImporter (frozen modules), and _frozen_importlib.PathFinder (for filesystem imports). Each finder checks if it can handle the given module name.
Loader: If a finder locates the module, it returns a loader (spec). The loader is responsible for loading the module — reading source, compiling bytecode, and executing it.
Execution: The loader executes the module’s code in its own namespace (a new dict). All names defined at module level become attributes of the module object.
Caching: The loaded module is cached in sys.modules. Subsequent imports of the same module return the cached object — this is why import is idempotent within a process.
You can see the cached modules:
python -c "import sys; print(list(sys.modules.keys())[:10])"
The import system is extensible. You can write custom finders and loaders to import from databases, URLs, or dynamically generated code. Tools like pytest and importlib.metadata use this hook system.
Putting it all together, CPython’s architecture from source to execution:
Python source (.py)
|
v
Tokenizer --> tokens
|
v
Parser --> Abstract Syntax Tree (AST)
|
v
Compiler --> Bytecode (code object)
|
v
CEval loop (switch on bytecodes)
|
+---> Memory allocator (obmalloc / pymalloc)
+---> Garbage collector (generational)
+---> GIL (switches every ~5ms)
+---> Threading (pthreads / Windows threads)
+---> Async I/O (selectors / epoll / kqueue)
+---> Import system (finders + loaders)
+---> C extension API (PyObject*, PyTypeObject)
|
v
Operating system
Each component is a C module:
Python/ceval.c — The CEval loopPython/pystate.c — Thread state and GIL managementObjects/obmalloc.c — Object allocatorModules/gcmodule.c — Garbage collectorPython/import.c — Import machineryPython/ast.c — AST constructionThe demo below lets you click through each layer of this architecture, with a brief explanation of what happens at each stage.
The GIL has been called “the thing that makes Python Python and also the thing that limits Python.” After decades of debate, PEP 703 (the “no-gil” proposal) was accepted for Python 3.13 in 2023, with a multi-year rollout plan.
NoGIL makes the GIL optional. When enabled:
Adoption will be gradual. Python 3.13 ships NoGIL as experimental (--disable-gil configure flag). Full default-on is targeted for Python 3.17 or later. C extensions that hold the GIL implicitly (using Py_BEGIN_ALLOW_THREADS manually) will need updates.
Another approach to GIL avoidance is subinterpreters (PEP 554). Each subinterpreter has its own GIL. They can run truly in parallel on multiple cores, but they share no state — communication must go through channels or shared memory. Think of them as lightweight processes within the same CPython process.
PyPy takes a radically different approach from CPython. Instead of interpreting bytecode in C, PyPy is written in RPython (a restricted subset of Python) and includes a Just-In-Time (JIT) compiler.
PyPy’s execution pipeline:
The result: PyPy is typically 4-7x faster than CPython on pure Python code. However:
PyPy’s JIT is most effective on tight loops with predictable types. Numerical code in pure Python sees dramatic speedups. Code that spends most time in C extensions (like NumPy) sees no benefit.
CPython’s extensibility comes from its stable C API. A minimal extension looks like:
#include <Python.h>
static PyObject* greet(PyObject* self, PyObject* args) {
const char* name;
if (!PyArg_ParseTuple(args, "s", &name))
return NULL;
return PyUnicode_FromFormat("Hello %s!", name);
}
static PyMethodDef methods[] = {
{"greet", greet, METH_VARARGS, "Return a greeting."},
{NULL, NULL, 0, NULL}
};
static struct PyModuleDef module = {
PyModuleDef_HEAD_INIT, "hello", NULL, -1, methods
};
PyMODINIT_FUNC PyInit_hello(void) {
return PyModule_Create(&module);
}
Compile with:
python3-config --cflags --lds # Get compiler flags
gcc -shared -o hello.so hello.c $(python3-config --cflags --lds)
Then use from Python:
import hello
print(hello.greet("world")) # "Hello world!"
When writing C extensions, you can release the GIL for long-running computations using Py_BEGIN_ALLOW_THREADS / Py_END_ALLOW_THREADS. This is how NumPy achieves parallelism — it releases the GIL before entering BLAS routines.
Modern alternatives to raw C extensions include Cython (Python-like syntax that compiles to C with CPython bindings), cffi (C Foreign Function Interface, pure Python), and pybind11 (modern C++11 bindings).
CPython is simpler than it looks. The core is a stack-based bytecode interpreter written in C, with reference counting for memory management and a global lock for thread safety. The GIL limits parallelism but makes the implementation reliable and C extensions easy to write.
Async/await side-steps the GIL entirely by providing cooperative multitasking within a single thread. The descriptor protocol gives you a clean mechanism for attribute access control that powers the entire class system.
The next few years will be transformative for Python. NoGIL will unlock true multi-core parallelism for threaded Python code. Subinterpreters offer an alternative communication-through-isolation model. PyPy continues to push the boundary of pure-Python performance.
Understanding these internals makes you a better Python developer. You will know when threading helps and when it hurts. You will understand why async def exists and when to use it. You will be able to write safer, faster, more memory-efficient code — and debug it when things go wrong.