Python Internals: GIL, Asyncio, and How Python Really Works

· pythoninternalsconcurrencyruntimecpython

Python is the second-oldest major language still in widespread use (after C), yet most developers have never looked under the hood. We type python my_script.py and assume magic happens. It does not. What happens is a carefully orchestrated sequence of parsing, compiling, interpreting, memory allocating, and locking — all inside a C program called CPython.

This post walks through every layer of CPython: how your source code becomes bytecode, how the stack machine executes it, how the GIL prevents the CPU from catching fire, how async/await enables concurrency without threads, and how the descriptor protocol powers @property, @staticmethod, and the entire class system. We end with the bleeding edge — NoGIL, subinterpreters, and PyPy’s JIT.

Each major section includes an interactive demo so you can see these mechanisms in action rather than just reading about them.

CPython and the Python Ecosystem

When people say “Python,” they almost always mean CPython — the C-based reference implementation maintained by the Python Software Foundation. CPython’s job is to read your .py files and execute them. But it is not the only implementation:

ImplementationLanguageKey Feature
CPythonCReference implementation, C extension API
PyPyRPython (subset of Python)JIT compiler, often 4-7x faster
JythonJavaRuns on JVM, Java interop
IronPythonC#Runs on .NET CLR
MicroPythonCFor microcontrollers, minimal memory

CPython dominates because every new Python feature is designed for it first. C extensions (NumPy, pandas, TensorFlow) target its C API. The Python ecosystem runs on CPython.

What makes CPython unique among these is the Global Interpreter Lock (GIL), the reference-counting memory manager, and the stack-based bytecode VM. These three decisions — made early in Python’s history — shape every program you write.

Interpreted, Compiled, or Both?

Python is called an “interpreted language,” but that is only half the story. CPython does compile your source code — just not to machine code. It compiles to bytecode, an intermediate representation that the CPython virtual machine interprets at runtime.

The full pipeline is:

.py source file --> Parser --> AST --> Compiler --> Bytecode (.pyc) --> CEval loop --> OS

Each step transforms your code into a lower-level representation. The parser turns characters into tokens and builds an Abstract Syntax Tree (AST). The compiler walks the AST and emits bytecode instructions. The CEval loop is a giant C switch statement that reads each bytecode and executes the corresponding C code.

The .pyc files you see in __pycache__/ are cached bytecode. If the source file has not changed, CPython skips the parse+compile step entirely and loads the bytecode directly.

This two-phase design (compile to bytecode, then interpret) is the same strategy used by Java (JVM bytecode), Lua, and early versions of JavaScript. It gives you portability across architectures while keeping the implementation simpler than a full native compiler.

The Stack-Based Virtual Machine

CPython is a stack-based VM. Most bytecode instructions operate on a stack: they pop values, compute something, and push the result back. This is different from register-based VMs (like Lua 5’s) where operations name registers directly.

Consider a simple function:

def add(a, b):
    return a + b

CPython compiles this to roughly:

import dis
dis.dis(add)

# Output:
#   2           0 LOAD_FAST                0 (a)
#               2 LOAD_FAST                1 (b)
#               4 BINARY_ADD
#               6 RETURN_VALUE

Here is what happens at each bytecode:

  1. LOAD_FAST a — Push the value of local variable a onto the stack.
  2. LOAD_FAST b — Push the value of b onto the stack. Stack now has [a, b].
  3. BINARY_ADD — Pop the top two values (b, then a), compute a + b, push the result.
  4. RETURN_VALUE — Pop the top value and return it to the caller.

The stack is the VM’s scratch space. It lives in the C call frame. Every function call pushes a new frame onto the call stack (not the data stack), and that frame has its own data stack, local variables array, and instruction pointer.

The demo below lets you step through any of several example functions bytecode by bytecode, watching the stack grow and shrink with each instruction.

Python Bytecode VM
Source
def add(a, b): return a + b
Bytecode InstructionsStep 1/4
0LOAD_FASTa (0)Push local variable a onto stack
2LOAD_FASTb (1)Push local variable b onto stack
4BINARY_ADDPop b, pop a, push a + b
6RETURN_VALUEPop result and return it
Stack
TOS: a
Current
LOAD_FAST
Push local variable a onto stack

Other common bytecodes include STORE_FAST (pop stack into local variable), LOAD_CONST (push a constant like a number or string), CALL_FUNCTION (pop N arguments and call a function), and BUILD_LIST (pop N values and build a list). The full instruction set is defined in Include/opcode_ids.h in the CPython source.

The Global Interpreter Lock

The GIL is arguably the most controversial feature of CPython. It is a single mutex that protects access to Python objects, preventing two threads from executing Python bytecode at the same time.

Why does it exist? CPython’s memory management uses reference counting (see next section). Each object has a refcount field that tracks how many references point to it. When the refcount hits zero, the object is deallocated. Multiple threads modifying an object’s refcount simultaneously would corrupt memory. The simplest fix is a global lock: only one thread runs Python code at a time.

The GIL is checked periodically (every 100 “ticks” in the default interpreter, or every 5ms in newer CPython versions via PyThreadState_SetAsyncExc). At each check point, the currently running thread releases the GIL and signals other threads to contend for it. This is called check interval or switch interval.

The impact on performance depends on the workload:

  • CPU-bound tasks suffer under the GIL. Two threads computing primes will not finish faster than one — they alternate, each holding the GIL for ~5ms at a time.
  • I/O-bound tasks are barely affected. When a thread does I/O (reads a file, makes an HTTP request), it releases the GIL voluntarily. Other threads can run while the I/O completes.
  • C extensions can release the GIL explicitly using the Py_BEGIN_ALLOW_THREADS macro. NumPy does this — heavy matrix operations run without the GIL.

The demo below shows two threads competing for the GIL. In CPU mode, they alternate every ~5ms. In I/O mode, the working thread releases the GIL when it hits an I/O operation, letting the other thread run immediately.

Global Interpreter Lock (GIL)
Thread A0%
Computing...
GIL
?
Thread B0%
Computing...
Timeline
Tick: 0
Mode: CPU-bound (switch every ~13 ticks)
GIL idle
Press Start to run simulation

The GIL is the reason Python has multiprocessing (separate processes, each with its own GIL) as an alternative to threading (same process, shared GIL). It is also why async/await became so important — it provides concurrency without needing threads at all.

Reference Counting: Python’s Primary Memory Manager

Every Python object is represented in C as PyObject*, a pointer to a struct that begins with two fields:

typedef struct _object {
    Py_ssize_t ob_refcnt;  // reference count
    PyTypeObject *ob_type; // pointer to the type object
} PyObject;

ob_refcnt is the reference count. When you write:

a = [1, 2, 3]   # refcount = 1 (a points to it)
b = a           # refcount = 2 (both a and b point to it)
del a           # refcount = 1 (b still points to it)
del b           # refcount = 0 (deallocated)

Every assignment (b = a), function argument pass, and container insertion increments the refcount. Every del, rebinding, or container removal decrements it. When the refcount hits zero, CPython immediately frees the object’s memory.

This is deterministic memory management. Unlike a tracing GC (like Java’s or Go’s) that runs periodically, CPython frees objects as soon as they become unreachable. This is why context managers (with open(...) as f:) and close-on-exit work reliably — the file object is freed the moment its refcount hits zero.

Reference counting has one major flaw: cycles. If object A points to B and B points to A (but nothing else points to either), both have refcount 1 and will never be freed:

class Node:
    def __init__(self):
        self.next = None

a = Node()
b = Node()
a.next = b
b.next = a
del a  # refcount of a = 1 (b.next still points to it)
del b  # refcount of b = 1 (a.next still points to it)

For this, CPython has a separate garbage collector (the gc module). It runs periodically, finds cycles, and frees unreachable cyclic garbage. The GC is generational (three generations) — most objects die young and are collected quickly; survivors are promoted.

The Generational Garbage Collector

The gc module implements a generational collector. Objects are placed into generation 0 when created. If they survive a collection, they move to generation 1, then generation 2. Older generations are collected less frequently.

The GC runs when the number of newly allocated objects minus the number of deallocated objects exceeds a threshold (default 700 for generation 0). You can inspect and tune these thresholds:

import gc
gc.get_threshold()   # (700, 10, 10)

# Manually trigger a collection
gc.collect()

The GC only looks at container objects (list, dict, set, tuple, and custom class instances) — immutable objects like strings and integers cannot form cycles, so the GC ignores them. Reference counting handles all non-cyclic garbage immediately.

The demo below lets you create objects, link them (creating references), delete references, and watch the refcount change in real time. A “Force GC” button triggers cycle detection and shows if the GC freed any cyclic garbage.

CPython Memory Management
obj_1
list
56 B | list
refcount: 0
obj_2
str "hello"
32 B | str
refcount: 0
obj_3
dict
72 B | dict
refcount: 0
obj_4
Node
48 B | node
refcount: 0
obj_5
Node
48 B | node
refcount: 0
Memory Log
Create refs to see reference counting in action

Understanding refcounts and the GC explains many Python behaviors: why del sometimes frees memory immediately (no cycle) and sometimes does nothing (still referenced elsewhere); why gc.collect() occasionally recovers significant memory in long-running apps; and why circular data structures like linked lists and graphs need weakref to avoid leaks.

Async/Await and the Event Loop

Python’s async/await system (asyncio) is a cooperative multitasking framework. Tasks voluntarily yield control at await points. No thread is involved — a single thread runns an event loop that schedules tasks.

Here is the core mechanism:

import asyncio

async def fetch_data(url):
    print(f"Fetching {url}")
    await asyncio.sleep(1)   # yield control, resume after 1 second
    print(f"Done {url}")
    return f"data from {url}"

async def main():
    # Both tasks run concurrently on a single thread
    task1 = asyncio.create_task(fetch_data("a.com"))
    task2 = asyncio.create_task(fetch_data("b.com"))
    r1 = await task1
    r2 = await task2
    return r1, r2

asyncio.run(main())

When await asyncio.sleep(1) is reached, the task suspends and returns control to the event loop. The event loop checks its list of scheduled callbacks and runs any task whose sleep has expired. Because there is no thread switch (no OS involvement), the overhead per task switch is tiny — microseconds instead of microseconds-plus-kernel-call.

Compare with time.sleep(1):

import time

def blocking_fetch(url):
    print(f"Fetching {url}")
    time.sleep(1)            # BLOCKS the entire thread
    print(f"Done {url}")
    return f"data from {url}"

time.sleep(1) blocks the OS thread. If you are using threads, that thread is stuck. If you are using asyncio, the entire event loop is stuck — no other task can run. This is why you must never call blocking functions inside async code without using asyncio.to_thread() or loop.run_in_executor() to offload them to a thread pool.

Beneath the surface, asyncio uses:

  • selectors (or epoll/kqueue/IOCP on different platforms) to poll file descriptors for I/O readiness
  • A heap of scheduled callbacks (timers)
  • A queue of ready callbacks

The event loop is a simple loop:

# Simplified pseudocode for the event loop
while True:
    if no more tasks:
        break
    # Run all ready callbacks
    for callback in ready_queue:
        callback()
    # Poll for I/O with the shortest timer as timeout
    timeout = get_next_timer_time()
    events = selector.select(timeout)
    # Add I/O callbacks to ready queue
    for event in events:
        ready_queue.append(event.callback)
    # Fire expired timers
    for timer in expired_timers():
        ready_queue.append(timer.callback)

The demo below visualizes this: multiple tasks running on a single event loop, yielding at await points, and resuming when their I/O or sleep completes.

Asyncio Event Loop
fetch(url_a)pending
I/O-bound0%
fetch(url_b)pending
I/O-bound0%
compute()pending
CPU-bound0%
read_file()pending
I/O-bound0%
Loop tick: 0
Tasks cooperate by yielding at await points
Event Loop Log
Press Start to run event loop

Async/await gives you concurrency without threads and without the GIL limitation. It is ideal for I/O-bound workloads (web servers, API clients, database drivers) but useless for CPU-bound computation (still needs multiprocessing or a C extension).

The Descriptor Protocol

Python’s class system is built on a simple mechanism called the descriptor protocol. A descriptor is any object that implements __get__(), __set__(), or __delete__(). When an attribute of a class is a descriptor, Python invokes these methods instead of the normal get/set/delete behavior.

class Descriptor:
    def __get__(self, instance, owner):
        print(f"__get__: instance={instance}, owner={owner}")
        return 42

    def __set__(self, instance, value):
        print(f"__set__: instance={instance}, value={value}")

class MyClass:
    attr = Descriptor()

obj = MyClass()
obj.attr          # Calls Descriptor.__get__(obj, MyClass) -> 42
obj.attr = 99     # Calls Descriptor.__set__(obj, 99)

This is how @property works:

@property
def name(self):
    return self._name

Is syntactic sugar for:

name = property(fget=lambda self: self._name)

The property class is itself a descriptor. Its __get__ calls the getter function you provided. Its __set__ calls the setter. If no setter is provided, attribute assignment raises AttributeError.

@staticmethod and @classmethod are also descriptors:

  • staticmethod.__get__ returns the underlying function unchanged (no self binding)
  • classmethod.__get__ binds the method to the class (not the instance)

The MRO (Method Resolution Order) is the order in which Python searches base classes when looking up an attribute. It is computed using the C3 linearization algorithm and stored in ClassName.__mro__:

class A: pass
class B(A): pass
class C(A): pass
class D(B, C): pass

print(D.__mro__)
# (<class 'D'>, <class 'B'>, <class 'C'>, <class 'A'>, <class 'object'>)

Python searches D, then B, then C, then A, then object. At each step, if the attribute is a descriptor on that class, the descriptor’s __get__ is invoked.

The descriptor protocol is the foundation of properties, bound methods, super(), __slots__, and classmethods. Every time you call obj.method(), a descriptor is involved — functions are descriptors whose __get__ returns a bound method object.

Descriptor Protocol
Class Definition
class MyClass(BaseClass, Mixin): @property def name(self): return self._name @staticmethod def util(x): return x * 2 @classmethod def create(cls): return cls() name = Validator(maxlen=100)
Click a descriptor above to see how it works.
Method Resolution Order (MRO)
MRO determines which class's method is called when there are multiple base classes. Python uses C3 linearization.

The Import System

Python’s import statement triggers a multi-step process:

  1. Finder: sys.meta_path lists finder objects. The default finders are _frozen_importlib.BuiltinImporter (for built-in modules like sys), _frozen_importlib.FrozenImporter (frozen modules), and _frozen_importlib.PathFinder (for filesystem imports). Each finder checks if it can handle the given module name.

  2. Loader: If a finder locates the module, it returns a loader (spec). The loader is responsible for loading the module — reading source, compiling bytecode, and executing it.

  3. Execution: The loader executes the module’s code in its own namespace (a new dict). All names defined at module level become attributes of the module object.

  4. Caching: The loaded module is cached in sys.modules. Subsequent imports of the same module return the cached object — this is why import is idempotent within a process.

You can see the cached modules:

python -c "import sys; print(list(sys.modules.keys())[:10])"

The import system is extensible. You can write custom finders and loaders to import from databases, URLs, or dynamically generated code. Tools like pytest and importlib.metadata use this hook system.

CPython Architecture Overview

Putting it all together, CPython’s architecture from source to execution:

Python source (.py)
     |
     v
  Tokenizer  --> tokens
     |
     v
  Parser  --> Abstract Syntax Tree (AST)
     |
     v
  Compiler  --> Bytecode (code object)
     |
     v
  CEval loop (switch on bytecodes)
     |
     +---> Memory allocator (obmalloc / pymalloc)
     +---> Garbage collector (generational)
     +---> GIL (switches every ~5ms)
     +---> Threading (pthreads / Windows threads)
     +---> Async I/O (selectors / epoll / kqueue)
     +---> Import system (finders + loaders)
     +---> C extension API (PyObject*, PyTypeObject)
     |
     v
  Operating system

Each component is a C module:

  • Python/ceval.c — The CEval loop
  • Python/pystate.c — Thread state and GIL management
  • Objects/obmalloc.c — Object allocator
  • Modules/gcmodule.c — Garbage collector
  • Python/import.c — Import machinery
  • Python/ast.c — AST construction

The demo below lets you click through each layer of this architecture, with a brief explanation of what happens at each stage.

CPython Architecture
Python Source (.py)
Your Python script saved as a .py file. The entry point for all Python execution.
n/a
|
Tokenizer
Breaks source text into tokens: keywords, identifiers, operators, literals.
tokenize.c
|
Parser (AST)
Builds an Abstract Syntax Tree from the token stream.
ast.c
|
Compiler
Walks the AST and emits bytecode instructions.
compile.c
|
Bytecode (.pyc)
Cached bytecode in __pycache__/. Loaded instead of re-parsing if source is unchanged.
marshal.c
|
CEval Loop
The main interpreter loop -- a big switch statement that executes each bytecode.
ceval.c
|
Runtime Services
Memory allocator, GC, GIL, threading, asyncio, import system, C API.
various
Click any component in the pipeline above to see its role in CPython

The Future: GIL Removal and NoGIL

The GIL has been called “the thing that makes Python Python and also the thing that limits Python.” After decades of debate, PEP 703 (the “no-gil” proposal) was accepted for Python 3.13 in 2023, with a multi-year rollout plan.

NoGIL makes the GIL optional. When enabled:

  • Reference counts become atomic operations
  • The GC gains additional cycle-detection safeguards for concurrent mutations
  • Each thread has its own “biasing” for frequently-accessed objects (to reduce cache line bouncing)
  • Free-threaded builds are ABI-incompatible with C extensions compiled for regular CPython

Adoption will be gradual. Python 3.13 ships NoGIL as experimental (--disable-gil configure flag). Full default-on is targeted for Python 3.17 or later. C extensions that hold the GIL implicitly (using Py_BEGIN_ALLOW_THREADS manually) will need updates.

Another approach to GIL avoidance is subinterpreters (PEP 554). Each subinterpreter has its own GIL. They can run truly in parallel on multiple cores, but they share no state — communication must go through channels or shared memory. Think of them as lightweight processes within the same CPython process.

PyPy and JIT Compilation

PyPy takes a radically different approach from CPython. Instead of interpreting bytecode in C, PyPy is written in RPython (a restricted subset of Python) and includes a Just-In-Time (JIT) compiler.

PyPy’s execution pipeline:

  1. Parse and compile to bytecode (same as CPython)
  2. Interpret the bytecode in a tracing JIT — it watches which paths the code takes most frequently
  3. Hot loops are compiled to machine code (x86-64 or ARM64) at runtime
  4. The machine code runs without interpreter overhead

The result: PyPy is typically 4-7x faster than CPython on pure Python code. However:

  • C extensions must be recompiled for PyPy’s C API (not all work)
  • Memory usage is usually higher (the JIT generates and caches machine code)
  • Startup time is slower (JIT warmup)
  • Some edge cases behave differently (reference counting is simulated, not real)

PyPy’s JIT is most effective on tight loops with predictable types. Numerical code in pure Python sees dramatic speedups. Code that spends most time in C extensions (like NumPy) sees no benefit.

Writing C Extensions

CPython’s extensibility comes from its stable C API. A minimal extension looks like:

#include <Python.h>

static PyObject* greet(PyObject* self, PyObject* args) {
    const char* name;
    if (!PyArg_ParseTuple(args, "s", &name))
        return NULL;
    return PyUnicode_FromFormat("Hello %s!", name);
}

static PyMethodDef methods[] = {
    {"greet", greet, METH_VARARGS, "Return a greeting."},
    {NULL, NULL, 0, NULL}
};

static struct PyModuleDef module = {
    PyModuleDef_HEAD_INIT, "hello", NULL, -1, methods
};

PyMODINIT_FUNC PyInit_hello(void) {
    return PyModule_Create(&module);
}

Compile with:

python3-config --cflags --lds  # Get compiler flags
gcc -shared -o hello.so hello.c $(python3-config --cflags --lds)

Then use from Python:

import hello
print(hello.greet("world"))  # "Hello world!"

When writing C extensions, you can release the GIL for long-running computations using Py_BEGIN_ALLOW_THREADS / Py_END_ALLOW_THREADS. This is how NumPy achieves parallelism — it releases the GIL before entering BLAS routines.

Modern alternatives to raw C extensions include Cython (Python-like syntax that compiles to C with CPython bindings), cffi (C Foreign Function Interface, pure Python), and pybind11 (modern C++11 bindings).

Final Thoughts

CPython is simpler than it looks. The core is a stack-based bytecode interpreter written in C, with reference counting for memory management and a global lock for thread safety. The GIL limits parallelism but makes the implementation reliable and C extensions easy to write.

Async/await side-steps the GIL entirely by providing cooperative multitasking within a single thread. The descriptor protocol gives you a clean mechanism for attribute access control that powers the entire class system.

The next few years will be transformative for Python. NoGIL will unlock true multi-core parallelism for threaded Python code. Subinterpreters offer an alternative communication-through-isolation model. PyPy continues to push the boundary of pure-Python performance.

Understanding these internals makes you a better Python developer. You will know when threading helps and when it hurts. You will understand why async def exists and when to use it. You will be able to write safer, faster, more memory-efficient code — and debug it when things go wrong.