Every program you write — from a tiny shell script to a database server — runs on top of an operating system. The OS is the invisible layer between your code and the raw hardware. It abstracts the CPU, memory, disk, and network into simple APIs like files, processes, and sockets.
Most developers interact with the OS through system calls without thinking about what happens beneath. When you call read(), fork(), or malloc(), a sophisticated machinery of interrupts, page tables, and scheduler algorithms activates. Understanding this machinery separates developers who guess at performance from those who can predict it.
We will walk through the four pillars of OS fundamentals: processes and threads, virtual memory, system calls, and CPU scheduling. Each section builds on the last, starting from the simplest analogy and working up to kernel internals.
Imagine a kindergarten classroom with one teacher and thirty children. Each child wants to use the crayons, the building blocks, the swing, and the storybook corner — all at the same time. The teacher cannot clone herself, so she must decide: who gets the crayons for the next ten minutes? Who goes to the swing? How does she make sure no child hoards the purple crayon forever?
The operating system is that teacher. The children are processes (running programs). The crayons, blocks, and swing are the CPU, memory, disk, and network.
The OS has two primary jobs:
These two goals are in tension. Giving one process unrestricted access to the CPU improves its throughput but starves others. Strict isolation adds overhead. The OS must balance them in a way that feels seamless to the user.
In the early days of computing (1950s-60s), there was no OS. Programs were loaded one at a time by an operator. The machine ran one job, printed the result, and the operator loaded the next. This was batch processing — no multitasking, no protection, no interactivity. The OS evolved because that model was wildly inefficient.
The most fundamental protection mechanism in any modern OS is the split between two privilege levels: user mode and kernel mode.
User mode (Ring 3): Applications run here. They cannot execute privileged CPU instructions, cannot directly access hardware devices, and cannot touch memory that belongs to other processes. If a program tries, the CPU raises an exception and the OS kills the process.
Kernel mode (Ring 0): The OS kernel runs here. It can execute any instruction, access any memory location, and talk to any device. The kernel is trusted — it is the smallest, most carefully audited piece of software on the system.
When a user program needs to do something privileged (read a file, allocate memory, send a network packet), it cannot just do it directly. It must ask the kernel through a system call.
When a user program needs kernel services (reading a file, allocating memory, sending network data), it makes a system call. This transitions from user mode to kernel mode and back.
This demo shows the full lifecycle of a system call. When your program calls read(), the libc wrapper places the syscall number in RAX, executes the syscall instruction, and the CPU atomically switches to kernel mode. The kernel handles the request and returns.
The switch itself is fast (about 100-200 nanoseconds for a simple syscall), but it involves saving registers, changing the stack pointer, and verifying the caller’s arguments. On x86-64 Linux, the syscall instruction reads the kernel entry point from the MSR_LSTAR model-specific register. The kernel entry assembly saves all registers, calls the C dispatcher, and eventually executes sysretq to return.
A process is a running instance of a program. It is more than just the code in memory. It includes:
The kernel represents each process with a Process Control Block (PCB) — a data structure stored in kernel memory. The PCB is the kernel’s bookkeeping entry for the process. On Linux, this is the task_struct, a large struct defined in include/linux/sched.h.
Every process has a PCB: a kernel data structure holding PID, state, registers, and memory map. Click Fork to create a child process. Click state buttons to transition.
The demo above shows a PCB with its key fields: PID, registers, memory map, program counter, and state. Click Fork to create a child process with copy-on-write memory. Transition between running, ready, and blocked states.
A process is always in exactly one of these states:
Transitions happen constantly. A running process that issues a disk read becomes blocked. When the disk finishes, the process moves back to ready. The scheduler then decides when it runs again.
On Unix-like systems, processes are created with the fork() system call. Fork creates an almost exact copy of the calling process — the parent and child are identical except for:
fork() returns the child’s PID to the parent, and 0 to the childAfter fork, the child typically calls exec() to load a new program, replacing its address space.
#include <unistd.h>
#include <stdio.h>
#include <sys/wait.h>
int main() {
pid_t pid = fork();
if (pid == 0) {
// Child process
printf("Child: PID=%d, Parent PID=%d\n", getpid(), getppid());
char *args[] = {"/bin/ls", "-l", NULL};
execvp(args[0], args);
} else if (pid > 0) {
// Parent process
printf("Parent: child has PID=%d\n", pid);
wait(NULL); // Wait for child to finish
printf("Parent: child exited\n");
} else {
perror("fork failed");
return 1;
}
return 0;
}
The key insight is that fork does not copy physical memory immediately. It uses copy-on-write (COW): the parent and child share the same physical pages, marked read-only. When either tries to write, the CPU raises a page fault and the kernel copies the page. This makes fork fast — it only copies page table entries, not the actual memory.
A thread is a lightweight unit of execution within a process. Threads share the same address space (code, data, heap) but have their own stack and registers.
| Aspect | Process | Thread |
|---|---|---|
| Address space | Separate | Shared |
| Creation overhead | High (fork + page tables) | Low (clone with sharing) |
| Context switch | Expensive (TLB flush) | Cheaper (same address space) |
| Isolation | Full | None (one crash kills all) |
| IPC | Pipes, sockets, shared memory | Direct memory access |
| Scheduling unit | Yes | Yes (kernel threads) or No (user threads) |
On Linux, both processes and threads are created with the clone() system call. The difference is in the flags: CLONE_VM for shared memory space (thread), no CLONE_VM for separate (process). At the kernel level, both are task_struct instances.
The CPU scheduler decides which ready process gets the CPU next. This decision happens frequently — typically every 4-100 milliseconds (a time slice or quantum).
Different scheduling algorithms yield different performance. The Gantt chart below shows execution order. Average waiting time and turnaround time measure efficiency.
The demo above lets you switch between four classic scheduling algorithms and see the resulting Gantt chart. Each algorithm makes different tradeoffs between fairness, throughput, and response time.
The simplest algorithm: the first process to arrive runs to completion. It is a FIFO queue.
FCFS suffers from the convoy effect: one long CPU-bound process blocks all shorter processes behind it. Average waiting time can be very high.
Run the process with the shortest total CPU burst first. This minimizes average waiting time (provably optimal for non-preemptive scheduling), but it requires knowing the burst time in advance, which is impossible in practice.
Operating systems approximate SJF using exponential averaging: predict the next burst based on past behavior.
Give each process a fixed time quantum, then rotate. This is the most common algorithm in general-purpose OSes. The quantum must balance responsiveness (short quantum) with efficiency (long quantum, less context switch overhead).
A very short quantum (1ms) means excellent interactivity but high overhead (context switches every 1ms). A long quantum (100ms) gives better throughput but makes the system feel sluggish for interactive tasks.
Each process has a priority. Higher-priority processes run first. Priorities can be static (set by the user) or dynamic (adjusted by the OS to prevent starvation).
Linux uses a hybrid: the Completely Fair Scheduler (CFS), which tracks “vruntime” for each process and always picks the one with the smallest vruntime. This ensures fairness without a traditional priority system. Nice values affect the rate at which vruntime advances, not a static priority.
Two key metrics:
The demo calculates both for each algorithm so you can compare. A good scheduler minimizes both.
Virtual memory is the single most important abstraction the OS provides. It gives every process its own private, contiguous address space, independent of physical memory layout.
Before virtual memory: programs addressed physical memory directly. A program compiled to run at address 0x1000 had to be loaded at exactly that physical address. Two programs could not run simultaneously without coordinated memory allocation.
With virtual memory: every process sees addresses from 0 to 2^64-1 (on 64-bit systems). The OS and hardware (MMU) translate these virtual addresses to physical addresses on the fly.
Each process sees a contiguous virtual address space. The MMU translates virtual pages to physical frames using the page table. The TLB caches recent translations for speed.
Memory is divided into fixed-size blocks called pages (typically 4 KB) and frames (same-size blocks in physical memory). The OS maintains a page table for each process that maps virtual page numbers (VPNs) to physical frame numbers (PFNs).
When a program accesses address 0x7fff_a100:
0x7fff_a (top bits)On x86-64, page tables have 4 levels (PML4, PDPT, PD, PT). A virtual address is split into 5 parts: 4 page table indices (9 bits each) plus a 12-bit offset. Each level points to the next, reducing memory consumption because only needed entries are allocated.
Page table walks are expensive — 4 memory accesses per translation on x86-64. The TLB is a hardware cache of recent VPN-to-PFN mappings. It has 32-512 entries and can be accessed in 1-2 CPU cycles.
On a TLB hit (90-99% of accesses), translation is instant. On a TLB miss, the hardware walks the page tables. On a context switch, the TLB must be flushed (process A’s mappings are meaningless for process B), which is why context switches are expensive.
When a program accesses a virtual address whose page table entry has the valid bit clear, the MMU raises a page fault. The kernel’s page fault handler runs:
Demand paging means pages are only loaded from disk when they are accessed, not when the process starts. Combined with copy-on-write, this means processes start instantly and only use memory they actually touch.
# Check memory mappings of a running process
cat /proc/$(pgrep bash)/maps
# Output shows mapped regions
# 00400000-00452000 r-xp 00000000 08:01 12345 /bin/bash
# 00651000-00652000 r--p 00051000 08:01 12345 /bin/bash
# 00652000-00658000 rw-p 00052000 08:01 12345 /bin/bash
# 7ffc00000000-7ffc00021000 rw-p 00000000 00:00 0 [stack]
When the scheduler decides to switch from process A to process B, a context switch occurs. This is the most concrete cost of multitasking.
A context switch saves the state of the running process and restores the next. This happens 100-10,000 times per second. Each switch costs ~1-100 microseconds of pure overhead.
The sequence is:
During a context switch, the CPU does zero productive work. This is pure overhead. Modern systems can do 300,000+ context switches per second, each costing about 1-10 microseconds. The TLB flush alone can cost hundreds of cycles.
#include <stdio.h>
#include <unistd.h>
#include <sys/time.h>
long elapsed_us(struct timeval start, struct timeval end) {
return (end.tv_sec - start.tv_sec) * 1000000 +
(end.tv_usec - start.tv_usec);
}
int main() {
int p[2];
pipe(p);
struct timeval start, end;
char buf[1] = {0};
gettimeofday(&start, NULL);
for (int i = 0; i < 10000; i++) {
write(p[1], buf, 1);
read(p[0], buf, 1);
}
gettimeofday(&end, NULL);
long total = elapsed_us(start, end);
printf("Context switch (pipe round trip): %ld us\n", total / 20000);
return 0;
}
This program measures the round-trip time through a pipe (two context switches: parent writes, child reads, child writes, parent reads). On a typical system, each context switch costs about 2-5 microseconds.
Compile with gcc ctx.c -o ctx && ./ctx.
If the OS only ran when user programs voluntarily yielded the CPU, a misbehaving program could loop forever and freeze the system. Interrupts are the hardware mechanism that lets the OS regain control.
A timer interrupt fires at a fixed frequency (typically 100-1000 Hz on Linux, i.e., every 1-10 ms). The interrupt controller (APIC) sends a signal to the CPU. The CPU finishes the current instruction, saves the instruction pointer, and jumps to the interrupt handler registered by the OS.
This is how preemptive multitasking works: processes do not need to be cooperative. The timer fires, the scheduler runs, and the current process may be preempted regardless of its behavior.
Interrupts are categorized as:
int instruction (used for system calls on some architectures)Each interrupt has a vector number. The kernel maintains an Interrupt Descriptor Table (IDT) that maps vectors to handler functions.
# Check interrupt statistics
cat /proc/interrupts
# Example output (simplified):
# CPU0 CPU1
# 0: 1234 IO-APIC timer
# 1: 10 IO-APIC i8042
# 8: 0 IO-APIC rtc0
# 130: 5678 PCI-MSI nvme0q0
Virtual memory enables two powerful features that make physical RAM go much further: demand paging and swapping.
Demand paging: When a process starts, the kernel loads only the first page of code. The rest stays on disk. As the process accesses more code and data, page faults bring in pages lazily. This means startup is fast and processes only use memory for the pages they actually touch.
Swapping: When physical memory is full, the kernel can move a page from physical memory to a designated swap area on disk. The page table entry is marked invalid but stores the disk location. When the process accesses that page, the page fault handler reads it back from swap.
The page replacement policy decides which page to evict when freeing a frame. Linux uses a variant of LRU (Least Recently Used) with two lists: active and inactive. Pages are promoted to active when accessed frequently and demoted to inactive when idle. On memory pressure, inactive pages are evicted first.
The mmap() system call maps a file into a process’s address space, creating a direct mapping between file blocks and virtual pages. Reading from the mapped region triggers page faults that load file data into memory. Writing to the mapped region modifies memory, and the kernel writes changes back to disk (either when msync() is called or when the mapping is removed).
This means file I/O can be as simple as dereferencing a pointer:
#include <sys/mman.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
int main() {
int fd = open("data.txt", O_RDWR);
char *data = mmap(NULL, 4096, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, 0);
// Now data[0] is the first byte of the file
printf("First char: %c\n", data[0]);
// Modify the file through memory
data[0] = 'X';
msync(data, 4096, MS_SYNC); // Flush to disk
munmap(data, 4096);
close(fd);
return 0;
}
Memory-mapped files are used extensively by databases, language runtimes, and the dynamic linker. They combine the convenience of memory access with the persistence of files.
The Linux scheduler has evolved significantly. Early versions used an O(n) scheduler that iterated over all processes. In the 2.6 kernel, it was replaced with an O(1) scheduler that used two priority arrays (active and expired) and could select the next process in constant time.
In 2007, Linux 2.6.23 introduced the Completely Fair Scheduler (CFS), which is still the default today. CFS does not use traditional priority queues. Instead:
// Simplified CFS pseudo-code
struct task_struct *cfs_pick_next(struct cfs_rq *cfs_rq) {
struct rb_node *left = cfs_rq->tasks_timeline.rb_leftmost;
struct task_struct *se = rb_entry(left, struct task_struct, run_node);
return se;
}
The key insight is that CFS achieves fairness not through static time slices but by continuously tracking how much each process has run relative to others. A nice value of -20 means a process runs roughly 1.5x more than default, while +19 means about 0.5x.
Let us trace a complete read() call from a user program down to disk and back:
char buf[512]; n = read(fd, buf, sizeof(buf));RDI=fd, RSI=&buf, RDX=512, puts syscall number 0 in RAXsyscall instruction: saves RIP to RCX and RFLAGS to R11, loads kernel stack, jumps to entry entry_SYSCALL_64do_syscall_64()do_syscall_64(): uses RAX to index into sys_call_table, calls sys_read()sys_read(): calls ksys_read() which does fdget(), vfs_read()vfs_read(): checks permissions, calls file->f_op->read_iter (e.g., ext4_file_read_iter)io_schedule() which marks the process as blocked, selects a new process via schedule()sysretq: restores user registers, switches to Ring 3, returns to user code with RAX = number of bytes readThis is the full path. Most of the time, step 8 succeeds (page cache hit, ~97% of reads on a warm system), and the whole operation takes about 1-5 microseconds. On a cache miss (step 9), a single disk seek adds ~4-15 milliseconds — 1000x slower.
A modern OS is a study in tradeoffs. Virtual memory gives every process its own address space but requires TLB flushes on context switches. Preemptive multitasking ensures fairness but adds the overhead of timer interrupts and scheduling decisions. System calls provide security but cost a mode switch.
The brilliance is that these mechanisms compose into a system that feels simple to the programmer. When you call malloc(), the kernel might be:
Your code does not know and does not care. The OS handles it.
Understanding these internals helps you write better software. You know that fork() is cheap until the child writes memory. You know that too many threads cause expensive context switches. You know that random file access on a spinning disk is 1000x slower than sequential. And you know that every system call, no matter how small, costs at minimum a mode switch and register save.
fork() return two different values? What would happen if copy-on-write did not exist?mmap() outperform read() for large sequential reads? When would read() be faster?man 2 syscalls: The full list of Linux system calls/proc/*/maps and /proc/*/status: Inspect memory and process state of any running process