Go is the only mainstream language that bakes concurrency into its syntax. The go keyword is not a library call — it is a language primitive that makes any function execute concurrently. Every Go program, even a trivial one-line script, starts with a main goroutine and has the full scheduler at its disposal.
Most languages achieve concurrency through OS threads (Java, C++) or an event loop (Node.js, Python asyncio). Go does neither. Instead, it implements an M:N scheduler that multiplexes M goroutines across N OS threads. The runtime manages this entirely behind the scenes.
The library catalog analogy: OS threads are like bookshelves — each one costs real space and you cannot have thousands of them in a room. Goroutines are like library index cards — cheap, plentiful, and the librarian (the Go scheduler) pulls out the right card when needed.
Goroutines are lightweight "threads" multiplexed onto OS threads. Each goroutine starts at ~2 KB stack vs ~1 MB for an OS thread. They are scheduled cooperatively — goroutines yield at channel ops, syscalls, or function calls.
The Go scheduler has three abstractions: G, M, and P.
GOMAXPROCS (default: number of CPU cores).The critical insight: P is not a physical processor. It is a resource that bounds the amount of parallelism. At most GOMAXPROCS goroutines run simultaneously, because each running G needs a P, and there are only N P’s.
Every M runs a scheduling loop:
// Pseudocode of the scheduler loop
func schedule(m *M) {
for {
// 1. Try local run queue of attached P
if g := m.p.localQueue.pop(); g != nil {
execute(g)
continue
}
// 2. Try global run queue
if g := globalQueue.pop(); g != nil {
execute(g)
continue
}
// 3. Try to steal from another P
if g := stealWork(m.p); g != nil {
execute(g)
continue
}
// 4. Park the M (idle)
park()
}
}
When a goroutine makes a blocking call (channel operation, syscall, mutex lock), the scheduler parks that G and picks the next runnable one from the P’s local queue. The M never blocks — only the G does.
The GMP model: Goroutines (G) are scheduled onto Machines (M, OS threads) by Processors (P, scheduling context). Each P has a local run queue. Idle M's steal work from other P's local queues.
A goroutine transitions through three primary states:
RUNNABLE → RUNNING → RUNNABLE (preemption or yield)
RUNNING → BLOCKED → RUNNABLE (channel/syscall/mutex → unblocked)
The blocked state is what makes goroutines efficient. When a goroutine blocks on a channel, the runtime removes it from the P’s local queue and parks it. The M picks the next goroutine without ever performing a blocking OS thread operation.
Each goroutine starts with a tiny 2 KB stack (compared to 1 MB+ for an OS thread). The stack is dynamically growable — when a goroutine needs more stack space, the runtime copies the entire stack to a larger buffer.
// Stack growth is transparent
func recursive(n int) int {
if n == 0 {
return 0
}
// Each call expands the stack as needed
return n + recursive(n-1)
}
The stack copy mechanism:
Before Go 1.14, the scheduler was purely cooperative. A goroutine would run until it made a function call that triggered a scheduling point:
time.Sleepruntime.Gosched()The problem: a tight loop without any function calls could starve other goroutines:
// Pre-Go 1.14: this loop never yields
for i := 0; i < 1e9; i++ {
// No function calls — no scheduling point
}
Go 1.14 introduced async preemption. The runtime’s sysmon thread sends a SIGURG signal to a running M, causing the goroutine to yield at the next safe point (function prologue). This ensures no goroutine can monopolize a P.
Sysmon → SIGURG → M receives signal → goroutine yields → scheduler picks next G
Not every point in execution is safe to preempt. The runtime tracks “safe points” — locations where the goroutine’s stack is in a consistent state (function entry, loop back-edge). The signal handler sets a flag, and the goroutine checks the flag at the next safe point.
Each P has a local run queue — a lock-free ring buffer that holds up to 256 goroutines. The scheduler pops from the local queue first because it is fast (no lock needed for the M that owns the P).
The global run queue is a linked list protected by a mutex. Goroutines from go statements that cannot fit in a P’s local queue go here. The scheduler checks the global queue every 61st scheduling iteration to ensure fairness.
// Scheduling decision
func schedule(p *P) *G {
// Every 61 iterations, check global queue
if s.ticks%61 == 0 {
if g := globalQueue.pop(); g != nil {
return g
}
}
// Prefer local queue
if g := p.localQueue.pop(); g != nil {
return g
}
// Fall back to global
if g := globalQueue.pop(); g != nil {
return g
}
return nil
}
The 61:1 ratio ensures that goroutines in the global queue eventually get CPU time, preventing starvation.
When an M finishes executing all goroutines in its P’s local queue and the global queue is empty, it does not go idle immediately. It picks another P at random and tries to steal half of its local queue.
// Work stealing: steal half the goroutines from a random P
func stealWork(p *P) *G {
for _, victim := range randomOrder(allProcessors) {
if victim != p && victim.localQueue.len() > 0 {
n := victim.localQueue.len() / 2
for i := 0; i < n; i++ {
g := victim.localQueue.popBack()
p.localQueue.push(g)
}
return p.localQueue.pop()
}
}
return nil
}
Work stealing uses popBack from the victim (removing from the tail) and pushes to the thief’s queue. This preserves the FIFO ordering of the victim’s remaining goroutines.
An M that fails to steal work does not immediately block. It spins for a while — checking the global queue and attempting steals in a loop. Spinning ensures low latency when a new goroutine becomes runnable. The runtime limits the number of spinning M’s to at most GOMAXPROCS.
The runtime spawns a dedicated OS thread called sysmon (system monitor). It runs independently of the scheduler and performs several critical tasks:
Sysmon is the runtime’s safety net. It ensures forward progress even if goroutines forget to yield and M’s get stuck on long syscalls.
When a goroutine makes a blocking syscall (e.g., file read, network I/O):
This is why Go can handle thousands of blocked I/O operations with only GOMAXPROCS M’s. The M blocks on the syscall, but the P and scheduler continue working.
A channel in Go is a pointer to an hchan struct in the runtime:
// Simplified hchan struct
type hchan struct {
qcount uint // total data in buf
dataqsiz uint // size of circular buffer
buf unsafe.Pointer // pointer to buffer array
elemsize uint16 // size of each element
closed uint32 // channel closed flag
elemtype *_type // element type
sendx uint // send index in buffer
recvx uint // receive index in buffer
recvq waitq // list of blocked receivers
sendq waitq // list of blocked senders
lock mutex // protects all fields
}
An unbuffered channel (make(chan T)) has dataqsiz = 0. A send blocks until a matching receive is ready, and vice versa. The runtime matches them directly — the sender copies its value into the receiver’s stack frame without going through a buffer.
// Unbuffered channel: handoff
ch := make(chan int)
// G1:
ch <- 42 // blocks until G2 receives
// G2:
x := <-ch // unblocks G1, receives 42 directly
The runtime’s chansend and chanrecv functions operate on the same hchan struct. When a goroutine calls ch <- x:
recvq is not empty (someone is waiting to receive), dequeue a receiver, copy the value directly, unlockbuf has space, copy value into buffer, unlocksendq, park the goroutine, unlockReceiving follows the reverse order: check sendq first (direct handoff), then buf, then park on recvq.
A buffered channel (make(chan T, N)) uses a circular buffer. Sends add to the buffer until full. Receives remove from the buffer until empty.
Visualizing hchan:
+-------------------------------+
| hchan |
| buf: [10, 20, _, _, _] |
| sendx: 2 (next write slot) |
| recvx: 0 (next read slot) |
| sendq: [G3, G4] (blocked) |
| recvq: [] |
| lock: unlocked |
+-------------------------------+
Key rules:
close() wakes up all goroutines in recvqChannels are Go's built-in communication primitive. The runtime `hchan` struct holds a buffer, send/recv queue, and mutex. Unbuffered channels block until both sides are ready. Buffered channels block only when the buffer is full (send) or empty (recv).
The select statement is one of Go’s most sophisticated runtime features. It lets a goroutine wait on multiple channel operations simultaneously.
select {
case v := <-ch1:
// handle v
case ch2 <- x:
// handle send
case <-ch3:
// handle receive
default:
// none ready
}
fastrand)recvq non-empty or buf non-empty?sendq non-empty or buf non-full?default exists, unlock all channels and execute defaultsendq/recvqWhen multiple cases are ready simultaneously, Go picks one uniformly at random:
// Go's runtime selects randomly among ready cases
// This prevents one case from always being favored
select {
case <-ch1:
// equally likely as ch2 or ch3
case <-ch2:
// equally likely
case <-ch3:
// equally likely
}
This is a critical design decision. Without randomization, a select that checks cases in order would always prefer the first ready case, leading to starvation. Go’s select scatters the order per-call using runtime.fastrand.
A receive from or send to a nil channel blocks forever. In a select, nil channels are never selected — the runtime skips them entirely.
var nilCh chan int
select {
case <-nilCh:
// never selected
case <-readyCh:
// this case is selected if readyCh has data
default:
// or this runs
}
This is useful for dynamically disabling cases. Set a channel variable to nil, and its case in a select becomes inert.
The `select` statement lets a goroutine wait on multiple channel operations. When multiple cases are ready, Go's runtime picks one uniformly at random. Nil channels are never selected. The `default` case fires immediately if no channel is ready.
Go’s sync.Mutex uses a two-level strategy: spin briefly, then fall back to a futex (fast userspace mutex on Linux).
// Simplified mutex implementation
type Mutex struct {
state int32 // locked=1 | starving? | woken? | waiters count
sema uint32 // semaphore for parking goroutines
}
func (m *Mutex) Lock() {
// Fast path: try to acquire immediately
if atomic.CompareAndSwap(&m.state, 0, 1) {
return // locked!
}
// Slow path: spin then park
m.lockSlow()
}
lockSlow Pathruntime_Semacquire. This parks the goroutine and puts it on a semaphore queue.Go 1.8+ added starvation mode. If a goroutine waits for more than 1 ms to acquire a mutex, the mutex enters starvation mode:
Normal mode:
Goroutine A holds mutex
Goroutine B spins briefly, then parks
Goroutine A unlocks, B wakes up
Starvation mode:
Goroutine B has been waiting >1ms
Mutex enters starvation
A unlocks and hands off directly to B
No spinning allowed
Go’s garbage collector is concurrent and runs alongside application goroutines. The GC interacts with the scheduler at several points.
STW → Concurrent Mark → STW → Concurrent Sweep
The runtime creates dedicated goroutines for concurrent marking. These GC workers are scheduled like any other goroutine, but they have special priority:
// GC workers are scheduled alongside user goroutines
// They have dedicated P time proportional to allocation rate
func gcBgMarkWorker() {
for {
// Wait for GC mark phase
<-gcBgMarkWorkerPool
// Mark work
markRoots()
drainWorkBuffers()
}
}
When an application goroutine allocates memory faster than the GC can mark, the runtime forces the goroutine to help (GC assist):
Allocation → GC Assist → Mark some objects → Continue
This creates feedback: if you allocate more, you help more, which keeps the GC pace matched to allocation rate.
During concurrent marking, the runtime needs to track pointer writes. Go uses a Dijkstra-style insertion write barrier:
// Write barrier: before writing a pointer
func writeBarrier(dst *unsafe.Pointer, src unsafe.Pointer) {
if gcphase == GC_MARK {
// Shade the new pointer (mark it)
shade(src)
}
*dst = src
}
The write barrier ensures that concurrent marking does not miss reachable objects. It activates during the GC mark phase and deactivates during sweep.
Go’s memory model defines the synchronization guarantees between goroutines. Without synchronization, writes in one goroutine are not guaranteed to be visible in another — this is a data race.
The basic rule: A write to a variable happens-before a read that observes the write, provided there is a synchronization operation between them.
Without synchronization:
G1: x = 42 | G2: print(x) // DATA RACE: may print 0 or 42
With channel synchronization:
With channel:
G1: x = 42 | G2: <-ch // happens-before
G1: ch <- 1 | G2: print(x) // guaranteed to print 42
A send on a channel happens-before the corresponding receive completes. This means any write before the send is visible to the goroutine after the receive.
// Mutex happens-before example
var mu sync.Mutex
var x int
// G1:
mu.Lock()
x = 42
mu.Unlock() // happens-before G2's Lock
// G2:
mu.Lock() // synchronizes with G1's Unlock
print(x) // guaranteed to see 42
mu.Unlock()
Go has a built-in race detector:
go run -race main.go
go build -race main.go
The race detector instruments every memory access. At runtime, it reports any unsynchronized read/write to the same memory location:
WARNING: DATA RACE
Read by goroutine 2:
main.readX()
/path/main.go:15 +0x3a
Previous write by goroutine 1:
main.writeX()
/path/main.go:10 +0x34
The race detector adds overhead (~10x memory, ~5x CPU), so use it in testing, not production.
Go's memory model defines when a write in one goroutine is guaranteed to be visible to a read in another. Without synchronization you have a data race. Channels, mutexes, and other sync primitives establish happens-before edges.
A goroutine leak occurs when goroutines are created but never exit. The goroutine stays in a blocked state forever, consuming stack memory.
// Leak 1: Send on a channel nobody receives from
func leak() {
ch := make(chan int)
go func() {
ch <- 42 // blocks forever — nobody reads
}()
}
// Leak 2: Receive from a channel nobody sends to
func leak2() {
ch := make(chan int)
go func() {
<-ch // blocks forever — nobody sends
}()
}
// Leak 3: Blocked on a mutex that never unlocks
func leak3() {
var mu sync.Mutex
mu.Lock()
go func() {
mu.Lock() // blocks forever — first Lock never unlocks
}()
}
net/http/pprof package exposes goroutine profiles:go tool pprof http://localhost:6060/debug/pprof/goroutine
import "runtime"
func monitor() {
for {
n := runtime.NumGoroutine()
log.Printf("goroutines: %d", n)
time.Sleep(10 * time.Second)
}
}
go.uber.org/goleak detect leaked goroutines in tests:func TestNoLeak(t *testing.T) {
defer goleak.VerifyNone(t)
// test code
}
select with default or timeouts for non-blocking channel ops// Safe pattern: context cancellation prevents leak
func safe(ctx context.Context, ch chan int) {
select {
case ch <- 42:
case <-ctx.Done():
// context cancelled, don't block
}
}
Go’s runtime is a self-contained library compiled into every binary. There is no VM, no JIT, and no runtime dependency to install.
Toolchain:
go build
→ go tool compile (source → AST → SSA → .o)
→ go tool asm (assembly → .o)
→ go tool pack (archive .a files)
→ go tool link (resolve symbols, embed runtime)
→ static executable
Runtime in the binary:
┌─────────────────────────────────┐
│ Go Binary │
│ ┌───────────────────────────┐ │
│ │ GMP Scheduler │ │
│ │ ┌─────┐ ┌─────┐ ┌─────┐│ │
│ │ │ G's │ │ M's │ │ P's ││ │
│ │ └─────┘ └─────┘ └─────┘│ │
│ │ Sysmon (preemption) │ │
│ ├───────────────────────────┤ │
│ │ GC (Concurrent) │ │
│ │ Marker Sweeper Worker │ │
│ ├───────────────────────────┤ │
│ │ Memory Allocator │ │
│ │ mcache mcentral mheap │ │
│ ├───────────────────────────┤ │
│ │ Network Poller │ │
│ │ (epoll/kqueue/IOCP) │ │
│ └───────────────────────────┘ │
│ │ │
│ syscall/SIGURF │
│ │ │
│ OS Kernel (Linux/macOS) │
└─────────────────────────────────┘
Go's runtime is compiled into every binary. It includes the GMP scheduler, concurrent garbage collector, memory allocator, and the sysmon thread. No VM, no runtime dependency — just a static binary.
| Concept | Mechanism | Key Insight |
|---|---|---|
| Goroutines | M:N scheduling | Lightweight stacks, growable, multiplexed onto threads |
| GMP | G, M, P | P limits parallelism, M executes, G is the workload |
| Scheduling | Cooperative + preemptive | Yield at channel ops, async preemption via SIGURG |
| Work stealing | Random steal from P queues | Balances load, keeps M’s busy |
| Channels | hchan struct | Buffer + sendq/recvq + mutex, direct handoff when possible |
| Select | Scrambled case checking | Random selection among ready cases, nil channels skipped |
| Mutex | Spin + futex | Brief spin avoids context switch, starvation mode for fairness |
| GC | Concurrent mark-sweep | GC assists, write barrier, sub-ms pauses |
| Memory model | Happens-before | Channels, mutexes, and atomics establish ordering |