Network¶
A network is a distributed message-passing substrate with no global clock, no shared memory, and no guarantee that messages arrive.
Three properties dominate everything that follows:
- Latency is variable (not just slow or fast)
- Loss is normal (not exceptional)
- Visibility is partial (you never see the whole system)
If your design ignores even one of these, it will eventually fail.
flowchart TB
classDef l7 fill:#1f2937,color:#fff,stroke:#38bdf8,stroke-width:2px
classDef l4 fill:#111827,color:#fff,stroke:#22c55e,stroke-width:2px
classDef l3 fill:#020617,color:#fff,stroke:#f97316,stroke-width:2px
classDef infra fill:#020617,color:#e5e7eb,stroke:#a855f7,stroke-width:2px
C[🧑 Client]
L7[🌐 Application Layer
HTTP • gRPC • WS]
L4[📦 Transport Layer
TCP • UDP]
L3[🛰️ Network Layer
IP • Routing]
I[🏗️ Infra
NICs • VPC • Cables]
C --> L7 --> L4 --> L3 --> I
class L7 l7
class L4 l4
class L3 l3
class I infra
Engineer note Failures propagate upward. Symptoms appear at L7, causes usually live below.
2. Layer 3 — IP and Routing (Weak by Design)¶
IP exists to solve addressing and forwarding at global scale. Everything else is explicitly out of scope.
What IP Guarantees¶
- Each packet has a destination address
- Routers make a best-effort attempt to forward it
What IP Refuses to Guarantee¶
- Delivery
- Order
- Duplication avoidance
- Timing
This is not a limitation — it is the reason the internet works.
sequenceDiagram
participant S as 📤 Sender
participant R1 as 🛰️ Router A
participant R2 as 🛰️ Router B
participant D as 📥 Destination
S->>R1: Packet #41
R1->>R2: Packet #41
R2->>D: Packet #41
S->>R1: Packet #42
Note over R1,R2: dropped silently
Production implication If your system treats packet loss as exceptional, it will behave pathologically under load.
3. Layer 4 — Transport: Turning Loss into Semantics¶
Transport protocols exist to simulate properties that the network layer refuses to provide.
TCP: Reliability via State and Backpressure¶
TCP converts loss into waiting.
sequenceDiagram
participant C as 🧑 Client
participant S as 🖥️ Server
C->>S: SYN
S->>C: SYN-ACK
C->>S: ACK
Note over C,S: Connection state established
C->>S: Data (seq=100)
Note over S: Packet lost
C-->>S: Retransmit (seq=100)
Properties:
- Ordered byte stream
- Retransmission
- Congestion control
Hidden cost:
One missing packet can stall everything behind it.
This is why tail latency explodes before throughput collapses.
UDP: Exposing Reality Directly¶
UDP removes transport-level illusions.
sequenceDiagram
participant App as 🎮 App
participant Net as 🌐 Network
App-->>Net: Packet A
App-->>Net: Packet B
Note over Net: delivery not guaranteed
Use UDP only when:
- latency is more important than completeness
- the application understands loss
If you add retries, ordering, and congestion logic on top, you are rebuilding TCP — without decades of tuning.
4. Layer 7 — Application Protocols (Where Humans Interact)¶
This is the layer engineers think they control.
flowchart LR
Client[🧑 Client]
API[🌐 API]
SVC[⚙️ Service]
DB[(🗄️ DB)]
Client --> API --> SVC --> DB
DB --> SVC --> API --> Client
Failures at this layer present as:
- timeouts
- partial data
- retry amplification
Almost never as crashes.
5. API Styles and Operational Reality¶
REST — Boring, Observable, Forgiving¶
REST aligns well with how failures actually happen:
- stateless requests
- independent retries
- cacheable responses
It survives because it degrades gracefully.
gRPC — Tight Contracts, Tight Coupling¶
flowchart LR
A[⚙️ Service A]
B[⚙️ Service B]
A -->|📦 Protobuf| B
B -->|📦 Binary| A
Benefits:
- explicit schemas
- efficient encoding
- streaming
Costs:
- harder debugging
- shared failure domains
Engineer rule Use gRPC where teams share operational ownership.
6. Long‑Lived Connections and State Leakage¶
WebSockets¶
sequenceDiagram
Client->>Server: HTTP Upgrade
Server->>Client: 101 Switching Protocols
Client-->>Server: Message
Server-->>Client: Message
Each open connection consumes:
- memory
- file descriptors
- load balancer state
State scales linearly. Traffic rarely does.
7. Load Balancers: Concentrated Power¶
flowchart TB
classDef lb fill:#0f172a,color:#fff,stroke:#38bdf8,stroke-width:2px
classDef svc fill:#020617,color:#e5e7eb,stroke:#22c55e,stroke-width:2px
U[👥 Users]
LB[⚖️ Load Balancer]
A[⚙️ Svc A]
B[⚙️ Svc B]
C[⚙️ Svc C]
U --> LB
LB --> A
LB --> B
LB --> C
class LB lb
class A,B,C svc
Load balancers:
- hide instance failure
- smooth traffic spikes
- introduce a new critical dependency
L4 vs L7¶
| Type | Sees | Typical Failure |
|---|---|---|
| L4 | TCP | resets, stuck connections |
| L7 | HTTP | misroutes, retry storms |
8. Geography, Latency, and Physics¶
flowchart LR
User[🧑 User]
EU[🇪🇺 EU Region]
US[🇺🇸 US Region]
User --> EU --> US
Facts that do not negotiate:
- distance adds latency
- cross-region calls fail more often
Design response:
- isolate regions
- replicate asynchronously
- cache aggressively
9. Failure Handling: Engineering Discipline¶
Timeouts¶
Every network call must have a timeout. No exceptions.
A missing timeout is a distributed resource leak.
Retries¶
flowchart LR
Call --> Timeout
Timeout --> Retry1
Retry1 --> Retry2
Retry2 --> Saturation
Rules:
- retry only idempotent operations
- exponential backoff
- jitter always
Most large outages are retry-driven.
Circuit Breakers¶
stateDiagram-v2
Closed --> Open: error threshold
Open --> HalfOpen: cooldown
HalfOpen --> Closed: success
HalfOpen --> Open: failure
Circuit breakers do not prevent failure. They prevent failure propagation.
10. Failure Timeline: How Real Incidents Unfold¶
Most outages do not start with a crash. They start with latency drift.
A Typical Production Timeline¶
sequenceDiagram
participant U as Users
participant API as API
participant DB as Database
U->>API: Request
API->>DB: Query (slow)
Note over DB: Latency increases
DB-->>API: Response (late)
API-->>U: Timeout
U->>API: Retry
Note over API: Load amplifies
What actually happened:
- DB slowed slightly (not down)
- Timeouts triggered retries
- Retries increased load
- Latency cascaded into failure
The root cause was not the database. It was the retry policy.
11. Observability by Layer (What to Measure)¶
Good metrics align with layers.
| Layer | What to Measure | Why |
|---|---|---|
| L7 | Latency percentiles | User pain lives here |
| L4 | Connection errors | Detect saturation |
| L3 | Packet loss | Rare, but catastrophic |
flowchart TB
Metrics[📊 Metrics]
Logs[📜 Logs]
Traces[🧵 Traces]
Metrics --> Decision
Logs --> Decision
Traces --> Decision
If you cannot explain an outage with these three, you are blind.
12. Design Checklist (Used Before Shipping)¶
Before introducing a network call, answer these:
- What is the timeout?
- Is the operation idempotent?
- What happens on partial failure?
- Where does backpressure appear?
- Can this retry amplify load?
If any answer is "we’ll see", the design is incomplete.
13. Vocabulary That Signals Seniority¶
Precision matters in incident reviews.
- Latency vs Delay: latency includes queuing
- Failure vs Fault: faults cause failures
- Load vs Traffic: load is resource pressure
- Availability vs Reliability: uptime vs correctness
Using these correctly changes conversations.
Final Notes¶
- Networks fail quietly
- Latency hides before it hurts
- Defaults encode opinions you didn’t choose
This document is conservative because production systems punish optimism.