Designing Protocols for Servers You Don't Trust

February 2026

Game servers are deceptively hard. From the outside it looks like you just need a box that hosts matches, but the moment you try to build ranked matchmaking, you realize you need to solve three very different networking problems at once: players need to authenticate and check leaderboards, game servers need to register themselves and report results, and the matchmaker needs a persistent connection to push match assignments in real time.

These are different kinds of communication. REST works great for request-response stuff like auth and profiles, but it falls apart when you need the server to push data to a client unprompted. And you definitely don't want game servers polling an HTTP endpoint every 500ms to ask "any new players for me?"

So I built RankedServer. It's async Rust on Tokio, and it runs an Axum HTTP server, a TCP server-to-server (S2S) channel, and a TCP rendezvous (RDV) service all in one process.

The wire format

S2S and RDV are actually the same protocol underneath. Both use a custom binary framing with a 57-byte header:

pub struct MessageHeader {
    pub payload_len: u32,        // 4 bytes
    pub hmac: [u8; 32],          // 32 bytes - HMAC-SHA256
    pub nonce: [u8; 12],         // 12 bytes - AES-GCM nonce
    pub msg_type: u8,            // 1 byte
    pub timestamp_ms: u64,       // 8 bytes
}

Payloads are encrypted with AES-256-GCM. The HMAC covers the header fields (minus the HMAC itself) and the ciphertext, so you can't tamper with the message type or timestamp without invalidating the tag. Each connection tracks the last 512 nonces in a rolling window and rejects duplicates, and timestamps can't drift more than 120 seconds from the server's clock.

The only difference between S2S and RDV is who's connecting and how they authenticate. S2S is for game servers, which use Ed25519 certificates. RDV is for players, which use temporary tickets from their HTTP login. Everything below the auth layer is identical.

S2S and the hostile server problem

The whole system assumes game servers will eventually get compromised. They run on machines you don't fully control, players poke at them constantly, and it only takes one exploit. So the question I kept coming back to was: what happens when one of your game servers is hostile?

The channel itself is pretty simple. Game servers connect over TCP, authenticate with their certificate, and get a persistent encrypted connection. They send heartbeats, report match results, and stream logs. The master sends back commands: allow players to join, kick someone, force-start a match, shut down.

pub enum MsgType {
    Heartbeat       = 0x21,   // Server -> Master: "I'm alive, here's my state"
    MatchResult     = 0x22,   // Server -> Master: game ended, here are stats
    MatchStarted    = 0x24,   // Server -> Master: match began
    CmdKickPlayer   = 0x33,   // Master -> Server: remove this player
    CmdAllowPlayers = 0x36,   // Master -> Server: these players may join
    CmdShutdown     = 0x34,   // Master -> Server: shut down gracefully
    LogInfo         = 0x11,   // Server -> Master: log streaming
    // ...
}

Each connection gets its own Tokio task trio: a reader loop that parses and decrypts inbound frames, a writer task that encrypts and sends outbound messages via an mpsc channel, and a handler task that processes decrypted messages without blocking I/O. The handler enforces certificate identity too. If a heartbeat claims a username that doesn't match the authenticated certificate owner, the server overrides it. When a connection drops, a RemoveOnDrop guard automatically marks the server offline in the database. RAII means you can't forget to clean up.

But none of that helps when the server itself gets owned. The attacker already has a valid, authenticated connection. That's where the certificate system comes in: each game server gets its own Ed25519 certificate with a unique UUID. Server gets compromised? Revoke that UUID in the database and its next heartbeat gets rejected. The other servers don't even know anything happened.

Each certificate is a 133-byte binary file:

// S2SC magic (4) + version (1) + UUID (16) + issued_at (8)
// + expires_at (8) + secret (32) + Ed25519 signature (64)
pub struct CertificateFile {
    pub uuid: [u8; 16],
    pub issued_at: i64,
    pub expires_at: i64,
    pub secret: [u8; 32],
    pub signature: [u8; 64],
}

The master server holds an Ed25519 signing key. When a cert is issued, the secret is generated randomly, the cert is signed, and the secret is stored in the database encrypted at rest with AES-256-GCM using a separate DB encryption key. The game server gets the .s2scert file and uses the embedded secret to HMAC-sign its auth handshake. The master decrypts the stored secret to verify. The secret itself never goes over the wire during auth, only an HMAC proof of it.

This gives you three ways to kill a compromised server: revoke its certificate UUID, let it expire, or rotate the master signing key and every cert dies at once. There's a CLI tool (certtool) for issuing, revoking, and inspecting certificates so you're not doing it by hand. Game servers can optionally embed the master public key to verify they're talking to the real master during handshake.

RDV and matchmaking

RDV handles the player side. Players connect over TCP after logging in through the HTTP API, authenticate using their session ticket, and send JoinQueue / LeaveQueue messages. The server pushes MatchFound, MatchStarted, MatchCancelled, and MatchReport back. Same binary framing as S2S, but authentication is ticket-based rather than certificate-based. The ticket is a SHA-256-derived key stored server-side after login. The handshake includes a challenge-response exchange, and the session key is derived from the auth key, client challenge, and server challenge response via HMAC, so even if someone intercepts the handshake, they can't derive the session key without the original ticket.

The matchmaker runs on a background tick every 500ms:

let tick = this.clone();
tokio::spawn(async move {
    loop {
        sleep(Duration::from_millis(500)).await;
        if let Err(e) = tick.matchmaker_tick().await {
            tracing::warn!("RDV tick error: {}", e);
        }
    }
});

Each tick runs inside a SQLite BEGIN IMMEDIATE transaction to prevent race conditions. It cleans up stale queue entries, expires old allowlist reservations, finds available servers, and batch-assigns queued players. One ordering detail that took me a while to get right: the matchmaker sends CmdAllowPlayers to the game server before telling clients where to connect. Without this, a fast client can join and get kicked because the game server hasn't received the allowlist yet.

Players who disconnect and reconnect get routed back to their existing match if one is active. The system checks the allowlist and active matches before putting someone back in the queue.

There's also a TrueSkill-inspired rating engine in there. I wrote a separate post about that.

The admin server

The admin dashboard runs on a separate port (23646, spell "ADMIN" on a phone keypad) with its own security. The IP allowlist drops connections before any data is exchanged, so if you're not on the list, the server never even completes the TCP handshake. Past that, there's mTLS with client certificate verification, and then bearer token auth at the application layer. The token is read from TBQ_ADMIN_TOKEN at startup and immediately scrubbed from the process environment so it doesn't show up in /proc or process listings. All admin actions are audit logged to both the database and structured logging with the IP, client certificate CN, action, and result.

From the dashboard you can see connected servers, view the matchmaking queue, look up players, issue and revoke certificates, send commands to game servers, and toggle maintenance mode. It's where all the operational stuff lives.

Session key derivation

Both S2S and RDV derive session keys the same way. The session key isn't the auth key directly. It's an HMAC of the auth key, the client's random challenge, and the server's challenge response. So even if the same certificate reconnects immediately, it gets a different session key, and compromising one connection's key doesn't give you the others. All HMAC verifications use constant-time comparisons:

pub fn constant_time_eq(a: &[u8], b: &[u8]) -> bool {
    if a.len() != b.len() { return false; }
    let mut v = 0u8;
    for (x, y) in a.iter().zip(b.iter()) {
        v |= x ^ y;
    }
    v == 0
}

Why Rust

I'll keep this short because "why Rust" has been argued to death elsewhere. For this project specifically:

Three protocol servers, hundreds of concurrent connections, background matchmaker ticks, and per-connection task trios all run in a single process. Tokio's async task model maps well to this. Each connection gets cheap green threads, and mpsc channels separate I/O from business logic cleanly.

When you're manually parsing binary frames with from_le_bytes and byte-offset slicing, Rust's type system catches the off-by-one errors that would be silent memory corruption in C. The MsgType enum with explicit discriminants means adding a new message type forces you to handle it everywhere.

The DashMap<u32, ConnectionHandle> that tracks live S2S connections is accessed from accept loops, handler tasks, the matchmaker, and admin commands, all concurrently. In Go you'd need careful mutex discipline. In Rust the compiler enforces it.

What I'd change

I rolled my own crypto framing because I wanted to understand how it works. In production I'd probably just use TLS 1.3 and save myself the headache of managing nonce windows and HMAC ordering. The custom framing gives me more control over the wire format, but it's also more surface area for me to screw up.

SQLite is fine for matchmaking state at this scale, but it wouldn't survive hundreds of concurrent game servers. I'd move to Postgres with advisory locks if this needed to grow. The matchmaker tick loop is also single-threaded by design, which keeps it simple but means matchmaking throughput is bottlenecked on one task.

The whole thing runs as a single process, which is easy to deploy but is a single point of failure. No clustering, no failover. For a small community that's fine. For anything bigger you'd want at least a hot standby.

Source is at github.com/dannyisbad/ranked-server.