Daily Dejavu

How to Build a Production CI/CD Pipeline with GitHub Actions — Staging, Secrets, and Rollback

banditz — Mon, 20 Apr 2026 09:48:00 GMT

The deploy process at too many companies looks like this: SSH into the production server, `cd /var/www/app`, `git pull origin main`, `npm install`, `npm run build`, restart the service, and refresh the browser to see if anything broke. If it did, you SSH back in, `git log`, `git revert`, and hope you caught the right commit.

This works until it doesn't. And when it doesn't, it fails spectacularly — a typo in a config file that takes the site down, a missing dependency that wasn't in the repo, a migration that runs against production before anyone's ready. No audit trail, no approval gate, no rollback beyond "revert and pray."

CI/CD isn't about fancy tools. It's about making deployments boring, predictable, and reversible.

## The Pipeline Structure

A production pipeline has four stages:

1. **Test** — run the full test suite. If anything fails, stop.
2. **Build** — create the deployment artifact (Docker image, compiled bundle, whatever ships).
3. **Deploy to Staging** — automatically deploy to an environment that mirrors production.
4. **Deploy to Production** — after staging verification, deploy with an approval gate.

Here's the GitHub Actions workflow structure:

    # .github/workflows/deploy.yml
    name: Deploy Pipeline

    on:
      push:
        branches: [main]
      workflow_dispatch:   # Manual trigger

    jobs:
      test:
        runs-on: ubuntu-latest
        steps:
          - uses: actions/checkout@v4

          - name: Set up Node.js
            uses: actions/setup-node@v4
            with:
              node-version: '20'
              cache: 'npm'

          - run: npm ci
          - run: npm run lint
          - run: npm test

      build:
        needs: test
        runs-on: ubuntu-latest
        steps:
          - uses: actions/checkout@v4

          - name: Build Docker image
            run: |
              docker build -t myapp:${{ github.sha }} .
              docker tag myapp:${{ github.sha }} myapp:latest

          - name: Push to registry
            run: |
              echo ${{ secrets.REGISTRY_PASSWORD }} | docker login -u ${{ secrets.REGISTRY_USER }} --password-stdin
              docker push myapp:${{ github.sha }}

      deploy-staging:
        needs: build
        runs-on: ubuntu-latest
        environment: staging
        steps:
          - name: Deploy to staging
            uses: appleboy/ssh-action@v1
            with:
              host: ${{ secrets.STAGING_HOST }}
              username: deploy
              key: ${{ secrets.SSH_PRIVATE_KEY }}
              script: |
                docker pull myapp:${{ github.sha }}
                docker stop myapp || true
                docker rm myapp || true
                docker run -d --name myapp \
                  --env-file /opt/myapp/.env \
                  -p 3000:3000 \
                  myapp:${{ github.sha }}

          - name: Smoke test staging
            run: |
              sleep 10
              curl -sf https://staging.yourdomain.com/health || exit 1

      deploy-production:
        needs: deploy-staging
        runs-on: ubuntu-latest
        environment: production     # Requires approval
        steps:
          - name: Deploy to production
            uses: appleboy/ssh-action@v1
            with:
              host: ${{ secrets.PRODUCTION_HOST }}
              username: deploy
              key: ${{ secrets.SSH_PRIVATE_KEY }}
              script: |
                docker pull myapp:${{ github.sha }}
                # Keep the previous version for rollback
                docker rename myapp myapp-previous || true
                docker stop myapp-previous || true
                docker run -d --name myapp \
                  --env-file /opt/myapp/.env \
                  -p 3000:3000 \
                  myapp:${{ github.sha }}

          - name: Verify production
            run: |
              sleep 15
              curl -sf https://yourdomain.com/health || exit 1

The `needs` keyword creates dependencies between jobs. `deploy-staging` won't run unless `build` succeeds. `deploy-production` won't run unless `deploy-staging` succeeds. One failed test stops the entire pipeline.

## Secrets Management

Never hardcode credentials in workflow files. GitHub provides encrypted secrets at the repository and environment level.

Go to **Settings → Secrets and variables → Actions** and add:

- `REGISTRY_USER` / `REGISTRY_PASSWORD` — Docker registry credentials
- `SSH_PRIVATE_KEY` — deploy user's SSH key
- `STAGING_HOST` / `PRODUCTION_HOST` — server addresses

Secrets are automatically masked in logs. If a secret value appears in the output, GitHub replaces it with `***`.

For cloud deployments (AWS, GCP, Azure), use **OIDC federation** instead of long-lived credentials. This generates short-lived tokens for each workflow run:

    - name: Configure AWS credentials
      uses: aws-actions/configure-aws-credentials@v4
      with:
        role-to-assume: arn:aws:iam::123456789:role/github-deploy
        aws-region: us-east-1

No access keys stored anywhere. GitHub authenticates with the cloud provider using its OIDC identity. Tokens expire after the workflow completes.

## The Approval Gate

In the GitHub repository, go to **Settings → Environments → production**. Enable:

- **Required reviewers** — specify team members who must approve before production deploy
- **Wait timer** — optional delay before deploy executes (gives time for last-minute checks)
- **Deployment branches** — restrict to `main` only

When the pipeline reaches `deploy-production`, it pauses and sends a notification to the required reviewers. They can inspect the staging deployment, check test results, and either approve or reject.

This prevents the "I accidentally merged to main and it deployed to production" scenario. Someone has to actively approve every production deployment.

## Database Migrations

The trickiest part of CI/CD is database migrations. The rule:

**Every migration must be backward compatible.**

This means the old application code must work with the new schema. Why? Because during deployment, there's a window where the old code is still running against the new database. And during rollback, the new schema must work with the old code.

Practically, this means:

- **Adding a column** — safe. Old code ignores it.
- **Removing a column** — do it in two deploys. First deploy: stop using the column in code. Second deploy: remove the column from the schema.
- **Renaming a column** — also two deploys. Add new column, migrate data, update code to use new column, then remove old column.
- **Adding a NOT NULL column** — add with a default value or make it nullable first.

Run migrations as a separate step before deploying the application:

    - name: Run migrations
      uses: appleboy/ssh-action@v1
      with:
        host: ${{ secrets.PRODUCTION_HOST }}
        username: deploy
        key: ${{ secrets.SSH_PRIVATE_KEY }}
        script: |
          docker run --rm \
            --env-file /opt/myapp/.env \
            myapp:${{ github.sha }} \
            npm run migrate

## Rollback

A deployment pipeline without rollback is a deployment prayer.

Create a separate rollback workflow:

    # .github/workflows/rollback.yml
    name: Rollback Production
    on:
      workflow_dispatch:
        inputs:
          version:
            description: 'Git SHA to rollback to'
            required: true

    jobs:
      rollback:
        runs-on: ubuntu-latest
        environment: production
        steps:
          - name: Rollback production
            uses: appleboy/ssh-action@v1
            with:
              host: ${{ secrets.PRODUCTION_HOST }}
              username: deploy
              key: ${{ secrets.SSH_PRIVATE_KEY }}
              script: |
                docker pull myapp:${{ github.event.inputs.version }}
                docker stop myapp || true
                docker rm myapp || true
                docker run -d --name myapp \
                  --env-file /opt/myapp/.env \
                  -p 3000:3000 \
                  myapp:${{ github.event.inputs.version }}

          - name: Verify rollback
            run: |
              sleep 15
              curl -sf https://yourdomain.com/health || exit 1

Trigger it manually from the Actions tab, input the SHA of the last known good version. The approval gate still applies.

**Critical:** test your rollback process regularly. An untested rollback is not a rollback — it's a hope. Run a rollback drill quarterly. Deploy a known version, roll back, verify.

## The CI/CD Maturity Progression

If you're currently at "SSH and git pull," don't try to build the full pipeline in one day. Progress through these stages:

**Stage 1:** Automated tests in GitHub Actions. Tests run on every push. You still deploy manually, but you know the code passes tests before you do.

**Stage 2:** Automated staging deploys. Code that passes tests is automatically deployed to staging. You manually deploy to production.

**Stage 3:** Production approval gates. Staging works, you've built confidence. Add the production deploy with manual approval.

**Stage 4:** Monitoring and rollback. Add health checks, deployment verification, and a tested rollback process.

Each stage builds on the previous one. Don't skip stages. The confidence you build at each level is what makes the next level safe.

---

If you found this guide helpful, check out our other resources:

- [Docker Networking Explained](/devops/docker-networking-bridge-host-overlay-explained)

Docker Networking Explained — Bridge vs Host vs Overlay and When to Use Each

banditz — Mon, 20 Apr 2026 08:55:00 GMT

You spin up two containers. A web app and a database. You try to connect the app to the database using the container name as the hostname. It doesn’t work. Connection refused. You try the container’s IP address — it works. But the IP changes every time the container restarts.

This is the most common Docker networking confusion, and it happens because of the default bridge network’s limitation that nobody tells you about until you hit it.

Docker has six network modes. Most people use one (the default bridge) and don’t realize it’s the worst one for multi-container applications.

The Default Bridge — Why It’s a Trap

When Docker installs, it creates a network called bridge backed by a Linux bridge device called docker0. Every container started without a --network flag connects to this default bridge.

docker run -d --name web nginx

docker run -d --name db postgres:16

Both containers are on the default bridge. They can reach each other by IP:

docker inspect web | grep IPAddress

# "IPAddress": "172.17.0.2"

docker exec db ping 172.17.0.2

# Works

But:

docker exec db ping web

# ping: bad address 'web'

The default bridge does not provide DNS resolution between containers. This is the single most confusing thing about Docker networking, and it catches everyone.

The reason is historical. The default bridge predates Docker’s embedded DNS server. For backward compatibility, Docker kept it without DNS. Every user-created bridge network gets DNS automatically.

Never use the default bridge for multi-container applications. Always create a custom bridge.

Custom Bridge Networks — The Right Way

docker network create mynet

docker run -d --name web --network mynet nginx

docker run -d --name db --network mynet postgres:16

Now:

docker exec web ping db

# PING db (172.20.0.3): 56 data bytes — works!

docker exec db ping web

# Works too

Custom bridge networks provide:

DNS resolution — containers find each other by name
Better isolation — containers on different networks can’t communicate
Connect/disconnect on the fly — docker network connect mynet existing-container

Docker Compose does this automatically. Every docker-compose.yml creates a custom bridge for the project. All services can reach each other by service name:

services:

  web:

    image: nginx

    ports:

      - "80:80"

  api:

    image: myapp

    environment:

      DATABASE_URL: postgresql://user:pass@db:5432/mydb

  db:

    image: postgres:16

    environment:

      POSTGRES_PASSWORD: pass

Notice db in the connection string. Compose’s network makes db resolvable from any service in the same file. The db service has no ports mapping — it’s only accessible from other containers on the network, not from the host or outside. That’s a security advantage.

Host Mode — When You Need Raw Performance

docker run -d --network host nginx

Host mode removes network isolation entirely. The container shares the host’s network stack — same IP, same interfaces, no NAT.

Nginx in the container now listens on port 80 of the host directly. No port mapping needed. No NAT overhead.

When to use host mode:

Network monitoring tools that need access to all host interfaces
Applications with many dynamic ports where mapping each one is impractical
Maximum network performance (eliminates NAT/iptables processing)
Applications that need to bind to the host’s specific network interface

When NOT to use host mode:

Multiple containers needing the same port — they’ll conflict
When you need network isolation between containers
On macOS or Windows — host mode only works on Linux

For most web applications, the performance difference between bridge and host is negligible — under 1% latency difference. You’d need to be pushing millions of packets per second (network appliances, high-frequency trading) for host mode to matter.

Overlay Networks — Multi-Host Communication

If you have containers running on different machines that need to communicate, overlay networks handle it:

docker swarm init

docker network create --driver overlay --attachable production

docker service create --name api --network production --replicas 3 myapp

Overlay networks:

Span multiple Docker hosts
Create encrypted tunnels between hosts (using VXLAN)
Provide DNS-based service discovery across hosts
Required for Docker Swarm services

Each container gets an IP on the overlay network and can reach other containers by name, regardless of which physical host they’re running on.

For single-host deployments (most development environments and many production setups), bridge networks are simpler and sufficient. Use overlay only when you genuinely have multi-host requirements.

None and Macvlan — The Specialty Modes

None mode — complete network isolation:

docker run --network none myapp

No network interfaces except loopback. Use for containers that process data from volumes but should never make network connections. Security-sensitive batch jobs.

Macvlan — containers appear as physical devices on your network:

docker network create -d macvlan \

    --subnet=192.168.1.0/24 \

    --gateway=192.168.1.1 \

    -o parent=eth0 macnet

Containers get IPs from your physical network’s subnet. Other devices on the network see them as separate machines. Useful for legacy applications that expect to be on the physical network, or for containers that need to be directly accessible without port mapping.

Debugging Docker Networking

When containers can’t communicate, work through this sequence:

Check which network the containers are on:

docker inspect mycontainer | grep -A 20 Networks

If they’re on different networks, they can’t communicate directly.

Check DNS resolution:

docker exec mycontainer nslookup other-container

If this fails, you’re on the default bridge. Create a custom network.

Check connectivity:

docker exec mycontainer ping other-container

docker exec mycontainer curl -s http://other-container:80/

Use netshoot for advanced debugging:

docker run --rm -it --network mynet nicolaka/netshoot

This image has every network debugging tool: curl, ping, dig, nslookup, tcpdump, netstat, ss, iperf, and more.

Check iptables rules Docker created:

sudo iptables -L DOCKER -n -v

Docker manages iptables rules for port mapping and inter-network isolation. Corrupted rules can cause mysterious connectivity issues. Restarting Docker often fixes them.

Check port mapping:

docker port mycontainer

Shows which host ports map to which container ports. If empty and you expected a mapping, you forgot -p when starting the container.

The Networking Decision Tree

Single container, no communication needed → default bridge is fine
Multiple containers on one host → custom bridge network
Need maximum network performance → host mode (Linux only)
Containers across multiple hosts → overlay network
Need containers on the physical network → macvlan
Need complete isolation → none

For 90% of deployments, a custom bridge network is the right answer. Docker Compose creates one automatically. Start there, and only reach for other modes when you have a specific reason.

If you found this guide helpful, check out our other resources:

(More articles coming soon in DevOps & Infrastructure)

How to Implement API Rate Limiting — Token Bucket, Sliding Window, and the Edge Cases Nobody Warns You About

banditz — Fri, 17 Apr 2026 06:53:00 GMT

It happens to every API eventually. You launch without rate limiting because “we’ll add it later.” Three weeks later, someone discovers your API and hits it 50,000 times in a minute. Maybe it’s a bot scraping your data. Maybe it’s a customer’s broken retry loop. Maybe it’s a competitor stress-testing your infrastructure. The result is the same: your servers are drowning, your database is on fire, and legitimate users can’t get a response.

Rate limiting isn’t a nice-to-have. It’s the bouncer at the door that keeps your API from being abused into oblivion.

The Algorithms — Pick the Right One

There are four common rate limiting algorithms. Each has tradeoffs.

Fixed Window Counter

The simplest approach. Divide time into fixed windows (e.g., 1-minute intervals). Count requests per window. Reject when the count exceeds the limit.

Window 12:00-12:01: 98/100 requests used

Window 12:01-12:02: 0/100 requests used

The problem: a burst at the window boundary. If a client sends 100 requests at 12:00:59 and another 100 at 12:01:01, they’ve sent 200 requests in 2 seconds while technically staying within the per-minute limit. This can overload your server even though the rate limit is “working.”

Sliding Window Log

Stores the timestamp of every request. When a new request arrives, count timestamps within the last window duration and reject if over the limit. Most accurate, but stores every timestamp — memory grows linearly with traffic.

Sliding Window Counter

A hybrid. Uses fixed windows but weights the previous window’s count based on overlap. If 70% of the current window has elapsed, count 30% of the previous window’s requests plus 100% of the current window’s. Good accuracy without per-request storage.

Token Bucket

A bucket holds tokens, refilled at a fixed rate. Each request consumes a token. If the bucket is empty, the request is rejected. The bucket has a maximum capacity, allowing controlled bursts up to that capacity while enforcing the average rate over time.

For most APIs, token bucket or sliding window counter is the right choice. Token bucket if you want to allow controlled bursts (most APIs should). Sliding window counter if you want strict per-window enforcement.

Implementing Token Bucket with Redis

Redis is the standard backing store for rate limiting because it’s fast, atomic, and shared across application servers.

The token bucket needs two pieces of data per client: the number of remaining tokens and the last refill timestamp.

Here’s the logic in a Redis Lua script (atomic execution, no race conditions):

-- Token bucket rate limiter (Lua script for Redis)

local key = KEYS[1]

local max_tokens = tonumber(ARGV[1])      -- bucket capacity

local refill_rate = tonumber(ARGV[2])      -- tokens per second

local now = tonumber(ARGV[3])              -- current timestamp

local data = redis.call('HMGET', key, 'tokens', 'last_refill')

local tokens = tonumber(data[1]) or max_tokens

local last_refill = tonumber(data[2]) or now

-- Calculate tokens to add since last refill

local elapsed = now - last_refill

local new_tokens = elapsed * refill_rate

tokens = math.min(max_tokens, tokens + new_tokens)

local allowed = 0

if tokens >= 1 then

    tokens = tokens - 1

    allowed = 1

end

redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)

redis.call('EXPIRE', key, math.ceil(max_tokens / refill_rate) * 2)

return { allowed, math.floor(tokens) }

This runs atomically in Redis. No race conditions even with 50 application servers hitting the same key simultaneously. The EXPIRE ensures cleanup of inactive clients.

Call it from your application:

const result = await redis.eval(luaScript, 1,

    `ratelimit:${clientId}`,

    100,           // max 100 tokens (burst capacity)

    1.67,          // refill at 1.67/sec = 100/min

    Date.now() / 1000

);

const [allowed, remaining] = result;

Setting Rate Limit Headers

Good APIs tell clients their rate limit status on every response. This lets well-behaved clients self-regulate instead of blindly hitting the limit.

Standard headers (following the IETF draft):

X-RateLimit-Limit: 100

X-RateLimit-Remaining: 67

X-RateLimit-Reset: 1712345678

When the limit is exceeded, return 429 Too Many Requests with a Retry-After header:

HTTP/1.1 429 Too Many Requests

Retry-After: 30

X-RateLimit-Limit: 100

X-RateLimit-Remaining: 0

X-RateLimit-Reset: 1712345678

{

    "error": "rate_limit_exceeded",

    "message": "Too many requests. Retry after 30 seconds.",

    "retry_after": 30

}

Include the Retry-After value in both the header and body. Some HTTP clients check headers, some parse the body.

Edge Cases That Break Naive Implementations

Client identification.

Most tutorials rate limit by IP address. This breaks immediately in production because:

Corporate NATs: thousands of users share one IP. Rate limiting by IP blocks an entire company.
VPNs and proxies: same problem.
Mobile carriers: CGNAT means millions of users share IP pools.

Rate limit by API key for authenticated endpoints. For unauthenticated endpoints (login, registration), rate limit by a combination of IP + other signals (user agent, fingerprint).

For login endpoints specifically, rate limit by both IP and target username. This prevents credential stuffing attacks where an attacker tries common passwords against thousands of usernames from a single IP.

Per-endpoint limits.

Not all endpoints are equal. Your /search endpoint might hit the database hard and need aggressive limiting. Your /health endpoint should probably never be limited. Your /login endpoint needs very tight limits to prevent brute force.

Apply different limits per endpoint or endpoint group:

/api/search     → 30 requests/minute

/api/users      → 100 requests/minute

/api/login      → 5 requests/minute

/api/health     → no limit

Distributed systems.

If you have multiple application servers behind a load balancer, each server needs to check the same rate limit counter. This is why Redis (or any shared store) is necessary. In-memory rate limiting on each server means a client can send N × server_count requests per window by hitting each server once.

Time synchronization.

If your application servers have slightly different clocks, timestamp-based algorithms can behave inconsistently. Use the Redis server’s time (redis.call('TIME')) instead of the application server’s clock.

Graceful Degradation

If Redis goes down, what happens to your rate limiter?

Bad answer: all requests are blocked because the rate limiter can’t check counts.

Good answer: fall back to a local in-memory rate limiter with generous limits, or allow requests through with logging and alerting.

async function checkRateLimit(clientId) {

    try {

        return await redisRateLimiter.check(clientId);

    } catch (error) {

        logger.warn('Redis rate limiter unavailable, falling back');

        return localRateLimiter.check(clientId);

    }

}

Rate limiting should protect your API, not become a single point of failure. Design it to fail open (allow requests) rather than fail closed (block everything), with monitoring to alert you when the fallback activates.

The Rate Limiting Checklist

Pick the algorithm — token bucket for most APIs
Use Redis — shared state across servers, atomic operations
Identify clients properly — API key, not just IP
Set per-endpoint limits — tight for auth, generous for reads
Return proper headers — X-RateLimit-Limit, Remaining, Reset
Return 429 with Retry-After — don’t just drop connections
Handle Redis failure — fall back, don’t fail closed
Monitor — track 429 rates, identify abusive patterns
Document your limits — developers need to know them upfront

Rate limiting is one of those things that’s boring until you need it. And when you need it, you need it ten minutes ago. Build it before the bot finds your API. It’s a lot less stressful that way.

If you found this guide helpful, check out our other resources:

How to Implement JWT Authentication Properly

How to Implement JWT Authentication Properly — Access Tokens, Refresh Tokens, and Common Mistakes

banditz — Fri, 09 Jan 2026 06:03:00 GMT

Every JWT tutorial on the internet follows the same pattern. User sends credentials. Server signs a JWT. Client stores it in localStorage. Client sends it in the Authorization header. Tutorial ends.

That implementation has at least three serious problems.

First, localStorage is accessible to any JavaScript on the page. One XSS vulnerability — one unsanitized input, one compromised third-party script — and the attacker reads and exfiltrates the token.

Second, no refresh mechanism. The token either expires quickly (user logs in constantly) or slowly (stolen token has a long window).

Third, no revocation. If the user changes their password or you detect suspicious activity, you can’t invalidate the token. It’s valid until it expires.

The Access / Refresh Token Pattern

OAuth2 standardized this approach with two tokens:

Access Token — Short-lived (15 minutes). Sent with every API request. If stolen, 15-minute window.

Refresh Token — Long-lived (7-30 days). Stored securely. Only sent to a refresh endpoint. Used to get new access tokens.

The flow:

User logs in with credentials
Server validates, generates access token (15 min) and refresh token (7 days)
Client stores both securely
Client sends access token with API requests
When access token expires, client sends refresh token to refresh endpoint
Server validates refresh token, issues new access token (and rotates refresh token)
When refresh token expires, user must log in again

Step 1: Token Claims

{

  "iss": "api.yourdomain.com",

  "sub": "user_12345",

  "aud": "yourdomain.com",

  "exp": 1712345678,

  "iat": 1712344778,

  "jti": "unique-id-abc123",

  "role": "admin"

}

iss — who created the token
sub — user ID
aud — intended audience
exp — expiration (Unix timestamp)
iat — issued at
jti — unique token ID (needed for revocation)

Never put sensitive data in the payload. JWTs are base64 encoded, not encrypted. Anyone can decode the payload. No passwords, no SSNs, no credit card numbers.

Keep it minimal. User ID and role. That’s it. Look up everything else server-side.

Step 2: Signing Algorithm

Use asymmetric signing for production.

RS256 (RSA + SHA-256) or ES256 (ECDSA + SHA-256). The auth server holds the private key and signs tokens. API servers have only the public key and can verify tokens but cannot create them.

This matters in a microservices architecture. If you use HS256 (symmetric), every service that verifies tokens has the shared secret. Compromise one service, compromise the signing key, and the attacker can mint arbitrary tokens.

With RS256/ES256, compromising an API server doesn’t help — the attacker can verify tokens but can’t sign new ones. Only the auth server can do that.

Generate an RS256 key pair:

openssl genrsa -out private.pem 2048

openssl rsa -in private.pem -pubout -out public.pem

The auth server uses private.pem to sign. API servers use public.pem to verify.

Step 3: Secure Token Storage

Where NOT to store tokens:

localStorage — accessible to any JavaScript. XSS = game over.
sessionStorage — same problem, just doesn’t persist across tabs.
Regular cookies without httpOnly — JavaScript can still read them.
URL parameters — visible in server logs, browser history, referrer headers.

Where to store them:

For web apps: httpOnly cookies with Secure and SameSite attributes.

res.cookie('access_token', token, {

    httpOnly: true,

    secure: true,

    sameSite: 'strict',

    maxAge: 900000   // 15 minutes

});

httpOnly means JavaScript cannot access the cookie. XSS can’t steal it. Secure means HTTPS only. SameSite: strict prevents the cookie from being sent in cross-origin requests, mitigating CSRF.

The browser sends the cookie automatically with every request to your domain. Your API reads it from the cookie header instead of the Authorization header. No JavaScript involved in token handling at all.

For mobile apps: Use platform secure storage — iOS Keychain, Android Keystore. These are hardware-backed encrypted storage that the OS protects.

For SPAs calling APIs on different domains: This is the hard case. SameSite: strict doesn’t work cross-origin. You need SameSite: none; Secure with proper CORS configuration. Or use the Backend For Frontend (BFF) pattern where your SPA talks to its own backend, and that backend handles tokens and proxies API calls.

Step 4: The Refresh Flow

When the access token expires, the client receives a 401 from the API. The client then sends the refresh token to a dedicated endpoint:

POST /auth/refresh

Cookie: refresh_token=eyJ...

The server:

Validates the refresh token (signature, expiration, not revoked)
Issues a new access token
Rotates the refresh token — issues a new refresh token and invalidates the old one
Returns both to the client

Refresh token rotation is critical for detecting theft. Here’s why:

If an attacker steals the refresh token and uses it before the legitimate user, the attacker gets a new token pair and the old refresh token is invalidated. When the legitimate user tries to refresh with the old token, it fails — and you know the refresh token was compromised. At this point, invalidate all tokens for that user and force re-authentication.

Store refresh tokens in a database (not just in-memory). Each refresh token maps to a user ID and a token family. When a refresh is requested:

Look up the token in the database
If it’s valid: issue new pair, mark old token as used, store new token
If it’s already been used: someone is replaying a stolen token. Invalidate the entire family. Force the user to log in again.

Step 5: Token Revocation

JWTs are stateless — you can’t invalidate them server-side. The whole point is that the server doesn’t store session state. But sometimes you need to revoke a token immediately: logout, password change, compromised account.

Option 1: Token blocklist.

Store revoked token JTIs (unique IDs) in Redis with a TTL matching the token’s remaining lifetime. On every API request, check the blocklist:

const isRevoked = await redis.get(`blocklist:${tokenJti}`);

if (isRevoked) return res.status(401).json({ error: 'Token revoked' });

This trades some statefulness for the ability to revoke. The blocklist is small (only active tokens that have been explicitly revoked) and the TTL ensures automatic cleanup.

Option 2: Short access tokens + refresh token revocation.

Keep access tokens at 5 minutes. For logout, only revoke the refresh token (delete it from the database). The access token expires naturally within 5 minutes. This is simpler but has a 5-minute window where the old access token still works.

Option 3: Token versioning.

Store a tokenVersion counter on the user record. Include it in the JWT payload. On every request, compare the token’s version to the database. If the user’s version has been incremented (due to password change, forced logout, etc.), reject the token. This requires a database lookup per request, which somewhat defeats the stateless advantage — but it’s a pragmatic compromise.

Common Mistakes

Mistake 1: Storing the secret in code. Use environment variables or a secrets manager. Never commit signing keys to version control.

Mistake 2: Using HS256 in a microservice architecture. Every service needs the shared secret. One compromised service = all services compromised. Use RS256/ES256.

Mistake 3: Not validating aud and iss claims. A token signed by your auth server but intended for a different service should be rejected. Always validate audience and issuer.

Mistake 4: Putting too much in the payload. Every byte is sent with every request. JWTs over 1KB are a sign you’re using them wrong.

Mistake 5: Never rotating refresh tokens. If a refresh token is valid for 30 days and gets stolen on day 1, the attacker has 29 days of access. Rotation limits this to a single use.

Mistake 6: No refresh token at all. A long-lived access token is the worst of both worlds — can’t be revoked and gives a long attack window. Use the two-token pattern.

The two-token pattern with httpOnly cookies, asymmetric signing, and refresh token rotation is the production standard. It’s not the simplest implementation, but it’s the one that doesn’t get you into the security news.

If you found this guide helpful, check out our other resources:

(More articles coming soon in Backend Engineering)

How to Diagnose and Fix MySQL Replication Lag

banditz — Wed, 07 Jan 2026 05:26:00 GMT

It starts with a support ticket. A user updated their profile but the change isn’t showing. They refresh — old data is back. Your application reads from a MySQL replica that’s lagging behind the primary, returning stale data.

You SSH into the replica:

SHOW REPLICA STATUS\G

You find the line that matters:

Seconds_Behind_Master: 347

Nearly 6 minutes behind. Every read from this server returns data that’s 6 minutes old. For a profile update that’s annoying. For e-commerce inventory, a customer is buying a product that sold out 5 minutes ago.

How Replication Works

Three components:

Binlog on the primary. Every write (INSERT, UPDATE, DELETE, DDL) is recorded in the binary log.

IO thread on the replica. Connects to the primary, reads binlog events, writes them to the local relay log.

SQL thread on the replica. Reads relay log events and executes them. This is where data changes on the replica.

Lag happens at two points: IO thread can’t fetch fast enough (network), or SQL thread can’t replay fast enough (most common).

Step 1: Read SHOW REPLICA STATUS

SHOW REPLICA STATUS\G

On MySQL before 8.0.22, use SHOW SLAVE STATUS\G.

Critical fields:

Replica_IO_Running — must be Yes. If No, check network, credentials, and whether the primary’s binlog was purged.

Replica_SQL_Running — must be Yes. If No, check Last_SQL_Error.

Seconds_Behind_Master — the lag. Approximate.

Read_Master_Log_Pos — how far IO thread has read.

Exec_Master_Log_Pos — how far SQL thread has executed. Gap = queued work.

Relay_Log_Space — unprocessed data size. Large and growing = SQL thread can’t keep up.

Step 2: IO Thread or SQL Thread?

Check the primary’s position:

-- On primary:

SHOW MASTER STATUS\G

If replica’s Read_Master_Log_Pos is close to primary, the IO thread is fine. SQL thread is the bottleneck.

If Read_Master_Log_Pos is far behind: network bandwidth, slow primary storage, or SSL overhead on the IO thread.

90% of replication lag is the SQL thread. Let’s fix it.

Step 3: Why the SQL Thread Is Slow

Cause 1: Single-threaded replay.

By default, MySQL replays events on one thread. The primary processes thousands of concurrent transactions across 32 cores, but the replica replays them one at a time.

SHOW VARIABLES LIKE 'replica_parallel_workers';

If 0 or 1, you’re single-threaded.

Fix — enable parallel replication:

STOP REPLICA;

SET GLOBAL replica_parallel_workers = 4;

SET GLOBAL replica_parallel_type = 'LOGICAL_CLOCK';

SET GLOBAL replica_preserve_commit_order = ON;

START REPLICA;

In my.cnf:

[mysqld]

replica_parallel_workers = 4

replica_parallel_type = LOGICAL_CLOCK

replica_preserve_commit_order = ON

Start with 4 workers. Going above 8-16 rarely helps.

Cause 2: Missing indexes on the replica.

With row-based replication, each row change is found by primary key on the replica. No primary key = full table scan per change. Multiply by thousands of changes per second and the SQL thread grinds.

Check what SQL thread is doing:

SHOW PROCESSLIST;

If you see slow UPDATE or DELETE from the system user, check that table’s indexes:

SHOW CREATE TABLE the_table;

Add missing primary keys.

Cause 3: Large transactions.

A single transaction updating 500,000 rows generates one massive binlog event. Replays as one operation, blocking everything behind it.

Fix on the application side — batch operations:

-- Instead of:

UPDATE orders SET status = 'archived' WHERE created_at < '2024-01-01';

-- Do:

UPDATE orders SET status = 'archived'

WHERE created_at < '2024-01-01' AND status != 'archived'

LIMIT 1000;

Loop until no rows affected. Each batch commits separately.

Cause 4: DDL operations.

ALTER TABLE on a large table blocks the SQL thread for its entire duration. Monitor lag during DDL — it’s expected and temporary.

Step 4: Monitoring That Works

Seconds_Behind_Master lies. It can show 0 between event bursts even when behind.

Use pt-heartbeat from Percona Toolkit:

On primary:

pt-heartbeat --update --database heartbeat --create-table

On replica:

pt-heartbeat --monitor --database heartbeat

Writes timestamps every second. True lag, no estimation.

Alert thresholds:

< 1s — acceptable
1-10s — watch
10-60s — investigate
> 60s — critical

The Diagnostic Sequence

SHOW REPLICA STATUS — threads running? What’s the error?
Compare positions — IO thread keeping up?
SHOW PROCESSLIST — what’s the SQL thread doing?
Check parallel replication — workers > 1?
Check table indexes — primary keys present?
Check for DDL — ALTER TABLE in progress?

Most lag resolves with parallel replication and proper indexing. The single-threaded SQL thread is by far the most common cause, and it’s the easiest fix. One config change that should be the default but somehow isn’t.

How to Audit Linux File Permissions and Find Security Holes

banditz — Mon, 05 Jan 2026 05:16:00 GMT

There’s a move every sysadmin has seen at least once. The application throws a permission denied error. Somebody runs chmod 777 on the directory. The error goes away. Everyone moves on.

Except now that directory is world-writable. Every user, every service, every process — they can all read, modify, and execute everything inside it. If that directory has config files, any compromised service can rewrite them. If it has scripts, anyone can inject code.

File permissions are boring. Nobody gets excited about rwxr-xr--. But misconfigured permissions are consistently in the top vectors for Linux privilege escalation. Not because the attacks are sophisticated — because the mistakes are simple.

The Permission Model in 60 Seconds

Run ls -la and you see:

-rw-r--r--  1 root root  1542 Mar 15 09:20 /etc/nginx/nginx.conf

drwx------  2 deploy deploy  4096 Mar 15 09:22 /home/deploy/.ssh

Breaking down -rw-r--r--:

First character = file type (- file, d directory, l symlink). Then three groups: owner (rw-), group (r--), others (r--).

r = read (4), w = write (2), x = execute (1), - = none (0)

So -rw-r--r-- in numeric = 644: owner reads/writes, everyone else reads.

For directories, x means you can enter it and access files. A directory with r-- lets you list filenames but not read files. With --x you can access files if you know their names but not list them.

Sane defaults: 644 for files, 755 for directories.

Step 1: Find World-Writable Files

World-writable = anyone on the system can modify the file. Lowest-hanging fruit for attackers.

Find every world-writable file:

sudo find / -type f -perm -o+w \

    -not -path "/proc/*" \

    -not -path "/sys/*" \

    -not -path "/dev/*" \

    2>/dev/null

For directories:

sudo find / -type d -perm -o+w \

    -not -path "/proc/*" \

    -not -path "/sys/*" \

    -not -path "/tmp" \

    -not -path "/var/tmp" \

    2>/dev/null

/tmp and /var/tmp are supposed to be world-writable. Everything else needs investigation.

Red flags:

Config files (.conf, .ini, .env) — anyone can change app behavior
Scripts (.sh, .py, .php) — code injection vector
Cron files — writable cron scripts run as the cron user
Web app files — writable PHP = direct code execution

Fix:

sudo chmod o-w /path/to/file

# Or set correct permissions

sudo chmod 644 /path/to/file

Step 2: Hunt for SUID and SGID Binaries

SUID means when you run this binary, it executes with the file owner’s permissions — not yours.

If root owns a SUID binary, anyone who runs it gets root-level execution. That’s by design for passwd (needs root for /etc/shadow) and sudo. The problem is unexpected SUID binaries.

Find all SUID:

sudo find / -type f -perm -4000 2>/dev/null

Find all SGID:

sudo find / -type f -perm -2000 2>/dev/null

A clean Linux server should have:

/usr/bin/passwd

/usr/bin/sudo

/usr/bin/su

/usr/bin/mount

/usr/bin/umount

/usr/bin/chfn

/usr/bin/chsh

/usr/bin/newgrp

/usr/bin/gpasswd

/usr/bin/pkexec

Everything else is suspicious. GTFOBins lists Linux binaries exploitable with SUID — find, vim, python, bash, nmap, and many more.

If python3 had SUID set (it never should):

python3 -c 'import os; os.setuid(0); os.system("/bin/bash")'

Instant root shell.

Remove unnecessary SUID:

sudo chmod u-s /path/to/binary

Step 3: Check Sensitive Files

SSH files:

sudo stat -c '%a %U %G %n' /home/*/.ssh /home/*/.ssh/* 2>/dev/null

Required:

~/.ssh/ — 700
~/.ssh/authorized_keys — 644 or 600
~/.ssh/id_ed25519 (private key) — 600 (SSH refuses looser permissions)
~/.ssh/config — 600

System credentials:

sudo stat -c '%a %U %G %n' /etc/passwd /etc/shadow /etc/group /etc/gshadow

/etc/passwd — 644 root:root
/etc/shadow — 640 root:shadow (contains password hashes)
/etc/group — 644 root:root
/etc/gshadow — 640 root:shadow

If /etc/shadow is world-readable, any user can read hashes and attempt offline cracking.

Web server:

Web directories should be 755, files 644. Config files owned by root at 640. Never make web files writable by the web server user unless absolutely necessary (uploads directory only).

Step 4: Find Orphaned Files

Files without valid owners = deleted user accounts or improper management:

sudo find / -nouser -o -nogroup 2>/dev/null | grep -v '/proc\|/sys'

Risk: if a new user gets the recycled UID, they inherit all orphaned files.

Fix:

sudo chown root:root /path/to/orphaned-file

Step 5: Set Proper Defaults with umask

Check current:

umask

Common values:

022 — new files 644, directories 755 (standard)
027 — new files 640, directories 750 (others can’t read)
077 — new files 600, directories 700 (only owner)

For servers, 027 is ideal. Set in /etc/login.defs:

UMASK 027

For systemd services:

[Service]

UMask=0027

The Permission Audit Checklist

Run quarterly on every production server:

World-writable files — find / -type f -perm -o+w — fix everything outside /tmp
SUID/SGID binaries — find / -perm -4000 — compare against known-good list
SSH permissions — directories 700, private keys 600
/etc/shadow — must be 640, never world-readable
Web files — directories 755, files 644, config owned by root
Orphaned files — find / -nouser -o -nogroup — reassign
Verify umask — should be 022 or 027

For automated auditing, Lynis does comprehensive permission checks. Install with sudo apt install lynis and run sudo lynis audit system.

Permissions aren’t glamorous. Nobody puts “I audited file permissions” on conference slides. But I’ve seen production databases exposed because /etc/shadow was 644. I’ve seen web servers owned because deploy scripts left everything world-writable. I’ve seen root shells from SUID on python3.

The boring stuff prevents the exciting breaches.

How to Harden SSH and Stop Brute Force Attacks on Linux Servers — Read the Logs Before You Lock the Door

banditz — Sat, 03 Jan 2026 04:14:00 GMT

I’m going to tell you something that’ll either scare you or make you shrug depending on how long you’ve been managing Linux servers: the VPS you spun up 20 minutes ago is already being attacked.

Not by a person sitting in a dark room with a hoodie — by bots. Thousands of them. Automated scripts running on compromised machines across the globe, systematically scanning every single IP address on the internet, looking for SSH servers running on port 22 with password authentication enabled. They try root/admin, root/password, admin/123456, deploy/deploy, ubuntu/ubuntu, and about ten thousand other common credential combos.

This isn’t hypothetical. Go look at your auth log right now:

sudo tail -200 /var/log/auth.log

If your server has been online for more than an hour with SSH on port 22, you’ll see a wall of failed login attempts from IP addresses you’ve never seen before. That’s the background radiation of the internet. It never stops. And if your only defense is a password — even a decent one — you’re playing a statistical game that you’ll eventually lose.

What the Bots Are Actually Doing

Before you start hardening things, it helps to understand the attack pattern. Most people picture brute force as one bot trying password1, password2, password3 at lightning speed against a single account. That’s the version from 2005.

Modern SSH brute force is smarter. Here’s what actually shows up in your logs:

Mar 28 04:17:32 web01 sshd[12847]: Invalid user admin from 185.224.128.47 port 42816

Mar 28 04:17:34 web01 sshd[12849]: Invalid user test from 185.224.128.47 port 42920

Mar 28 04:17:36 web01 sshd[12851]: Invalid user oracle from 185.224.128.47 port 43018

Mar 28 04:17:38 web01 sshd[12853]: Invalid user postgres from 185.224.128.47 port 43112

Mar 28 04:17:40 web01 sshd[12855]: Failed password for root from 185.224.128.47 port 43200 ssh2

Notice what’s happening. The bot isn’t just hammering root. It’s cycling through usernames — admin, test, oracle, postgres, deploy, git, jenkins, ubuntu — because these are default accounts that exist on millions of servers, and people frequently leave them with weak or default passwords.

The smarter bots also throttle their attempts. Instead of 100 attempts per second (which is trivially detectable), they’ll do 3 attempts, wait 60 seconds, try 3 more. Some distribute the attack across multiple IPs from the same botnet, so no single IP triggers rate limiting.

On RHEL/CentOS systems, the same information lives in /var/log/secure instead of /var/log/auth.log. The format is identical.

To see a summary of who’s been trying to get in:

grep "Failed password" /var/log/auth.log | awk '{print $(NF-3)}' | sort | uniq -c | sort -rn | head -20

This gives you a ranked list of the most persistent attacking IPs. On a server that’s been online for a week, don’t be surprised if the top IP has thousands of attempts.

Now let’s shut them down properly.

Step 1: Switch to SSH Key-Only Authentication

This is the nuclear option against brute force, and it should be the first thing you do on any server. Not the last. Not “eventually.” First.

SSH key authentication replaces passwords with a cryptographic key pair. Your private key stays on your local machine, your public key goes on the server. When you connect, the server challenges your client to prove it has the private key without ever transmitting it. No password is sent. No password can be guessed. Brute force doesn’t work because there’s nothing to brute force.

Generate a key pair on your local machine (if you don’t already have one):

ssh-keygen -t ed25519 -C "you@yourdomain.com"

Ed25519 is the modern choice — it’s faster, shorter, and more secure than RSA. If you’re dealing with legacy systems that don’t support ed25519, fall back to RSA with a 4096-bit key:

ssh-keygen -t rsa -b 4096

Copy the public key to your server:

ssh-copy-id -i ~/.ssh/id_ed25519.pub user@your-server-ip

Test key login before disabling passwords. Open a new terminal and connect:

ssh user@your-server-ip

If you get in without being asked for a password, key auth is working. Keep this session open as a safety net.

Now disable password authentication. Edit the SSH daemon config:

sudo nano /etc/ssh/sshd_config

Find and set these three directives:

PasswordAuthentication no

PubkeyAuthentication yes

ChallengeResponseAuthentication no

That third one is important. ChallengeResponseAuthentication can bypass PasswordAuthentication on some PAM configurations, effectively keeping password login alive even when you think you’ve disabled it. Set it to no.

Also disable empty passwords while you’re in there:

PermitEmptyPasswords no

Test the config before applying:

sudo sshd -t

If it returns silently with no errors, restart:

sudo systemctl restart sshd

From this point on, every brute force attempt in the world is wasted effort against your server. The bots will keep trying — they don’t know you’ve disabled passwords — but every single attempt fails instantly. Check your auth log after a few minutes:

sudo tail -50 /var/log/auth.log

You’ll see the attempts now ending with Connection closed by authenticating user or Disconnected from authenticating user instead of Failed password. They can’t even get to the password prompt.

Step 2: Lock Down sshd_config

Key-only auth handles brute force, but there’s more to SSH security than just authentication. Open /etc/ssh/sshd_config and layer these additional restrictions:

Disable root login:

PermitRootLogin no

Even with key-only auth, allowing direct root login is bad practice. If an attacker somehow gets your private key (stolen laptop, compromised backup), they’d have immediate root access. Force the use of a regular user account that escalates with sudo.

If you absolutely must allow root key-based login (some automation tools require it), use:

PermitRootLogin prohibit-password

This allows root login with SSH keys but not passwords. It’s a compromise, not ideal, but better than yes.

Restrict which users can log in:

AllowUsers deployer admin

This is a whitelist. Only the users listed here can SSH in. Everyone else — even users with valid system accounts and SSH keys — gets rejected. This is powerful because it means a compromised application user (like www-data or postgres) can’t be leveraged for SSH access even if an attacker manages to plant a key in their authorized_keys file.

Limit authentication attempts and timing:

MaxAuthTries 3

LoginGraceTime 20

MaxAuthTries 3 means after 3 failed authentication attempts within a single connection, the server disconnects. LoginGraceTime 20 gives only 20 seconds to complete authentication — if you haven’t authenticated in 20 seconds, you’re disconnected. Legitimate users authenticate in under 2 seconds; only bots and manual attackers need more time.

Disconnect idle sessions:

ClientAliveInterval 300

ClientAliveCountMax 2

The server sends a keepalive probe every 300 seconds (5 minutes). If the client doesn’t respond to 2 consecutive probes, the connection is terminated. This prevents abandoned sessions from sitting open indefinitely, which reduces the window for session hijacking.

Disable unnecessary features:

X11Forwarding no

AllowTcpForwarding no

AllowAgentForwarding no

Unless you specifically need X11 forwarding (running graphical apps over SSH), TCP forwarding (tunneling), or agent forwarding (chaining SSH connections), disable them. Each enabled feature is a potential attack vector. Turn them off by default and enable them selectively when needed.

After making all changes, test and restart:

sudo sshd -t

sudo systemctl restart sshd

Always — and I mean always — keep an existing SSH session open while you restart sshd. If you made a typo that prevents new connections, your existing session stays alive and lets you fix it. If you close your only session before testing, and the new config has an error, you’re locked out and praying your VPS provider has a console rescue option.

Step 3: Configure Fail2ban

Key-only auth makes brute force ineffective, but the bots don’t know that. They’ll keep hammering your server, consuming bandwidth and filling your logs. Fail2ban fixes this by watching your auth log and automatically firewall-blocking IPs that fail too many times.

Install fail2ban:

sudo apt install fail2ban      # Debian/Ubuntu

sudo dnf install fail2ban      # RHEL/CentOS/Fedora

Create a local config (never edit jail.conf directly — it gets overwritten on updates):

sudo cp /etc/fail2ban/jail.conf /etc/fail2ban/jail.local

Edit the local config:

sudo nano /etc/fail2ban/jail.local

Find the [sshd] section and configure it:

[sshd]

enabled = true

port = ssh

filter = sshd

logpath = /var/log/auth.log

maxretry = 3

findtime = 600

bantime = 3600

This means: if an IP fails 3 times within 10 minutes (600 seconds), ban it for 1 hour (3600 seconds). For most servers, this is a reasonable starting point.

For servers that take a real beating, you can get more aggressive:

bantime = 86400          # 24-hour ban

findtime = 3600          # 3 failures within an hour

maxretry = 3

Add the recidive jail for repeat offenders. This catches IPs that get banned, wait out the ban, and come back. Add this to the bottom of jail.local:

[recidive]

enabled = true

logpath = /var/log/fail2ban.log

banaction = %(banaction_allports)s

bantime = 604800

findtime = 86400

maxretry = 3

This watches fail2ban’s own log. If an IP gets banned 3 times within 24 hours, the recidive jail bans them for a full week across all ports. Persistent bots learn the hard way.

Start and enable fail2ban:

sudo systemctl enable fail2ban

sudo systemctl start fail2ban

Check the status:

sudo fail2ban-client status sshd

You’ll see the number of currently banned IPs and the total number of bans since fail2ban started. On a fresh server with SSH on port 22, expect this number to climb quickly.

Unban an IP (if you accidentally lock yourself out):

sudo fail2ban-client set sshd unbanip 203.0.113.50

Pro tip: whitelist your own IP so fail2ban never bans you, even if you fat-finger your key passphrase multiple times:

# In jail.local, under [DEFAULT]

ignoreip = 127.0.0.1/8 ::1 203.0.113.50

Replace 203.0.113.50 with your actual IP or IP range.

Step 4: Reduce the Noise — Change the Port

I’ll be upfront about this: changing the SSH port is not a security measure. It’s a noise reduction measure. Any attacker who specifically targets your server will find your SSH port within seconds using a port scan. Changing it from 22 to 2222 or 4822 or 39122 doesn’t add meaningful security against a determined attacker.

What it does do is eliminate 99% of the automated bot traffic that exclusively targets port 22. After changing the port, your auth.log goes from thousands of daily entries to nearly zero. This makes it much easier to spot real threats among the remaining entries.

Change the port in sshd_config:

sudo nano /etc/ssh/sshd_config

Change:

Port 22

To:

Port 4822

Update the firewall BEFORE restarting SSH (otherwise you lock yourself out):

For UFW:

sudo ufw allow 4822/tcp

sudo ufw delete allow 22/tcp

For iptables:

sudo iptables -A INPUT -p tcp --dport 4822 -j ACCEPT

sudo iptables -D INPUT -p tcp --dport 22 -j ACCEPT

Update fail2ban to watch the new port. In jail.local, change the sshd section:

[sshd]

port = 4822

Restart everything:

sudo sshd -t

sudo systemctl restart sshd

sudo systemctl restart fail2ban

Connect from a new terminal using the new port:

ssh -p 4822 user@your-server-ip

Keep your old session alive until you confirm the new connection works.

Step 5: IP Whitelisting and Port Knocking (For the Paranoid)

If you always connect from the same IP or IP range — say, your home IP or your office VPN — you can lock SSH down to only accept connections from those addresses.

With UFW:

sudo ufw allow from 203.0.113.0/24 to any port 4822

This means only IPs in the 203.0.113.0/24 range can even reach the SSH port. Everyone else gets a timeout. The bots don’t even see that SSH exists on your server.

The problem with IP whitelisting is that your IP might change (dynamic ISP, travel, different networks). If your IP changes and you haven’t updated the whitelist, you’re locked out.

Port knocking solves this problem. It keeps the SSH port completely closed — invisible to port scans — until a client sends a specific sequence of connection attempts to other ports in the correct order.

Install knockd:

sudo apt install knockd

Configure it:

sudo nano /etc/knockd.conf

[options]

    UseSyslog

[openSSH]

    sequence = 7000,8000,9000

    seq_timeout = 15

    command = /usr/sbin/ufw allow from %IP% to any port 4822

    tcpflags = syn

[closeSSH]

    sequence = 9000,8000,7000

    seq_timeout = 15

    command = /usr/sbin/ufw delete allow from %IP% to any port 4822

    tcpflags = syn

This configuration works like a secret handshake. To open SSH, you knock on ports 7000, 8000, 9000 in that exact order within 15 seconds. To close it afterward, knock in reverse: 9000, 8000, 7000.

From your local machine, knock with:

knock your-server-ip 7000 8000 9000

ssh -p 4822 user@your-server-ip

# When done:

knock your-server-ip 9000 8000 7000

Or if you don’t have the knock client, you can use nmap:

nmap -Pn --host-timeout 201 --max-retries 0 -p 7000 your-server-ip

nmap -Pn --host-timeout 201 --max-retries 0 -p 8000 your-server-ip

nmap -Pn --host-timeout 201 --max-retries 0 -p 9000 your-server-ip

Port knocking is overkill for most setups. But if you’re managing a server that handles sensitive data, or if you just want the satisfaction of knowing that your SSH port is completely invisible to the entire internet, it’s a satisfying layer to add.

The SSH Hardening Checklist

Here’s the complete order of operations for any new Linux server:

Generate SSH keys on your local machine (if you haven’t already)
Copy the public key to the server with ssh-copy-id
Test key login before touching any config
Disable password authentication — PasswordAuthentication no
Disable root login — PermitRootLogin no
Restrict users — AllowUsers with only the accounts that need SSH
Limit attempts and timeouts — MaxAuthTries 3, LoginGraceTime 20
Install fail2ban — configure with reasonable ban times
Change the SSH port — reduces log noise from automated scanners
Whitelist IPs or set up port knocking — if you connect from known locations

Do them in this order. Each layer addresses a different threat, and together they make SSH brute force a non-issue. The bots will keep scanning, the botnets will keep running, but your server is no longer a target that can yield results.

I’ve been watching these logs for decades. The attacks never stop, they only evolve. But a properly hardened SSH setup hasn’t changed much in that time either — because the fundamentals work. Keys beat passwords. Automatic bans beat manual blocking. And a healthy dose of paranoia beats blind trust every single time.

If you found this guide helpful, check out our other resources:

(More articles coming soon in the Cyber Security category)

How to Find and Fix Slow Queries in PostgreSQL — Read EXPLAIN ANALYZE Before You Add Random Indexes

banditz — Thu, 01 Jan 2026 04:41:00 GMT

There’s a ritual that happens in every engineering team eventually. Someone notices an API endpoint is slow. Someone else looks at the database and says “we need an index.” They add an index on what seems like the right column, deploy it, and… the query is still slow. Or it’s faster for that one query but now three other queries have mysteriously gotten worse.

This happens because adding an index without reading the query plan is like prescribing medicine without diagnosing the patient. Sometimes the problem is a missing index. But just as often, the problem is stale statistics, a badly written query, a join that the planner is executing in the wrong order, or an ORM that’s generating SQL you’d be embarrassed to write by hand.

PostgreSQL gives you the exact diagnostic tool to figure this out. It’s called EXPLAIN ANALYZE, and if you’re not using it every time you investigate a slow query, you’re guessing. Let’s stop guessing.

Step 1: Find the Queries That Actually Matter

Before you optimize anything, you need to know what to optimize. The slowest query isn’t necessarily the one that matters most. A query that takes 2 seconds but runs once a day isn’t as urgent as a query that takes 50ms but runs 100,000 times a day. The second one consumes way more server time overall.

Enable pg_stat_statements — this extension tracks execution statistics for every query that runs on your server.

Add it to postgresql.conf:

shared_preload_libraries = 'pg_stat_statements'

Restart PostgreSQL:

sudo systemctl restart postgresql

Create the extension:

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

Now find the queries that consume the most total time:

SELECT

    substring(query, 1, 120) AS query_preview,

    calls,

    round(total_exec_time::numeric, 2) AS total_ms,

    round(mean_exec_time::numeric, 2) AS avg_ms,

    rows

FROM pg_stat_statements

ORDER BY total_exec_time DESC

LIMIT 10;

This shows your top 10 resource consumers. The total_exec_time column is what matters — it’s the cumulative time spent on that query across all calls. A query with 5ms average execution time and 2 million calls has consumed 10,000 seconds of total server time. That matters a lot more than the 800ms query that runs 50 times a day.

For finding queries that are slow per-execution (the ones your users actually feel):

SELECT

    substring(query, 1, 120) AS query_preview,

    calls,

    round(mean_exec_time::numeric, 2) AS avg_ms,

    round(max_exec_time::numeric, 2) AS max_ms

FROM pg_stat_statements

WHERE mean_exec_time > 100

ORDER BY mean_exec_time DESC

LIMIT 10;

This catches the individually slow queries — anything averaging over 100ms. The max_exec_time column is useful too; if a query averages 50ms but maxes at 12,000ms, it has an intermittent performance problem likely related to lock contention, resource exhaustion, or cold cache.

You can also enable slow query logging directly in PostgreSQL as a safety net:

# In postgresql.conf

log_min_duration_statement = 1000    # Log any query taking > 1 second

This writes slow queries directly to the PostgreSQL log file, which is useful if pg_stat_statements isn’t available or if you want to see the exact parameters used in slow queries.

Step 2: Read the Execution Plan (This Is Where Understanding Begins)

You’ve identified a slow query. Now run it through EXPLAIN ANALYZE:

EXPLAIN (ANALYZE, BUFFERS)

SELECT o.id, o.total, c.name

FROM orders o

JOIN customers c ON c.id = o.customer_id

WHERE o.status = 'completed'

AND o.created_at > '2025-01-01'

ORDER BY o.created_at DESC

LIMIT 20;

The BUFFERS option adds information about how many disk pages were read, which helps distinguish between I/O problems and CPU problems.

Here’s what an output might look like:

Limit  (cost=15234.52..15234.57 rows=20 width=52) (actual time=892.45..892.48 rows=20 loops=1)

  ->  Sort  (cost=15234.52..15456.23 rows=88682 width=52) (actual time=892.44..892.46 rows=20 loops=1)

        Sort Key: o.created_at DESC

        Sort Method: top-N heapsort  Memory: 27kB

        ->  Hash Join  (cost=12.50..13012.34 rows=88682 width=52) (actual time=0.82..845.20 rows=89542 loops=1)

              Hash Cond: (o.customer_id = c.id)

              ->  Seq Scan on orders o  (cost=0.00..11842.00 rows=88682 width=44) (actual time=0.04..780.32 rows=89542 loops=1)

                    Filter: ((status = 'completed') AND (created_at > '2025-01-01'))

                    Rows Removed by Filter: 410458

              ->  Hash  (cost=10.00..10.00 rows=200 width=12) (actual time=0.42..0.42 rows=200 loops=1)

                    ->  Seq Scan on customers c  (cost=0.00..10.00 rows=200 width=12) (actual time=0.01..0.18 rows=200 loops=1)

Planning Time: 0.38 ms

Execution Time: 892.72 ms

Now let’s read this like a professional:

The bottleneck is obvious. The Seq Scan on orders o line shows actual time=0.04..780.32. That’s 780ms spent reading the entire orders table sequentially — all 500,000 rows — to find the 89,542 that match the filter. That’s 82% of the total execution time in one node.

The estimated vs actual rows match. rows=88682 estimated, rows=89542 actual. That’s close enough — the statistics are fine. The planner isn’t making a bad decision because of stale stats; it’s making the only decision it can because there’s no suitable index.

The Seq Scan on customers is fine. It takes 0.18ms because the table has only 200 rows. Every table under a few thousand rows is fastest with a sequential scan. Don’t index tables with 200 rows — it’s pointless overhead.

The sort is cheap. top-N heapsort with 27kB memory means PostgreSQL used an efficient algorithm for the LIMIT 20 ORDER BY — it found the top 20 without sorting all 89,542 matching rows.

The diagnosis: this query needs a composite index on orders(status, created_at) to avoid the sequential scan.

CREATE INDEX idx_orders_status_created ON orders (status, created_at);

Run EXPLAIN ANALYZE again after creating the index:

->  Index Scan using idx_orders_status_created on orders o

      (cost=0.42..1823.56 rows=88682 width=44)

      (actual time=0.03..45.21 rows=89542 loops=1)

      Index Cond: ((status = 'completed') AND (created_at > '2025-01-01'))

780ms → 45ms. The sequential scan is gone, replaced by an index scan that reads only the matching rows.

Step 3: When Stale Statistics Are the Real Problem

Sometimes the execution plan shows something strange: the estimated row count is wildly different from the actual row count.

->  Seq Scan on events  (cost=0.00..25.00 rows=5 width=32) (actual time=0.04..312.45 rows=147823 loops=1)

The planner estimated 5 rows. The actual result was 147,823. That’s not a rounding error — the planner is working with completely wrong statistics and making terrible decisions as a result.

When estimated rows are much lower than actual rows, the planner tends to choose nested loop joins (good for small sets, terrible for large ones) and avoids using hash joins or merge joins that would be much more efficient. The entire execution plan downstream of the bad estimate is suboptimal.

Fix it with ANALYZE:

ANALYZE events;

This collects fresh statistics about the table’s data distribution — how many rows, how many distinct values per column, most common values, histogram boundaries. After running ANALYZE, the planner has accurate information and can make better decisions.

If specific columns have unusual distributions (lots of NULLs, extreme skew, or a huge number of distinct values), increase the statistics target for those columns:

ALTER TABLE events ALTER COLUMN event_type SET STATISTICS 1000;

ANALYZE events;

The default statistics target is 100, which means PostgreSQL samples 100 × 300 = 30,000 rows to build histograms. Increasing it to 1000 means 300,000 rows are sampled, giving more accurate statistics for high-cardinality or skewed columns.

Check if autovacuum is keeping up:

SELECT

    relname,

    n_live_tup,

    n_dead_tup,

    last_autoanalyze,

    last_autovacuum

FROM pg_stat_user_tables

WHERE n_dead_tup > 1000

ORDER BY n_dead_tup DESC;

If last_autoanalyze was a long time ago and n_dead_tup is high, autovacuum isn’t keeping up. For high-churn tables, tune it:

ALTER TABLE events SET (

    autovacuum_analyze_scale_factor = 0.02,

    autovacuum_vacuum_scale_factor = 0.05

);

This triggers ANALYZE after 2% of the table changes (instead of the default 10%) and VACUUM after 5%.

Step 4: Query Patterns That No Index Can Fix

Some queries are slow not because of missing indexes but because the SQL itself prevents efficient execution. No amount of indexing fixes a fundamentally bad query pattern.

Functions on indexed columns:

-- PostgreSQL CANNOT use an index on created_at here

SELECT * FROM orders WHERE EXTRACT(YEAR FROM created_at) = 2026;

The function EXTRACT() is evaluated on every single row. The index on created_at exists but is useless because PostgreSQL would need an index on EXTRACT(YEAR FROM created_at) — which doesn’t exist.

Rewrite as a range:

-- This uses the index on created_at

SELECT * FROM orders

WHERE created_at >= '2026-01-01'

AND created_at < '2027-01-01';

Same results. But now PostgreSQL does a quick index range scan instead of reading the entire table.

OFFSET pagination:

SELECT * FROM orders ORDER BY created_at DESC LIMIT 20 OFFSET 10000;

This looks efficient — “give me 20 rows starting at position 10,000.” But PostgreSQL must read and sort all 10,020 rows, then throw away the first 10,000. The deeper you paginate, the slower it gets. At OFFSET 100,000, it’s reading 100,020 rows to return 20.

Use keyset pagination instead:

SELECT * FROM orders

WHERE created_at < '2026-03-15T10:30:00Z'

ORDER BY created_at DESC

LIMIT 20;

The WHERE clause on created_at replaces the OFFSET. The query starts reading from the right position in the index and returns 20 rows immediately, regardless of which “page” you’re on. Page 1 and page 5,000 take the same amount of time.

SELECT * through an ORM:

SELECT * FROM orders

JOIN customers ON customers.id = orders.customer_id

JOIN order_items ON order_items.order_id = orders.id

JOIN products ON products.id = order_items.product_id;

ORMs love eager loading, and eager loading loves SELECT * with multiple JOINs. The result set explodes — if an order has 5 items, you get 5 rows per order, each containing every column from all four tables. Most of that data is duplicated and never used.

The fix depends on what you actually need. If you only need order totals and customer names:

SELECT o.id, o.total, c.name

FROM orders o

JOIN customers c ON c.id = o.customer_id

WHERE o.created_at > '2026-01-01';

Selecting only the columns you need means smaller result sets, less data transfer, and potentially index-only scans (where PostgreSQL can answer the query entirely from the index without touching the table at all).

Correlated subqueries:

SELECT *,

    (SELECT COUNT(*) FROM order_items WHERE order_id = orders.id) AS item_count

FROM orders

WHERE status = 'completed';

That subquery runs once for every row in the outer query. If the outer query returns 50,000 rows, the subquery executes 50,000 times. Replace it with a JOIN:

SELECT o.*, COUNT(oi.id) AS item_count

FROM orders o

LEFT JOIN order_items oi ON oi.order_id = o.id

WHERE o.status = 'completed'

GROUP BY o.id;

One query, one pass. The planner can use a hash join and process everything in bulk.

Step 5: Index Strategy — When, What, and How Many

After you’ve confirmed through EXPLAIN ANALYZE that a missing index is genuinely the bottleneck, be strategic about what you create.

Composite indexes — column order matters:

-- Good: status is the equality filter, created_at is the range filter

CREATE INDEX idx_orders_status_created ON orders (status, created_at);

-- Less useful: reversed order doesn't help if you're filtering by status first

CREATE INDEX idx_orders_created_status ON orders (created_at, status);

The general rule: equality columns first, range columns second. PostgreSQL can use a composite index for a prefix — an index on (status, created_at) helps queries filtering on status alone, but an index on (created_at, status) doesn’t help queries filtering on status alone.

Partial indexes — index only what matters:

CREATE INDEX idx_orders_pending ON orders (customer_id, created_at)

WHERE status = 'pending';

This index is smaller than a full index because it only includes rows where status = 'pending'. If your query always filters for pending orders, this index is both smaller (faster to scan, less memory) and more precise than a full index.

Covering indexes (index-only scans):

CREATE INDEX idx_orders_covering ON orders (status, created_at)

INCLUDE (id, total, customer_id);

The INCLUDE columns are stored in the index but not used for searching. If your query only selects id, total, and customer_id, PostgreSQL can answer it entirely from the index without reading the table at all. This is called an index-only scan and it’s the fastest possible execution path.

Don’t over-index. Every index on a table slows down writes. Each INSERT must update every index. Each UPDATE on an indexed column must update that index. Each DELETE must mark the row as dead in every index. A table with 15 indexes has 15 times the write overhead.

Check for unused indexes periodically:

SELECT

    indexrelname AS index_name,

    idx_scan AS times_used,

    pg_size_pretty(pg_relation_size(indexrelid)) AS size

FROM pg_stat_user_indexes

WHERE idx_scan = 0

AND indexrelname NOT LIKE '%pkey%'

ORDER BY pg_relation_size(indexrelid) DESC;

These are indexes that have never been used since the last statistics reset. If they’ve been unused for months, drop them. They’re consuming disk space, memory, and write performance for zero benefit.

The Complete Slow Query Diagnostic Sequence

When a query is slow, run through this:

Find it — use pg_stat_statements to identify the query by total time or average time
Read the plan — EXPLAIN (ANALYZE, BUFFERS) on the query
Check estimated vs actual rows — if they’re wildly different, run ANALYZE on the tables involved
Look for Seq Scans on large tables — if the filter is selective (returns < 15% of rows), an index is likely needed
Check the query pattern — look for functions on indexed columns, OFFSET pagination, SELECT *, correlated subqueries
Add indexes strategically — composite, partial, covering — based on what the plan tells you
Verify the fix — run EXPLAIN ANALYZE again and confirm the plan improved

Don’t skip step 3. I’ve seen teams spend days tuning queries that were slow because of stale statistics. A single ANALYZE command fixed the whole thing in under a second.

And don’t skip step 7. I’ve seen people create indexes that PostgreSQL ignores because the query pattern doesn’t match. You haven’t fixed anything until EXPLAIN ANALYZE confirms it.

Database performance is a discipline, not a guessing game. The tools exist. The plans are readable. The fixes are usually straightforward once you know where to look. The hard part isn’t the fix — it’s convincing yourself to actually look at the plan instead of throwing indexes at the wall and hoping one sticks.

If you found this guide helpful, check out our other resources:

(More articles coming soon in the Database Systems category)