<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Daily Dejavu</title>
    <link>https://dailydejavu.com</link>
    <description>Latest updates from Daily Dejavu</description>
    <language>en-us</language>
    <lastBuildDate>Mon, 20 Apr 2026 10:01:37 GMT</lastBuildDate>
    <atom:link href="https://dailydejavu.com/feed" rel="self" type="application/rss+xml" />
    <item>
      <title><![CDATA[How to Build a Production CI/CD Pipeline with GitHub Actions — Staging, Secrets, and Rollback]]></title>
      <link>https://dailydejavu.com/devops/build-cicd-pipeline-github-actions-staging-production</link>
      <guid isPermaLink="true">https://dailydejavu.com/devops/build-cicd-pipeline-github-actions-staging-production</guid>
      <pubDate>Mon, 20 Apr 2026 09:48:00 GMT</pubDate>
      <dc:creator><![CDATA[banditz]]></dc:creator>
      <description><![CDATA[Your deploy process is SSH into server, git pull, pray. Here's how to build a real CI/CD pipeline with GitHub Actions — automated tests, staging deploys, production gates, and rollback that works.
]]></description>
      <content:encoded><![CDATA[<pre><code class="hljs">The deploy process at too many companies looks like this: SSH into the production server, <span class="hljs-code">`cd /var/www/app`</span>, <span class="hljs-code">`git pull origin main`</span>, <span class="hljs-code">`npm install`</span>, <span class="hljs-code">`npm run build`</span>, restart the service, and refresh the browser to see if anything broke. If it did, you SSH back in, <span class="hljs-code">`git log`</span>, <span class="hljs-code">`git revert`</span>, and hope you caught the right commit.

This works until it doesn't. And when it doesn't, it fails spectacularly — a typo in a config file that takes the site down, a missing dependency that wasn't in the repo, a migration that runs against production before anyone's ready. No audit trail, no approval gate, no rollback beyond "revert and pray."

CI/CD isn't about fancy tools. It's about making deployments boring, predictable, and reversible.

<span class="hljs-section">## The Pipeline Structure</span>

A production pipeline has four stages:

<span class="hljs-bullet">1.</span> <span class="hljs-strong">**Test**</span> — run the full test suite. If anything fails, stop.
<span class="hljs-bullet">2.</span> <span class="hljs-strong">**Build**</span> — create the deployment artifact (Docker image, compiled bundle, whatever ships).
<span class="hljs-bullet">3.</span> <span class="hljs-strong">**Deploy to Staging**</span> — automatically deploy to an environment that mirrors production.
<span class="hljs-bullet">4.</span> <span class="hljs-strong">**Deploy to Production**</span> — after staging verification, deploy with an approval gate.

Here's the GitHub Actions workflow structure:

<span class="hljs-code">    # .github/workflows/deploy.yml
    name: Deploy Pipeline
</span>
<span class="hljs-code">    on:
      push:
        branches: [main]
      workflow_dispatch:   # Manual trigger
</span>
<span class="hljs-code">    jobs:
      test:
        runs-on: ubuntu-latest
        steps:
          - uses: actions/checkout@v4
</span>
<span class="hljs-bullet">          -</span> name: Set up Node.js
<span class="hljs-code">            uses: actions/setup-node@v4
            with:
              node-version: '20'
              cache: 'npm'
</span>
<span class="hljs-bullet">          -</span> run: npm ci
<span class="hljs-bullet">          -</span> run: npm run lint
<span class="hljs-bullet">          -</span> run: npm test

<span class="hljs-code">      build:
        needs: test
        runs-on: ubuntu-latest
        steps:
          - uses: actions/checkout@v4
</span>
<span class="hljs-bullet">          -</span> name: Build Docker image
<span class="hljs-code">            run: |
              docker build -t myapp:${{ github.sha }} .
              docker tag myapp:${{ github.sha }} myapp:latest
</span>
<span class="hljs-bullet">          -</span> name: Push to registry
<span class="hljs-code">            run: |
              echo ${{ secrets.REGISTRY_PASSWORD }} | docker login -u ${{ secrets.REGISTRY_USER }} --password-stdin
              docker push myapp:${{ github.sha }}
</span>
<span class="hljs-code">      deploy-staging:
        needs: build
        runs-on: ubuntu-latest
        environment: staging
        steps:
          - name: Deploy to staging
            uses: appleboy/ssh-action@v1
            with:
              host: ${{ secrets.STAGING_HOST }}
              username: deploy
              key: ${{ secrets.SSH_PRIVATE_KEY }}
              script: |
                docker pull myapp:${{ github.sha }}
                docker stop myapp || true
                docker rm myapp || true
                docker run -d --name myapp \
                  --env-file /opt/myapp/.env \
                  -p 3000:3000 \
                  myapp:${{ github.sha }}
</span>
<span class="hljs-bullet">          -</span> name: Smoke test staging
<span class="hljs-code">            run: |
              sleep 10
              curl -sf https://staging.yourdomain.com/health || exit 1
</span>
<span class="hljs-code">      deploy-production:
        needs: deploy-staging
        runs-on: ubuntu-latest
        environment: production     # Requires approval
        steps:
          - name: Deploy to production
            uses: appleboy/ssh-action@v1
            with:
              host: ${{ secrets.PRODUCTION_HOST }}
              username: deploy
              key: ${{ secrets.SSH_PRIVATE_KEY }}
              script: |
                docker pull myapp:${{ github.sha }}
                # Keep the previous version for rollback
                docker rename myapp myapp-previous || true
                docker stop myapp-previous || true
                docker run -d --name myapp \
                  --env-file /opt/myapp/.env \
                  -p 3000:3000 \
                  myapp:${{ github.sha }}
</span>
<span class="hljs-bullet">          -</span> name: Verify production
<span class="hljs-code">            run: |
              sleep 15
              curl -sf https://yourdomain.com/health || exit 1
</span>
The <span class="hljs-code">`needs`</span> keyword creates dependencies between jobs. <span class="hljs-code">`deploy-staging`</span> won't run unless <span class="hljs-code">`build`</span> succeeds. <span class="hljs-code">`deploy-production`</span> won't run unless <span class="hljs-code">`deploy-staging`</span> succeeds. One failed test stops the entire pipeline.

<span class="hljs-section">## Secrets Management</span>

Never hardcode credentials in workflow files. GitHub provides encrypted secrets at the repository and environment level.

Go to <span class="hljs-strong">**Settings → Secrets and variables → Actions**</span> and add:

<span class="hljs-bullet">-</span> <span class="hljs-code">`REGISTRY_USER`</span> / <span class="hljs-code">`REGISTRY_PASSWORD`</span> — Docker registry credentials
<span class="hljs-bullet">-</span> <span class="hljs-code">`SSH_PRIVATE_KEY`</span> — deploy user's SSH key
<span class="hljs-bullet">-</span> <span class="hljs-code">`STAGING_HOST`</span> / <span class="hljs-code">`PRODUCTION_HOST`</span> — server addresses

Secrets are automatically masked in logs. If a secret value appears in the output, GitHub replaces it with <span class="hljs-code">`***`</span>.

For cloud deployments (AWS, GCP, Azure), use <span class="hljs-strong">**OIDC federation**</span> instead of long-lived credentials. This generates short-lived tokens for each workflow run:

<span class="hljs-bullet">    -</span> name: Configure AWS credentials
<span class="hljs-code">      uses: aws-actions/configure-aws-credentials@v4
      with:
        role-to-assume: arn:aws:iam::123456789:role/github-deploy
        aws-region: us-east-1
</span>
No access keys stored anywhere. GitHub authenticates with the cloud provider using its OIDC identity. Tokens expire after the workflow completes.

<span class="hljs-section">## The Approval Gate</span>

In the GitHub repository, go to <span class="hljs-strong">**Settings → Environments → production**</span>. Enable:

<span class="hljs-bullet">-</span> <span class="hljs-strong">**Required reviewers**</span> — specify team members who must approve before production deploy
<span class="hljs-bullet">-</span> <span class="hljs-strong">**Wait timer**</span> — optional delay before deploy executes (gives time for last-minute checks)
<span class="hljs-bullet">-</span> <span class="hljs-strong">**Deployment branches**</span> — restrict to <span class="hljs-code">`main`</span> only

When the pipeline reaches <span class="hljs-code">`deploy-production`</span>, it pauses and sends a notification to the required reviewers. They can inspect the staging deployment, check test results, and either approve or reject.

This prevents the "I accidentally merged to main and it deployed to production" scenario. Someone has to actively approve every production deployment.

<span class="hljs-section">## Database Migrations</span>

The trickiest part of CI/CD is database migrations. The rule:

<span class="hljs-strong">**Every migration must be backward compatible.**</span>

This means the old application code must work with the new schema. Why? Because during deployment, there's a window where the old code is still running against the new database. And during rollback, the new schema must work with the old code.

Practically, this means:

<span class="hljs-bullet">-</span> <span class="hljs-strong">**Adding a column**</span> — safe. Old code ignores it.
<span class="hljs-bullet">-</span> <span class="hljs-strong">**Removing a column**</span> — do it in two deploys. First deploy: stop using the column in code. Second deploy: remove the column from the schema.
<span class="hljs-bullet">-</span> <span class="hljs-strong">**Renaming a column**</span> — also two deploys. Add new column, migrate data, update code to use new column, then remove old column.
<span class="hljs-bullet">-</span> <span class="hljs-strong">**Adding a NOT NULL column**</span> — add with a default value or make it nullable first.

Run migrations as a separate step before deploying the application:

<span class="hljs-bullet">    -</span> name: Run migrations
<span class="hljs-code">      uses: appleboy/ssh-action@v1
      with:
        host: ${{ secrets.PRODUCTION_HOST }}
        username: deploy
        key: ${{ secrets.SSH_PRIVATE_KEY }}
        script: |
          docker run --rm \
            --env-file /opt/myapp/.env \
            myapp:${{ github.sha }} \
            npm run migrate
</span>
<span class="hljs-section">## Rollback</span>

A deployment pipeline without rollback is a deployment prayer.

Create a separate rollback workflow:

<span class="hljs-code">    # .github/workflows/rollback.yml
    name: Rollback Production
    on:
      workflow_dispatch:
        inputs:
          version:
            description: 'Git SHA to rollback to'
            required: true
</span>
<span class="hljs-code">    jobs:
      rollback:
        runs-on: ubuntu-latest
        environment: production
        steps:
          - name: Rollback production
            uses: appleboy/ssh-action@v1
            with:
              host: ${{ secrets.PRODUCTION_HOST }}
              username: deploy
              key: ${{ secrets.SSH_PRIVATE_KEY }}
              script: |
                docker pull myapp:${{ github.event.inputs.version }}
                docker stop myapp || true
                docker rm myapp || true
                docker run -d --name myapp \
                  --env-file /opt/myapp/.env \
                  -p 3000:3000 \
                  myapp:${{ github.event.inputs.version }}
</span>
<span class="hljs-bullet">          -</span> name: Verify rollback
<span class="hljs-code">            run: |
              sleep 15
              curl -sf https://yourdomain.com/health || exit 1
</span>
Trigger it manually from the Actions tab, input the SHA of the last known good version. The approval gate still applies.

<span class="hljs-strong">**Critical:**</span> test your rollback process regularly. An untested rollback is not a rollback — it's a hope. Run a rollback drill quarterly. Deploy a known version, roll back, verify.

<span class="hljs-section">## The CI/CD Maturity Progression</span>

If you're currently at "SSH and git pull," don't try to build the full pipeline in one day. Progress through these stages:

<span class="hljs-strong">**Stage 1:**</span> Automated tests in GitHub Actions. Tests run on every push. You still deploy manually, but you know the code passes tests before you do.

<span class="hljs-strong">**Stage 2:**</span> Automated staging deploys. Code that passes tests is automatically deployed to staging. You manually deploy to production.

<span class="hljs-strong">**Stage 3:**</span> Production approval gates. Staging works, you've built confidence. Add the production deploy with manual approval.

<span class="hljs-strong">**Stage 4:**</span> Monitoring and rollback. Add health checks, deployment verification, and a tested rollback process.

Each stage builds on the previous one. Don't skip stages. The confidence you build at each level is what makes the next level safe.

---

If you found this guide helpful, check out our other resources:

<span class="hljs-bullet">-</span> [<span class="hljs-string">Docker Networking Explained</span>](<span class="hljs-link">/devops/docker-networking-bridge-host-overlay-explained</span>)
</code></pre>
]]></content:encoded>
    </item>
    <item>
      <title><![CDATA[Docker Networking Explained — Bridge vs Host vs Overlay and When to Use Each]]></title>
      <link>https://dailydejavu.com/devops/docker-networking-bridge-host-overlay-explained</link>
      <guid isPermaLink="true">https://dailydejavu.com/devops/docker-networking-bridge-host-overlay-explained</guid>
      <pubDate>Mon, 20 Apr 2026 08:55:00 GMT</pubDate>
      <dc:creator><![CDATA[banditz]]></dc:creator>
      <description><![CDATA[Your containers can't talk to each other. Or they can, but only by IP and not by name. Or port mapping isn't working. Docker networking has six modes and most people only understand one of them.]]></description>
      <content:encoded><![CDATA[<p>You spin up two containers. A web app and a database. You try to connect the app to the database using the container name as the hostname. It doesn’t work. Connection refused. You try the container’s IP address — it works. But the IP changes every time the container restarts.</p>
<p>This is the most common Docker networking confusion, and it happens because of the default bridge network’s limitation that nobody tells you about until you hit it.</p>
<p>Docker has six network modes. Most people use one (the default bridge) and don’t realize it’s the worst one for multi-container applications.</p>
<h2 id="the-default-bridge-why-its-a-trap"><a class="header-anchor" href="#the-default-bridge-why-its-a-trap" target="_blank" rel="noopener noreferrer">The Default Bridge — Why It’s a Trap</a></h2>
<p>When Docker installs, it creates a network called <code>bridge</code> backed by a Linux bridge device called <code>docker0</code>. Every container started without a <code>--network</code> flag connects to this default bridge.</p>
<pre><code>docker run -d --name web nginx

docker run -d --name db postgres:16
</code></pre>
<p>Both containers are on the default bridge. They can reach each other by IP:</p>
<pre><code>docker inspect web | grep IPAddress

# "IPAddress": "172.17.0.2"

docker exec db ping 172.17.0.2

# Works
</code></pre>
<p>But:</p>
<pre><code>docker exec db ping web

# ping: bad address 'web'
</code></pre>
<p><strong>The default bridge does not provide DNS resolution between containers.</strong> This is the single most confusing thing about Docker networking, and it catches everyone.</p>
<p>The reason is historical. The default bridge predates Docker’s embedded DNS server. For backward compatibility, Docker kept it without DNS. Every user-created bridge network gets DNS automatically.</p>
<p><strong>Never use the default bridge for multi-container applications.</strong> Always create a custom bridge.</p>
<h2 id="custom-bridge-networks-the-right-way"><a class="header-anchor" href="#custom-bridge-networks-the-right-way" target="_blank" rel="noopener noreferrer">Custom Bridge Networks — The Right Way</a></h2>
<pre><code>docker network create mynet

docker run -d --name web --network mynet nginx

docker run -d --name db --network mynet postgres:16
</code></pre>
<p>Now:</p>
<pre><code>docker exec web ping db

# PING db (172.20.0.3): 56 data bytes — works!

docker exec db ping web

# Works too
</code></pre>
<p>Custom bridge networks provide:</p>
<ul>
<li><strong>DNS resolution</strong> — containers find each other by name</li>
<li><strong>Better isolation</strong> — containers on different networks can’t communicate</li>
<li><strong>Connect/disconnect on the fly</strong> — <code>docker network connect mynet existing-container</code></li>
</ul>
<p><strong>Docker Compose does this automatically.</strong> Every <code>docker-compose.yml</code> creates a custom bridge for the project. All services can reach each other by service name:</p>
<pre><code>services:

  web:

    image: nginx

    ports:

      - "80:80"

  api:

    image: myapp

    environment:

      DATABASE_URL: postgresql://user:pass@db:5432/mydb

  db:

    image: postgres:16

    environment:

      POSTGRES_PASSWORD: pass
</code></pre>
<p>Notice <code>db</code> in the connection string. Compose’s network makes <code>db</code> resolvable from any service in the same file. The <code>db</code> service has no <code>ports</code> mapping — it’s only accessible from other containers on the network, not from the host or outside. That’s a security advantage.</p>
<h2 id="host-mode-when-you-need-raw-performance"><a class="header-anchor" href="#host-mode-when-you-need-raw-performance" target="_blank" rel="noopener noreferrer">Host Mode — When You Need Raw Performance</a></h2>
<pre><code>docker run -d --network host nginx
</code></pre>
<p>Host mode removes network isolation entirely. The container shares the host’s network stack — same IP, same interfaces, no NAT.</p>
<p>Nginx in the container now listens on port 80 of the host directly. No port mapping needed. No NAT overhead.</p>
<p><strong>When to use host mode:</strong></p>
<ul>
<li>Network monitoring tools that need access to all host interfaces</li>
<li>Applications with many dynamic ports where mapping each one is impractical</li>
<li>Maximum network performance (eliminates NAT/iptables processing)</li>
<li>Applications that need to bind to the host’s specific network interface</li>
</ul>
<p><strong>When NOT to use host mode:</strong></p>
<ul>
<li>Multiple containers needing the same port — they’ll conflict</li>
<li>When you need network isolation between containers</li>
<li>On macOS or Windows — host mode only works on Linux</li>
</ul>
<p>For most web applications, the performance difference between bridge and host is negligible — under 1% latency difference. You’d need to be pushing millions of packets per second (network appliances, high-frequency trading) for host mode to matter.</p>
<h2 id="overlay-networks-multi-host-communication"><a class="header-anchor" href="#overlay-networks-multi-host-communication" target="_blank" rel="noopener noreferrer">Overlay Networks — Multi-Host Communication</a></h2>
<p>If you have containers running on different machines that need to communicate, overlay networks handle it:</p>
<pre><code>docker swarm init

docker network create --driver overlay --attachable production

docker service create --name api --network production --replicas 3 myapp
</code></pre>
<p>Overlay networks:</p>
<ul>
<li>Span multiple Docker hosts</li>
<li>Create encrypted tunnels between hosts (using VXLAN)</li>
<li>Provide DNS-based service discovery across hosts</li>
<li>Required for Docker Swarm services</li>
</ul>
<p>Each container gets an IP on the overlay network and can reach other containers by name, regardless of which physical host they’re running on.</p>
<p>For single-host deployments (most development environments and many production setups), bridge networks are simpler and sufficient. Use overlay only when you genuinely have multi-host requirements.</p>
<h2 id="none-and-macvlan-the-specialty-modes"><a class="header-anchor" href="#none-and-macvlan-the-specialty-modes" target="_blank" rel="noopener noreferrer">None and Macvlan — The Specialty Modes</a></h2>
<p><strong>None mode</strong> — complete network isolation:</p>
<pre><code>docker run --network none myapp
</code></pre>
<p>No network interfaces except loopback. Use for containers that process data from volumes but should never make network connections. Security-sensitive batch jobs.</p>
<p><strong>Macvlan</strong> — containers appear as physical devices on your network:</p>
<pre><code>docker network create -d macvlan \

    --subnet=192.168.1.0/24 \

    --gateway=192.168.1.1 \

    -o parent=eth0 macnet
</code></pre>
<p>Containers get IPs from your physical network’s subnet. Other devices on the network see them as separate machines. Useful for legacy applications that expect to be on the physical network, or for containers that need to be directly accessible without port mapping.</p>
<h2 id="debugging-docker-networking"><a class="header-anchor" href="#debugging-docker-networking" target="_blank" rel="noopener noreferrer">Debugging Docker Networking</a></h2>
<p>When containers can’t communicate, work through this sequence:</p>
<p><strong>Check which network the containers are on:</strong></p>
<pre><code>docker inspect mycontainer | grep -A 20 Networks
</code></pre>
<p>If they’re on different networks, they can’t communicate directly.</p>
<p><strong>Check DNS resolution:</strong></p>
<pre><code>docker exec mycontainer nslookup other-container
</code></pre>
<p>If this fails, you’re on the default bridge. Create a custom network.</p>
<p><strong>Check connectivity:</strong></p>
<pre><code>docker exec mycontainer ping other-container

docker exec mycontainer curl -s http://other-container:80/
</code></pre>
<p><strong>Use netshoot for advanced debugging:</strong></p>
<pre><code>docker run --rm -it --network mynet nicolaka/netshoot
</code></pre>
<p>This image has every network debugging tool: <code>curl</code>, <code>ping</code>, <code>dig</code>, <code>nslookup</code>, <code>tcpdump</code>, <code>netstat</code>, <code>ss</code>, <code>iperf</code>, and more.</p>
<p><strong>Check iptables rules Docker created:</strong></p>
<pre><code>sudo iptables -L DOCKER -n -v
</code></pre>
<p>Docker manages iptables rules for port mapping and inter-network isolation. Corrupted rules can cause mysterious connectivity issues. Restarting Docker often fixes them.</p>
<p><strong>Check port mapping:</strong></p>
<pre><code>docker port mycontainer
</code></pre>
<p>Shows which host ports map to which container ports. If empty and you expected a mapping, you forgot <code>-p</code> when starting the container.</p>
<h2 id="the-networking-decision-tree"><a class="header-anchor" href="#the-networking-decision-tree" target="_blank" rel="noopener noreferrer">The Networking Decision Tree</a></h2>
<ul>
<li><strong>Single container, no communication needed</strong> → default bridge is fine</li>
<li><strong>Multiple containers on one host</strong> → custom bridge network</li>
<li><strong>Need maximum network performance</strong> → host mode (Linux only)</li>
<li><strong>Containers across multiple hosts</strong> → overlay network</li>
<li><strong>Need containers on the physical network</strong> → macvlan</li>
<li><strong>Need complete isolation</strong> → none</li>
</ul>
<p>For 90% of deployments, a custom bridge network is the right answer. Docker Compose creates one automatically. Start there, and only reach for other modes when you have a specific reason.</p>
<hr />
<p>If you found this guide helpful, check out our other resources:</p>
<ul>
<li>(More articles coming soon in DevOps &amp; Infrastructure)</li>
</ul>
<pre><code class="hljs"></code></pre>
]]></content:encoded>
    </item>
    <item>
      <title><![CDATA[How to Implement API Rate Limiting — Token Bucket, Sliding Window, and the Edge Cases Nobody Warns You About]]></title>
      <link>https://dailydejavu.com/backend-engineering/implement-api-rate-limiting-token-bucket-sliding-window</link>
      <guid isPermaLink="true">https://dailydejavu.com/backend-engineering/implement-api-rate-limiting-token-bucket-sliding-window</guid>
      <pubDate>Fri, 17 Apr 2026 06:53:00 GMT</pubDate>
      <dc:creator><![CDATA[banditz]]></dc:creator>
      <description><![CDATA[Your API has no rate limiting and someone just hit it 50,000 times in a minute. Here's how to pick the right algorithm, implement it with Redis, and handle the edge cases that break naive implementations.]]></description>
      <content:encoded><![CDATA[<p>It happens to every API eventually. You launch without rate limiting because “we’ll add it later.” Three weeks later, someone discovers your API and hits it 50,000 times in a minute. Maybe it’s a bot scraping your data. Maybe it’s a customer’s broken retry loop. Maybe it’s a competitor stress-testing your infrastructure. The result is the same: your servers are drowning, your database is on fire, and legitimate users can’t get a response.</p>
<p>Rate limiting isn’t a nice-to-have. It’s the bouncer at the door that keeps your API from being abused into oblivion.</p>
<h2 id="the-algorithms-pick-the-right-one"><a class="header-anchor" href="#the-algorithms-pick-the-right-one" target="_blank" rel="noopener noreferrer">The Algorithms — Pick the Right One</a></h2>
<p>There are four common rate limiting algorithms. Each has tradeoffs.</p>
<p><strong>Fixed Window Counter</strong></p>
<p>The simplest approach. Divide time into fixed windows (e.g., 1-minute intervals). Count requests per window. Reject when the count exceeds the limit.</p>
<pre><code>Window 12:00-12:01: 98/100 requests used

Window 12:01-12:02: 0/100 requests used
</code></pre>
<p>The problem: a burst at the window boundary. If a client sends 100 requests at 12:00:59 and another 100 at 12:01:01, they’ve sent 200 requests in 2 seconds while technically staying within the per-minute limit. This can overload your server even though the rate limit is “working.”</p>
<p><strong>Sliding Window Log</strong></p>
<p>Stores the timestamp of every request. When a new request arrives, count timestamps within the last window duration and reject if over the limit. Most accurate, but stores every timestamp — memory grows linearly with traffic.</p>
<p><strong>Sliding Window Counter</strong></p>
<p>A hybrid. Uses fixed windows but weights the previous window’s count based on overlap. If 70% of the current window has elapsed, count 30% of the previous window’s requests plus 100% of the current window’s. Good accuracy without per-request storage.</p>
<p><strong>Token Bucket</strong></p>
<p>A bucket holds tokens, refilled at a fixed rate. Each request consumes a token. If the bucket is empty, the request is rejected. The bucket has a maximum capacity, allowing controlled bursts up to that capacity while enforcing the average rate over time.</p>
<p>For most APIs, <strong>token bucket</strong> or <strong>sliding window counter</strong> is the right choice. Token bucket if you want to allow controlled bursts (most APIs should). Sliding window counter if you want strict per-window enforcement.</p>
<h2 id="implementing-token-bucket-with-redis"><a class="header-anchor" href="#implementing-token-bucket-with-redis" target="_blank" rel="noopener noreferrer">Implementing Token Bucket with Redis</a></h2>
<p>Redis is the standard backing store for rate limiting because it’s fast, atomic, and shared across application servers.</p>
<p>The token bucket needs two pieces of data per client: the number of remaining tokens and the last refill timestamp.</p>
<p>Here’s the logic in a Redis Lua script (atomic execution, no race conditions):</p>
<pre><code>-- Token bucket rate limiter (Lua script for Redis)

local key = KEYS[1]

local max_tokens = tonumber(ARGV[1])      -- bucket capacity

local refill_rate = tonumber(ARGV[2])      -- tokens per second

local now = tonumber(ARGV[3])              -- current timestamp

local data = redis.call('HMGET', key, 'tokens', 'last_refill')

local tokens = tonumber(data[1]) or max_tokens

local last_refill = tonumber(data[2]) or now

-- Calculate tokens to add since last refill

local elapsed = now - last_refill

local new_tokens = elapsed * refill_rate

tokens = math.min(max_tokens, tokens + new_tokens)

local allowed = 0

if tokens &gt;= 1 then

    tokens = tokens - 1

    allowed = 1

end

redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)

redis.call('EXPIRE', key, math.ceil(max_tokens / refill_rate) * 2)

return { allowed, math.floor(tokens) }
</code></pre>
<p>This runs atomically in Redis. No race conditions even with 50 application servers hitting the same key simultaneously. The <code>EXPIRE</code> ensures cleanup of inactive clients.</p>
<p>Call it from your application:</p>
<pre><code>const result = await redis.eval(luaScript, 1,

    `ratelimit:${clientId}`,

    100,           // max 100 tokens (burst capacity)

    1.67,          // refill at 1.67/sec = 100/min

    Date.now() / 1000

);

const [allowed, remaining] = result;
</code></pre>
<h2 id="setting-rate-limit-headers"><a class="header-anchor" href="#setting-rate-limit-headers" target="_blank" rel="noopener noreferrer">Setting Rate Limit Headers</a></h2>
<p>Good APIs tell clients their rate limit status on every response. This lets well-behaved clients self-regulate instead of blindly hitting the limit.</p>
<p>Standard headers (following the <a href="https://datatracker.ietf.org/doc/draft-ietf-httpapi-ratelimit-headers/" target="_blank" rel="noopener noreferrer">IETF draft</a>):</p>
<pre><code>X-RateLimit-Limit: 100

X-RateLimit-Remaining: 67

X-RateLimit-Reset: 1712345678
</code></pre>
<p>When the limit is exceeded, return <strong>429 Too Many Requests</strong> with a <code>Retry-After</code> header:</p>
<pre><code>HTTP/1.1 429 Too Many Requests

Retry-After: 30

X-RateLimit-Limit: 100

X-RateLimit-Remaining: 0

X-RateLimit-Reset: 1712345678

{

    "error": "rate_limit_exceeded",

    "message": "Too many requests. Retry after 30 seconds.",

    "retry_after": 30

}
</code></pre>
<p>Include the <code>Retry-After</code> value in both the header and body. Some HTTP clients check headers, some parse the body.</p>
<h2 id="edge-cases-that-break-naive-implementations"><a class="header-anchor" href="#edge-cases-that-break-naive-implementations" target="_blank" rel="noopener noreferrer">Edge Cases That Break Naive Implementations</a></h2>
<p><strong>Client identification.</strong></p>
<p>Most tutorials rate limit by IP address. This breaks immediately in production because:</p>
<ul>
<li>Corporate NATs: thousands of users share one IP. Rate limiting by IP blocks an entire company.</li>
<li>VPNs and proxies: same problem.</li>
<li>Mobile carriers: CGNAT means millions of users share IP pools.</li>
</ul>
<p>Rate limit by <strong>API key</strong> for authenticated endpoints. For unauthenticated endpoints (login, registration), rate limit by a combination of IP + other signals (user agent, fingerprint).</p>
<p>For login endpoints specifically, rate limit by both IP and target username. This prevents credential stuffing attacks where an attacker tries common passwords against thousands of usernames from a single IP.</p>
<p><strong>Per-endpoint limits.</strong></p>
<p>Not all endpoints are equal. Your <code>/search</code> endpoint might hit the database hard and need aggressive limiting. Your <code>/health</code> endpoint should probably never be limited. Your <code>/login</code> endpoint needs very tight limits to prevent brute force.</p>
<p>Apply different limits per endpoint or endpoint group:</p>
<pre><code>/api/search     → 30 requests/minute

/api/users      → 100 requests/minute

/api/login      → 5 requests/minute

/api/health     → no limit
</code></pre>
<p><strong>Distributed systems.</strong></p>
<p>If you have multiple application servers behind a load balancer, each server needs to check the same rate limit counter. This is why Redis (or any shared store) is necessary. In-memory rate limiting on each server means a client can send N × server_count requests per window by hitting each server once.</p>
<p><strong>Time synchronization.</strong></p>
<p>If your application servers have slightly different clocks, timestamp-based algorithms can behave inconsistently. Use the Redis server’s time (<code>redis.call('TIME')</code>) instead of the application server’s clock.</p>
<h2 id="graceful-degradation"><a class="header-anchor" href="#graceful-degradation" target="_blank" rel="noopener noreferrer">Graceful Degradation</a></h2>
<p>If Redis goes down, what happens to your rate limiter?</p>
<p><strong>Bad answer:</strong> all requests are blocked because the rate limiter can’t check counts.</p>
<p><strong>Good answer:</strong> fall back to a local in-memory rate limiter with generous limits, or allow requests through with logging and alerting.</p>
<pre><code>async function checkRateLimit(clientId) {

    try {

        return await redisRateLimiter.check(clientId);

    } catch (error) {

        logger.warn('Redis rate limiter unavailable, falling back');

        return localRateLimiter.check(clientId);

    }

}
</code></pre>
<p>Rate limiting should protect your API, not become a single point of failure. Design it to fail open (allow requests) rather than fail closed (block everything), with monitoring to alert you when the fallback activates.</p>
<h2 id="the-rate-limiting-checklist"><a class="header-anchor" href="#the-rate-limiting-checklist" target="_blank" rel="noopener noreferrer">The Rate Limiting Checklist</a></h2>
<ol>
<li><strong>Pick the algorithm</strong> — token bucket for most APIs</li>
<li><strong>Use Redis</strong> — shared state across servers, atomic operations</li>
<li><strong>Identify clients properly</strong> — API key, not just IP</li>
<li><strong>Set per-endpoint limits</strong> — tight for auth, generous for reads</li>
<li><strong>Return proper headers</strong> — X-RateLimit-Limit, Remaining, Reset</li>
<li><strong>Return 429 with Retry-After</strong> — don’t just drop connections</li>
<li><strong>Handle Redis failure</strong> — fall back, don’t fail closed</li>
<li><strong>Monitor</strong> — track 429 rates, identify abusive patterns</li>
<li><strong>Document your limits</strong> — developers need to know them upfront</li>
</ol>
<p>Rate limiting is one of those things that’s boring until you need it. And when you need it, you need it ten minutes ago. Build it before the bot finds your API. It’s a lot less stressful that way.</p>
<hr />
<p>If you found this guide helpful, check out our other resources:</p>
<ul>
<li><a href="/backend-engineering/implement-jwt-authentication-access-refresh-tokens" target="_blank" rel="noopener noreferrer">How to Implement JWT Authentication Properly</a></li>
</ul>
]]></content:encoded>
    </item>
    <item>
      <title><![CDATA[How to Implement JWT Authentication Properly — Access Tokens, Refresh Tokens, and Common Mistakes]]></title>
      <link>https://dailydejavu.com/backend-engineering/implement-jwt-authentication-access-refresh-tokens</link>
      <guid isPermaLink="true">https://dailydejavu.com/backend-engineering/implement-jwt-authentication-access-refresh-tokens</guid>
      <pubDate>Fri, 09 Jan 2026 06:03:00 GMT</pubDate>
      <dc:creator><![CDATA[banditz]]></dc:creator>
      <description><![CDATA[Most JWT tutorials store tokens in localStorage and skip refresh flow. Here's how to implement JWT auth that actually holds up in production — proper token pairs, secure storage, and revocation.
]]></description>
      <content:encoded><![CDATA[<p>Every JWT tutorial on the internet follows the same pattern. User sends credentials. Server signs a JWT. Client stores it in <code>localStorage</code>. Client sends it in the <code>Authorization</code> header. Tutorial ends.</p>
<p>That implementation has at least three serious problems.</p>
<p>First, <code>localStorage</code> is accessible to any JavaScript on the page. One XSS vulnerability — one unsanitized input, one compromised third-party script — and the attacker reads and exfiltrates the token.</p>
<p>Second, no refresh mechanism. The token either expires quickly (user logs in constantly) or slowly (stolen token has a long window).</p>
<p>Third, no revocation. If the user changes their password or you detect suspicious activity, you can’t invalidate the token. It’s valid until it expires.</p>
<h2 id="the-access-refresh-token-pattern"><a class="header-anchor" href="#the-access-refresh-token-pattern" target="_blank" rel="noopener noreferrer">The Access / Refresh Token Pattern</a></h2>
<p>OAuth2 standardized this approach with two tokens:</p>
<p><strong>Access Token</strong> — Short-lived (15 minutes). Sent with every API request. If stolen, 15-minute window.</p>
<p><strong>Refresh Token</strong> — Long-lived (7-30 days). Stored securely. Only sent to a refresh endpoint. Used to get new access tokens.</p>
<p>The flow:</p>
<ol>
<li>User logs in with credentials</li>
<li>Server validates, generates access token (15 min) and refresh token (7 days)</li>
<li>Client stores both securely</li>
<li>Client sends access token with API requests</li>
<li>When access token expires, client sends refresh token to refresh endpoint</li>
<li>Server validates refresh token, issues new access token (and rotates refresh token)</li>
<li>When refresh token expires, user must log in again</li>
</ol>
<h2 id="step-1-token-claims"><a class="header-anchor" href="#step-1-token-claims" target="_blank" rel="noopener noreferrer">Step 1: Token Claims</a></h2>
<pre><code>{

  "iss": "api.yourdomain.com",

  "sub": "user_12345",

  "aud": "yourdomain.com",

  "exp": 1712345678,

  "iat": 1712344778,

  "jti": "unique-id-abc123",

  "role": "admin"

}
</code></pre>
<ul>
<li><code>iss</code> — who created the token</li>
<li><code>sub</code> — user ID</li>
<li><code>aud</code> — intended audience</li>
<li><code>exp</code> — expiration (Unix timestamp)</li>
<li><code>iat</code> — issued at</li>
<li><code>jti</code> — unique token ID (needed for revocation)</li>
</ul>
<p><strong>Never put sensitive data in the payload.</strong> JWTs are base64 encoded, not encrypted. Anyone can decode the payload. No passwords, no SSNs, no credit card numbers.</p>
<p><strong>Keep it minimal.</strong> User ID and role. That’s it. Look up everything else server-side.</p>
<h2 id="step-2-signing-algorithm"><a class="header-anchor" href="#step-2-signing-algorithm" target="_blank" rel="noopener noreferrer">Step 2: Signing Algorithm</a></h2>
<p><strong>Use asymmetric signing for production.</strong></p>
<p>RS256 (RSA + SHA-256) or ES256 (ECDSA + SHA-256). The auth server holds the private key and signs tokens. API servers have only the public key and can verify tokens but cannot create them.</p>
<p>This matters in a microservices architecture. If you use HS256 (symmetric), every service that verifies tokens has the shared secret. Compromise one service, compromise the signing key, and the attacker can mint arbitrary tokens.</p>
<p>With RS256/ES256, compromising an API server doesn’t help — the attacker can verify tokens but can’t sign new ones. Only the auth server can do that.</p>
<p>Generate an RS256 key pair:</p>
<pre><code>openssl genrsa -out private.pem 2048

openssl rsa -in private.pem -pubout -out public.pem
</code></pre>
<p>The auth server uses <code>private.pem</code> to sign. API servers use <code>public.pem</code> to verify.</p>
<h2 id="step-3-secure-token-storage"><a class="header-anchor" href="#step-3-secure-token-storage" target="_blank" rel="noopener noreferrer">Step 3: Secure Token Storage</a></h2>
<p><strong>Where NOT to store tokens:</strong></p>
<ul>
<li><code>localStorage</code> — accessible to any JavaScript. XSS = game over.</li>
<li><code>sessionStorage</code> — same problem, just doesn’t persist across tabs.</li>
<li>Regular cookies without <code>httpOnly</code> — JavaScript can still read them.</li>
<li>URL parameters — visible in server logs, browser history, referrer headers.</li>
</ul>
<p><strong>Where to store them:</strong></p>
<p><strong>For web apps:</strong> httpOnly cookies with <code>Secure</code> and <code>SameSite</code> attributes.</p>
<pre><code>res.cookie('access_token', token, {

    httpOnly: true,

    secure: true,

    sameSite: 'strict',

    maxAge: 900000   // 15 minutes

});
</code></pre>
<p><code>httpOnly</code> means JavaScript cannot access the cookie. XSS can’t steal it. <code>Secure</code> means HTTPS only. <code>SameSite: strict</code> prevents the cookie from being sent in cross-origin requests, mitigating CSRF.</p>
<p>The browser sends the cookie automatically with every request to your domain. Your API reads it from the cookie header instead of the Authorization header. No JavaScript involved in token handling at all.</p>
<p><strong>For mobile apps:</strong> Use platform secure storage — iOS Keychain, Android Keystore. These are hardware-backed encrypted storage that the OS protects.</p>
<p><strong>For SPAs calling APIs on different domains:</strong> This is the hard case. <code>SameSite: strict</code> doesn’t work cross-origin. You need <code>SameSite: none; Secure</code> with proper CORS configuration. Or use the Backend For Frontend (BFF) pattern where your SPA talks to its own backend, and that backend handles tokens and proxies API calls.</p>
<h2 id="step-4-the-refresh-flow"><a class="header-anchor" href="#step-4-the-refresh-flow" target="_blank" rel="noopener noreferrer">Step 4: The Refresh Flow</a></h2>
<p>When the access token expires, the client receives a 401 from the API. The client then sends the refresh token to a dedicated endpoint:</p>
<pre><code>POST /auth/refresh

Cookie: refresh_token=eyJ...
</code></pre>
<p>The server:</p>
<ol>
<li>Validates the refresh token (signature, expiration, not revoked)</li>
<li>Issues a new access token</li>
<li><strong>Rotates the refresh token</strong> — issues a new refresh token and invalidates the old one</li>
<li>Returns both to the client</li>
</ol>
<p><strong>Refresh token rotation</strong> is critical for detecting theft. Here’s why:</p>
<p>If an attacker steals the refresh token and uses it before the legitimate user, the attacker gets a new token pair and the old refresh token is invalidated. When the legitimate user tries to refresh with the old token, it fails — and you know the refresh token was compromised. At this point, invalidate all tokens for that user and force re-authentication.</p>
<p><strong>Store refresh tokens in a database</strong> (not just in-memory). Each refresh token maps to a user ID and a token family. When a refresh is requested:</p>
<ol>
<li>Look up the token in the database</li>
<li>If it’s valid: issue new pair, mark old token as used, store new token</li>
<li>If it’s already been used: someone is replaying a stolen token. Invalidate the entire family. Force the user to log in again.</li>
</ol>
<h2 id="step-5-token-revocation"><a class="header-anchor" href="#step-5-token-revocation" target="_blank" rel="noopener noreferrer">Step 5: Token Revocation</a></h2>
<p>JWTs are stateless — you can’t invalidate them server-side. The whole point is that the server doesn’t store session state. But sometimes you need to revoke a token immediately: logout, password change, compromised account.</p>
<p><strong>Option 1: Token blocklist.</strong></p>
<p>Store revoked token JTIs (unique IDs) in Redis with a TTL matching the token’s remaining lifetime. On every API request, check the blocklist:</p>
<pre><code>const isRevoked = await redis.get(`blocklist:${tokenJti}`);

if (isRevoked) return res.status(401).json({ error: 'Token revoked' });
</code></pre>
<p>This trades some statefulness for the ability to revoke. The blocklist is small (only active tokens that have been explicitly revoked) and the TTL ensures automatic cleanup.</p>
<p><strong>Option 2: Short access tokens + refresh token revocation.</strong></p>
<p>Keep access tokens at 5 minutes. For logout, only revoke the refresh token (delete it from the database). The access token expires naturally within 5 minutes. This is simpler but has a 5-minute window where the old access token still works.</p>
<p><strong>Option 3: Token versioning.</strong></p>
<p>Store a <code>tokenVersion</code> counter on the user record. Include it in the JWT payload. On every request, compare the token’s version to the database. If the user’s version has been incremented (due to password change, forced logout, etc.), reject the token. This requires a database lookup per request, which somewhat defeats the stateless advantage — but it’s a pragmatic compromise.</p>
<h2 id="common-mistakes"><a class="header-anchor" href="#common-mistakes" target="_blank" rel="noopener noreferrer">Common Mistakes</a></h2>
<p><strong>Mistake 1: Storing the secret in code.</strong> Use environment variables or a secrets manager. Never commit signing keys to version control.</p>
<p><strong>Mistake 2: Using HS256 in a microservice architecture.</strong> Every service needs the shared secret. One compromised service = all services compromised. Use RS256/ES256.</p>
<p><strong>Mistake 3: Not validating <code>aud</code> and <code>iss</code> claims.</strong> A token signed by your auth server but intended for a different service should be rejected. Always validate audience and issuer.</p>
<p><strong>Mistake 4: Putting too much in the payload.</strong> Every byte is sent with every request. JWTs over 1KB are a sign you’re using them wrong.</p>
<p><strong>Mistake 5: Never rotating refresh tokens.</strong> If a refresh token is valid for 30 days and gets stolen on day 1, the attacker has 29 days of access. Rotation limits this to a single use.</p>
<p><strong>Mistake 6: No refresh token at all.</strong> A long-lived access token is the worst of both worlds — can’t be revoked and gives a long attack window. Use the two-token pattern.</p>
<p>The two-token pattern with httpOnly cookies, asymmetric signing, and refresh token rotation is the production standard. It’s not the simplest implementation, but it’s the one that doesn’t get you into the security news.</p>
<hr />
<p>If you found this guide helpful, check out our other resources:</p>
<ul>
<li>(More articles coming soon in Backend Engineering)</li>
</ul>
]]></content:encoded>
    </item>
    <item>
      <title><![CDATA[How to Diagnose and Fix MySQL Replication Lag]]></title>
      <link>https://dailydejavu.com/database/diagnose-fix-mysql-replication-lag</link>
      <guid isPermaLink="true">https://dailydejavu.com/database/diagnose-fix-mysql-replication-lag</guid>
      <pubDate>Wed, 07 Jan 2026 05:26:00 GMT</pubDate>
      <dc:creator><![CDATA[banditz]]></dc:creator>
      <description><![CDATA[Your replica is 300 seconds behind. Reads return stale data and users see ghost records. Here's how to find the cause and fix it without breaking replication.
]]></description>
      <content:encoded><![CDATA[<p>It starts with a support ticket. A user updated their profile but the change isn’t showing. They refresh — old data is back. Your application reads from a MySQL replica that’s lagging behind the primary, returning stale data.</p>
<p>You SSH into the replica:</p>
<pre><code>SHOW REPLICA STATUS\G
</code></pre>
<p>You find the line that matters:</p>
<pre><code>Seconds_Behind_Master: 347
</code></pre>
<p>Nearly 6 minutes behind. Every read from this server returns data that’s 6 minutes old. For a profile update that’s annoying. For e-commerce inventory, a customer is buying a product that sold out 5 minutes ago.</p>
<h2 id="how-replication-works"><a class="header-anchor" href="#how-replication-works" target="_blank" rel="noopener noreferrer">How Replication Works</a></h2>
<p>Three components:</p>
<p><strong>Binlog on the primary.</strong> Every write (INSERT, UPDATE, DELETE, DDL) is recorded in the binary log.</p>
<p><strong>IO thread on the replica.</strong> Connects to the primary, reads binlog events, writes them to the local <strong>relay log</strong>.</p>
<p><strong>SQL thread on the replica.</strong> Reads relay log events and executes them. This is where data changes on the replica.</p>
<p>Lag happens at two points: IO thread can’t fetch fast enough (network), or SQL thread can’t replay fast enough (most common).</p>
<h2 id="step-1-read-show-replica-status"><a class="header-anchor" href="#step-1-read-show-replica-status" target="_blank" rel="noopener noreferrer">Step 1: Read SHOW REPLICA STATUS</a></h2>
<pre><code>SHOW REPLICA STATUS\G
</code></pre>
<p>On MySQL before 8.0.22, use <code>SHOW SLAVE STATUS\G</code>.</p>
<p>Critical fields:</p>
<p><strong>Replica_IO_Running</strong> — must be <code>Yes</code>. If <code>No</code>, check network, credentials, and whether the primary’s binlog was purged.</p>
<p><strong>Replica_SQL_Running</strong> — must be <code>Yes</code>. If <code>No</code>, check <code>Last_SQL_Error</code>.</p>
<p><strong>Seconds_Behind_Master</strong> — the lag. Approximate.</p>
<p><strong>Read_Master_Log_Pos</strong> — how far IO thread has read.</p>
<p><strong>Exec_Master_Log_Pos</strong> — how far SQL thread has executed. Gap = queued work.</p>
<p><strong>Relay_Log_Space</strong> — unprocessed data size. Large and growing = SQL thread can’t keep up.</p>
<h2 id="step-2-io-thread-or-sql-thread"><a class="header-anchor" href="#step-2-io-thread-or-sql-thread" target="_blank" rel="noopener noreferrer">Step 2: IO Thread or SQL Thread?</a></h2>
<p>Check the primary’s position:</p>
<pre><code>-- On primary:

SHOW MASTER STATUS\G
</code></pre>
<p>If replica’s <code>Read_Master_Log_Pos</code> is close to primary, the IO thread is fine. <strong>SQL thread is the bottleneck.</strong></p>
<p>If <code>Read_Master_Log_Pos</code> is far behind: network bandwidth, slow primary storage, or SSL overhead on the IO thread.</p>
<p>90% of replication lag is the SQL thread. Let’s fix it.</p>
<h2 id="step-3-why-the-sql-thread-is-slow"><a class="header-anchor" href="#step-3-why-the-sql-thread-is-slow" target="_blank" rel="noopener noreferrer">Step 3: Why the SQL Thread Is Slow</a></h2>
<p><strong>Cause 1: Single-threaded replay.</strong></p>
<p>By default, MySQL replays events on one thread. The primary processes thousands of concurrent transactions across 32 cores, but the replica replays them one at a time.</p>
<pre><code>SHOW VARIABLES LIKE 'replica_parallel_workers';
</code></pre>
<p>If <code>0</code> or <code>1</code>, you’re single-threaded.</p>
<p><strong>Fix — enable parallel replication:</strong></p>
<pre><code>STOP REPLICA;

SET GLOBAL replica_parallel_workers = 4;

SET GLOBAL replica_parallel_type = 'LOGICAL_CLOCK';

SET GLOBAL replica_preserve_commit_order = ON;

START REPLICA;
</code></pre>
<p>In <code>my.cnf</code>:</p>
<pre><code>[mysqld]

replica_parallel_workers = 4

replica_parallel_type = LOGICAL_CLOCK

replica_preserve_commit_order = ON
</code></pre>
<p>Start with 4 workers. Going above 8-16 rarely helps.</p>
<p><strong>Cause 2: Missing indexes on the replica.</strong></p>
<p>With row-based replication, each row change is found by primary key on the replica. No primary key = full table scan per change. Multiply by thousands of changes per second and the SQL thread grinds.</p>
<p>Check what SQL thread is doing:</p>
<pre><code>SHOW PROCESSLIST;
</code></pre>
<p>If you see slow UPDATE or DELETE from the system user, check that table’s indexes:</p>
<pre><code>SHOW CREATE TABLE the_table;
</code></pre>
<p>Add missing primary keys.</p>
<p><strong>Cause 3: Large transactions.</strong></p>
<p>A single transaction updating 500,000 rows generates one massive binlog event. Replays as one operation, blocking everything behind it.</p>
<p>Fix on the application side — batch operations:</p>
<pre><code>-- Instead of:

UPDATE orders SET status = 'archived' WHERE created_at &lt; '2024-01-01';

-- Do:

UPDATE orders SET status = 'archived'

WHERE created_at &lt; '2024-01-01' AND status != 'archived'

LIMIT 1000;
</code></pre>
<p>Loop until no rows affected. Each batch commits separately.</p>
<p><strong>Cause 4: DDL operations.</strong></p>
<p><code>ALTER TABLE</code> on a large table blocks the SQL thread for its entire duration. Monitor lag during DDL — it’s expected and temporary.</p>
<h2 id="step-4-monitoring-that-works"><a class="header-anchor" href="#step-4-monitoring-that-works" target="_blank" rel="noopener noreferrer">Step 4: Monitoring That Works</a></h2>
<p><code>Seconds_Behind_Master</code> lies. It can show 0 between event bursts even when behind.</p>
<p>Use <strong>pt-heartbeat</strong> from <a href="https://docs.percona.com/percona-toolkit/" target="_blank" rel="noopener noreferrer">Percona Toolkit</a>:</p>
<p>On primary:</p>
<pre><code>pt-heartbeat --update --database heartbeat --create-table
</code></pre>
<p>On replica:</p>
<pre><code>pt-heartbeat --monitor --database heartbeat
</code></pre>
<p>Writes timestamps every second. True lag, no estimation.</p>
<p><strong>Alert thresholds:</strong></p>
<ul>
<li><strong>&lt; 1s</strong> — acceptable</li>
<li><strong>1-10s</strong> — watch</li>
<li><strong>10-60s</strong> — investigate</li>
<li><strong>&gt; 60s</strong> — critical</li>
</ul>
<h2 id="the-diagnostic-sequence"><a class="header-anchor" href="#the-diagnostic-sequence" target="_blank" rel="noopener noreferrer">The Diagnostic Sequence</a></h2>
<ol>
<li><strong>SHOW REPLICA STATUS</strong> — threads running? What’s the error?</li>
<li><strong>Compare positions</strong> — IO thread keeping up?</li>
<li><strong>SHOW PROCESSLIST</strong> — what’s the SQL thread doing?</li>
<li><strong>Check parallel replication</strong> — workers &gt; 1?</li>
<li><strong>Check table indexes</strong> — primary keys present?</li>
<li><strong>Check for DDL</strong> — ALTER TABLE in progress?</li>
</ol>
<p>Most lag resolves with parallel replication and proper indexing. The single-threaded SQL thread is by far the most common cause, and it’s the easiest fix. One config change that should be the default but somehow isn’t.</p>
]]></content:encoded>
    </item>
    <item>
      <title><![CDATA[How to Audit Linux File Permissions and Find Security Holes]]></title>
      <link>https://dailydejavu.com/cyber-security/audit-linux-file-permissions-find-security-holes</link>
      <guid isPermaLink="true">https://dailydejavu.com/cyber-security/audit-linux-file-permissions-find-security-holes</guid>
      <pubDate>Mon, 05 Jan 2026 05:16:00 GMT</pubDate>
      <dc:creator><![CDATA[banditz]]></dc:creator>
      <description><![CDATA[Someone ran chmod 777 to fix a permissions error. Now your server has a gaping hole. Here's how to audit file permissions and find dangerous misconfigurations before attackers do.]]></description>
      <content:encoded><![CDATA[<p>There’s a move every sysadmin has seen at least once. The application throws a permission denied error. Somebody runs <code>chmod 777</code> on the directory. The error goes away. Everyone moves on.</p>
<p>Except now that directory is world-writable. Every user, every service, every process — they can all read, modify, and execute everything inside it. If that directory has config files, any compromised service can rewrite them. If it has scripts, anyone can inject code.</p>
<p>File permissions are boring. Nobody gets excited about <code>rwxr-xr--</code>. But misconfigured permissions are consistently in the top vectors for Linux privilege escalation. Not because the attacks are sophisticated — because the mistakes are simple.</p>
<h2 id="the-permission-model-in-60-seconds"><a class="header-anchor" href="#the-permission-model-in-60-seconds" target="_blank" rel="noopener noreferrer">The Permission Model in 60 Seconds</a></h2>
<p>Run <code>ls -la</code> and you see:</p>
<pre><code>-rw-r--r--  1 root root  1542 Mar 15 09:20 /etc/nginx/nginx.conf

drwx------  2 deploy deploy  4096 Mar 15 09:22 /home/deploy/.ssh
</code></pre>
<p>Breaking down <code>-rw-r--r--</code>:</p>
<p>First character = file type (<code>-</code> file, <code>d</code> directory, <code>l</code> symlink). Then three groups: <strong>owner</strong> (<code>rw-</code>), <strong>group</strong> (<code>r--</code>), <strong>others</strong> (<code>r--</code>).</p>
<ul>
<li><code>r</code> = read (4), <code>w</code> = write (2), <code>x</code> = execute (1), <code>-</code> = none (0)</li>
</ul>
<p>So <code>-rw-r--r--</code> in numeric = <strong>644</strong>: owner reads/writes, everyone else reads.</p>
<p>For directories, <code>x</code> means you can enter it and access files. A directory with <code>r--</code> lets you list filenames but not read files. With <code>--x</code> you can access files if you know their names but not list them.</p>
<p>Sane defaults: <strong>644</strong> for files, <strong>755</strong> for directories.</p>
<h2 id="step-1-find-world-writable-files"><a class="header-anchor" href="#step-1-find-world-writable-files" target="_blank" rel="noopener noreferrer">Step 1: Find World-Writable Files</a></h2>
<p>World-writable = anyone on the system can modify the file. Lowest-hanging fruit for attackers.</p>
<p>Find every world-writable file:</p>
<pre><code>sudo find / -type f -perm -o+w \

    -not -path "/proc/*" \

    -not -path "/sys/*" \

    -not -path "/dev/*" \

    2&gt;/dev/null
</code></pre>
<p>For directories:</p>
<pre><code>sudo find / -type d -perm -o+w \

    -not -path "/proc/*" \

    -not -path "/sys/*" \

    -not -path "/tmp" \

    -not -path "/var/tmp" \

    2&gt;/dev/null
</code></pre>
<p><code>/tmp</code> and <code>/var/tmp</code> are supposed to be world-writable. Everything else needs investigation.</p>
<p><strong>Red flags:</strong></p>
<ul>
<li>Config files (<code>.conf</code>, <code>.ini</code>, <code>.env</code>) — anyone can change app behavior</li>
<li>Scripts (<code>.sh</code>, <code>.py</code>, <code>.php</code>) — code injection vector</li>
<li>Cron files — writable cron scripts run as the cron user</li>
<li>Web app files — writable PHP = direct code execution</li>
</ul>
<p><strong>Fix:</strong></p>
<pre><code>sudo chmod o-w /path/to/file

# Or set correct permissions

sudo chmod 644 /path/to/file
</code></pre>
<h2 id="step-2-hunt-for-suid-and-sgid-binaries"><a class="header-anchor" href="#step-2-hunt-for-suid-and-sgid-binaries" target="_blank" rel="noopener noreferrer">Step 2: Hunt for SUID and SGID Binaries</a></h2>
<p>SUID means when you run this binary, it executes with the <strong>file owner’s</strong> permissions — not yours.</p>
<p>If root owns a SUID binary, anyone who runs it gets root-level execution. That’s by design for <code>passwd</code> (needs root for <code>/etc/shadow</code>) and <code>sudo</code>. The problem is <strong>unexpected</strong> SUID binaries.</p>
<p>Find all SUID:</p>
<pre><code>sudo find / -type f -perm -4000 2&gt;/dev/null
</code></pre>
<p>Find all SGID:</p>
<pre><code>sudo find / -type f -perm -2000 2&gt;/dev/null
</code></pre>
<p>A clean Linux server should have:</p>
<pre><code>/usr/bin/passwd

/usr/bin/sudo

/usr/bin/su

/usr/bin/mount

/usr/bin/umount

/usr/bin/chfn

/usr/bin/chsh

/usr/bin/newgrp

/usr/bin/gpasswd

/usr/bin/pkexec
</code></pre>
<p><strong>Everything else is suspicious.</strong> <a href="https://gtfobins.github.io/" target="_blank" rel="noopener noreferrer">GTFOBins</a> lists Linux binaries exploitable with SUID — <code>find</code>, <code>vim</code>, <code>python</code>, <code>bash</code>, <code>nmap</code>, and many more.</p>
<p>If <code>python3</code> had SUID set (it never should):</p>
<pre><code>python3 -c 'import os; os.setuid(0); os.system("/bin/bash")'
</code></pre>
<p>Instant root shell.</p>
<p><strong>Remove unnecessary SUID:</strong></p>
<pre><code>sudo chmod u-s /path/to/binary
</code></pre>
<h2 id="step-3-check-sensitive-files"><a class="header-anchor" href="#step-3-check-sensitive-files" target="_blank" rel="noopener noreferrer">Step 3: Check Sensitive Files</a></h2>
<p><strong>SSH files:</strong></p>
<pre><code>sudo stat -c '%a %U %G %n' /home/*/.ssh /home/*/.ssh/* 2&gt;/dev/null
</code></pre>
<p>Required:</p>
<ul>
<li><code>~/.ssh/</code> — <strong>700</strong></li>
<li><code>~/.ssh/authorized_keys</code> — <strong>644</strong> or <strong>600</strong></li>
<li><code>~/.ssh/id_ed25519</code> (private key) — <strong>600</strong> (SSH refuses looser permissions)</li>
<li><code>~/.ssh/config</code> — <strong>600</strong></li>
</ul>
<p><strong>System credentials:</strong></p>
<pre><code>sudo stat -c '%a %U %G %n' /etc/passwd /etc/shadow /etc/group /etc/gshadow
</code></pre>
<ul>
<li><code>/etc/passwd</code> — <strong>644</strong> root:root</li>
<li><code>/etc/shadow</code> — <strong>640</strong> root:shadow (contains password hashes)</li>
<li><code>/etc/group</code> — <strong>644</strong> root:root</li>
<li><code>/etc/gshadow</code> — <strong>640</strong> root:shadow</li>
</ul>
<p>If <code>/etc/shadow</code> is world-readable, any user can read hashes and attempt offline cracking.</p>
<p><strong>Web server:</strong></p>
<p>Web directories should be <strong>755</strong>, files <strong>644</strong>. Config files owned by root at <strong>640</strong>. Never make web files writable by the web server user unless absolutely necessary (uploads directory only).</p>
<h2 id="step-4-find-orphaned-files"><a class="header-anchor" href="#step-4-find-orphaned-files" target="_blank" rel="noopener noreferrer">Step 4: Find Orphaned Files</a></h2>
<p>Files without valid owners = deleted user accounts or improper management:</p>
<pre><code>sudo find / -nouser -o -nogroup 2&gt;/dev/null | grep -v '/proc\|/sys'
</code></pre>
<p>Risk: if a new user gets the recycled UID, they inherit all orphaned files.</p>
<p><strong>Fix:</strong></p>
<pre><code>sudo chown root:root /path/to/orphaned-file
</code></pre>
<h2 id="step-5-set-proper-defaults-with-umask"><a class="header-anchor" href="#step-5-set-proper-defaults-with-umask" target="_blank" rel="noopener noreferrer">Step 5: Set Proper Defaults with umask</a></h2>
<p>Check current:</p>
<pre><code>umask
</code></pre>
<p>Common values:</p>
<ul>
<li><strong>022</strong> — new files 644, directories 755 (standard)</li>
<li><strong>027</strong> — new files 640, directories 750 (others can’t read)</li>
<li><strong>077</strong> — new files 600, directories 700 (only owner)</li>
</ul>
<p>For servers, <strong>027</strong> is ideal. Set in <code>/etc/login.defs</code>:</p>
<pre><code>UMASK 027
</code></pre>
<p>For systemd services:</p>
<pre><code>[Service]

UMask=0027
</code></pre>
<h2 id="the-permission-audit-checklist"><a class="header-anchor" href="#the-permission-audit-checklist" target="_blank" rel="noopener noreferrer">The Permission Audit Checklist</a></h2>
<p>Run quarterly on every production server:</p>
<ol>
<li><strong>World-writable files</strong> — <code>find / -type f -perm -o+w</code> — fix everything outside /tmp</li>
<li><strong>SUID/SGID binaries</strong> — <code>find / -perm -4000</code> — compare against known-good list</li>
<li><strong>SSH permissions</strong> — directories 700, private keys 600</li>
<li><strong>/etc/shadow</strong> — must be 640, never world-readable</li>
<li><strong>Web files</strong> — directories 755, files 644, config owned by root</li>
<li><strong>Orphaned files</strong> — <code>find / -nouser -o -nogroup</code> — reassign</li>
<li><strong>Verify umask</strong> — should be 022 or 027</li>
</ol>
<p>For automated auditing, <a href="https://cisofy.com/lynis/" target="_blank" rel="noopener noreferrer">Lynis</a> does comprehensive permission checks. Install with <code>sudo apt install lynis</code> and run <code>sudo lynis audit system</code>.</p>
<p>Permissions aren’t glamorous. Nobody puts “I audited file permissions” on conference slides. But I’ve seen production databases exposed because <code>/etc/shadow</code> was 644. I’ve seen web servers owned because deploy scripts left everything world-writable. I’ve seen root shells from SUID on <code>python3</code>.</p>
<p>The boring stuff prevents the exciting breaches.</p>
]]></content:encoded>
    </item>
    <item>
      <title><![CDATA[How to Harden SSH and Stop Brute Force Attacks on Linux Servers — Read the Logs Before You Lock the Door]]></title>
      <link>https://dailydejavu.com/cyber-security/harden-ssh-stop-brute-force-attacks-linux</link>
      <guid isPermaLink="true">https://dailydejavu.com/cyber-security/harden-ssh-stop-brute-force-attacks-linux</guid>
      <pubDate>Sat, 03 Jan 2026 04:14:00 GMT</pubDate>
      <dc:creator><![CDATA[banditz]]></dc:creator>
      <description><![CDATA[Every Linux server with SSH exposed to the internet gets hammered by brute force bots within hours of going live. Most admins just change the port and call it a day. But if you actually read /var/log/auth.log, you'd see that the bots adapt — they rotate usernames.]]></description>
      <content:encoded><![CDATA[<p>I’m going to tell you something that’ll either scare you or make you shrug depending on how long you’ve been managing Linux servers: the VPS you spun up 20 minutes ago is already being attacked.</p>
<p>Not by a person sitting in a dark room with a hoodie — by bots. Thousands of them. Automated scripts running on compromised machines across the globe, systematically scanning every single IP address on the internet, looking for SSH servers running on port 22 with password authentication enabled. They try <code>root/admin</code>, <code>root/password</code>, <code>admin/123456</code>, <code>deploy/deploy</code>, <code>ubuntu/ubuntu</code>, and about ten thousand other common credential combos.</p>
<p>This isn’t hypothetical. Go look at your auth log right now:</p>
<pre><code>sudo tail -200 /var/log/auth.log
</code></pre>
<p>If your server has been online for more than an hour with SSH on port 22, you’ll see a wall of failed login attempts from IP addresses you’ve never seen before. That’s the background radiation of the internet. It never stops. And if your only defense is a password — even a decent one — you’re playing a statistical game that you’ll eventually lose.</p>
<h2 id="what-the-bots-are-actually-doing"><a class="header-anchor" href="#what-the-bots-are-actually-doing" target="_blank" rel="noopener noreferrer">What the Bots Are Actually Doing</a></h2>
<p>Before you start hardening things, it helps to understand the attack pattern. Most people picture brute force as one bot trying <code>password1</code>, <code>password2</code>, <code>password3</code> at lightning speed against a single account. That’s the version from 2005.</p>
<p>Modern SSH brute force is smarter. Here’s what actually shows up in your logs:</p>
<pre><code>Mar 28 04:17:32 web01 sshd[12847]: Invalid user admin from 185.224.128.47 port 42816

Mar 28 04:17:34 web01 sshd[12849]: Invalid user test from 185.224.128.47 port 42920

Mar 28 04:17:36 web01 sshd[12851]: Invalid user oracle from 185.224.128.47 port 43018

Mar 28 04:17:38 web01 sshd[12853]: Invalid user postgres from 185.224.128.47 port 43112

Mar 28 04:17:40 web01 sshd[12855]: Failed password for root from 185.224.128.47 port 43200 ssh2
</code></pre>
<p>Notice what’s happening. The bot isn’t just hammering <code>root</code>. It’s cycling through usernames — <code>admin</code>, <code>test</code>, <code>oracle</code>, <code>postgres</code>, <code>deploy</code>, <code>git</code>, <code>jenkins</code>, <code>ubuntu</code> — because these are default accounts that exist on millions of servers, and people frequently leave them with weak or default passwords.</p>
<p>The smarter bots also throttle their attempts. Instead of 100 attempts per second (which is trivially detectable), they’ll do 3 attempts, wait 60 seconds, try 3 more. Some distribute the attack across multiple IPs from the same botnet, so no single IP triggers rate limiting.</p>
<p>On RHEL/CentOS systems, the same information lives in <code>/var/log/secure</code> instead of <code>/var/log/auth.log</code>. The format is identical.</p>
<p>To see a summary of who’s been trying to get in:</p>
<pre><code>grep "Failed password" /var/log/auth.log | awk '{print $(NF-3)}' | sort | uniq -c | sort -rn | head -20
</code></pre>
<p>This gives you a ranked list of the most persistent attacking IPs. On a server that’s been online for a week, don’t be surprised if the top IP has thousands of attempts.</p>
<p>Now let’s shut them down properly.</p>
<h2 id="step-1-switch-to-ssh-key-only-authentication"><a class="header-anchor" href="#step-1-switch-to-ssh-key-only-authentication" target="_blank" rel="noopener noreferrer">Step 1: Switch to SSH Key-Only Authentication</a></h2>
<p>This is the nuclear option against brute force, and it should be the first thing you do on any server. Not the last. Not “eventually.” First.</p>
<p>SSH key authentication replaces passwords with a cryptographic key pair. Your private key stays on your local machine, your public key goes on the server. When you connect, the server challenges your client to prove it has the private key without ever transmitting it. No password is sent. No password can be guessed. Brute force doesn’t work because there’s nothing to brute force.</p>
<p><strong>Generate a key pair on your local machine</strong> (if you don’t already have one):</p>
<pre><code>ssh-keygen -t ed25519 -C "you@yourdomain.com"
</code></pre>
<p>Ed25519 is the modern choice — it’s faster, shorter, and more secure than RSA. If you’re dealing with legacy systems that don’t support ed25519, fall back to RSA with a 4096-bit key:</p>
<pre><code>ssh-keygen -t rsa -b 4096
</code></pre>
<p><strong>Copy the public key to your server:</strong></p>
<pre><code>ssh-copy-id -i ~/.ssh/id_ed25519.pub user@your-server-ip
</code></pre>
<p><strong>Test key login before disabling passwords.</strong> Open a new terminal and connect:</p>
<pre><code>ssh user@your-server-ip
</code></pre>
<p>If you get in without being asked for a password, key auth is working. Keep this session open as a safety net.</p>
<p><strong>Now disable password authentication.</strong> Edit the SSH daemon config:</p>
<pre><code>sudo nano /etc/ssh/sshd_config
</code></pre>
<p>Find and set these three directives:</p>
<pre><code>PasswordAuthentication no

PubkeyAuthentication yes

ChallengeResponseAuthentication no
</code></pre>
<p>That third one is important. <code>ChallengeResponseAuthentication</code> can bypass <code>PasswordAuthentication</code> on some PAM configurations, effectively keeping password login alive even when you think you’ve disabled it. Set it to <code>no</code>.</p>
<p>Also disable empty passwords while you’re in there:</p>
<pre><code>PermitEmptyPasswords no
</code></pre>
<p>Test the config before applying:</p>
<pre><code>sudo sshd -t
</code></pre>
<p>If it returns silently with no errors, restart:</p>
<pre><code>sudo systemctl restart sshd
</code></pre>
<p>From this point on, every brute force attempt in the world is wasted effort against your server. The bots will keep trying — they don’t know you’ve disabled passwords — but every single attempt fails instantly. Check your auth log after a few minutes:</p>
<pre><code>sudo tail -50 /var/log/auth.log
</code></pre>
<p>You’ll see the attempts now ending with <code>Connection closed by authenticating user</code> or <code>Disconnected from authenticating user</code> instead of <code>Failed password</code>. They can’t even get to the password prompt.</p>
<h2 id="step-2-lock-down-sshd_config"><a class="header-anchor" href="#step-2-lock-down-sshd_config" target="_blank" rel="noopener noreferrer">Step 2: Lock Down sshd_config</a></h2>
<p>Key-only auth handles brute force, but there’s more to SSH security than just authentication. Open <code>/etc/ssh/sshd_config</code> and layer these additional restrictions:</p>
<p><strong>Disable root login:</strong></p>
<pre><code>PermitRootLogin no
</code></pre>
<p>Even with key-only auth, allowing direct root login is bad practice. If an attacker somehow gets your private key (stolen laptop, compromised backup), they’d have immediate root access. Force the use of a regular user account that escalates with <code>sudo</code>.</p>
<p>If you absolutely must allow root key-based login (some automation tools require it), use:</p>
<pre><code>PermitRootLogin prohibit-password
</code></pre>
<p>This allows root login with SSH keys but not passwords. It’s a compromise, not ideal, but better than <code>yes</code>.</p>
<p><strong>Restrict which users can log in:</strong></p>
<pre><code>AllowUsers deployer admin
</code></pre>
<p>This is a whitelist. Only the users listed here can SSH in. Everyone else — even users with valid system accounts and SSH keys — gets rejected. This is powerful because it means a compromised application user (like <code>www-data</code> or <code>postgres</code>) can’t be leveraged for SSH access even if an attacker manages to plant a key in their <code>authorized_keys</code> file.</p>
<p><strong>Limit authentication attempts and timing:</strong></p>
<pre><code>MaxAuthTries 3

LoginGraceTime 20
</code></pre>
<p><code>MaxAuthTries 3</code> means after 3 failed authentication attempts within a single connection, the server disconnects. <code>LoginGraceTime 20</code> gives only 20 seconds to complete authentication — if you haven’t authenticated in 20 seconds, you’re disconnected. Legitimate users authenticate in under 2 seconds; only bots and manual attackers need more time.</p>
<p><strong>Disconnect idle sessions:</strong></p>
<pre><code>ClientAliveInterval 300

ClientAliveCountMax 2
</code></pre>
<p>The server sends a keepalive probe every 300 seconds (5 minutes). If the client doesn’t respond to 2 consecutive probes, the connection is terminated. This prevents abandoned sessions from sitting open indefinitely, which reduces the window for session hijacking.</p>
<p><strong>Disable unnecessary features:</strong></p>
<pre><code>X11Forwarding no

AllowTcpForwarding no

AllowAgentForwarding no
</code></pre>
<p>Unless you specifically need X11 forwarding (running graphical apps over SSH), TCP forwarding (tunneling), or agent forwarding (chaining SSH connections), disable them. Each enabled feature is a potential attack vector. Turn them off by default and enable them selectively when needed.</p>
<p>After making all changes, test and restart:</p>
<pre><code>sudo sshd -t

sudo systemctl restart sshd
</code></pre>
<p>Always — and I mean always — keep an existing SSH session open while you restart sshd. If you made a typo that prevents new connections, your existing session stays alive and lets you fix it. If you close your only session before testing, and the new config has an error, you’re locked out and praying your VPS provider has a console rescue option.</p>
<h2 id="step-3-configure-fail2ban"><a class="header-anchor" href="#step-3-configure-fail2ban" target="_blank" rel="noopener noreferrer">Step 3: Configure Fail2ban</a></h2>
<p>Key-only auth makes brute force ineffective, but the bots don’t know that. They’ll keep hammering your server, consuming bandwidth and filling your logs. Fail2ban fixes this by watching your auth log and automatically firewall-blocking IPs that fail too many times.</p>
<p><strong>Install fail2ban:</strong></p>
<pre><code>sudo apt install fail2ban      # Debian/Ubuntu

sudo dnf install fail2ban      # RHEL/CentOS/Fedora
</code></pre>
<p><strong>Create a local config</strong> (never edit <code>jail.conf</code> directly — it gets overwritten on updates):</p>
<pre><code>sudo cp /etc/fail2ban/jail.conf /etc/fail2ban/jail.local
</code></pre>
<p><strong>Edit the local config:</strong></p>
<pre><code>sudo nano /etc/fail2ban/jail.local
</code></pre>
<p>Find the <code>[sshd]</code> section and configure it:</p>
<pre><code>[sshd]

enabled = true

port = ssh

filter = sshd

logpath = /var/log/auth.log

maxretry = 3

findtime = 600

bantime = 3600
</code></pre>
<p>This means: if an IP fails 3 times within 10 minutes (600 seconds), ban it for 1 hour (3600 seconds). For most servers, this is a reasonable starting point.</p>
<p>For servers that take a real beating, you can get more aggressive:</p>
<pre><code>bantime = 86400          # 24-hour ban

findtime = 3600          # 3 failures within an hour

maxretry = 3
</code></pre>
<p><strong>Add the recidive jail</strong> for repeat offenders. This catches IPs that get banned, wait out the ban, and come back. Add this to the bottom of <code>jail.local</code>:</p>
<pre><code>[recidive]

enabled = true

logpath = /var/log/fail2ban.log

banaction = %(banaction_allports)s

bantime = 604800

findtime = 86400

maxretry = 3
</code></pre>
<p>This watches fail2ban’s own log. If an IP gets banned 3 times within 24 hours, the recidive jail bans them for a full week across all ports. Persistent bots learn the hard way.</p>
<p><strong>Start and enable fail2ban:</strong></p>
<pre><code>sudo systemctl enable fail2ban

sudo systemctl start fail2ban
</code></pre>
<p><strong>Check the status:</strong></p>
<pre><code>sudo fail2ban-client status sshd
</code></pre>
<p>You’ll see the number of currently banned IPs and the total number of bans since fail2ban started. On a fresh server with SSH on port 22, expect this number to climb quickly.</p>
<p><strong>Unban an IP</strong> (if you accidentally lock yourself out):</p>
<pre><code>sudo fail2ban-client set sshd unbanip 203.0.113.50
</code></pre>
<p>Pro tip: whitelist your own IP so fail2ban never bans you, even if you fat-finger your key passphrase multiple times:</p>
<pre><code># In jail.local, under [DEFAULT]

ignoreip = 127.0.0.1/8 ::1 203.0.113.50
</code></pre>
<p>Replace <code>203.0.113.50</code> with your actual IP or IP range.</p>
<h2 id="step-4-reduce-the-noise-change-the-port"><a class="header-anchor" href="#step-4-reduce-the-noise-change-the-port" target="_blank" rel="noopener noreferrer">Step 4: Reduce the Noise — Change the Port</a></h2>
<p>I’ll be upfront about this: changing the SSH port is not a security measure. It’s a noise reduction measure. Any attacker who specifically targets your server will find your SSH port within seconds using a port scan. Changing it from 22 to 2222 or 4822 or 39122 doesn’t add meaningful security against a determined attacker.</p>
<p>What it does do is eliminate 99% of the automated bot traffic that exclusively targets port 22. After changing the port, your auth.log goes from thousands of daily entries to nearly zero. This makes it much easier to spot real threats among the remaining entries.</p>
<p><strong>Change the port in sshd_config:</strong></p>
<pre><code>sudo nano /etc/ssh/sshd_config
</code></pre>
<p>Change:</p>
<pre><code>Port 22
</code></pre>
<p>To:</p>
<pre><code>Port 4822
</code></pre>
<p><strong>Update the firewall BEFORE restarting SSH</strong> (otherwise you lock yourself out):</p>
<p>For UFW:</p>
<pre><code>sudo ufw allow 4822/tcp

sudo ufw delete allow 22/tcp
</code></pre>
<p>For iptables:</p>
<pre><code>sudo iptables -A INPUT -p tcp --dport 4822 -j ACCEPT

sudo iptables -D INPUT -p tcp --dport 22 -j ACCEPT
</code></pre>
<p><strong>Update fail2ban</strong> to watch the new port. In <code>jail.local</code>, change the sshd section:</p>
<pre><code>[sshd]

port = 4822
</code></pre>
<p><strong>Restart everything:</strong></p>
<pre><code>sudo sshd -t

sudo systemctl restart sshd

sudo systemctl restart fail2ban
</code></pre>
<p>Connect from a new terminal using the new port:</p>
<pre><code>ssh -p 4822 user@your-server-ip
</code></pre>
<p>Keep your old session alive until you confirm the new connection works.</p>
<h2 id="step-5-ip-whitelisting-and-port-knocking-for-the-paranoid"><a class="header-anchor" href="#step-5-ip-whitelisting-and-port-knocking-for-the-paranoid" target="_blank" rel="noopener noreferrer">Step 5: IP Whitelisting and Port Knocking (For the Paranoid)</a></h2>
<p>If you always connect from the same IP or IP range — say, your home IP or your office VPN — you can lock SSH down to only accept connections from those addresses.</p>
<p><strong>With UFW:</strong></p>
<pre><code>sudo ufw allow from 203.0.113.0/24 to any port 4822
</code></pre>
<p>This means only IPs in the 203.0.113.0/24 range can even reach the SSH port. Everyone else gets a timeout. The bots don’t even see that SSH exists on your server.</p>
<p>The problem with IP whitelisting is that your IP might change (dynamic ISP, travel, different networks). If your IP changes and you haven’t updated the whitelist, you’re locked out.</p>
<p><strong>Port knocking</strong> solves this problem. It keeps the SSH port completely closed — invisible to port scans — until a client sends a specific sequence of connection attempts to other ports in the correct order.</p>
<p>Install <code>knockd</code>:</p>
<pre><code>sudo apt install knockd
</code></pre>
<p>Configure it:</p>
<pre><code>sudo nano /etc/knockd.conf

[options]

    UseSyslog

[openSSH]

    sequence = 7000,8000,9000

    seq_timeout = 15

    command = /usr/sbin/ufw allow from %IP% to any port 4822

    tcpflags = syn

[closeSSH]

    sequence = 9000,8000,7000

    seq_timeout = 15

    command = /usr/sbin/ufw delete allow from %IP% to any port 4822

    tcpflags = syn
</code></pre>
<p>This configuration works like a secret handshake. To open SSH, you knock on ports 7000, 8000, 9000 in that exact order within 15 seconds. To close it afterward, knock in reverse: 9000, 8000, 7000.</p>
<p>From your local machine, knock with:</p>
<pre><code>knock your-server-ip 7000 8000 9000

ssh -p 4822 user@your-server-ip

# When done:

knock your-server-ip 9000 8000 7000
</code></pre>
<p>Or if you don’t have the <code>knock</code> client, you can use <code>nmap</code>:</p>
<pre><code>nmap -Pn --host-timeout 201 --max-retries 0 -p 7000 your-server-ip

nmap -Pn --host-timeout 201 --max-retries 0 -p 8000 your-server-ip

nmap -Pn --host-timeout 201 --max-retries 0 -p 9000 your-server-ip
</code></pre>
<p>Port knocking is overkill for most setups. But if you’re managing a server that handles sensitive data, or if you just want the satisfaction of knowing that your SSH port is completely invisible to the entire internet, it’s a satisfying layer to add.</p>
<h2 id="the-ssh-hardening-checklist"><a class="header-anchor" href="#the-ssh-hardening-checklist" target="_blank" rel="noopener noreferrer">The SSH Hardening Checklist</a></h2>
<p>Here’s the complete order of operations for any new Linux server:</p>
<ol>
<li><strong>Generate SSH keys</strong> on your local machine (if you haven’t already)</li>
<li><strong>Copy the public key</strong> to the server with <code>ssh-copy-id</code></li>
<li><strong>Test key login</strong> before touching any config</li>
<li><strong>Disable password authentication</strong> — <code>PasswordAuthentication no</code></li>
<li><strong>Disable root login</strong> — <code>PermitRootLogin no</code></li>
<li><strong>Restrict users</strong> — <code>AllowUsers</code> with only the accounts that need SSH</li>
<li><strong>Limit attempts and timeouts</strong> — <code>MaxAuthTries 3</code>, <code>LoginGraceTime 20</code></li>
<li><strong>Install fail2ban</strong> — configure with reasonable ban times</li>
<li><strong>Change the SSH port</strong> — reduces log noise from automated scanners</li>
<li><strong>Whitelist IPs or set up port knocking</strong> — if you connect from known locations</li>
</ol>
<p>Do them in this order. Each layer addresses a different threat, and together they make SSH brute force a non-issue. The bots will keep scanning, the botnets will keep running, but your server is no longer a target that can yield results.</p>
<p>I’ve been watching these logs for decades. The attacks never stop, they only evolve. But a properly hardened SSH setup hasn’t changed much in that time either — because the fundamentals work. Keys beat passwords. Automatic bans beat manual blocking. And a healthy dose of paranoia beats blind trust every single time.</p>
<hr />
<p>If you found this guide helpful, check out our other resources:</p>
<ul>
<li>(More articles coming soon in the Cyber Security category)</li>
</ul>
]]></content:encoded>
    </item>
    <item>
      <title><![CDATA[How to Find and Fix Slow Queries in PostgreSQL — Read EXPLAIN ANALYZE Before You Add Random Indexes]]></title>
      <link>https://dailydejavu.com/database/find-fix-slow-queries-postgresql-explain-analyze</link>
      <guid isPermaLink="true">https://dailydejavu.com/database/find-fix-slow-queries-postgresql-explain-analyze</guid>
      <pubDate>Thu, 01 Jan 2026 04:41:00 GMT</pubDate>
      <dc:creator><![CDATA[banditz]]></dc:creator>
      <description><![CDATA[Your API endpoint takes 4 seconds to respond. Your dashboard loads like it's 2003. Someone suggests "just add an index" without even looking at the query plan. Stop. The problem might not be a missing index at all — it could be stale statistics, a bad join order, an ORM generating garbage SQL.]]></description>
      <content:encoded><![CDATA[<p>There’s a ritual that happens in every engineering team eventually. Someone notices an API endpoint is slow. Someone else looks at the database and says “we need an index.” They add an index on what seems like the right column, deploy it, and… the query is still slow. Or it’s faster for that one query but now three other queries have mysteriously gotten worse.</p>
<p>This happens because adding an index without reading the query plan is like prescribing medicine without diagnosing the patient. Sometimes the problem is a missing index. But just as often, the problem is stale statistics, a badly written query, a join that the planner is executing in the wrong order, or an ORM that’s generating SQL you’d be embarrassed to write by hand.</p>
<p>PostgreSQL gives you the exact diagnostic tool to figure this out. It’s called <code>EXPLAIN ANALYZE</code>, and if you’re not using it every time you investigate a slow query, you’re guessing. Let’s stop guessing.</p>
<h2 id="step-1-find-the-queries-that-actually-matter"><a class="header-anchor" href="#step-1-find-the-queries-that-actually-matter" target="_blank" rel="noopener noreferrer">Step 1: Find the Queries That Actually Matter</a></h2>
<p>Before you optimize anything, you need to know what to optimize. The slowest query isn’t necessarily the one that matters most. A query that takes 2 seconds but runs once a day isn’t as urgent as a query that takes 50ms but runs 100,000 times a day. The second one consumes way more server time overall.</p>
<p><strong>Enable pg_stat_statements</strong> — this extension tracks execution statistics for every query that runs on your server.</p>
<p>Add it to <code>postgresql.conf</code>:</p>
<pre><code>shared_preload_libraries = 'pg_stat_statements'
</code></pre>
<p>Restart PostgreSQL:</p>
<pre><code>sudo systemctl restart postgresql
</code></pre>
<p>Create the extension:</p>
<pre><code>CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
</code></pre>
<p>Now find the queries that consume the most total time:</p>
<pre><code>SELECT

    substring(query, 1, 120) AS query_preview,

    calls,

    round(total_exec_time::numeric, 2) AS total_ms,

    round(mean_exec_time::numeric, 2) AS avg_ms,

    rows

FROM pg_stat_statements

ORDER BY total_exec_time DESC

LIMIT 10;
</code></pre>
<p>This shows your top 10 resource consumers. The <code>total_exec_time</code> column is what matters — it’s the cumulative time spent on that query across all calls. A query with 5ms average execution time and 2 million calls has consumed 10,000 seconds of total server time. That matters a lot more than the 800ms query that runs 50 times a day.</p>
<p>For finding queries that are slow per-execution (the ones your users actually feel):</p>
<pre><code>SELECT

    substring(query, 1, 120) AS query_preview,

    calls,

    round(mean_exec_time::numeric, 2) AS avg_ms,

    round(max_exec_time::numeric, 2) AS max_ms

FROM pg_stat_statements

WHERE mean_exec_time &gt; 100

ORDER BY mean_exec_time DESC

LIMIT 10;
</code></pre>
<p>This catches the individually slow queries — anything averaging over 100ms. The <code>max_exec_time</code> column is useful too; if a query averages 50ms but maxes at 12,000ms, it has an intermittent performance problem likely related to lock contention, resource exhaustion, or cold cache.</p>
<p>You can also enable slow query logging directly in PostgreSQL as a safety net:</p>
<pre><code># In postgresql.conf

log_min_duration_statement = 1000    # Log any query taking &gt; 1 second
</code></pre>
<p>This writes slow queries directly to the PostgreSQL log file, which is useful if <code>pg_stat_statements</code> isn’t available or if you want to see the exact parameters used in slow queries.</p>
<h2 id="step-2-read-the-execution-plan-this-is-where-understanding-begins"><a class="header-anchor" href="#step-2-read-the-execution-plan-this-is-where-understanding-begins" target="_blank" rel="noopener noreferrer">Step 2: Read the Execution Plan (This Is Where Understanding Begins)</a></h2>
<p>You’ve identified a slow query. Now run it through <code>EXPLAIN ANALYZE</code>:</p>
<pre><code>EXPLAIN (ANALYZE, BUFFERS)

SELECT o.id, o.total, c.name

FROM orders o

JOIN customers c ON c.id = o.customer_id

WHERE o.status = 'completed'

AND o.created_at &gt; '2025-01-01'

ORDER BY o.created_at DESC

LIMIT 20;
</code></pre>
<p>The <code>BUFFERS</code> option adds information about how many disk pages were read, which helps distinguish between I/O problems and CPU problems.</p>
<p>Here’s what an output might look like:</p>
<pre><code>Limit  (cost=15234.52..15234.57 rows=20 width=52) (actual time=892.45..892.48 rows=20 loops=1)

  -&gt;  Sort  (cost=15234.52..15456.23 rows=88682 width=52) (actual time=892.44..892.46 rows=20 loops=1)

        Sort Key: o.created_at DESC

        Sort Method: top-N heapsort  Memory: 27kB

        -&gt;  Hash Join  (cost=12.50..13012.34 rows=88682 width=52) (actual time=0.82..845.20 rows=89542 loops=1)

              Hash Cond: (o.customer_id = c.id)

              -&gt;  Seq Scan on orders o  (cost=0.00..11842.00 rows=88682 width=44) (actual time=0.04..780.32 rows=89542 loops=1)

                    Filter: ((status = 'completed') AND (created_at &gt; '2025-01-01'))

                    Rows Removed by Filter: 410458

              -&gt;  Hash  (cost=10.00..10.00 rows=200 width=12) (actual time=0.42..0.42 rows=200 loops=1)

                    -&gt;  Seq Scan on customers c  (cost=0.00..10.00 rows=200 width=12) (actual time=0.01..0.18 rows=200 loops=1)

Planning Time: 0.38 ms

Execution Time: 892.72 ms
</code></pre>
<p>Now let’s read this like a professional:</p>
<p><strong>The bottleneck is obvious.</strong> The <code>Seq Scan on orders o</code> line shows <code>actual time=0.04..780.32</code>. That’s 780ms spent reading the entire <code>orders</code> table sequentially — all 500,000 rows — to find the 89,542 that match the filter. That’s 82% of the total execution time in one node.</p>
<p><strong>The estimated vs actual rows match.</strong> <code>rows=88682</code> estimated, <code>rows=89542</code> actual. That’s close enough — the statistics are fine. The planner isn’t making a bad decision because of stale stats; it’s making the only decision it can because there’s no suitable index.</p>
<p><strong>The Seq Scan on customers is fine.</strong> It takes 0.18ms because the table has only 200 rows. Every table under a few thousand rows is fastest with a sequential scan. Don’t index tables with 200 rows — it’s pointless overhead.</p>
<p><strong>The sort is cheap.</strong> <code>top-N heapsort</code> with 27kB memory means PostgreSQL used an efficient algorithm for the <code>LIMIT 20 ORDER BY</code> — it found the top 20 without sorting all 89,542 matching rows.</p>
<p><strong>The diagnosis:</strong> this query needs a composite index on <code>orders(status, created_at)</code> to avoid the sequential scan.</p>
<pre><code>CREATE INDEX idx_orders_status_created ON orders (status, created_at);
</code></pre>
<p>Run <code>EXPLAIN ANALYZE</code> again after creating the index:</p>
<pre><code>-&gt;  Index Scan using idx_orders_status_created on orders o

      (cost=0.42..1823.56 rows=88682 width=44)

      (actual time=0.03..45.21 rows=89542 loops=1)

      Index Cond: ((status = 'completed') AND (created_at &gt; '2025-01-01'))
</code></pre>
<p>780ms → 45ms. The sequential scan is gone, replaced by an index scan that reads only the matching rows.</p>
<h2 id="step-3-when-stale-statistics-are-the-real-problem"><a class="header-anchor" href="#step-3-when-stale-statistics-are-the-real-problem" target="_blank" rel="noopener noreferrer">Step 3: When Stale Statistics Are the Real Problem</a></h2>
<p>Sometimes the execution plan shows something strange: the estimated row count is wildly different from the actual row count.</p>
<pre><code>-&gt;  Seq Scan on events  (cost=0.00..25.00 rows=5 width=32) (actual time=0.04..312.45 rows=147823 loops=1)
</code></pre>
<p>The planner estimated 5 rows. The actual result was 147,823. That’s not a rounding error — the planner is working with completely wrong statistics and making terrible decisions as a result.</p>
<p>When estimated rows are much lower than actual rows, the planner tends to choose nested loop joins (good for small sets, terrible for large ones) and avoids using hash joins or merge joins that would be much more efficient. The entire execution plan downstream of the bad estimate is suboptimal.</p>
<p><strong>Fix it with ANALYZE:</strong></p>
<pre><code>ANALYZE events;
</code></pre>
<p>This collects fresh statistics about the table’s data distribution — how many rows, how many distinct values per column, most common values, histogram boundaries. After running ANALYZE, the planner has accurate information and can make better decisions.</p>
<p>If specific columns have unusual distributions (lots of NULLs, extreme skew, or a huge number of distinct values), increase the statistics target for those columns:</p>
<pre><code>ALTER TABLE events ALTER COLUMN event_type SET STATISTICS 1000;

ANALYZE events;
</code></pre>
<p>The default statistics target is 100, which means PostgreSQL samples 100 × 300 = 30,000 rows to build histograms. Increasing it to 1000 means 300,000 rows are sampled, giving more accurate statistics for high-cardinality or skewed columns.</p>
<p><strong>Check if autovacuum is keeping up:</strong></p>
<pre><code>SELECT

    relname,

    n_live_tup,

    n_dead_tup,

    last_autoanalyze,

    last_autovacuum

FROM pg_stat_user_tables

WHERE n_dead_tup &gt; 1000

ORDER BY n_dead_tup DESC;
</code></pre>
<p>If <code>last_autoanalyze</code> was a long time ago and <code>n_dead_tup</code> is high, autovacuum isn’t keeping up. For high-churn tables, tune it:</p>
<pre><code>ALTER TABLE events SET (

    autovacuum_analyze_scale_factor = 0.02,

    autovacuum_vacuum_scale_factor = 0.05

);
</code></pre>
<p>This triggers ANALYZE after 2% of the table changes (instead of the default 10%) and VACUUM after 5%.</p>
<h2 id="step-4-query-patterns-that-no-index-can-fix"><a class="header-anchor" href="#step-4-query-patterns-that-no-index-can-fix" target="_blank" rel="noopener noreferrer">Step 4: Query Patterns That No Index Can Fix</a></h2>
<p>Some queries are slow not because of missing indexes but because the SQL itself prevents efficient execution. No amount of indexing fixes a fundamentally bad query pattern.</p>
<p><strong>Functions on indexed columns:</strong></p>
<pre><code>-- PostgreSQL CANNOT use an index on created_at here

SELECT * FROM orders WHERE EXTRACT(YEAR FROM created_at) = 2026;
</code></pre>
<p>The function <code>EXTRACT()</code> is evaluated on every single row. The index on <code>created_at</code> exists but is useless because PostgreSQL would need an index on <code>EXTRACT(YEAR FROM created_at)</code> — which doesn’t exist.</p>
<p>Rewrite as a range:</p>
<pre><code>-- This uses the index on created_at

SELECT * FROM orders

WHERE created_at &gt;= '2026-01-01'

AND created_at &lt; '2027-01-01';
</code></pre>
<p>Same results. But now PostgreSQL does a quick index range scan instead of reading the entire table.</p>
<p><strong>OFFSET pagination:</strong></p>
<pre><code>SELECT * FROM orders ORDER BY created_at DESC LIMIT 20 OFFSET 10000;
</code></pre>
<p>This looks efficient — “give me 20 rows starting at position 10,000.” But PostgreSQL must read and sort all 10,020 rows, then throw away the first 10,000. The deeper you paginate, the slower it gets. At OFFSET 100,000, it’s reading 100,020 rows to return 20.</p>
<p>Use keyset pagination instead:</p>
<pre><code>SELECT * FROM orders

WHERE created_at &lt; '2026-03-15T10:30:00Z'

ORDER BY created_at DESC

LIMIT 20;
</code></pre>
<p>The <code>WHERE</code> clause on <code>created_at</code> replaces the OFFSET. The query starts reading from the right position in the index and returns 20 rows immediately, regardless of which “page” you’re on. Page 1 and page 5,000 take the same amount of time.</p>
<p><strong>SELECT * through an ORM:</strong></p>
<pre><code>SELECT * FROM orders

JOIN customers ON customers.id = orders.customer_id

JOIN order_items ON order_items.order_id = orders.id

JOIN products ON products.id = order_items.product_id;
</code></pre>
<p>ORMs love eager loading, and eager loading loves <code>SELECT *</code> with multiple JOINs. The result set explodes — if an order has 5 items, you get 5 rows per order, each containing every column from all four tables. Most of that data is duplicated and never used.</p>
<p>The fix depends on what you actually need. If you only need order totals and customer names:</p>
<pre><code>SELECT o.id, o.total, c.name

FROM orders o

JOIN customers c ON c.id = o.customer_id

WHERE o.created_at &gt; '2026-01-01';
</code></pre>
<p>Selecting only the columns you need means smaller result sets, less data transfer, and potentially index-only scans (where PostgreSQL can answer the query entirely from the index without touching the table at all).</p>
<p><strong>Correlated subqueries:</strong></p>
<pre><code>SELECT *,

    (SELECT COUNT(*) FROM order_items WHERE order_id = orders.id) AS item_count

FROM orders

WHERE status = 'completed';
</code></pre>
<p>That subquery runs once for every row in the outer query. If the outer query returns 50,000 rows, the subquery executes 50,000 times. Replace it with a JOIN:</p>
<pre><code>SELECT o.*, COUNT(oi.id) AS item_count

FROM orders o

LEFT JOIN order_items oi ON oi.order_id = o.id

WHERE o.status = 'completed'

GROUP BY o.id;
</code></pre>
<p>One query, one pass. The planner can use a hash join and process everything in bulk.</p>
<h2 id="step-5-index-strategy-when-what-and-how-many"><a class="header-anchor" href="#step-5-index-strategy-when-what-and-how-many" target="_blank" rel="noopener noreferrer">Step 5: Index Strategy — When, What, and How Many</a></h2>
<p>After you’ve confirmed through <code>EXPLAIN ANALYZE</code> that a missing index is genuinely the bottleneck, be strategic about what you create.</p>
<p><strong>Composite indexes — column order matters:</strong></p>
<pre><code>-- Good: status is the equality filter, created_at is the range filter

CREATE INDEX idx_orders_status_created ON orders (status, created_at);

-- Less useful: reversed order doesn't help if you're filtering by status first

CREATE INDEX idx_orders_created_status ON orders (created_at, status);
</code></pre>
<p>The general rule: equality columns first, range columns second. PostgreSQL can use a composite index for a prefix — an index on <code>(status, created_at)</code> helps queries filtering on <code>status</code> alone, but an index on <code>(created_at, status)</code> doesn’t help queries filtering on <code>status</code> alone.</p>
<p><strong>Partial indexes — index only what matters:</strong></p>
<pre><code>CREATE INDEX idx_orders_pending ON orders (customer_id, created_at)

WHERE status = 'pending';
</code></pre>
<p>This index is smaller than a full index because it only includes rows where <code>status = 'pending'</code>. If your query always filters for pending orders, this index is both smaller (faster to scan, less memory) and more precise than a full index.</p>
<p><strong>Covering indexes (index-only scans):</strong></p>
<pre><code>CREATE INDEX idx_orders_covering ON orders (status, created_at)

INCLUDE (id, total, customer_id);
</code></pre>
<p>The <code>INCLUDE</code> columns are stored in the index but not used for searching. If your query only selects <code>id</code>, <code>total</code>, and <code>customer_id</code>, PostgreSQL can answer it entirely from the index without reading the table at all. This is called an index-only scan and it’s the fastest possible execution path.</p>
<p><strong>Don’t over-index.</strong> Every index on a table slows down writes. Each <code>INSERT</code> must update every index. Each <code>UPDATE</code> on an indexed column must update that index. Each <code>DELETE</code> must mark the row as dead in every index. A table with 15 indexes has 15 times the write overhead.</p>
<p>Check for unused indexes periodically:</p>
<pre><code>SELECT

    indexrelname AS index_name,

    idx_scan AS times_used,

    pg_size_pretty(pg_relation_size(indexrelid)) AS size

FROM pg_stat_user_indexes

WHERE idx_scan = 0

AND indexrelname NOT LIKE '%pkey%'

ORDER BY pg_relation_size(indexrelid) DESC;
</code></pre>
<p>These are indexes that have never been used since the last statistics reset. If they’ve been unused for months, drop them. They’re consuming disk space, memory, and write performance for zero benefit.</p>
<h2 id="the-complete-slow-query-diagnostic-sequence"><a class="header-anchor" href="#the-complete-slow-query-diagnostic-sequence" target="_blank" rel="noopener noreferrer">The Complete Slow Query Diagnostic Sequence</a></h2>
<p>When a query is slow, run through this:</p>
<ol>
<li><strong>Find it</strong> — use <code>pg_stat_statements</code> to identify the query by total time or average time</li>
<li><strong>Read the plan</strong> — <code>EXPLAIN (ANALYZE, BUFFERS)</code> on the query</li>
<li><strong>Check estimated vs actual rows</strong> — if they’re wildly different, run <code>ANALYZE</code> on the tables involved</li>
<li><strong>Look for Seq Scans on large tables</strong> — if the filter is selective (returns &lt; 15% of rows), an index is likely needed</li>
<li><strong>Check the query pattern</strong> — look for functions on indexed columns, OFFSET pagination, SELECT *, correlated subqueries</li>
<li><strong>Add indexes strategically</strong> — composite, partial, covering — based on what the plan tells you</li>
<li><strong>Verify the fix</strong> — run <code>EXPLAIN ANALYZE</code> again and confirm the plan improved</li>
</ol>
<p>Don’t skip step 3. I’ve seen teams spend days tuning queries that were slow because of stale statistics. A single <code>ANALYZE</code> command fixed the whole thing in under a second.</p>
<p>And don’t skip step 7. I’ve seen people create indexes that PostgreSQL ignores because the query pattern doesn’t match. You haven’t fixed anything until <code>EXPLAIN ANALYZE</code> confirms it.</p>
<p>Database performance is a discipline, not a guessing game. The tools exist. The plans are readable. The fixes are usually straightforward once you know where to look. The hard part isn’t the fix — it’s convincing yourself to actually look at the plan instead of throwing indexes at the wall and hoping one sticks.</p>
<hr />
<p>If you found this guide helpful, check out our other resources:</p>
<ul>
<li>(More articles coming soon in the Database Systems category)</li>
</ul>
]]></content:encoded>
    </item>
  </channel>
</rss>