Adding Eyes to the Cluster — Jumpbox, Ansible Monitoring, and a Live Health Dashboard

A ten-server cluster spread across three cities is only as good as your ability to see what is happening inside it. After moving our database layer into a private VPC, we had strong network isolation — but no centralised visibility. No single place to check if all servers were healthy, if replication was lagging, or if someone was hammering SSH. We added a dedicated jumpbox, automated Ansible monitoring, a live health dashboard, and intrusion prevention across the cluster. Here is what we built and why.

The Problem: Blind Spots in a Distributed Cluster

Our HA WordPress cluster had nine servers across Sydney, Brisbane, and Melbourne — six web nodes, three database nodes, and an anycast load balancer tying them together. The initial build focused on redundancy and performance. The VPC migration locked down the database layer. But we were missing something fundamental: observability.

To check if a server was healthy, we had to SSH into it manually and run commands. To verify replication, we had to connect to each replica and inspect SHOW REPLICA STATUS. To see if anyone was brute-forcing SSH, we had to grep auth logs on each server individually. Across ten servers in three cities, this does not scale.

We needed three things:

  • A secure entry point — a single server that can reach everything, without exposing the database layer
  • Automated health checks — something that polls every server on a schedule and records the results
  • A dashboard — a single page that shows green or red for every server, updated automatically

The Jumpbox: A Dedicated Operations Hub

Previously, we accessed database servers by bouncing SSH through whichever web server was in the right region. This worked but had problems — web servers should serve web traffic, not act as SSH relays. And with six possible jump points, there was no consistent operational entry.

We built a dedicated jumpbox — a lightweight server in Brisbane whose sole purpose is operational access and monitoring. It sits on the same cross-region VPC as every other server, giving it direct private network connectivity to all nine nodes.

PropertyValue
RoleJumpbox + monitoring hub
RegionBrisbane
Size1 vCPU, 2 GB RAM
VPC IP10.241.0.254 (infrastructure subnet)
Public IP[redacted]
OSUbuntu 24.04

The jumpbox has SSH keys for every server in the cluster. From it, we can reach any web server or database server over the VPC — no public internet involved. This is the only server that can SSH into the private database nodes.

Single Point of Entry, Not Single Point of Failure

The jumpbox is the operational gateway — it is how we manage the cluster, not how the cluster serves traffic. If the jumpbox goes down, the website continues to run normally. Load balancing, web serving, database replication — none of these depend on it. We lose monitoring visibility and SSH access to the DB layer until it is restored, but the site stays up. BinaryLane’s VNC console provides emergency access if needed.

The Tenth Server

Adding the jumpbox brought our cluster to ten servers. The VPC addressing scheme made this clean — the jumpbox sits in its own infrastructure subnet, separate from both the web tier and the database tier:

SubnetRangePurposeServers
Infrastructure10.241.0.xJumpbox, monitoring1
Database10.241.1.xMariaDB primary + replicas3
Web10.241.2.xNginx + PHP-FPM + WordPress6

Ten servers, three subnets, three cities, one VPC. Every server can reach every other server over private IPs — and the jumpbox can see them all.

Ansible: Automated Health Checks

With the jumpbox in place, we installed Ansible as the monitoring engine. Ansible is a natural fit — it connects to servers over SSH (which the jumpbox already does), runs commands, collects results, and can template output into any format we want.

Every five minutes, a cron job on the jumpbox runs a health check playbook that polls all ten servers simultaneously. Here is what it checks:

Every Server (All 10)

CheckMethodHealthy When
CPU usage/proc/stat sampled over 1 second< 90%
Memory usagefree -m< 90%
Disk usagedf /< 85%
Load average/proc/loadavgReported (informational)
Uptime/proc/uptimeReported (informational)
Auth failures (5 min)auth.log grepReported (informational)
Listening portsss -tlnpOnly expected ports
Suspicious processesProcess scan for crypto minersNone found

Web Servers (6 Nodes)

CheckMethodHealthy When
Nginxsystemctl is-activeActive
PHP-FPMsystemctl is-activeActive
WordPresscurl https://localhost/HTTP 200
Health endpointcurl https://localhost/healthHTTP 200
SSL certificateopenssl s_client> 7 days until expiry
fail2bansystemctl is-activeActive

Database Servers (3 Nodes)

CheckMethodHealthy When
MariaDBsystemctl is-activeActive
Connectivitymysql -e "SELECT 1"Success
Thread countSHOW STATUSReported (informational)
DB uptimeSHOW STATUSReported (informational)

Replica Servers (2 Nodes)

CheckMethodHealthy When
IO threadSHOW REPLICA STATUSRunning
SQL threadSHOW REPLICA STATUSRunning
Replication lagSeconds_Behind_Master< 30 seconds
Last errorLast_ErrorEmpty

That is over 80 individual checks running every five minutes, automatically, across three cities. The results are assembled into a single JSON file that captures the complete health state of the cluster at that moment.

Why Ansible Instead of Prometheus or Zabbix?

Traditional monitoring stacks like Prometheus require agents on every server, a time-series database, and a separate visualisation layer like Grafana. For a ten-server cluster, that is significant overhead — both in compute resources and operational complexity. Ansible is already on the jumpbox for management tasks, connects over SSH (no agents needed), and can write its output as a simple JSON file. A lightweight HTML dashboard reads that JSON. Total added infrastructure: zero new servers, zero new services, zero new dependencies.

The Dashboard: Live Cluster Health at a Glance

The health check JSON powers a live dashboard served from the jumpbox itself. It is a single HTML page — no frameworks, no build tools, no dependencies — that fetches the status JSON every thirty seconds and renders the entire cluster state visually.

The dashboard is protected by HTTP basic authentication and accessible only at the jumpbox’s public IP address, which we do not publish. It is an internal operations tool, not a public-facing page.

Here is what it shows:

  • Top banner — green “All Systems Operational” or red “N servers reporting issues”
  • Server cards — grouped by region (Sydney, Brisbane, Melbourne), each card shows the server name, role, and health status with a green or red border
  • Metric bars — CPU, memory, and disk usage as visual progress bars on each card, colour-coded from green through yellow to red
  • Service badges — small indicators for each service (nginx, PHP-FPM, WordPress, SSL, fail2ban, MariaDB) showing active or inactive
  • Replication panel — dedicated section for the two database replicas showing IO thread, SQL thread, lag in seconds, and any replication errors
  • Security overview — auth failure counts, root login attempts, fail2ban coverage, and any suspicious process alerts
  • Timestamp — when the data was last collected, so you can tell if the cron job is running

When everything is healthy, the page is a wall of green. When something breaks, the affected server card turns red and the top banner changes immediately. At a glance, you know the state of your entire cluster.

fail2ban: Intrusion Prevention

While building the monitoring, we also activated fail2ban on all six web servers. fail2ban watches authentication logs and automatically blocks IP addresses that show signs of brute-force attacks — too many failed SSH login attempts from the same source results in a temporary firewall ban.

The database servers do not need fail2ban — they have no public IP addresses and are unreachable from the internet. The jumpbox’s firewall restricts SSH to expected sources. But the six web servers, with their public IPs serving HTTP and HTTPS traffic, are the front line. fail2ban gives them automatic protection against credential stuffing and brute-force attacks.

The Ansible health checks monitor fail2ban on every web server. If it stops running, the dashboard shows it immediately.

What the Security Posture Looks Like Now

Each phase of this project has tightened the security model. Here is the layered picture:

LayerWhat It DoesAdded In
Anycast load balancerDistributes traffic, hides origin IPsPhase 1
BinaryLane firewallsAllow only SSH, HTTP, HTTPS, ICMP — deny everything elsePhase 1
Private VPCAll database traffic on isolated private networkVPC migration
Removed DB public IPsDatabase servers unreachable from internet entirelyVPC migration
Dedicated jumpboxSingle controlled SSH entry point for all serversThis phase
fail2banAutomatic brute-force IP banning on web serversThis phase
Automated monitoringDetects anomalies: unexpected ports, processes, auth spikesThis phase

Defence in Depth

No single layer is the security strategy — they work together. The load balancer absorbs volumetric attacks. Firewalls block unexpected ports. The VPC isolates internal traffic. Private IPs make database servers unreachable. The jumpbox controls SSH access. fail2ban blocks brute-force attempts. And the monitoring layer watches all of it, alerting when something looks wrong. An attacker would need to defeat every layer to reach anything valuable.

The Full Architecture: Ten Servers, Three Cities

With the jumpbox and monitoring in place, the complete cluster architecture looks like this:

ServerRoleRegionVPC IPPublic IP
jumpboxOperations hubBNE10.241.0.254[redacted]
wp-web-1-sydWebSYD10.241.2.1Behind LB
wp-web-2-sydWebSYD10.241.2.2Behind LB
wp-web-3-bneWebBNE10.241.2.3Behind LB
wp-web-4-bneWebBNE10.241.2.4Behind LB
wp-web-5-melWebMEL10.241.2.5Behind LB
wp-web-6-melWebMEL10.241.2.6Behind LB
wp-db-primaryDB PrimarySYD10.241.1.1None
wp-db-replicaDB ReplicaBNE10.241.1.2None
wp-db-replica-melDB ReplicaMEL10.241.1.3None

For the full interactive topology — server connections, replication flows, VPC routing, and HyperDB read/write splitting — see the architecture diagram.

AI-Managed, Start to Finish

Like every phase of this infrastructure, the jumpbox deployment, Ansible monitoring setup, dashboard creation, and fail2ban activation were performed entirely by Claude using the BinaryLane MCP and SSH MCP.

Here is what Claude did in this phase:

StepToolWhat Happened
1BinaryLane MCPCreated the jumpbox server in Brisbane and joined it to the VPC
2BinaryLane MCPConfigured stateless firewall — SSH, HTTP, HTTPS, ICMP, DNS with explicit deny
3SSH MCPDeployed SSH keys so the jumpbox can reach all 9 cluster servers over VPC
4SSH MCPInstalled Ansible, nginx, and apache2-utils on the jumpbox
5SSH MCPWrote the full Ansible inventory — all 10 servers with VPC IPs, roles, and regions
6SSH MCPCreated five health check task files covering server, app, database, replication, and security checks
7SSH MCPWrote a Jinja2 template to assemble check results into structured JSON
8SSH MCPBuilt the dashboard — a single HTML/CSS/JS file with live polling and visual status indicators
9SSH MCPConfigured nginx with basic auth to serve the dashboard securely
10SSH MCPInstalled and activated fail2ban on all 6 web servers via Ansible ad-hoc commands
11SSH MCPSet up a cron job for five-minute automated health checks
12BothRan the first health check, verified all 10 servers healthy, confirmed dashboard rendering

Claude also identified and fixed issues during the build — discovering that the CPU measurement was capturing Ansible’s own process overhead (producing false 100% readings), that the MariaDB connectivity check needed output redirection to parse correctly, and that fail2ban needed a fresh install rather than just activation. Each issue was diagnosed, fixed, and verified without human intervention.

What We Monitor vs What We Could Monitor

The current monitoring covers the essentials — the checks that tell you whether the cluster is serving traffic correctly and whether anything looks suspicious. But this is a foundation, not a ceiling. The same Ansible framework can be extended to check:

  • SSL certificate auto-renewal — alert if Let’s Encrypt renewal fails before expiry hits the 7-day threshold
  • WordPress update status — flag when core, theme, or plugin updates are available
  • Disk I/O latency — detect storage performance degradation before it affects users
  • Log anomaly detection — pattern matching for unusual error rates in nginx or PHP-FPM logs
  • Cross-region latency — measure VPC round-trip times between cities to detect network issues

Adding a new check means writing one Ansible task and adding a field to the JSON template. The dashboard picks it up automatically.

Cost

The jumpbox is the only new infrastructure cost. It runs on BinaryLane’s std-1vcpu plan — the same size as the database servers. Ansible, nginx, fail2ban, and the dashboard are all open-source software running on existing resources. The total cluster cost, including the new jumpbox, is approximately $70/month for ten servers, an anycast load balancer, and automated monitoring across three Australian cities.

💡 Try It Yourself

The MCP servers that make AI-managed infrastructure possible are open-source:

Install them, point Claude at your BinaryLane account, and start building. From a single server to a multi-city monitored cluster — the AI handles the infrastructure so you can focus on what matters.