A ten-server cluster spread across three cities is only as good as your ability to see what is happening inside it. After moving our database layer into a private VPC, we had strong network isolation — but no centralised visibility. No single place to check if all servers were healthy, if replication was lagging, or if someone was hammering SSH. We added a dedicated jumpbox, automated Ansible monitoring, a live health dashboard, and intrusion prevention across the cluster. Here is what we built and why.
The Problem: Blind Spots in a Distributed Cluster
Our HA WordPress cluster had nine servers across Sydney, Brisbane, and Melbourne — six web nodes, three database nodes, and an anycast load balancer tying them together. The initial build focused on redundancy and performance. The VPC migration locked down the database layer. But we were missing something fundamental: observability.
To check if a server was healthy, we had to SSH into it manually and run commands. To verify replication, we had to connect to each replica and inspect SHOW REPLICA STATUS. To see if anyone was brute-forcing SSH, we had to grep auth logs on each server individually. Across ten servers in three cities, this does not scale.
We needed three things:
- A secure entry point — a single server that can reach everything, without exposing the database layer
- Automated health checks — something that polls every server on a schedule and records the results
- A dashboard — a single page that shows green or red for every server, updated automatically
The Jumpbox: A Dedicated Operations Hub
Previously, we accessed database servers by bouncing SSH through whichever web server was in the right region. This worked but had problems — web servers should serve web traffic, not act as SSH relays. And with six possible jump points, there was no consistent operational entry.
We built a dedicated jumpbox — a lightweight server in Brisbane whose sole purpose is operational access and monitoring. It sits on the same cross-region VPC as every other server, giving it direct private network connectivity to all nine nodes.
| Property | Value |
|---|---|
| Role | Jumpbox + monitoring hub |
| Region | Brisbane |
| Size | 1 vCPU, 2 GB RAM |
| VPC IP | 10.241.0.254 (infrastructure subnet) |
| Public IP | [redacted] |
| OS | Ubuntu 24.04 |
The jumpbox has SSH keys for every server in the cluster. From it, we can reach any web server or database server over the VPC — no public internet involved. This is the only server that can SSH into the private database nodes.
Single Point of Entry, Not Single Point of Failure
The jumpbox is the operational gateway — it is how we manage the cluster, not how the cluster serves traffic. If the jumpbox goes down, the website continues to run normally. Load balancing, web serving, database replication — none of these depend on it. We lose monitoring visibility and SSH access to the DB layer until it is restored, but the site stays up. BinaryLane’s VNC console provides emergency access if needed.
The Tenth Server
Adding the jumpbox brought our cluster to ten servers. The VPC addressing scheme made this clean — the jumpbox sits in its own infrastructure subnet, separate from both the web tier and the database tier:
| Subnet | Range | Purpose | Servers |
|---|---|---|---|
| Infrastructure | 10.241.0.x | Jumpbox, monitoring | 1 |
| Database | 10.241.1.x | MariaDB primary + replicas | 3 |
| Web | 10.241.2.x | Nginx + PHP-FPM + WordPress | 6 |
Ten servers, three subnets, three cities, one VPC. Every server can reach every other server over private IPs — and the jumpbox can see them all.
Ansible: Automated Health Checks
With the jumpbox in place, we installed Ansible as the monitoring engine. Ansible is a natural fit — it connects to servers over SSH (which the jumpbox already does), runs commands, collects results, and can template output into any format we want.
Every five minutes, a cron job on the jumpbox runs a health check playbook that polls all ten servers simultaneously. Here is what it checks:
Every Server (All 10)
| Check | Method | Healthy When |
|---|---|---|
| CPU usage | /proc/stat sampled over 1 second | < 90% |
| Memory usage | free -m | < 90% |
| Disk usage | df / | < 85% |
| Load average | /proc/loadavg | Reported (informational) |
| Uptime | /proc/uptime | Reported (informational) |
| Auth failures (5 min) | auth.log grep | Reported (informational) |
| Listening ports | ss -tlnp | Only expected ports |
| Suspicious processes | Process scan for crypto miners | None found |
Web Servers (6 Nodes)
| Check | Method | Healthy When |
|---|---|---|
| Nginx | systemctl is-active | Active |
| PHP-FPM | systemctl is-active | Active |
| WordPress | curl https://localhost/ | HTTP 200 |
| Health endpoint | curl https://localhost/health | HTTP 200 |
| SSL certificate | openssl s_client | > 7 days until expiry |
| fail2ban | systemctl is-active | Active |
Database Servers (3 Nodes)
| Check | Method | Healthy When |
|---|---|---|
| MariaDB | systemctl is-active | Active |
| Connectivity | mysql -e "SELECT 1" | Success |
| Thread count | SHOW STATUS | Reported (informational) |
| DB uptime | SHOW STATUS | Reported (informational) |
Replica Servers (2 Nodes)
| Check | Method | Healthy When |
|---|---|---|
| IO thread | SHOW REPLICA STATUS | Running |
| SQL thread | SHOW REPLICA STATUS | Running |
| Replication lag | Seconds_Behind_Master | < 30 seconds |
| Last error | Last_Error | Empty |
That is over 80 individual checks running every five minutes, automatically, across three cities. The results are assembled into a single JSON file that captures the complete health state of the cluster at that moment.
Why Ansible Instead of Prometheus or Zabbix?
Traditional monitoring stacks like Prometheus require agents on every server, a time-series database, and a separate visualisation layer like Grafana. For a ten-server cluster, that is significant overhead — both in compute resources and operational complexity. Ansible is already on the jumpbox for management tasks, connects over SSH (no agents needed), and can write its output as a simple JSON file. A lightweight HTML dashboard reads that JSON. Total added infrastructure: zero new servers, zero new services, zero new dependencies.
The Dashboard: Live Cluster Health at a Glance
The health check JSON powers a live dashboard served from the jumpbox itself. It is a single HTML page — no frameworks, no build tools, no dependencies — that fetches the status JSON every thirty seconds and renders the entire cluster state visually.
The dashboard is protected by HTTP basic authentication and accessible only at the jumpbox’s public IP address, which we do not publish. It is an internal operations tool, not a public-facing page.
Here is what it shows:
- Top banner — green “All Systems Operational” or red “N servers reporting issues”
- Server cards — grouped by region (Sydney, Brisbane, Melbourne), each card shows the server name, role, and health status with a green or red border
- Metric bars — CPU, memory, and disk usage as visual progress bars on each card, colour-coded from green through yellow to red
- Service badges — small indicators for each service (nginx, PHP-FPM, WordPress, SSL, fail2ban, MariaDB) showing active or inactive
- Replication panel — dedicated section for the two database replicas showing IO thread, SQL thread, lag in seconds, and any replication errors
- Security overview — auth failure counts, root login attempts, fail2ban coverage, and any suspicious process alerts
- Timestamp — when the data was last collected, so you can tell if the cron job is running
When everything is healthy, the page is a wall of green. When something breaks, the affected server card turns red and the top banner changes immediately. At a glance, you know the state of your entire cluster.
fail2ban: Intrusion Prevention
While building the monitoring, we also activated fail2ban on all six web servers. fail2ban watches authentication logs and automatically blocks IP addresses that show signs of brute-force attacks — too many failed SSH login attempts from the same source results in a temporary firewall ban.
The database servers do not need fail2ban — they have no public IP addresses and are unreachable from the internet. The jumpbox’s firewall restricts SSH to expected sources. But the six web servers, with their public IPs serving HTTP and HTTPS traffic, are the front line. fail2ban gives them automatic protection against credential stuffing and brute-force attacks.
The Ansible health checks monitor fail2ban on every web server. If it stops running, the dashboard shows it immediately.
What the Security Posture Looks Like Now
Each phase of this project has tightened the security model. Here is the layered picture:
| Layer | What It Does | Added In |
|---|---|---|
| Anycast load balancer | Distributes traffic, hides origin IPs | Phase 1 |
| BinaryLane firewalls | Allow only SSH, HTTP, HTTPS, ICMP — deny everything else | Phase 1 |
| Private VPC | All database traffic on isolated private network | VPC migration |
| Removed DB public IPs | Database servers unreachable from internet entirely | VPC migration |
| Dedicated jumpbox | Single controlled SSH entry point for all servers | This phase |
| fail2ban | Automatic brute-force IP banning on web servers | This phase |
| Automated monitoring | Detects anomalies: unexpected ports, processes, auth spikes | This phase |
Defence in Depth
No single layer is the security strategy — they work together. The load balancer absorbs volumetric attacks. Firewalls block unexpected ports. The VPC isolates internal traffic. Private IPs make database servers unreachable. The jumpbox controls SSH access. fail2ban blocks brute-force attempts. And the monitoring layer watches all of it, alerting when something looks wrong. An attacker would need to defeat every layer to reach anything valuable.
The Full Architecture: Ten Servers, Three Cities
With the jumpbox and monitoring in place, the complete cluster architecture looks like this:
| Server | Role | Region | VPC IP | Public IP |
|---|---|---|---|---|
| jumpbox | Operations hub | BNE | 10.241.0.254 | [redacted] |
| wp-web-1-syd | Web | SYD | 10.241.2.1 | Behind LB |
| wp-web-2-syd | Web | SYD | 10.241.2.2 | Behind LB |
| wp-web-3-bne | Web | BNE | 10.241.2.3 | Behind LB |
| wp-web-4-bne | Web | BNE | 10.241.2.4 | Behind LB |
| wp-web-5-mel | Web | MEL | 10.241.2.5 | Behind LB |
| wp-web-6-mel | Web | MEL | 10.241.2.6 | Behind LB |
| wp-db-primary | DB Primary | SYD | 10.241.1.1 | None |
| wp-db-replica | DB Replica | BNE | 10.241.1.2 | None |
| wp-db-replica-mel | DB Replica | MEL | 10.241.1.3 | None |
For the full interactive topology — server connections, replication flows, VPC routing, and HyperDB read/write splitting — see the architecture diagram.
AI-Managed, Start to Finish
Like every phase of this infrastructure, the jumpbox deployment, Ansible monitoring setup, dashboard creation, and fail2ban activation were performed entirely by Claude using the BinaryLane MCP and SSH MCP.
Here is what Claude did in this phase:
| Step | Tool | What Happened |
|---|---|---|
| 1 | BinaryLane MCP | Created the jumpbox server in Brisbane and joined it to the VPC |
| 2 | BinaryLane MCP | Configured stateless firewall — SSH, HTTP, HTTPS, ICMP, DNS with explicit deny |
| 3 | SSH MCP | Deployed SSH keys so the jumpbox can reach all 9 cluster servers over VPC |
| 4 | SSH MCP | Installed Ansible, nginx, and apache2-utils on the jumpbox |
| 5 | SSH MCP | Wrote the full Ansible inventory — all 10 servers with VPC IPs, roles, and regions |
| 6 | SSH MCP | Created five health check task files covering server, app, database, replication, and security checks |
| 7 | SSH MCP | Wrote a Jinja2 template to assemble check results into structured JSON |
| 8 | SSH MCP | Built the dashboard — a single HTML/CSS/JS file with live polling and visual status indicators |
| 9 | SSH MCP | Configured nginx with basic auth to serve the dashboard securely |
| 10 | SSH MCP | Installed and activated fail2ban on all 6 web servers via Ansible ad-hoc commands |
| 11 | SSH MCP | Set up a cron job for five-minute automated health checks |
| 12 | Both | Ran the first health check, verified all 10 servers healthy, confirmed dashboard rendering |
Claude also identified and fixed issues during the build — discovering that the CPU measurement was capturing Ansible’s own process overhead (producing false 100% readings), that the MariaDB connectivity check needed output redirection to parse correctly, and that fail2ban needed a fresh install rather than just activation. Each issue was diagnosed, fixed, and verified without human intervention.
What We Monitor vs What We Could Monitor
The current monitoring covers the essentials — the checks that tell you whether the cluster is serving traffic correctly and whether anything looks suspicious. But this is a foundation, not a ceiling. The same Ansible framework can be extended to check:
- SSL certificate auto-renewal — alert if Let’s Encrypt renewal fails before expiry hits the 7-day threshold
- WordPress update status — flag when core, theme, or plugin updates are available
- Disk I/O latency — detect storage performance degradation before it affects users
- Log anomaly detection — pattern matching for unusual error rates in nginx or PHP-FPM logs
- Cross-region latency — measure VPC round-trip times between cities to detect network issues
Adding a new check means writing one Ansible task and adding a field to the JSON template. The dashboard picks it up automatically.
Cost
The jumpbox is the only new infrastructure cost. It runs on BinaryLane’s std-1vcpu plan — the same size as the database servers. Ansible, nginx, fail2ban, and the dashboard are all open-source software running on existing resources. The total cluster cost, including the new jumpbox, is approximately $70/month for ten servers, an anycast load balancer, and automated monitoring across three Australian cities.
💡 Try It Yourself
The MCP servers that make AI-managed infrastructure possible are open-source:
- BinaryLane MCP: github.com/termau/binarylane-mcp
- SSH MCP: github.com/termau/ssh-mcp
Install them, point Claude at your BinaryLane account, and start building. From a single server to a multi-city monitored cluster — the AI handles the infrastructure so you can focus on what matters.