Adding Eyes to the Cluster — Jumpbox, Ansible Monitoring, and a Live Health Dashboard

A ten-server cluster spread across three cities is only as good as your ability to see what is happening inside it. After moving our database layer into a private VPC, we had strong network isolation — but no centralised visibility. No single place to check if all servers were healthy, if replication was lagging, or if someone was hammering SSH. We added a dedicated jumpbox, automated Ansible monitoring, a live health dashboard, and intrusion prevention across the cluster. Here is what we built and why.

The Problem: Blind Spots in a Distributed Cluster

Our HA WordPress cluster had nine servers across Sydney, Brisbane, and Melbourne — six web nodes, three database nodes, and an anycast load balancer tying them together. The initial build focused on redundancy and performance. The VPC migration locked down the database layer. But we were missing something fundamental: observability.

To check if a server was healthy, we had to SSH into it manually and run commands. To verify replication, we had to connect to each replica and inspect SHOW REPLICA STATUS. To see if anyone was brute-forcing SSH, we had to grep auth logs on each server individually. Across ten servers in three cities, this does not scale.

We needed three things:

A secure entry point — a single server that can reach everything, without exposing the database layer
Automated health checks — something that polls every server on a schedule and records the results
A dashboard — a single page that shows green or red for every server, updated automatically

The Jumpbox: A Dedicated Operations Hub

Previously, we accessed database servers by bouncing SSH through whichever web server was in the right region. This worked but had problems — web servers should serve web traffic, not act as SSH relays. And with six possible jump points, there was no consistent operational entry.

We built a dedicated jumpbox — a lightweight server in Brisbane whose sole purpose is operational access and monitoring. It sits on the same cross-region VPC as every other server, giving it direct private network connectivity to all nine nodes.

Property	Value
Role	Jumpbox + monitoring hub
Region	Brisbane
Size	1 vCPU, 2 GB RAM
VPC IP	10.241.0.254 (infrastructure subnet)
Public IP	[redacted]
OS	Ubuntu 24.04

The jumpbox has SSH keys for every server in the cluster. From it, we can reach any web server or database server over the VPC — no public internet involved. This is the only server that can SSH into the private database nodes.

Single Point of Entry, Not Single Point of Failure

The jumpbox is the operational gateway — it is how we manage the cluster, not how the cluster serves traffic. If the jumpbox goes down, the website continues to run normally. Load balancing, web serving, database replication — none of these depend on it. We lose monitoring visibility and SSH access to the DB layer until it is restored, but the site stays up. BinaryLane’s VNC console provides emergency access if needed.

The Tenth Server

Adding the jumpbox brought our cluster to ten servers. The VPC addressing scheme made this clean — the jumpbox sits in its own infrastructure subnet, separate from both the web tier and the database tier:

Subnet	Range	Purpose	Servers
Infrastructure	10.241.0.x	Jumpbox, monitoring	1
Database	10.241.1.x	MariaDB primary + replicas	3
Web	10.241.2.x	Nginx + PHP-FPM + WordPress	6

Ten servers, three subnets, three cities, one VPC. Every server can reach every other server over private IPs — and the jumpbox can see them all.

Ansible: Automated Health Checks

With the jumpbox in place, we installed Ansible as the monitoring engine. Ansible is a natural fit — it connects to servers over SSH (which the jumpbox already does), runs commands, collects results, and can template output into any format we want.

Every five minutes, a cron job on the jumpbox runs a health check playbook that polls all ten servers simultaneously. Here is what it checks:

Every Server (All 10)

Check	Method	Healthy When
CPU usage	`/proc/stat` sampled over 1 second	< 90%
Memory usage	`free -m`	< 90%
Disk usage	`df /`	< 85%
Load average	`/proc/loadavg`	Reported (informational)
Uptime	`/proc/uptime`	Reported (informational)
Auth failures (5 min)	`auth.log` grep	Reported (informational)
Listening ports	`ss -tlnp`	Only expected ports
Suspicious processes	Process scan for crypto miners	None found

Web Servers (6 Nodes)

Check	Method	Healthy When
Nginx	`systemctl is-active`	Active
PHP-FPM	`systemctl is-active`	Active
WordPress	`curl https://localhost/`	HTTP 200
Health endpoint	`curl https://localhost/health`	HTTP 200
SSL certificate	`openssl s_client`	> 7 days until expiry
fail2ban	`systemctl is-active`	Active

Database Servers (3 Nodes)

Check	Method	Healthy When
MariaDB	`systemctl is-active`	Active
Connectivity	`mysql -e "SELECT 1"`	Success
Thread count	`SHOW STATUS`	Reported (informational)
DB uptime	`SHOW STATUS`	Reported (informational)

Replica Servers (2 Nodes)

Check	Method	Healthy When
IO thread	`SHOW REPLICA STATUS`	Running
SQL thread	`SHOW REPLICA STATUS`	Running
Replication lag	`Seconds_Behind_Master`	< 30 seconds
Last error	`Last_Error`	Empty

That is over 80 individual checks running every five minutes, automatically, across three cities. The results are assembled into a single JSON file that captures the complete health state of the cluster at that moment.

Why Ansible Instead of Prometheus or Zabbix?

Traditional monitoring stacks like Prometheus require agents on every server, a time-series database, and a separate visualisation layer like Grafana. For a ten-server cluster, that is significant overhead — both in compute resources and operational complexity. Ansible is already on the jumpbox for management tasks, connects over SSH (no agents needed), and can write its output as a simple JSON file. A lightweight HTML dashboard reads that JSON. Total added infrastructure: zero new servers, zero new services, zero new dependencies.

The Dashboard: Live Cluster Health at a Glance

The health check JSON powers a live dashboard served from the jumpbox itself. It is a single HTML page — no frameworks, no build tools, no dependencies — that fetches the status JSON every thirty seconds and renders the entire cluster state visually.

The dashboard is protected by HTTP basic authentication and accessible only at the jumpbox’s public IP address, which we do not publish. It is an internal operations tool, not a public-facing page.

Here is what it shows:

Top banner — green “All Systems Operational” or red “N servers reporting issues”
Server cards — grouped by region (Sydney, Brisbane, Melbourne), each card shows the server name, role, and health status with a green or red border
Metric bars — CPU, memory, and disk usage as visual progress bars on each card, colour-coded from green through yellow to red
Service badges — small indicators for each service (nginx, PHP-FPM, WordPress, SSL, fail2ban, MariaDB) showing active or inactive
Replication panel — dedicated section for the two database replicas showing IO thread, SQL thread, lag in seconds, and any replication errors
Security overview — auth failure counts, root login attempts, fail2ban coverage, and any suspicious process alerts
Timestamp — when the data was last collected, so you can tell if the cron job is running

When everything is healthy, the page is a wall of green. When something breaks, the affected server card turns red and the top banner changes immediately. At a glance, you know the state of your entire cluster.

fail2ban: Intrusion Prevention

While building the monitoring, we also activated fail2ban on all six web servers. fail2ban watches authentication logs and automatically blocks IP addresses that show signs of brute-force attacks — too many failed SSH login attempts from the same source results in a temporary firewall ban.

The database servers do not need fail2ban — they have no public IP addresses and are unreachable from the internet. The jumpbox’s firewall restricts SSH to expected sources. But the six web servers, with their public IPs serving HTTP and HTTPS traffic, are the front line. fail2ban gives them automatic protection against credential stuffing and brute-force attacks.

The Ansible health checks monitor fail2ban on every web server. If it stops running, the dashboard shows it immediately.

What the Security Posture Looks Like Now

Each phase of this project has tightened the security model. Here is the layered picture:

Layer	What It Does	Added In
Anycast load balancer	Distributes traffic, hides origin IPs	Phase 1
BinaryLane firewalls	Allow only SSH, HTTP, HTTPS, ICMP — deny everything else	Phase 1
Private VPC	All database traffic on isolated private network	VPC migration
Removed DB public IPs	Database servers unreachable from internet entirely	VPC migration
Dedicated jumpbox	Single controlled SSH entry point for all servers	This phase
fail2ban	Automatic brute-force IP banning on web servers	This phase
Automated monitoring	Detects anomalies: unexpected ports, processes, auth spikes	This phase

Defence in Depth

No single layer is the security strategy — they work together. The load balancer absorbs volumetric attacks. Firewalls block unexpected ports. The VPC isolates internal traffic. Private IPs make database servers unreachable. The jumpbox controls SSH access. fail2ban blocks brute-force attempts. And the monitoring layer watches all of it, alerting when something looks wrong. An attacker would need to defeat every layer to reach anything valuable.

The Full Architecture: Ten Servers, Three Cities

With the jumpbox and monitoring in place, the complete cluster architecture looks like this:

Server	Role	Region	VPC IP	Public IP
jumpbox	Operations hub	BNE	10.241.0.254	[redacted]
wp-web-1-syd	Web	SYD	10.241.2.1	Behind LB
wp-web-2-syd	Web	SYD	10.241.2.2	Behind LB
wp-web-3-bne	Web	BNE	10.241.2.3	Behind LB
wp-web-4-bne	Web	BNE	10.241.2.4	Behind LB
wp-web-5-mel	Web	MEL	10.241.2.5	Behind LB
wp-web-6-mel	Web	MEL	10.241.2.6	Behind LB
wp-db-primary	DB Primary	SYD	10.241.1.1	None
wp-db-replica	DB Replica	BNE	10.241.1.2	None
wp-db-replica-mel	DB Replica	MEL	10.241.1.3	None

For the full interactive topology — server connections, replication flows, VPC routing, and HyperDB read/write splitting — see the architecture diagram.

AI-Managed, Start to Finish

Like every phase of this infrastructure, the jumpbox deployment, Ansible monitoring setup, dashboard creation, and fail2ban activation were performed entirely by Claude using the BinaryLane MCP and SSH MCP.

Here is what Claude did in this phase:

Step	Tool	What Happened
1	BinaryLane MCP	Created the jumpbox server in Brisbane and joined it to the VPC
2	BinaryLane MCP	Configured stateless firewall — SSH, HTTP, HTTPS, ICMP, DNS with explicit deny
3	SSH MCP	Deployed SSH keys so the jumpbox can reach all 9 cluster servers over VPC
4	SSH MCP	Installed Ansible, nginx, and apache2-utils on the jumpbox
5	SSH MCP	Wrote the full Ansible inventory — all 10 servers with VPC IPs, roles, and regions
6	SSH MCP	Created five health check task files covering server, app, database, replication, and security checks
7	SSH MCP	Wrote a Jinja2 template to assemble check results into structured JSON
8	SSH MCP	Built the dashboard — a single HTML/CSS/JS file with live polling and visual status indicators
9	SSH MCP	Configured nginx with basic auth to serve the dashboard securely
10	SSH MCP	Installed and activated fail2ban on all 6 web servers via Ansible ad-hoc commands
11	SSH MCP	Set up a cron job for five-minute automated health checks
12	Both	Ran the first health check, verified all 10 servers healthy, confirmed dashboard rendering

Claude also identified and fixed issues during the build — discovering that the CPU measurement was capturing Ansible’s own process overhead (producing false 100% readings), that the MariaDB connectivity check needed output redirection to parse correctly, and that fail2ban needed a fresh install rather than just activation. Each issue was diagnosed, fixed, and verified without human intervention.

What We Monitor vs What We Could Monitor

The current monitoring covers the essentials — the checks that tell you whether the cluster is serving traffic correctly and whether anything looks suspicious. But this is a foundation, not a ceiling. The same Ansible framework can be extended to check:

SSL certificate auto-renewal — alert if Let’s Encrypt renewal fails before expiry hits the 7-day threshold
WordPress update status — flag when core, theme, or plugin updates are available
Disk I/O latency — detect storage performance degradation before it affects users
Log anomaly detection — pattern matching for unusual error rates in nginx or PHP-FPM logs
Cross-region latency — measure VPC round-trip times between cities to detect network issues

Adding a new check means writing one Ansible task and adding a field to the JSON template. The dashboard picks it up automatically.

Cost

The jumpbox is the only new infrastructure cost. It runs on BinaryLane’s std-1vcpu plan — the same size as the database servers. Ansible, nginx, fail2ban, and the dashboard are all open-source software running on existing resources. The total cluster cost, including the new jumpbox, is approximately $70/month for ten servers, an anycast load balancer, and automated monitoring across three Australian cities.

💡 Try It Yourself

The MCP servers that make AI-managed infrastructure possible are open-source:

BinaryLane MCP: github.com/termau/binarylane-mcp
SSH MCP: github.com/termau/ssh-mcp

Install them, point Claude at your BinaryLane account, and start building. From a single server to a multi-city monitored cluster — the AI handles the infrastructure so you can focus on what matters.