TotallyWildAi · agent-observability

Self-hosted AWS observability, in one Terraform stack per environment.

Prometheus + Grafana + a CloudWatch exporter + a Cost Explorer exporter on Fargate, fronted by Cloudflare Tunnel + Cloudflare Access. No public IP, no inbound firewall rule, no static AWS credentials in CI. Dashboards and alert rules ship as code from the repo and auto-provision on container start.

SSO-gated. The live Grafana sits behind Cloudflare Access — your email needs to be on the allowlist before the demo. Ask for an invite, or browse the screenshots below to see what's running without signing in.
5
Fargate services per env
3
cloud APIs scraped
5
built-in alert rules
~$55/mo
per env, ap-southeast-2
0
public IPs / inbound rules

1. Infrastructure as code, per environment

Approach. A single Terraform root deploys the whole stack onto an existing AWS ECS cluster (or provisions a dedicated cluster if asked). Backend config is supplied at init time via -backend-config so the same code targets any AWS account; one S3 state file per environment, with native S3 conditional-write locking (no DynamoDB lock table). Container images pinned to specific versions in var.container_images.

What's running

Demonstrable artifacts

2. AWS metrics & cost in one timeseries store

Approach. Two purpose-built exporters write to one Prometheus. yace (Yet Another CloudWatch Exporter) covers operational metrics — ECS, RDS, ALB, NLB, ElastiCache, optional AWS/Billing. cost_explorer_exporter + a Python forecast sidecar hit the Cost Explorer API for richer cost data than CloudWatch can provide. Both register with Cloud Map; Prometheus DNS-discovers them.

What's scraped

ExporterSourceSurfaces
yaceCloudWatch (cloudwatch:GetMetricData)ECS CPU/mem, ECS/ContainerInsights task counts, RDS CPU/conns/freeable-mem/IOPS, ALB p50/p95/p99 + 5xx + healthy-host, NLB flow + healthy-host, ElastiCache CPU/mem/hits/misses/network
yaceCloudWatch AWS/Billing (us-east-1)Estimated charges total + per-service (backup signal — Cost Explorer is the primary)
cost-exporterCost Explorer ce:GetCostAndUsageMTD unblended cost by service, credits applied, daily-granularity trend
cost-forecast sidecarCost Explorer ce:GetCostForecastEnd-of-month forecast + 30-day forecast with 80% prediction interval, UNBLENDED & AMORTIZED

Cost Explorer is billed at $0.01 per call — polling intervals default to 8h (usage) and 24h (forecast) so the API spend stays below 10 cents/month.

Grafana panel showing end-of-month cost forecast with 80% prediction interval and MTD usage by service
The forecast panel on the single-pane-of-glass dashboard — sums end-of-month and 30-day forecasts from ce:GetCostForecast, sourced from the aws_cost_forecast_usd{period,metric} series.

Demonstrable artifacts

3. Dashboards & alerts as code

Approach. Drop a JSON file in dashboards/; Terraform discovers it at plan time via fileset(), ships it as an env var on the Grafana container, and the entrypoint writes it to /tmp/provisioning/dashboards/ where Grafana auto-loads it on startup. Five Grafana unified-alerting rules ship the same way, inlined in the Grafana module. No custom image build, no UI clicks to reproduce a deploy.

What's running

The single-pane-of-glass Grafana dashboard showing ECS, RDS, ALB, ElastiCache and AWS cost panels
The auto-provisioned dashboard at /dashboards/aws-single-pane-of-glass — every panel sourced from a single Prometheus datasource Grafana wired up on its own at task start.

Alert rules

RuleFires whenSeverity
ecs_service_unhealthyrunning tasks < desired for 5mcritical
rds_cpu_highRDS CPU > 80% for 10mwarning
alb_5xx_spikeALB 5xx rate > 5/min for 5mwarning
monthly_forecast_exceeds_budgetend-of-month forecast > alert_monthly_budget_usd for 1hinfo
prometheus_scrape_target_downup == 0 for 2mwarning

Rules evaluate but won't notify until a contact point is configured in the Grafana UI — each option (Slack webhook, SMTP, PagerDuty key) is an env-specific secret, so we deliberately don't bake one into Terraform.

Grafana Alerting page showing the 5 built-in rules in the agent-obs/infrastructure group
Grafana → Alerting → Alert rules — all 5 rules provisioned from terraform/modules/grafana/main.tf, no UI clicks to reproduce.

Demonstrable artifacts

4. Zero-trust browser access — no public IP, no inbound rule

Approach. A cloudflared Fargate task in the VPC maintains a persistent outbound tunnel to the Cloudflare edge. Browser traffic hits the public Grafana hostname, Cloudflare Access challenges the request through the configured SSO provider, then the request rides the tunnel back to Grafana on its private IP. No public ALB, no IGW route for Grafana, no inbound port 443 exposed anywhere on the AWS side.

Zero-trust browser access architecture — browser hits Cloudflare edge, Cloudflare Access challenges the request through an SSO provider, then the request rides a persistent outbound tunnel back to Grafana in a private subnet. No public IP, no inbound firewall rule on the AWS side.
Browser → Cloudflare edge (SSO challenge) → persistent outbound tunnel established by cloudflared → Grafana on a private IP. No public ALB, no IGW route for Grafana, no inbound port 443 exposed on AWS.

What's running

Demonstrable artifacts

5. CI/CD via GitHub Actions + OIDC — no static AWS keys

Approach. One workflow, two jobs. plan runs on every PR touching terraform/** or dashboards/**; apply runs only on push to main, gated behind a test GitHub Environment that can require reviewer approval. AWS auth is OIDC — a federated trust between token.actions.githubusercontent.com and an IAM role created by this same Terraform stack. Zero static credentials in GitHub Secrets on the AWS side.

Pipeline

.github/workflows/terraform.yml

  1. materialise env filesenvs/test.tfvars rendered from the TFVARS_TEST repository variable, envs/test.backend.hcl emitted by a heredoc inside the workflow itself (single source of truth, no drift)
  2. aws-actions/configure-aws-credentials — assumes AWS_DEPLOY_ROLE_ARN via OIDC, no static keys
  3. terraform init with -backend-config=../envs/test.backend.hcl
  4. terraform validate
  5. terraform plan — uploaded as artifact on PRs for review
  6. terraform apply -auto-approve tfplan — push-to-main only, behind the test environment gate

The Cloudflare side still uses an API token (CLOUDFLARE_API_TOKEN in GitHub Secrets) — Cloudflare doesn't offer GitHub OIDC federation yet, so it's the one static credential the workflow can't avoid.

Bootstrap

  1. First apply runs locally with developer AWS creds — provisions the IAM OIDC provider (skippable if your account already has one) and the GitHub deploy role.
  2. Capture the role ARN from terraform output github_actions_role_arn.
  3. GitHub repo settings — set AWS_DEPLOY_ROLE_ARN and TFVARS_TEST as repository variables, CLOUDFLARE_API_TOKEN as a repository secret.
  4. Future deploys run from CI — PR opens, plan attached as artifact, merge to main runs apply.

Demonstrable artifacts

Capability scorecard

CapabilityStatusNotes
Infrastructure as code, per envFully demonstratedOne Terraform root, S3 state w/ native locking, container images pinned, EFS-backed Prom & Grafana
AWS metrics & cost collectionFully demonstratedCloudWatch via yace (6 namespaces) + Cost Explorer with forecast sidecar, all into one Prometheus
Dashboards & alerts as codeFully demonstrated1 dashboard + 5 alert rules auto-provisioned on container start, no custom image required
Zero-trust browser accessFully demonstratedCloudflare Tunnel + Access SSO, 0 public IPs, 0 inbound firewall rules, tunnel token in Secrets Manager
CI/CD via OIDCFully demonstratedPlan on PR, apply on merge, AWS auth via OIDC; Cloudflare API token is the one unavoidable static cred