How-To Guides
Archive Health Data
Design an Archival Strategy for Health Data¶
Health checks are high-volume and noisy. Pharox leaves persistence to storage adapters so you can decide how much history to keep and how to compact it. Use the playbook below to capture actionable history without letting raw events grow unbounded.
Goals¶
- Keep enough raw events to debug incidents and provider issues.
- Summarise old data into compact rollups that power dashboards and SLOs.
- Make retention explicit and automated so operators are not surprised by storage growth.
Recommended Data Model¶
Raw events (short retention)
- Table keyed by
(checked_at, proxy_id)with columns forpool_id,status,latency_ms,status_code, anderror_message. - Partition by day (or week) on
checked_atto make drops cheap. - Index on
(pool_id, checked_at)for pool-level queries; add(proxy_id, checked_at)if you need per-proxy history.
Rollups (long retention)
- Hourly or daily grain keyed by
(window_start, pool_id)and optionallyproxy_id. - Metrics to store: counts per status, success/error rate, p50/p95 latency, max latency, and most common error reason.
- Built from raw events and safe to regenerate if you need to change the aggregation logic.
Retention Policy¶
- Choose a raw retention window (7–30 days) that matches your debugging needs.
- Drop old partitions on a schedule (cron/Airflow/RQ worker). Avoid ad hoc deletes on large tables.
- Keep rollups for months since they are compact; expire them separately if needed.
Rollup Job Sketch (PostgreSQL)¶
-- Materialise hourly rollups; run each hour after the window closes.
INSERT INTO health_check_rollup_hourly (
window_start,
pool_id,
total,
active,
slow,
inactive,
error_rate,
latency_p50_ms,
latency_p95_ms,
latency_max_ms
)
SELECT
date_trunc('hour', checked_at) AS window_start,
pool_id,
count(*) AS total,
count(*) FILTER (WHERE status = 'ACTIVE') AS active,
count(*) FILTER (WHERE status = 'SLOW') AS slow,
count(*) FILTER (WHERE status = 'INACTIVE') AS inactive,
count(*) FILTER (WHERE status = 'INACTIVE')::float / NULLIF(count(*), 0) AS error_rate,
percentile_cont(0.5) WITHIN GROUP (ORDER BY latency_ms) AS latency_p50_ms,
percentile_cont(0.95) WITHIN GROUP (ORDER BY latency_ms) AS latency_p95_ms,
max(latency_ms) AS latency_max_ms
FROM health_check_event
WHERE checked_at >= date_trunc('hour', now()) - interval '1 hour'
AND checked_at < date_trunc('hour', now())
GROUP BY 1, 2
ON CONFLICT (window_start, pool_id) DO UPDATE
SET
total = EXCLUDED.total,
active = EXCLUDED.active,
slow = EXCLUDED.slow,
inactive = EXCLUDED.inactive,
error_rate = EXCLUDED.error_rate,
latency_p50_ms = EXCLUDED.latency_p50_ms,
latency_p95_ms = EXCLUDED.latency_p95_ms,
latency_max_ms = EXCLUDED.latency_max_ms;
Notes:
- Use
ON CONFLICTso reruns are idempotent. - Replace
percentile_contwith approximate quantiles if your adapter supports them (e.g.,tdigestorapprox_percentile). - For multi-tenant systems, include
consumer_idororg_idin the keys.
Operational Checklist¶
- [ ] Schedule raw partition drops (e.g., keep 14 days) and monitor failures.
- [ ] Schedule rollup jobs after each window closes; alert on lag.
- [ ] Validate rollups feed your dashboards (Prometheus scraper can scrape rollup exporters or you can expose the rollup table via a metrics bridge).
- [ ] Document the windows and retention in your adapter README so downstream services know what is available.
Adapter Integration Tips¶
- Keep archival logic in the adapter layer or a co-located worker; avoid baking
it into
ProxyManager. - When exporting metrics, emit counters/histograms from rollups rather than raw rows to reduce Prometheus pressure.
- For cloud databases with native TTL/partitioning, use the managed features instead of custom jobs; the shape above still applies.