Long-term metrics archive with an index for point queries.
Prometheus breaks after 14 days, Thanos is an operational nightmare, Splunk Metrics costs more than the data itself.
Packs time-series into binary .mila blocks with Gorilla compression, archives to S3 with WORM, exposes PromQL-compatible queries through a hibernatable query layer — without a permanent 24/7 cluster.
Every organization generates terabytes of metrics daily. Every one wants to keep them for years because of regulators. No existing solution does this economically — either you pay for 24/7 compute, or you have data without an index.
Excellent for hot search, but for 90+ day retention you need Thanos or Cortex — meaning Cassandra, Bigtable, or an object store plus a 3-person ops team. The operational cost exceeds the value of long retention.
Enterprise SaaS with a six-figure annual minimum, every metric counted. The customer pays for 24/7 compute on data nobody reads. The worst ROI in the observability category.
Great compression, great queries — but you still maintain a cluster continuously because queries need a hot in-memory index. Storage is cheap, compute is expensive. For multi-year retention the economics don't work.
The cheapest archive, but a point query takes an hour. Without an index on metric name and label set, every query is a full scan. An auditor asks for one series from a year ago — waits 60 minutes for 4 data points.
A binary format optimized for time-series, nightly D+1 compaction, a hibernatable PromQL-compatible query layer — all in one S3-native stack, compatible with existing Prometheus / Grafana tooling.
Delta-of-delta on timestamps, XOR on float64 values. Average 16 raw bytes → 1.37 bytes per point. Typical 10× compression ratio for regular metrics. Less S3 space, faster queries (less to decode).
Per-block dictionary: unique label values mapped to 2-byte IDs. Repeating labels (host, region, service) take 2 bytes instead of 50+. Built into the .mila format, transparent to the query layer.
A single CronJob merges the last 24h of staging, promotes to WORM archive. No distributed coordination, no race conditions, no distributed locks. Proven pattern from ELSA, ported 1:1.
Zero coordinationQuery nodes scale to zero when idle for 30 min. Cold start <10 s. You pay for storage, not for an idle compute cluster. An auditor queries once a month — MILA sits at 720h × $0/h the rest of the time.
$0 idle computePromQL subset covering 90% of dashboard queries: selectors, aggregations, rate, over_time, histogram_quantile. Grafana datasource plugin: zero changes to existing dashboards.
S3 Object Lock COMPLIANCE per block, GDPR pseudonimization at ingest for PII labels (user_id, customer_id), chain-hashed audit trail, per-series legal hold. Built in, not bolted on.
MiFID II · HIPAA · GDPR
The defining property of MILA: the query layer does not need to run continuously.
Ingestors collect metrics via Prometheus remote_write. A nightly CronJob builds .mila blocks. S3 holds the archive.
Redis metastore and query pods — needed only when someone is querying. An auditor queries once a month, an analyst writes a report once a quarter, prosecution requests data once a year. You pay for storage, not for an idle cluster.
Real scenarios for organizations that must archive billions of metric points for years, with the ability to answer a regulator's question in minutes, not hours.
Best execution reporting requires retention of tick data and execution quality metrics for 5–7 years. A raw tick stream is millions of points per day per instrument. Splunk is an enterprise heart attack, raw Parquet means no index for a single auditor question.
Carriers report service KPIs (call drop rate, throughput, latency per cell, per service) to regulators. Volume: millions of series × minute granularity × 3–5 year retention. Classic TSDB is uneconomical, raw archive has no index.
Smart meters send readings every 15 minutes. 1M devices = ~100M points/day. 4–10 year retention per energy regulator. A classic spike use case for long-term archival — everyone wants to keep data, few know how.
ICU monitors, wearables, medical devices generate continuous patient telemetry. HIPAA requires 6+ years of retention with an access audit trail. Classic TSDB breaks under the volume, classic EHR has no index on metrics.
MILA meets regulatory requirements through architecture, not through end-of-project configuration. Immutability, audit trail, and PII label pseudonimization are first-class concepts of the system.
.mila blocks locked immediately after promotion from staging. COMPLIANCE mode default (un-deletable), GOVERNANCE with MFA for selected use cases. Retention configurable per tenant per metric pattern.
Pseudonimization at ingest (default) — PII labels (user_id, customer_id, email_hash) hashed with HMAC-SHA256 using a per-tenant secret. Tombstone fallback (opt-in) for retroactive removal of existing data.
Config-driven YAML per metric pattern + regulation. MiFID II metrics 5–7 years, HIPAA monitoring 6+ years, internal metrics 1 year. Audited separately — its own Object Lock for policy changes.
Every operation (COMPACTION, READ, EXPORT, RETAIN, LEGAL_HOLD) logged with a cryptographic link to the preceding entry. Embedded AuditTrailWriter, no circular dependencies. Detection of audit log tampering is deterministic.
Entity-scoped hold: deletion lock for specific series (e.g., all metrics of customer X). The system automatically identifies affected blocks and applies an S3 Object Lock legal hold flag. Release requires four-eyes and MFA.
Per-metric per-tenant cardinality budget with overflow alerts. Sampled mode fallback protects against metric explosion (a misconfigured client flooding with millions of unique label sets). The auditor sees not only what we store, but also why.
Four layers, no shared state — ingest, format, compaction, and query run independently, scale separately, and rebuild from S3 if any is lost.
Prometheus remote_write v2.0, OTLP HTTP, custom HMAC POST. Per-tenant rate limiting, cardinality protection, authentication. Staging area locally + S3 backup every 5 min.
.milaGorilla compression on float64 (delta-of-delta + XOR), dictionary-encoded labels, sparse index by time, bloom filter (ELBF v1, shared with ELSA). Range-GET friendly.
Singleton (zero coordination overhead). K-way merge sorted by (series_id, timestamp). Atomic upload with S3 conditional PUT. Redis metastore rebuild via Lua scripts.
Quarkus pods. PromQL subset: selectors, aggregations, rate, over_time. Scale to zero on 30 min idle. Cold start <10 s. Auto-routing to continuous aggregates.
| Component | Technology | Rationale |
|---|---|---|
| Runtime | Java 21 + Quarkus | Native image, non-blocking I/O, reactive |
| Object storage | S3-compatible | AWS S3, MinIO, Ceph/RGW — identical to DES/ELSA |
| Block format | .mila append-only | [HEADER][BLOCKS][INDEX][FOOTER] — optimized for time-series |
| Value compression | Gorilla (DoD + XOR) | 5–10× ratio · 16B raw → 1.37B per point on average |
| Label compression | Dictionary + Snappy | Repeating labels 2 bytes vs 50+ bytes raw |
| Metastore cache | Redis | Sorted sets per (tenant, metric); rebuild <5 min from S3 |
| Bloom filter | ELBF v1 | Shared with ELSA · independent of Guava version |
| WORM | S3 Object Lock | COMPLIANCE mode + Extended Retention Management |
| Authentication | Prometheus basic / OTLP bearer / HMAC | Multi-protocol entry · per-tenant RBAC |
| Compaction | K8s CronJob (singleton) | No distributed locks · idempotent restart safe |
| Query | PromQL subset + Grafana plugin | Zero migration for Prometheus dashboards |
| Observability | Prometheus + Grafana | External Prometheus for self-monitoring (no recursion) |
Reach out to discuss deploying MILA in your organization — migration from Prometheus / Thanos / VictoriaMetrics, Grafana integration, or a TCO compression plan for multi-year regulatory metric retention.