MILA

The challenge

Does this sound familiar?

Every organization generates terabytes of metrics daily. Every one wants to keep them for years because of regulators. No existing solution does this economically — either you pay for 24/7 compute, or you have data without an index.

⏱️

Prometheus breaks after 14 days

Excellent for hot search, but for 90+ day retention you need Thanos or Cortex — meaning Cassandra, Bigtable, or an object store plus a 3-person ops team. The operational cost exceeds the value of long retention.

💸

Splunk Metrics costs more than the data

Enterprise SaaS with a six-figure annual minimum, every metric counted. The customer pays for 24/7 compute on data nobody reads. The worst ROI in the observability category.

🏗️

VictoriaMetrics — still 24/7 compute

Great compression, great queries — but you still maintain a cluster continuously because queries need a hot in-memory index. Storage is cheap, compute is expensive. For multi-year retention the economics don't work.

🗄️

Raw Parquet in S3 — no index

The cheapest archive, but a point query takes an hour. Without an index on metric name and label set, every query is a full scan. An auditor asks for one series from a year ago — waits 60 minutes for 4 data points.

Solution

How MILA solves these problems

A binary format optimized for time-series, nightly D+1 compaction, a hibernatable PromQL-compatible query layer — all in one S3-native stack, compatible with existing Prometheus / Grafana tooling.

🗜️

Gorilla compression for values

Delta-of-delta on timestamps, XOR on float64 values. Average 16 raw bytes → 1.37 bytes per point. Typical 10× compression ratio for regular metrics. Less S3 space, faster queries (less to decode).

10× compression

📚

Dictionary encoding for labels

Per-block dictionary: unique label values mapped to 2-byte IDs. Repeating labels (host, region, service) take 2 bytes instead of 50+. Built into the .mila format, transparent to the query layer.

3–5× labels savings

🌙

D+1 nightly compaction

A single CronJob merges the last 24h of staging, promotes to WORM archive. No distributed coordination, no race conditions, no distributed locks. Proven pattern from ELSA, ported 1:1.

Zero coordination

💤

Hibernatable query layer

Query nodes scale to zero when idle for 30 min. Cold start <10 s. You pay for storage, not for an idle compute cluster. An auditor queries once a month — MILA sits at 720h × $0/h the rest of the time.

$0 idle compute

📊

PromQL compatibility

PromQL subset covering 90% of dashboard queries: selectors, aggregations, rate, over_time, histogram_quantile. Grafana datasource plugin: zero changes to existing dashboards.

Grafana ready

🔒

Compliance-grade from the ground up

S3 Object Lock COMPLIANCE per block, GDPR pseudonimization at ingest for PII labels (user_id, customer_id), chain-hashed audit trail, per-series legal hold. Built in, not bolted on.

MiFID II · HIPAA · GDPR

Unique economic property

PromQL without a permanent cluster

The defining property of MILA: the query layer does not need to run continuously. Ingestors collect metrics via Prometheus remote_write. A nightly CronJob builds .mila blocks. S3 holds the archive.

Redis metastore and query pods — needed only when someone is querying. An auditor queries once a month, an analyst writes a report once a quarter, prosecution requests data once a year. You pay for storage, not for an idle cluster.

Always on — minimal footprint

Ingestors (HPA 1-N, low utilization)
Nightly CronJob (once daily, ~30 min)
S3 — only source of truth

On-demand — spin up when needed

Redis metastore (rebuild from S3 in <5 min)
Query nodes (stateless Quarkus pods)
Tear down when query completes

SCENARIO: MiFID II BEST EXECUTION AUDIT

# Tuesday 14:32 — DPO requests one year of tick data metrics

$ curl 'https://mila/api/v1/query_range?

query=fix_execution{venue="MTF",sym="VOD.L"}

&start=2025-05-21&end=2026-05-21&step=1h'

→ Query node warming up... (cold start)

→ Redis metastore rebuild (4m 17s, 2.3B series)

→ Query executing across 365 days...

✓ 8,760 points · 12 series · 47.2 MB · 6.4s

✓ Result delivered to Grafana dashboard

# 30 min later — no new queries

→ Query nodes scale to zero

→ Redis dropped from cluster

✓ Cost: $0/h idle

# Cost of the scenario:

# Storage (1y, 2B series, S3-IA): $185/mo

# Query compute (5m wake + 6s): $0.03

# vs Splunk Metrics (24/7): ~$8,400/mo

Use cases

Who needs MILA?

Real scenarios for organizations that must archive billions of metric points for years, with the ability to answer a regulator's question in minutes, not hours.

🏦

Fintech & banking

Tick data and MiFID II best execution

Best execution reporting requires retention of tick data and execution quality metrics for 5–7 years. A raw tick stream is millions of points per day per instrument. Splunk is an enterprise heart attack, raw Parquet means no index for a single auditor question.

Retention — 5–7 years per MiFID II Art. 27, MAR, KNF
Volume — millions of points/day/instrument, billions of series
Continuous aggregates — tick → 1m → 1h → 1d, auto-routing query
Grafana plugin — existing execution quality dashboards work unchanged

📡

Telecommunications

Quality-of-Service and regulator reporting

Carriers report service KPIs (call drop rate, throughput, latency per cell, per service) to regulators. Volume: millions of series × minute granularity × 3–5 year retention. Classic TSDB is uneconomical, raw archive has no index.

Retention — 3–5 years per NSI Poland, EU telco regs
Granularity — cell-level + service-level minute-by-minute
Cardinality protection — millions of distinct cell_id × service × time
Continuous aggregates — minute → hour → day for long retention

⚡

IoT and energy

Smart metering and sensor telemetry

Smart meters send readings every 15 minutes. 1M devices = ~100M points/day. 4–10 year retention per energy regulator. A classic spike use case for long-term archival — everyone wants to keep data, few know how.

Retention — 4–10 years per URE Poland, EU electricity directive
Volume — ~100M points/day for 1M devices
Ingest — OTLP, custom HMAC from the network edge
Cardinality — meter_id (PII) pseudonimization at ingest

🏥

Healthcare

Continuous patient monitoring

ICU monitors, wearables, medical devices generate continuous patient telemetry. HIPAA requires 6+ years of retention with an access audit trail. Classic TSDB breaks under the volume, classic EHR has no index on metrics.

Retention — 6+ years per HIPAA, medical device regs
PII — patient_id in labels, pseudonimization at ingest
Access audit — AURA integration: who accessed the telemetry
Per-device — vitals × device × time, millions of series

Regulatory compliance

Compliance built in, not bolted on

MILA meets regulatory requirements through architecture, not through end-of-project configuration. Immutability, audit trail, and PII label pseudonimization are first-class concepts of the system.

MiFID II

DORA

HIPAA

NIS2

GDPR

ISO 27001

✓

S3 Object Lock — WORM at the infrastructure level

.mila blocks locked immediately after promotion from staging. COMPLIANCE mode default (un-deletable), GOVERNANCE with MFA for selected use cases. Retention configurable per tenant per metric pattern.

✓

GDPR Art. 17 + WORM — conflict resolved

Pseudonimization at ingest (default) — PII labels (user_id, customer_id, email_hash) hashed with HMAC-SHA256 using a per-tenant secret. Tombstone fallback (opt-in) for retroactive removal of existing data.

Dual approach

✓

Retention policies per regulation

Config-driven YAML per metric pattern + regulation. MiFID II metrics 5–7 years, HIPAA monitoring 6+ years, internal metrics 1 year. Audited separately — its own Object Lock for policy changes.

✓

Audit trail with chain hash

Every operation (COMPACTION, READ, EXPORT, RETAIN, LEGAL_HOLD) logged with a cryptographic link to the preceding entry. Embedded AuditTrailWriter, no circular dependencies. Detection of audit log tampering is deterministic.

✓

Legal hold per metric series

Entity-scoped hold: deletion lock for specific series (e.g., all metrics of customer X). The system automatically identifies affected blocks and applies an S3 Object Lock legal hold flag. Release requires four-eyes and MFA.

✓

Cardinality protection

Per-metric per-tenant cardinality budget with overflow alerts. Sampled mode fallback protects against metric explosion (a misconfigured client flooding with millions of unique label sets). The auditor sees not only what we store, but also why.

Under the hood

Architecture designed for the long haul

Four layers, no shared state — ingest, format, compaction, and query run independently, scale separately, and rebuild from S3 if any is lost.

Ingestion layer

Multi-protocol

Prometheus remote_write v2.0, OTLP HTTP, custom HMAC POST. Per-tenant rate limiting, cardinality protection, authentication. Staging area locally + S3 backup every 5 min.

Binary format

Block `.mila`

Gorilla compression on float64 (delta-of-delta + XOR), dictionary-encoded labels, sparse index by time, bloom filter (ELBF v1, shared with ELSA). Range-GET friendly.

D+1 compaction

Nightly CronJob

Singleton (zero coordination overhead). K-way merge sorted by (series_id, timestamp). Atomic upload with S3 conditional PUT. Redis metastore rebuild via Lua scripts.

Query layer

PromQL + hibernate

Quarkus pods. PromQL subset: selectors, aggregations, rate, over_time. Scale to zero on 30 min idle. Cold start <10 s. Auto-routing to continuous aggregates.

Component	Technology	Rationale
Runtime	`Java 21 + Quarkus`	Native image, non-blocking I/O, reactive
Object storage	`S3-compatible`	AWS S3, MinIO, Ceph/RGW — identical to DES/ELSA
Block format	`.mila append-only`	[HEADER][BLOCKS][INDEX][FOOTER] — optimized for time-series
Value compression	`Gorilla (DoD + XOR)`	5–10× ratio · 16B raw → 1.37B per point on average
Label compression	`Dictionary + Snappy`	Repeating labels 2 bytes vs 50+ bytes raw
Metastore cache	`Redis`	Sorted sets per (tenant, metric); rebuild <5 min from S3
Bloom filter	`ELBF v1`	Shared with ELSA · independent of Guava version
WORM	`S3 Object Lock`	COMPLIANCE mode + Extended Retention Management
Authentication	`Prometheus basic / OTLP bearer / HMAC`	Multi-protocol entry · per-tenant RBAC
Compaction	`K8s CronJob (singleton)`	No distributed locks · idempotent restart safe
Query	`PromQL subset + Grafana plugin`	Zero migration for Prometheus dashboards
Observability	`Prometheus + Grafana`	External Prometheus for self-monitoring (no recursion)