Datavision.pl — metric & time-series archive

MILA

Metric Indexed Long Archive

Long-term metrics archive with an index for point queries. Prometheus breaks after 14 days, Thanos is an operational nightmare, Splunk Metrics costs more than the data itself.

Packs time-series into binary .mila blocks with Gorilla compression, archives to S3 with WORM, exposes PromQL-compatible queries through a hibernatable query layer — without a permanent 24/7 cluster.

PromQL compatible MiFID II DORA HIPAA GDPR Art. 17 Gorilla
GORILLA 10× compress .mila TIME-SERIES BLOCK S3
The challenge

Does this sound familiar?

Every organization generates terabytes of metrics daily. Every one wants to keep them for years because of regulators. No existing solution does this economically — either you pay for 24/7 compute, or you have data without an index.

⏱️

Prometheus breaks after 14 days

Excellent for hot search, but for 90+ day retention you need Thanos or Cortex — meaning Cassandra, Bigtable, or an object store plus a 3-person ops team. The operational cost exceeds the value of long retention.

💸

Splunk Metrics costs more than the data

Enterprise SaaS with a six-figure annual minimum, every metric counted. The customer pays for 24/7 compute on data nobody reads. The worst ROI in the observability category.

🏗️

VictoriaMetrics — still 24/7 compute

Great compression, great queries — but you still maintain a cluster continuously because queries need a hot in-memory index. Storage is cheap, compute is expensive. For multi-year retention the economics don't work.

🗄️

Raw Parquet in S3 — no index

The cheapest archive, but a point query takes an hour. Without an index on metric name and label set, every query is a full scan. An auditor asks for one series from a year ago — waits 60 minutes for 4 data points.

10–15×
Cheaper than Splunk Metrics
3-year TCO, 100k samples/s
<3s
Query latency (cold)
Single series, 30 days
5–10×
Compression ratio
Gorilla + dictionary encoding
0$
Of permanent compute
Hibernatable query layer
Solution

How MILA solves these problems

A binary format optimized for time-series, nightly D+1 compaction, a hibernatable PromQL-compatible query layer — all in one S3-native stack, compatible with existing Prometheus / Grafana tooling.

🗜️

Gorilla compression for values

Delta-of-delta on timestamps, XOR on float64 values. Average 16 raw bytes → 1.37 bytes per point. Typical 10× compression ratio for regular metrics. Less S3 space, faster queries (less to decode).

10× compression
📚

Dictionary encoding for labels

Per-block dictionary: unique label values mapped to 2-byte IDs. Repeating labels (host, region, service) take 2 bytes instead of 50+. Built into the .mila format, transparent to the query layer.

3–5× labels savings
🌙

D+1 nightly compaction

A single CronJob merges the last 24h of staging, promotes to WORM archive. No distributed coordination, no race conditions, no distributed locks. Proven pattern from ELSA, ported 1:1.

Zero coordination
💤

Hibernatable query layer

Query nodes scale to zero when idle for 30 min. Cold start <10 s. You pay for storage, not for an idle compute cluster. An auditor queries once a month — MILA sits at 720h × $0/h the rest of the time.

$0 idle compute
📊

PromQL compatibility

PromQL subset covering 90% of dashboard queries: selectors, aggregations, rate, over_time, histogram_quantile. Grafana datasource plugin: zero changes to existing dashboards.

Grafana ready
🔒

Compliance-grade from the ground up

S3 Object Lock COMPLIANCE per block, GDPR pseudonimization at ingest for PII labels (user_id, customer_id), chain-hashed audit trail, per-series legal hold. Built in, not bolted on.

MiFID II · HIPAA · GDPR
Unique economic property

PromQL without a permanent cluster

The defining property of MILA: the query layer does not need to run continuously. Ingestors collect metrics via Prometheus remote_write. A nightly CronJob builds .mila blocks. S3 holds the archive.

Redis metastore and query pods — needed only when someone is querying. An auditor queries once a month, an analyst writes a report once a quarter, prosecution requests data once a year. You pay for storage, not for an idle cluster.

Always on — minimal footprint
  • Ingestors (HPA 1-N, low utilization)
  • Nightly CronJob (once daily, ~30 min)
  • S3 — only source of truth
On-demand — spin up when needed
  • Redis metastore (rebuild from S3 in <5 min)
  • Query nodes (stateless Quarkus pods)
  • Tear down when query completes
SCENARIO: MiFID II BEST EXECUTION AUDIT
# Tuesday 14:32 — DPO requests one year of tick data metrics
$ curl 'https://mila/api/v1/query_range?
    query=fix_execution{venue="MTF",sym="VOD.L"}
    &start=2025-05-21&end=2026-05-21&step=1h'
→ Query node warming up... (cold start)
→ Redis metastore rebuild (4m 17s, 2.3B series)
→ Query executing across 365 days...
✓ 8,760 points · 12 series · 47.2 MB · 6.4s
✓ Result delivered to Grafana dashboard
# 30 min later — no new queries
→ Query nodes scale to zero
→ Redis dropped from cluster
✓ Cost: $0/h idle
# Cost of the scenario:
# Storage (1y, 2B series, S3-IA): $185/mo
# Query compute (5m wake + 6s): $0.03
# vs Splunk Metrics (24/7): ~$8,400/mo
Use cases

Who needs MILA?

Real scenarios for organizations that must archive billions of metric points for years, with the ability to answer a regulator's question in minutes, not hours.

🏦
Fintech & banking

Tick data and MiFID II best execution

Best execution reporting requires retention of tick data and execution quality metrics for 5–7 years. A raw tick stream is millions of points per day per instrument. Splunk is an enterprise heart attack, raw Parquet means no index for a single auditor question.

  • Retention — 5–7 years per MiFID II Art. 27, MAR, KNF
  • Volume — millions of points/day/instrument, billions of series
  • Continuous aggregates — tick → 1m → 1h → 1d, auto-routing query
  • Grafana plugin — existing execution quality dashboards work unchanged
📡
Telecommunications

Quality-of-Service and regulator reporting

Carriers report service KPIs (call drop rate, throughput, latency per cell, per service) to regulators. Volume: millions of series × minute granularity × 3–5 year retention. Classic TSDB is uneconomical, raw archive has no index.

  • Retention — 3–5 years per NSI Poland, EU telco regs
  • Granularity — cell-level + service-level minute-by-minute
  • Cardinality protection — millions of distinct cell_id × service × time
  • Continuous aggregates — minute → hour → day for long retention
IoT and energy

Smart metering and sensor telemetry

Smart meters send readings every 15 minutes. 1M devices = ~100M points/day. 4–10 year retention per energy regulator. A classic spike use case for long-term archival — everyone wants to keep data, few know how.

  • Retention — 4–10 years per URE Poland, EU electricity directive
  • Volume — ~100M points/day for 1M devices
  • Ingest — OTLP, custom HMAC from the network edge
  • Cardinality — meter_id (PII) pseudonimization at ingest
🏥
Healthcare

Continuous patient monitoring

ICU monitors, wearables, medical devices generate continuous patient telemetry. HIPAA requires 6+ years of retention with an access audit trail. Classic TSDB breaks under the volume, classic EHR has no index on metrics.

  • Retention — 6+ years per HIPAA, medical device regs
  • PII — patient_id in labels, pseudonimization at ingest
  • Access audit — AURA integration: who accessed the telemetry
  • Per-device — vitals × device × time, millions of series
Regulatory compliance

Compliance built in, not bolted on

MILA meets regulatory requirements through architecture, not through end-of-project configuration. Immutability, audit trail, and PII label pseudonimization are first-class concepts of the system.

MiFID II
DORA
HIPAA
NIS2
GDPR
ISO 27001

S3 Object Lock — WORM at the infrastructure level

.mila blocks locked immediately after promotion from staging. COMPLIANCE mode default (un-deletable), GOVERNANCE with MFA for selected use cases. Retention configurable per tenant per metric pattern.

GDPR Art. 17 + WORM — conflict resolved

Pseudonimization at ingest (default) — PII labels (user_id, customer_id, email_hash) hashed with HMAC-SHA256 using a per-tenant secret. Tombstone fallback (opt-in) for retroactive removal of existing data.

Dual approach

Retention policies per regulation

Config-driven YAML per metric pattern + regulation. MiFID II metrics 5–7 years, HIPAA monitoring 6+ years, internal metrics 1 year. Audited separately — its own Object Lock for policy changes.

Audit trail with chain hash

Every operation (COMPACTION, READ, EXPORT, RETAIN, LEGAL_HOLD) logged with a cryptographic link to the preceding entry. Embedded AuditTrailWriter, no circular dependencies. Detection of audit log tampering is deterministic.

Legal hold per metric series

Entity-scoped hold: deletion lock for specific series (e.g., all metrics of customer X). The system automatically identifies affected blocks and applies an S3 Object Lock legal hold flag. Release requires four-eyes and MFA.

Cardinality protection

Per-metric per-tenant cardinality budget with overflow alerts. Sampled mode fallback protects against metric explosion (a misconfigured client flooding with millions of unique label sets). The auditor sees not only what we store, but also why.

Under the hood

Architecture designed for the long haul

Four layers, no shared state — ingest, format, compaction, and query run independently, scale separately, and rebuild from S3 if any is lost.

Ingestion layer

Multi-protocol

Prometheus remote_write v2.0, OTLP HTTP, custom HMAC POST. Per-tenant rate limiting, cardinality protection, authentication. Staging area locally + S3 backup every 5 min.

Binary format

Block .mila

Gorilla compression on float64 (delta-of-delta + XOR), dictionary-encoded labels, sparse index by time, bloom filter (ELBF v1, shared with ELSA). Range-GET friendly.

D+1 compaction

Nightly CronJob

Singleton (zero coordination overhead). K-way merge sorted by (series_id, timestamp). Atomic upload with S3 conditional PUT. Redis metastore rebuild via Lua scripts.

Query layer

PromQL + hibernate

Quarkus pods. PromQL subset: selectors, aggregations, rate, over_time. Scale to zero on 30 min idle. Cold start <10 s. Auto-routing to continuous aggregates.

Component Technology Rationale
RuntimeJava 21 + QuarkusNative image, non-blocking I/O, reactive
Object storageS3-compatibleAWS S3, MinIO, Ceph/RGW — identical to DES/ELSA
Block format.mila append-only[HEADER][BLOCKS][INDEX][FOOTER] — optimized for time-series
Value compressionGorilla (DoD + XOR)5–10× ratio · 16B raw → 1.37B per point on average
Label compressionDictionary + SnappyRepeating labels 2 bytes vs 50+ bytes raw
Metastore cacheRedisSorted sets per (tenant, metric); rebuild <5 min from S3
Bloom filterELBF v1Shared with ELSA · independent of Guava version
WORMS3 Object LockCOMPLIANCE mode + Extended Retention Management
AuthenticationPrometheus basic / OTLP bearer / HMACMulti-protocol entry · per-tenant RBAC
CompactionK8s CronJob (singleton)No distributed locks · idempotent restart safe
QueryPromQL subset + Grafana pluginZero migration for Prometheus dashboards
ObservabilityPrometheus + GrafanaExternal Prometheus for self-monitoring (no recursion)
Datavision.pl

Start archiving metrics
with a guarantee of finding them in the future

Reach out to discuss deploying MILA in your organization — migration from Prometheus / Thanos / VictoriaMetrics, Grafana integration, or a TCO compression plan for multi-year regulatory metric retention.