Datavision.pl — file packing & archival

VERA

Versioned Easy Repository Archive

Billions of small files, one stateless consolidation system. Deterministic routing, immutable shards, S3 as the only source of truth.

Engineered for scale where traditional object stores stop being economical. Built-in compliance with SEC 17a-4, HIPAA, GDPR, and SOC 2 — from the very first shard.

SEC 17a-4 HIPAA GDPR Art. 17 SOC 2 S3 WORM Stateless
.vera SHARD S3
The challenge

Does this sound familiar?

Every organization storing billions of small files faces the same tension: costs scale linearly with object count, while regulators demand years of retention with documented immutability.

💸

Costs grow linearly with object count

S3 is economical for MB–GB objects, not KB ones. A billion 50 KB files is not just storage — it's a billion LIST, GET, and PUT calls. The API bill can exceed the storage bill itself.

🐌

Backup, replication, recovery — days instead of hours

Every bulk operation requires billions of API calls. Full region replication takes weeks. Disaster recovery for 100M files is a multi-month project. Each object handled separately — without consolidation, there is no scaling.

⚖️

GDPR Art. 17 vs WORM — an apparent conflict

Regulators require immutability (S3 Object Lock COMPLIANCE). GDPR mandates the right to erasure. Without an application layer, each such request demands a manual workaround. The auditor sees either broken WORM or unerased data.

🗄️

Distributed state kills scaling

Classic archival systems need a central database mapping file → location. That database becomes the bottleneck, the coordinator, the single point of failure. More files, bigger problem. Coordination tax grows exponentially.

60–80%
S3 cost reduction
Consolidation + compression
1000×
Fewer objects in S3
From billions to millions
<10ms
Lookup without a database
Deterministic routing
100%
Shard integrity
CRC32 + WORM + audit trail
Solution

How VERA solves these problems

Intelligent consolidation of small files into immutable .vera containers, deterministic routing without a database, S3 Object Lock, and GDPR tombstones — all in one system.

📦

Intelligent packing into shards

Thousands of small files merged into binary .vera containers optimized for S3. Single S3 objects instead of billions — backup, replication, and lifecycle policies finally operate in reasonable time.

1000× fewer objects

Range-GET instead of full reads

Index stored in the shard footer. Fetching a single file out of a billion is two S3 calls: one for the index (cached), one for the byte range. No full container download, no external database.

<100ms latency
🛡️

GDPR + WORM — conflict resolved

Per-shard metadata sidecar (.vera.meta) with file tombstones. Data logically removed, physical shard untouched. Once a threshold is crossed — repack to a new shard. The auditor sees a documented mutation, not broken WORM.

48h SLA for Art. 17
🗜️

Intelligent zstd / lz4 compression

Per-file heuristics: zstd for text data (3–6× ratio), skip already-compressed formats (JPEG, MP4, ZIP). No CPU overhead where compression yields nothing, aggressive compression where it pays off.

40–70% less space
🔒

Compliance-grade from the ground up

S3 Object Lock (COMPLIANCE), Extended Retention Management, chain-hashed audit log, Ed25519/RSA PKI for operator authentication, OpenBao/Vault for secrets. No bolting compliance on at the end of the project.

SEC / HIPAA / GDPR / SOC 2
📈

Horizontal scaling without a coordinator

Thousands of concurrent packing workers, each running independently. Work distributed via queues (SQS / Kafka / RabbitMQ). Kubernetes HPA as the native scaling model. No leader election, no race conditions.

10,000+ files/min
Architectural foundation

Statelessness as a hard requirement

The defining property of VERA: no database mapping a file to its location. Funkcja locate_shard(uid, created_at, n_bits) is a pure function — it returns a deterministic shard address with no coordination, no consultation of any external state.

Every process — packer, reader, migrator — independently knows where a file lives. Scaling to thousands of workers means adding pods to K8s, not changing architecture. S3 is the only source of truth. Everything else is reconstructible.

Persistent — minimal footprint
  • .vera shards in S3 (immutable, WORM)
  • .vera.meta sidecars (GDPR tombstones)
  • Migration watermark in business DB
Reconstructible — spin up when needed
  • Packer workers (K8s CronJob / HPA)
  • Retriever API (stateless pods)
  • Index cache (16 GB, rebuild from S3)
SCENARIO: 100M FILE MIGRATION
# Day 1 — 100M file backlog in business DB
$ kubectl scale deployment des-packer --replicas=100
→ 100 workers start, each picks its watermark slot
$ des-admin migrate --mode=continuous \
    --source=postgres://biz/files
→ Cutoff: 2026-02-01 · Page size: 10,000
→ Rate: 12,400 files/min · ETA: 5d 14h
# No coordinator — each worker runs independently.
# Pod crash breaks nothing, watermark is atomic.
$ curl GET /api/v1/files/inv_2025_8a4f
✓ Shard located <10ms (deterministic routing)
✓ Range-GET → 42 KB returned in 87 ms
$ kubectl scale deployment des-packer --replicas=10
✓ Scaled down · no re-coordination needed
# Peak cost: 100 pods × 6 days vs constant cluster
Use cases

Who needs VERA?

Real scenarios for organizations that must archive billions of small files for years, with documented compliance.

🏦
Fintech & banking

Transaction document archival

Banks and investment firms must retain billions of PDFs, receipts, and transaction scans for 5–7 years (SEC 17a-4, KNF). On vanilla S3, the API bill grows faster than the data volume itself.

  • Typical scale — 5B PDFs × 50 KB ≈ 250 TB + billions of API calls
  • SEC 17a-4 / KNF — immutability at the S3 Object Lock level
  • ROI — typically 60–80% reduction in storage TCO
  • Extended Retention — per-document legal hold without copying
🏥
Healthcare

Medical records and DICOM imaging

Hospitals archive hundreds of millions of patient records, test results, and DICOM thumbnails. HIPAA requires 6+ years of retention with an access log; GDPR gives the patient the right to erase personal data.

  • Typical scale — hundreds of millions of small XML/JSON/DICOM files
  • HIPAA audit trail — every access logged, chain-hashed
  • GDPR Art. 17 — tombstone in the sidecar, repack on threshold
  • Compression — 70%+ reduction for structured data
📡
Telecommunications

Call Detail Records (CDR) and billing logs

Carriers generate billions of CDR records daily. Retention requirement is 3–5 years, with point access on law-enforcement request. Volume keeps growing, every day adds new backlog.

  • Volume — millions of files per hour, petabyte scale
  • Compression — 5–10× for textual CDR data
  • Lifecycle — automatic S3 Standard → Glacier transition
  • Multi-region — independent zones, no global state
⚖️
Public administration

State archives and government records

Central agencies, courts, social security, tax authorities — billions of applications, requests, and administrative decisions. Retention requirements of 10–50 years, accessible for audits and proceedings across the entire decade.

  • Long retention — 10–50 years with documented integrity
  • Low operating cost — open source, no vendor lock-in
  • Transparency — audit log available to auditors / oversight
  • Glacier integration — automatic tier transitions
Regulatory compliance

Compliance built in, not bolted on

VERA meets regulatory requirements through architecture, not through end-of-project configuration. Immutability, audit trail, and GDPR tombstones are first-class concepts of the system.

SEC 17a-4
HIPAA
GDPR
SOC 2
ISO 27001

S3 Object Lock — WORM at the infrastructure level

Shards are locked immediately upon finalization. GOVERNANCE mode (deletion with MFA + break-glass permission) or COMPLIANCE (un-deletable under any circumstances). Retention configurable per shard.

GDPR Art. 17 + WORM — conflict resolved

Per-shard .vera.meta sidecar with tombstones. Data logically removed immediately, physical repack after threshold (typically 30% tombstones). 48h SLA from request to loss of visibility.

Unique approach

Extended Retention Management — per-file legal hold

Extend retention of an individual file without copying the entire shard. Uses S3 Object Lock retention period on a subset. Full API integration with business systems.

Full audit trail with chain hash

Every operation (PACK, READ, DELETE, REPACK, RETAIN) is logged with a cryptographic link to the preceding entry. Detection of deletion or tampering of the audit log is deterministic.

Operator authentication — SSH-style PKI

Every administrator and workflow authenticated via Ed25519 or RSA. No baked-in credentials. Secrets in OpenBao / Vault with dynamic credentials and short TTLs. Zero-downtime key rotation.

Export with integrity proof

Every export includes integrity_proof.json with shard hashes, WORM confirmation, and, where applicable, repack anchor entries. Ready to hand to an auditor, financial regulator, or law enforcement.

Under the hood

Architecture designed for the long haul

Four layers, no shared state — packing, indexing, retrieval, and lifecycle management run independently and scale separately.

Packing layer

Packer Workers

Watermark-based migration from the source DB. Per-file heuristic compression. Append-only writes to the shard. Stateless — each worker claims a slot from the watermark and runs independently.

Routing layer

Deterministic Router

Pure function locate_shard(uid, created_at, n_bits). No lookup database. O(1) with no external dependencies. Every process independently knows where to find any file.

Read layer

Retriever API

FastAPI / Quarkus. S3 Range-GET based on the shard footer index. 16 GB index cache (rebuilt from S3). Stateless pods behind a load balancer, K8s HPA scales with traffic.

Lifecycle layer

Repack & GDPR

A .vera.meta sidecar holds tombstones. A CronJob watches the threshold and triggers a repack to a new shard. Repack anchor entries preserve audit-trail continuity. Old shards are archived to Glacier.

Component Technology Rationale
Runtime (v1.x)Python 3.11 + FastAPIProven stability, rich SQL integration
Runtime (v2.x)Java 21 + QuarkusNative image, non-blocking I/O, ~2.5B files/year scale
Object storageS3-compatibleAWS S3, MinIO, Ceph/RGW, Wasabi, Dell EMC, NetApp
Shard format.vera append-only[HEADER][DATA][INDEX][FOOTER] — Range-GET friendly
Compressionzstd / lz4 (auto)Per-file heuristic · skip already-compressed formats
Database sourceSQLAlchemyPostgreSQL · MySQL · SQLite · watermark migration
SecretsOpenBao / VaultDynamic credentials · short TTL · MFA for break-glass
AuthenticationEd25519 / RSA PKISSH-style signing · zero hardcoded credentials
OrchestrationKubernetesCronJob for repack · HPA for retriever · Helm charts
Queue (planned)SQS / Kafka / RabbitMQDistributed work · event-driven scaling
ObservabilityPrometheus + GrafanaAlert: des_pack_lag >30min → PagerDuty
Datavision.pl

Start saving
on S3 storage today

Reach out to discuss deploying VERA in your organization, get a quote, or ask a technical question about architecture, integration, or migration planning.