Billions of small files, one stateless consolidation system.
Deterministic routing, immutable shards, S3 as the only source of truth.
Engineered for scale where traditional object stores stop being economical.
Built-in compliance with SEC 17a-4, HIPAA, GDPR, and SOC 2 — from the very first shard.
Every organization storing billions of small files faces the same tension: costs scale linearly with object count, while regulators demand years of retention with documented immutability.
S3 is economical for MB–GB objects, not KB ones. A billion 50 KB files is not just storage — it's a billion LIST, GET, and PUT calls. The API bill can exceed the storage bill itself.
Every bulk operation requires billions of API calls. Full region replication takes weeks. Disaster recovery for 100M files is a multi-month project. Each object handled separately — without consolidation, there is no scaling.
Regulators require immutability (S3 Object Lock COMPLIANCE). GDPR mandates the right to erasure. Without an application layer, each such request demands a manual workaround. The auditor sees either broken WORM or unerased data.
Classic archival systems need a central database mapping file → location. That database becomes the bottleneck, the coordinator, the single point of failure. More files, bigger problem. Coordination tax grows exponentially.
Intelligent consolidation of small files into immutable .vera containers, deterministic routing without a database, S3 Object Lock, and GDPR tombstones — all in one system.
Thousands of small files merged into binary .vera containers optimized for S3. Single S3 objects instead of billions — backup, replication, and lifecycle policies finally operate in reasonable time.
Index stored in the shard footer. Fetching a single file out of a billion is two S3 calls: one for the index (cached), one for the byte range. No full container download, no external database.
<100ms latencyPer-shard metadata sidecar (.vera.meta) with file tombstones. Data logically removed, physical shard untouched. Once a threshold is crossed — repack to a new shard. The auditor sees a documented mutation, not broken WORM.
Per-file heuristics: zstd for text data (3–6× ratio), skip already-compressed formats (JPEG, MP4, ZIP). No CPU overhead where compression yields nothing, aggressive compression where it pays off.
40–70% less spaceS3 Object Lock (COMPLIANCE), Extended Retention Management, chain-hashed audit log, Ed25519/RSA PKI for operator authentication, OpenBao/Vault for secrets. No bolting compliance on at the end of the project.
SEC / HIPAA / GDPR / SOC 2Thousands of concurrent packing workers, each running independently. Work distributed via queues (SQS / Kafka / RabbitMQ). Kubernetes HPA as the native scaling model. No leader election, no race conditions.
10,000+ files/min
The defining property of VERA: no database mapping a file to its location.
Funkcja locate_shard(uid, created_at, n_bits) is a pure function — it returns a deterministic shard address with no coordination, no consultation of any external state.
Every process — packer, reader, migrator — independently knows where a file lives. Scaling to thousands of workers means adding pods to K8s, not changing architecture. S3 is the only source of truth. Everything else is reconstructible.
Real scenarios for organizations that must archive billions of small files for years, with documented compliance.
Banks and investment firms must retain billions of PDFs, receipts, and transaction scans for 5–7 years (SEC 17a-4, KNF). On vanilla S3, the API bill grows faster than the data volume itself.
Hospitals archive hundreds of millions of patient records, test results, and DICOM thumbnails. HIPAA requires 6+ years of retention with an access log; GDPR gives the patient the right to erase personal data.
Carriers generate billions of CDR records daily. Retention requirement is 3–5 years, with point access on law-enforcement request. Volume keeps growing, every day adds new backlog.
Central agencies, courts, social security, tax authorities — billions of applications, requests, and administrative decisions. Retention requirements of 10–50 years, accessible for audits and proceedings across the entire decade.
VERA meets regulatory requirements through architecture, not through end-of-project configuration. Immutability, audit trail, and GDPR tombstones are first-class concepts of the system.
Shards are locked immediately upon finalization. GOVERNANCE mode (deletion with MFA + break-glass permission) or COMPLIANCE (un-deletable under any circumstances). Retention configurable per shard.
Per-shard .vera.meta sidecar with tombstones. Data logically removed immediately, physical repack after threshold (typically 30% tombstones). 48h SLA from request to loss of visibility.
Extend retention of an individual file without copying the entire shard. Uses S3 Object Lock retention period on a subset. Full API integration with business systems.
Every operation (PACK, READ, DELETE, REPACK, RETAIN) is logged with a cryptographic link to the preceding entry. Detection of deletion or tampering of the audit log is deterministic.
Every administrator and workflow authenticated via Ed25519 or RSA. No baked-in credentials. Secrets in OpenBao / Vault with dynamic credentials and short TTLs. Zero-downtime key rotation.
Every export includes integrity_proof.json with shard hashes, WORM confirmation, and, where applicable, repack anchor entries. Ready to hand to an auditor, financial regulator, or law enforcement.
Four layers, no shared state — packing, indexing, retrieval, and lifecycle management run independently and scale separately.
Watermark-based migration from the source DB. Per-file heuristic compression. Append-only writes to the shard. Stateless — each worker claims a slot from the watermark and runs independently.
Pure function locate_shard(uid, created_at, n_bits). No lookup database. O(1) with no external dependencies. Every process independently knows where to find any file.
FastAPI / Quarkus. S3 Range-GET based on the shard footer index. 16 GB index cache (rebuilt from S3). Stateless pods behind a load balancer, K8s HPA scales with traffic.
A .vera.meta sidecar holds tombstones. A CronJob watches the threshold and triggers a repack to a new shard. Repack anchor entries preserve audit-trail continuity. Old shards are archived to Glacier.
| Component | Technology | Rationale |
|---|---|---|
| Runtime (v1.x) | Python 3.11 + FastAPI | Proven stability, rich SQL integration |
| Runtime (v2.x) | Java 21 + Quarkus | Native image, non-blocking I/O, ~2.5B files/year scale |
| Object storage | S3-compatible | AWS S3, MinIO, Ceph/RGW, Wasabi, Dell EMC, NetApp |
| Shard format | .vera append-only | [HEADER][DATA][INDEX][FOOTER] — Range-GET friendly |
| Compression | zstd / lz4 (auto) | Per-file heuristic · skip already-compressed formats |
| Database source | SQLAlchemy | PostgreSQL · MySQL · SQLite · watermark migration |
| Secrets | OpenBao / Vault | Dynamic credentials · short TTL · MFA for break-glass |
| Authentication | Ed25519 / RSA PKI | SSH-style signing · zero hardcoded credentials |
| Orchestration | Kubernetes | CronJob for repack · HPA for retriever · Helm charts |
| Queue (planned) | SQS / Kafka / RabbitMQ | Distributed work · event-driven scaling |
| Observability | Prometheus + Grafana | Alert: des_pack_lag >30min → PagerDuty |
Reach out to discuss deploying VERA in your organization, get a quote, or ask a technical question about architecture, integration, or migration planning.