CQRS & Event Sourcing Architecture for 4.5B Record Scale
How we decoupled reads from writes using CQRS and Event Sourcing to process 4.5 billion records with sub-second query latency.
Key Results
The Challenge (The Bottleneck)
Business Pain
The client—a major player in the FinTech/Insurance sector—was effectively flying blind. The core ledger had become so heavy that generating end-of-month reconciliation reports took 48+ hours, often timing out. Real-time decision-making—crucial for fraud detection and risk assessment—was impossible due to massive data latency.
Technical Pain
The legacy relational database was being forced to act as a “Jack of all trades” (OLTP + OLAP), creating a classic resource contention nightmare:
- Lock Contention: Heavy write operations during business hours were locking rows, causing read queries from the dashboard to hang or timeout.
- Indexing Paralysis: A single monolithic table grew to 4.5 Billion rows. Adding a new index to optimize queries took days and required planned downtime, freezing product iteration.
- Tight Coupling: The monolithic architecture coupled data ingestion directly to the user interface. A spike in transaction volume directly degraded the user experience for support staff.
The Architecture (The Solution)
Strategy
We moved away from the CRUD mindset and implemented a strict CQRS (Command Query Responsibility Segregation) pattern combined with Event Sourcing. This allowed us to scale reads and writes independently.
The Logic
De-coupling Writes from Reads: We separated the “Write Side” (Commands) from the “Read Side” (Queries). The write model focused solely on high-speed ingestion and validation, ignoring complex join logic.
The Immutable Truth: The “Source of Truth” shifted from a mutable relational database to an append-only Event Store (using Kafka/Pulsar). This ensured a perfect audit trail of every state change (e.g., PaymentInitiated, PaymentSettled), enabling “Time Travel” debugging.
Polyglot Persistence (Projections): We acknowledged that one database cannot fit all needs. We built specialized “Read Models” (Projections) by consuming the event stream:
- Elasticsearch: Optimized for fuzzy search and complex filtering criteria used by support staff.
- Data Lake (S3/Parquet): Optimized for batched OLAP workloads. We used partition strategies (Year/Month/Day) to enable the Data Science team to query years of history using Athena/Presto without touching production.
- Redis: Optimized for key-value lookups (e.g., “Get User Balance”) with sub-millisecond latency.
The Outcome
- Operational Velocity: Report generation time dropped from 48 hours to Near Real-Time (<5 seconds lag).
- Scalability: The system successfully handled the 4.5B record load. Write spikes no longer affected read performance, as they were buffered by the event bus.
- Data Democratization: Enabled the Data Science team to run complex, resource-intensive queries on the Data Warehouse without affecting production performance or risking UI timeouts.