Discovering Kappa Architecture the hard way

instead of learning from

The Lambda architecture
Questioning the Lambda Architecture

Honza @Novoj Novotný

Problem introduction

Generating click / scroll heatmaps

DEMO

Constraints

  • big but simple structured data (time, x, y, url, viewport)
  • constant input stream - 24h/day
  • heatmaps must be returned within a second
  • minor data loss is acceptable

Architecture: relational access

  • MySQL - click / scroll per row
  • SQL group by to get aggregated data
  • gradients generated by client
    • storage effective
    • computational load moved to client
  • working prototype in 24 hours

Observations

  • input stream serialization
    sufficient performance
  • generating heatmap data
    noticeably slowing from 500k records
    becoming unusable from 1m records upwards (takes seconds)
  • client handles maximum of thousands gradients in realtime
    we need to preprocess data on server
  • algorithm for excluding uninteresting points based on DB triggers
    not sufficient, programmatic access necessary

Architecture: index + diff

  • MySQL
    • record per row only for current day
    • precomputed indexes for previous days / months
    • clearing journal table after index computation
      we need to keep row count low
  • current day computed on the fly - the old way
  • night jobs compute day/month indexes
    • Kryo serialized binary in MySQL BLOB
  • SQL reads several rows + map / reduce

Observations

  • input stream serialization
    sufficient performance
  • querying (reducing) milions of records
    within 1 secs
  • do we need ACID properties for our task?
    not at all - choosing db with less guarantees might add performance boost
  • jobs are potential bottleneck
    we need to ensure that daily data are converted to indexes on regular basis
    when to execute jobs (time zones)?!
    unpredictable load peaks or data processing delays
    how to repair incorrect indexes?

Lambda Architecture

lambda-architecture.net

Lambda Architecture example

count hashtag appearances in tweets by day / hour
lambda-architecture.net

  1. Tweets are ingested from Kafka
  2. Trident (STORM) saves data to HDFS
    Trident (STORM) computes counts and stores them in memory
  3. Hadoop MapReduce procesess files on HDFS and generates others with counts of hashtags by date
  4. SploutSQL indexes file with counts and deploys it to the SploutSQL cluster
  5. Trident (STORM - DRPC) handles queries by combining suqueries to memory state and SploutSQL indexes

Questioning Lambda Architecture

Pros

  • keeping original data log enables reprocessing original data in case of bug introduction or algorithm evolution
  • beats CAP theorem by combining multiple systems with different tradeoffs?!? #probablyNot

Cons

  • you need to implement application logic twice → Hadoop MapReduce jobs + Trident (STORM) implementation #costly #hardToMaintain #bugProne
  • you may use abstraction (SummingBird for example) but you will operate on least common denominator #anotherLevelOfAbstraction
  • anyway it requires deep knowledge of both subsystems - realtime / batch

Architecture: streaming access

  • Mongo DB instead of MySQL
    replicated cluster (write/read node) + arbiter on balancer
  • chunked flat files = journal
    journal ZIPped and backed up
  • indexes for day / month computed on the fly
    merged with MongoDB index on EhCache evict
  • Kryo serialized blobs in Mongo DB binary field
    storage and network effective, must be updated as a whole
  • querying several documents + live EhCache index
    → map / reduce
  • unified processing logic
  • no nightly jobs
    cache evict distributes batch updates through all the time

Observations

  • input stream serialization
    performance 3k reqs/sec
  • handles milions of records query
    within 0,5 secs
  • runs pretty well on commodity HW
    several hundreds CZK/month
  • better scaling possibility
    reading from secondaries
    sharding
  • algorithm evolution
    replay tool can easily reprocess files from journal via original streaming API

Performance testing

Hits per second
Hits per second
Traffic in bytes
Traffic in bytes
CPU
CPU load
memory
Memory

Current database size

no BIG data yet, no SMALL data already

MongoDB stats


{
    "db" : "monkeyTracker",
    "objects" : 3908201,
    "avgObjSize" : 395.4592550383156,
    "dataSize" : 1545534256,
    "storageSize" : 1913036800,
    "indexSize" : 756549808,
    "fileSize" : 4226809856,
}
                        

Records processed since 11/2014

Month Clicks Scrolls
November 4,641,660 2,668,661
December 8,016,352 3,940,576
January 8,088,716 4,557,283
February 9,759,176 5,012,504
Total 33,931,572 17,402,555

Kappa Architecture

Questioning the Lambda Architecture (LinkedIn)
www.Kappa-Architecture.com

Not without problems ...

  • exactly once strategy
  • connectors
  • maturity

Try MonkeyTracker on your own!

MonkeyTracker

Honza Novotný, FG Forrest
@novoj
http://blog.novoj.net