Data Stream Management

####1. DSMS

  1. Abstract: DSMS(Data Stream Management System)real-time, dynamic data, suporting for relatively fixed, real-timme, continous quries. DBMS:PULL, DSMS:push.
  2. Application: sensor networkds, network traffic analysis, financial trickers, transaction log analysis, network click event analysis. etc.
  3. data stream: a sequence of tuples which consist of a set of attributes, similar to a row in database table.includes transactional datastream, and log datastream.
  4. motivation:
    • massive data sets
    • high detail data
    • near real-time analysis
    • traditional data feeds, near real-time.
    • anomaly detection
    • performance of disk. Memory I/O is much faster than disk IO.
  5. Requirements
    • data model and query semantics.
    • query processing
    • data reduction. most important in big data.
    • real-time reactions
    • long running queries
    • scalability
  6. dsms architecture
    • functional architecture: input module -> system monitor->query tree(quick query)->output module.
    • 3-level architecture. most is online dealing with data based on memory.
  7. data models
    • tuples (relation-based and object-based)
    • window model
      • sliding —- most-used
      • jumping
      • overlapping
    • timestamps
  8. Queries
    • DBMS: one-time query, DSMS: continous query.
    • DBMS: exact query, DSMS, approximate query answer. data reduction(summary, poly(Wavelet Transform,Approximate search,BlingDB,query optimization)), batch processing.
  9. query lang.
    • SQL-like
    • model defination language.
    • Selections and Projections. selections and projections are straighfoward.
    • Joins. like sql
    • Aggregations 聚集. like group, sum, count, min, max
  10. query optimization
    • stream rate based
    • resource based
    • QoS based
  11. data reduction tech
    • aggregation
    • load shedding
    • samping
    • sketches
    • wavelets
    • histograms.

####2. How to build a DSMS

####3. Intorduction of STORM

  1. Storm

Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!

Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is easy, scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.

  1. Storm solution
    • in memory
    • Permanent process
    • Real-time handling
    • DAG model
    • queue + worker.
  2. Spout and Bolt.
    • Storm in action

####keywords: query tree, NVM, BlingDB, PCM, storm, graph stream, data reduction ,HASH algo.

Published: June 09 2015

  • tags: