Big Data Modelling & Management

Data Model

Basic Data Operations: selection, projection, union and join

Sturctre

Person{
firstName: string,
lastName: string,
DOB: date
}

Relational Data Model

Represented as Table; relational tuple as row in the table

Atomic: represent one unit of information and cannot be degraded further

Header tells the constaints:

ID: Int Primary Key | Fname: string Not null

Foreign Key: that refer to the primary key in the parent table; and foreign key is not unique

Union Operation: UNION removes duplicate records (where all columns in the results are the same), UNION ALL does not.

Structured vs. Semi-structured (as tree-structured)

Web pages: html, XML (generalization of HTML where the beginning and end markers within the angular brackets can be any string), json

XML allows the querying of both schema and data

json: key-value pair in tuple in square brackets (indicate arrays

Treat semi-structured data as tree -- allow nagivation of access to data

Vector Space Model

Text Data

Decument vector model as "vector model"

TF-IDF: term frequency and inverse document frequency

IDF = log_2(\frac{Num\ of\ documents}{doc\ frequency})

Use log2 is more of a convention in many areas of Computer Science; IDF acts as a penalty for terms that are too widely used.

Term Frequency
term_frequency

TF-IDF Matrix

Length of d1 = sqrt(0.584^2 * 3) = 1.011

Query Vector

If the query term is "new new york", then the query vector would be:

[0 0 (2/2)0.584 0 0 (1/2)0.584] corresponding to

[angeles los new post times york] while each factor is $\frac{#\ of\ occurrence}{max_i{frequency}}$

A similarity function between 2 vectors is a measure of how far they are apart: Use Cosine

Query Term Weighting

Every query term may optionally be associated with a weighting term

Q=York times^2 post^5; so that wt(York) = 1/(1+5+2)=1/8; wt(times) = 2/8 = 0.25 and wt(post) = 5/8 = 0.625.

Multiply the query vector with these weights.

Graph Data Model

Connected Network, Anolnamous Network, Connected Components, Shortest Path

Other Data Models

array as a Data Model; Arrays of Vectors (get back the whole vector like images data)

Data Format vs. Data Model

For example, csv is a data format but if plotted as a graph, then it 's a data model.

Data Stream

Managing and processing data in motion is a typical capability of streaming data system.

Dynamic Steering is often a part of streaming data management and processing -- dynamically changing the next steps or direction of an application through a continuous computational process using streaming. (E.g. self-driving car)

Data-at-rest vs. Data-in-motion

Mostly static data from one or more sources; collected prior to analysis; batch processing
Analyzed as it is generated (sensor data from self-driving car); stream processing

Property of Steaming data processing: Unbounded size, but finite time and space

The modelling and managemnet of streaming data should enable computations on one data element or a small window of group of recent data elements; No iteraction with the data source.

Lambda Architecture

Streaming wheel over the real-time data is managed and kept until those data elements are pushed into a batch system and become available to access as batch data.

Big Data Challenge: Scalability, data replication, durability and fault tolerance. 2 main challenges -- avoid data loss and enable real time analytical tasks.

Streaming data changes can be periodic (evenings, weekends) and sporadic (major events, breaking news)

What's Streaming? Utilizing real-time data to compute and change the state of an application continuously,

Steaming data must be handled differently than static data

Size --> unbounded

Size and Frequency --> Unpredictable

Processing --> Fast and Simple

Properties of Working with Streaming Data

Does not ping the source interactively for a response upon receiving the data
Small time windows for working with data
Data manipulation is near real time
Independent computations that do not rely on previous or future data

Data Lake as a massive storage depository

load data from source --> store raw data --> add data model (sturcture) on read

Enable batch processing of streaming data

Schema on write vs schema on read

Organize data streams, data lakes and data warehourses on a spectrum of big data management and storage

In a Traditional Data Warehouse: schema-on-write -- transform and structure before load; any application using the data needs to know the format in order to retrieve and use the data.

Data Lake: schema-on-read -- data is stored as raw data until it it read by an application where the application assigns structure. all data is stored for a potentially unkonwn use at a later time; object storage.

object storage

Data Lake Summary

A big data storage architecture
collects all data for current and future analysis
transforms data format only when needed
supports all types of big data users
infrastructure componenets which evolve over time based on application-specific needs (perhaps most important)

DBMS

Advantages

Declarative query languages
Data Independence
Efficient access through optimization
Data Integrity and security

ACID properties of transaction (buying tickets): Atomicity, consistency, isolation and durability

Atomicity: all of the changes that we need to do for these promotions must happen altogether, as a single unit.

Consistency: data must be valid according to all defined rules including constraints

Durability: once transaction commited, it'll remain so no matter power loss.

Concurrency: Many users can simultaneously access data without conflict, transaction must happen as if they are done as series

Traditional DBMS handle big data through parallel and distributed database system

Distributed DBMS: Data is stored across several sites, each site managed by a DBMS capable of running independenly!

MapReduce-stype systems: complex data processing over a cluster of machines; number of machines can go up to thousand

Mixed Solutions: DBMS on HDFS; Relational operations in MapReduce Systems like Spark; Streaming input to DBMS

BDMS Big data management system

Desired Characteristics:

A flexible, semistructured data model (schema fist to schema never)
Support for today's common "big data types" (texture, temporal, spatial data values)
A full query language (expectedly at least the power of SQL)
An efficient parallel query runtime
Wide range of query sizes
Continuous data ingestion (stream ingesting)
Scale gracefully to manage and query large volumes of data (use large clusters)

ACID properties hard to maintain in a BDMS

BASE relaxes ACID: BA as basic availability; S as soft state (state of the system is very likely to change over time); E as eventual consistency (system will eventually become consistent once it stop receiving input)

CAP Thm A distributed computer system (the network is partitioned, nodes in different parts of the network have different content) cannot simultaneously achieve

Consistency: all nodes see the same data at any time

Availability: every request receive a reponse about sucess/fail

Partition Tolearace: system continue to operate despite arbitrary partitioning due to network failures

at the same time.

Most of the big data systems available today will adhere BASE properties, although several modern system do offer stricter ACID properties or at least several of them.

Redis: an enhanced key-value store (in memory data structure store for fast processing); support stirngs, hashes, lists, sets, sorted sets; used by Twitter

Look-up Problem:

(key:string, value:string). Redis allows binary as string and have a size of upto 512 MB, use the image itself as the key.
Keys may have internal structure and expiry, such as user.commercial.entertainment and user.commercial.entertainment.movie-industry
(key: string, value: list): userID:[ tweetID1, tweetID2, ...] and ziplists (used by Redis) to compress lists in memory without changing the content --> significant reduction in memory use
(key:string, value: attribute-value pairs): Redis hashes
Redis scalabilty: Range partitioning (e.g. user record number divided as bins and goes to different); Hash partitioning (H-functino{"key string"} %10 = 2, so the record goes to machine 2)
Redis replication: Master-Slave mode replication (clients write to master, master replicates to slaves; clients read from the slaves to scale-up read performance; slaves are mostly consistent)

Aerospike: a distributed NoSQL database and key value store for web-scale applications

I/O optimized for SSD storage
Data Types: standard scalar, lists, maps (like hash table) , geospatial, large objects
KV store operations allows for geospatial queries like point-in-polygon, restaurant within 3 miles
AQL: SQL-like language (SELECT name, age FROM users.profiles)
Aerospike ensures ACID guarantees (Consistency -- all copies of data items are in sync )

MongoDB:dominant store for JSON stype semi-structured data

AsterixDB:provides ACID guarantees

type as open meaning actual data can have more attributes than specified in the schema here.
geo: point? means this area is optional
access external data (rea-time data from files in a directory path)

asterixDB_feed

Solr: for large-scale text data searching

Basic challengegs with text: defining a match (capitalizatino, structural punctuations, nominal variations; synonyms; abbreviations; initialism )
txt_challenge

Engine on Lucene, as an inverted index (Vocabulary as all terms in a collection of documents and Occurrence for each term in the collection -- List of doc ID [position of occurrence])
Functionality: Enterprise Search Platform; Inverted Index; Faceted Search and Term Highlighting)
Tokenizer (as filters)

tokenizer

Vertica -- A columnar DBMS
vertica

Space Efficiency -- column stores keep columns in sorted order (don't have to store the contigency values); Run-length encoding (1/1/2007 - 16 records)

Frame-of-reference encoding (fix a number and only record the difference)

Column-Groups: frequently co-accessed columns bahave as mini row-stores within the column store

Update Performance Slower: internal conversion from row-representation to column representation and then compress. (lowness can be perceptible for large uploads or updates)

Enhanced Suite of Analytical Operations allows for many more statistical functions than in classical DBMS; analytical computations happens inside the database instead of in an external application
vertica_analytics

Vertica and Distributed R Uses master node (schedules tasks and sends code) and worker nodes (maintain data partitions (not necessarily equal partition) and compute); uses a data structure called dArray or distributed array
workflow