Home

Works × Notes × LinkedIn × StackOverflow × Github

Theorems and definitions

PACeLC theorem

Abadi’s “Consistency Tradeoffs in Modern Distributed Database System Design”, 2012.

Theorem: In case of network Partition, the system remains Available or Consistent, else it ensures low Latency or Consistency

  PC PA
EC MongoDB, BigTable/HBase and fully ACID$^{[1]}$ systems: VoltDB/H-Store, Megastore, MySQL Cluster Most in-memory datagrids (Apache Ignite, Hazelcast IMDG)
EL PNUTS DynamoDB, Cassandra, Riak, Cosmos DB

PACeLC is an extension of the CAP theorem.

ACID DataBase properties (in short)

A transaction is a sequence of database operations that satisfies the following rules:

Frameworks’ notes

Spark

See dedicated notes

Hadoop’s MapReduce

Execution steps of a MapReduce job containing 1 Mapper and 1 Reducer (steps in bold rely on hook classes exposed to the user for extension):

ElasticSearch

1h Elasticsearch Tutorial & Getting Started

Parallels with distributed relationnal databases

Elastic Search Relational DataBase
Cluster DataBase engine & servers
Index Database
Type Table
Document Record
Property Column
Node Node
Index Shard DB Partition
Shard’s Replica Partition’s Replica

Hadoop

JAVA_HOME said to be missing but is not

Getting ERROR: JAVA_HOME is not set and could not be found. but the variable is exported in your .bashrc and/or .profile ? -> You need to also export it by uncommenting the proper line in <hadoop home>/etc/hadoop/hadoop-env.sh.

Google Cloud Platform

Submit and monitor a Dataproc spark job (YARN)

Submit

  1. Submit your spark app in your terminal using gcloud cli
    gcloud dataproc jobs submit spark --cluster=<dataproc-cluster-name>  --region=<e.g. europe-west1> --jars gs://<path-to-jarname.jar --class com.package.name.to.MainClassName --properties 'spark.executor.cores=2,[...]' -- <arg1 for MainClass> <arg2 for MainClass> [...]
    
  2. Visit the page describing your master node VM in Google Cloud Console at:
    https://console.cloud.google.com/compute/instancesDetail/zones/<region, e.g.: europe-west1-c>/instances/<dataproc cluster name>-m
    
  3. Copy the ephemeral external ip address of the master node VM, we will take the example address 104.155.87.75 as of now
  4. Open a SSH connection to that master VM, forwarding its 1080 port using:
    ssh -A -D 1080 104.155.87.75
    
  5. Launch a Chrome/Chromium with a proxy on 1080 port (the one passing through our ssh connection)
    chromium-browser --user-data-dir --proxy-server="socks5://127.0.0.1:1080"
    
  6. Use this browser to access the YARN cluster UI at http://<dataproc cluster name>-m:8088/cluster/ or the Spark History Server at http://<dataproc cluster name>-m:18080

BigQuery vs BigTable

BigQuery

See: Google’s BigQuery Under the Hood podcast and blog article

Run query using bq cli

cat /path/to/query.sql | bq query --format json > path/to/output

Find and cancel job

bq ls -j -a --max_results=1000
bq cancel --location=europe-west1 "{JOB_ID}"

Get and Update table schema

Check the slot time consumed by your personal last 10 queries

SELECT
 job_id
 , (CASE WHEN TIMESTAMP_DIFF(end_time,start_time,MILLISECOND) = 0 THEN NULL ELSE total_slot_ms / TIMESTAMP_DIFF(end_time,start_time,MILLISECOND) END) as num_slot
 , total_slot_ms
 , *
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE user_email='foo@bar'
ORDER BY creation_time DESC
LIMIT 10

Set partition expiration

to 7 days

bq update --time_partitioning_expiration 604800 --time_partitioning_type DAY project:dataset.table

Delta Lake

DeltaLog & ACID guarantees

0) the DeltaLog

Deltalog = Delta Lake’s transaction log.

The deltalog is a collection of ordered json files. It acts as a single source of truth giving to users access to the last version of a DeltaTable’s state.

1) Atomicity

2) Consistency

The consistency of a DeltaTable is guaranteed by their strong schema checking.

3) Isolation

Concurrency of commits is managed to ensure their isolation. An optimistic concurrency control is applied:

4) Durability

Commits containing actions that mutate the DeltaTable’s data need to finish their writes/deletions on underlying Parquet files (stored on the filesystem) to be considered as successfully completed, making them durable. ___ Further readings:

Diving Into Delta Lake: Unpacking The Transaction Log

ACID properties

Useful ports

service port
YARN jobs UI 8088
Spark UI 4040
Spark History Server (Applications level) 18080
Jupyter notebook 8888

File formats

Parquet

Sources:

Hierarchy

Compression

A compression algorithm is applied by parquet, the compression unit being the page.

Here are the supported algorithms, when set through Spark configuration "spark.sql.parquet.compression.codec":

Encodings

https://github.com/apache/parquet-format/blob/master/Encodings.md Parquet framework automatically find out which encoding to use for each column depending on the data type (must be supported by the encoding) and the values characteristics (some rules decide if it is worth to use this or this encoding, depending on the data values). Dictionary encoding is a default that parquet will try on each column: It is considered to be the most efficient encoding if its condition of trigger are met (small number of distinct values).

Dictionary encoding

If dictionary encoding is enabled, parquet’s row groups looks like:

Columns chucks’s data pages consist of a bunch of references relative to the dictionary page that follows them.

This dictionary pages are keys and values encoded using plain encoding.

There is a fallback to plain encoding if it turns out that there is too much unique values to leverage this encoding.

In Spark, one can desactivate dictionary encoding with the config: "parquet.enable.dictionary" -> "false", that can be useful as it may take the place of a more efficient encoding. (dictionary encoding taking the place of delta string encoding on BYTE_ARRAYs for example)

Delta encoding

Supported types are INT32 and INT64. This encoding leverage the similarity between successive values, for instance when integers are representing timestamps.

Delta strings encoding

Suppoted type is BYTE_ARRAY. Known as Incremental encoding, it is another delta encoding that elude the writing of common prefix and suffix between: may lead to nice compression for close HTML pages.

Data warehouse vs RDBMS vs Key-value store

  RDBMS Data warehouse Key-Value Store
targeted workload OLTP OLAP OLTP
support relational model yes yes no
main feature performant random access using indexes decouples scaling of storage and processing resources in memory distributed storage for super fast and scalable key retrieval
typical data model 3NF Dimensional model key-value pairs
data format row-based columnar hashmap
examples MySQL, Postgres, Oracle BigQuery, Snowflake, Redshift Aerospike, Redis, Ignite

Cloud-based Data Warehouses solutions

Type Snowflake, BigQuery

basic bricks

Databricks lakehouse offer is built with bricks:

Fun

One ZB (ZettaByte, 10**21 bytes, 1 billion of GB) of data for one month in archive cloud storage cost 1 billion dollar