My favorite data engineering papers/articles

Following this tweet, I've decided to keep track of papers related to data engineering and distributed systems.

I have read many of the papers/articles in this list and I enjoyed doing so but some of them are recommendations I've received after making the list public.

πŸ“‘ MapReduce: Simplified Data Processing on Large Clusters
πŸ“‘ Spark: Cluster Computing with Working Sets
πŸ“‘ Kafka: a Distributed Messaging System for Log Processing
πŸ“‘ Dremel: Interactive Analysis of Web-Scale Datasets

πŸ’‘ A paper describing the technology behind Google BigQuery
πŸ“‘ Procella: Unifying serving and analytical data at YouTube
πŸ“‘ The Log: What every software engineer should know about real-time data's unifying abstraction

πŸ’‘ Not a paper but an article from one of the Kafka creators. He explains the basic data structure, key for many databases, and modern distributed systems.
πŸ“‘ Making reliable distributed systems in the presence of software errors
πŸ“‘ Conflict-free Replicated Data Types
πŸ“‘ Delta State Replicated Data Types
πŸ“‘ Time, Clocks and the Ordering of Events in a Distributed System
πŸ“‘ Dynamo: Amazon’s Highly Available Key-value Store
πŸ“‘ Linearizability: A Correctness Condition for Concurrent Objects
πŸ“‘ Space/Time Trade-offs in Hash Coding with Allowable Errors

πŸ’‘ Not especially interesting reading but had to mention Bloom Filters and this is the original paper. Perhaps the most surprising data structure I discovered working with data.
πŸ“š Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

πŸ’‘ Not a paper but a book. THE book for anyone interested in databases, data, and distributed systems.
πŸ“‘ Naiad: A Timely Dataflow System

πŸ’‘ Precursor paper of Materialize
πŸ“‘ The Snowflake Elastic Data Warehouse
πŸ“‘ The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing