My favorite data engineering papers/articles
Following this tweet, I've decided to keep track of papers related to data engineering and distributed systems.
I have read many of the papers/articles in this list and I enjoyed doing so but some of them are recommendations I've received after making the list public.
📑 MapReduce: Simplified Data Processing on Large Clusters
📑 Spark: Cluster Computing with Working Sets
📑 Kafka: a Distributed Messaging System for Log Processing
📑 Dremel: Interactive Analysis of Web-Scale Datasets
💡 A paper describing the technology behind Google BigQuery
📑 Procella: Unifying serving and analytical data at YouTube
📑 The Log: What every software engineer should know about real-time data's unifying abstraction
💡 Not a paper but an article from one of the Kafka creators. He explains the basic data structure, key for many databases, and modern distributed systems.
📑 Making reliable distributed systems in the presence of software errors
📑 Conflict-free Replicated Data Types
📑 Delta State Replicated Data Types
📑 Time, Clocks and the Ordering of Events in a Distributed System
📑 Dynamo: Amazon’s Highly Available Key-value Store
📑 Linearizability: A Correctness Condition for Concurrent Objects
📑 Space/Time Trade-offs in Hash Coding with Allowable Errors
💡 Not especially interesting reading but had to mention Bloom Filters and this is the original paper. Perhaps the most surprising data structure I discovered working with data.
📚 Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
💡 Not a paper but a book. THE book for anyone interested in databases, data, and distributed systems.
📑 Naiad: A Timely Dataflow System
💡 Precursor paper of Materialize
📑 The Snowflake Elastic Data Warehouse
📑 The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing