My favorite data engineering papers/articles
It's pretty common to see people here sharing papers about small advancements in NLP, Computer Vision, Machine Learning, Deep Learning, and similar fields.
— Jordi Villar (@jrdi) November 3, 2021
Let me share a list of Data papers I've enjoyed reading lately.
Following this tweet, I've decided to keep track of papers related to data engineering and distributed systems.
I have read many of the papers/articles in this list and I enjoyed doing so but some of them are recommendations I've received after making the list public.
π MapReduce: Simplified Data Processing on Large Clusters
π Spark: Cluster Computing with Working Sets
π Kafka: a Distributed Messaging System for Log Processing
π Dremel: Interactive Analysis of Web-Scale Datasets
π‘ A paper describing the technology behind Google BigQuery
π Procella: Unifying serving and analytical data at YouTube
π The Log: What every software engineer should know about real-time data's unifying abstraction
π‘ Not a paper but an article from one of the Kafka creators. He explains the basic data structure, key for many databases, and modern distributed systems.
π Making reliable distributed systems in the presence of software errors
π Conflict-free Replicated Data Types
π Delta State Replicated Data Types
π Time, Clocks and the Ordering of Events in a Distributed System
π Dynamo: Amazonβs Highly Available Key-value Store
π Linearizability: A Correctness Condition for Concurrent Objects
π Space/Time Trade-offs in Hash Coding with Allowable Errors
π‘ Not especially interesting reading but had to mention Bloom Filters and this is the original paper. Perhaps the most surprising data structure I discovered working with data.
π Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
π‘ Not a paper but a book. THE book for anyone interested in databases, data, and distributed systems.
π Naiad: A Timely Dataflow System
π‘ Precursor paper of Materialize
π The Snowflake Elastic Data Warehouse
π The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing