DDIA
Legendary book about building scalable data systems.
- Chapters 1-4 are a must read.
- Chapter 1 defines what system characteristics we care about: scalability, reliability, maintainability.
- Chapter 2 discusses data models and query languages - from user's perspective how do you work with data.
- Chapter 3 discusses storage and retrieval - how data is stored on disk, B+ trees, LSM trees - from DB-engineer's perspective. DB has two jobs to store data and to return data when queried.
- Chapter 4 discusses encoding and evolution - how data is encoded on disk, serialization formats (JSON, XML, Protobuf, Avro, Thrift), schema evolution. This is about data representation on a computer.
- Part II is about distributed data (a bit harder to read). Chapter 7 is worth reading to understand what transactions are (ACID, where the 'C' is not related to transactions..)
- Chapter 5& 6 are about replication and partitioning and are ok. Chapter 8-9 not worth so much (or at least I'm not at the stage where I could understand and relate to them).
- Chapter 10 about batch processing is great. Starts from Unix batch, explain MapReduce and all this will help you understand the internals of Spark.