S3 vs DL vs Redshift
S3 (Simple Storage Service)
- pure storage, no SQL or any analytics
- Object Storage. NOT columnar, log-structured(LSM), or B-trees
- S3 does not understand tables, rows, columns, indexes, or trees.
- All structuring is defined by how you write/read files (e.g., CSV, Parquet, JSON), not by S3 itself!
- stores objects within buckets
- an object is a file and any metadata that describes the file
- a bucket is a container for objects
DataLake
- Still Object Storage (in S3)
- Built on top of S3
- An architectural concept or solution—not a product or "service".
- Centralizes and organizes data (usually on S3)
- in AWS you will go to S3 service and you will have many buckets. Some buckets might be data lakes
- DLs often support query tools (Athena/Databricks) for direct analytics!
Note on S3/DL storage
- DL and S3 are object storages. The objects (files) within might have structure, for example columnar format like Parquet, Avro and you can define schema on them.
- But: The underlying object store (S3) remains ignorant of the structure; it just stores files. The columnar "magic" happens inside the files themselves.
- you can not have B-trees, LSM-tree storage indexing in S3/DL - you need to use full fledged database management system.
Redshift
- Amazon Redshift is a fully managed cloud data warehouse service
- It is not just storage - it is a ful featured SQL analytics database
- Columnar Storage:
- Redshift stores table data in a columnar format (not row-based, not LSM, not B-tree).
- This means data is stored column by column, which is highly efficient for analytical queries (scanning/aggregating specific columns in very large datasets).
See How does Yelp store data? or practical usage
Note on parquet vs csv
- parquet files are columnar, meaning they store data in columns rather than rows.
- This is different from CSV files, which are row-based.
Bottom line
- S3 = object storage, NO index/tree, NO enforced schema, NO columnar unless your files are.
- Data Lake = organization + schema/catalog on S3; columnar if files are (e.g., Parquet); NO B-tree/LSM tree.
- Databases (like Redshift) = implement B-trees/columnar engines internally as part of their design for querying rows and columns efficiently.
Pricing
Data Lake
- Athena and Spark charge based on the amount of data scanned, so inefficient queries or large raw datasets can quickly increase costs.
- No infrastructure to manage: You don’t pay for servers or clusters—just for the queries you run.
- Best for
- Ad hoc queries
- Infrequent or unpredictable workloads
Redshift
- Provisioned clusters: Pay for reserved compute/storage capacity (hourly, regardless of usage). OR
- Yelp has 4 clusters!!
- Serverless: Pay for compute seconds used per query, plus storage.
- Best for:
- Frequent, complex analytics
- Large-scale, repeated reporting
- BI dashboards and heavy workloads
HDFS
Hadoop Distributed File System. Like S3 it is used to store large amounts of data.
-
Distributed file system that runs on lusters of computers
-
Data stored in HDFS can be of any type, but the system automatically splits these files and stores them redundantly across many machines
-
HDFS is faster than S3 (no network latency)
-
HDFS is a distributed file system you run and maintain on your own cluster, primarily for on-prem or managed Hadoop/Spark environments.
-
S3 is not a file system, it is object storage. S3 is a manged service by AWS. HDFS is open source and you have to manage it yourself. It runs on a cluster of connected servers.
-
Legacy/On-prem-focused orgs: More likely to use HDFS.
-
Cloud/data-driven orgs (most modern tech): Strongly favor S3.