S3 vs DL vs Redshift
S3 (Simple Storage Service)
- pure storage, no SQL or any analytics
- Object Storage. NOT columnar, log-structured(LSM), or B-trees
- S3 does not understand tables, rows, columns, indexes, or trees.
- All structuring is defined by how you write/read files (e.g., CSV, Parquet, JSON), not by S3 itself!
- stores objects within buckets
- an object is a file and any metadata that describes the file
- a bucket is a container for objects
DataLake
- Still Object Storage (in S3)
- Built on top of S3
- An architectural concept or solution—not a product or "service".
- Centralizes and organizes data (usually on S3)
- in AWS you will go to S3 service and you will have many buckets. Some buckets might be data lakes
- DLs often support query tools (Athena/Databricks) for direct analytics!
Note on S3/DL storage
- DL and S3 are object storages. The objects (files) within might have structure, for example columnar format like Parquet, Avro and you can define schema on them.
- But: The underlying object store (S3) remains ignorant of the structure; it just stores files. The columnar "magic" happens inside the files themselves.
- you can not have B-trees, LSM-tree storage indexing in S3/DL - you need to use full fledged database management system.
Redshift
- Amazon Redshift is a fully managed cloud data warehouse service
- It is not just storage - it is a ful featured SQL analytics database
- Columnar Storage:
- Redshift stores table data in a columnar format (not row-based, not LSM, not B-tree).
- This means data is stored column by column, which is highly efficient for analytical queries (scanning/aggregating specific columns in very large datasets).
See How does Yelp store data? or practical usage
Note on parquet vs csv
- parquet files are columnar, meaning they store data in columns rather than rows.
- This is different from CSV files, which are row-based.
Bottom line
- S3 = object storage, NO index/tree, NO enforced schema, NO columnar unless your files are.
- Data Lake = organization + schema/catalog on S3; columnar if files are (e.g., Parquet); NO B-tree/LSM tree.
- Databases (like Redshift) = implement B-trees/columnar engines internally as part of their design for querying rows and columns efficiently.
Pricing
Data Lake
- Athena and Spark charge based on the amount of data scanned, so inefficient queries or large raw datasets can quickly increase costs.
- No infrastructure to manage: You don’t pay for servers or clusters—just for the queries you run.
- Best for
- Ad hoc queries
- Infrequent or unpredictable workloads
Redshift
- Provisioned clusters: Pay for reserved compute/storage capacity (hourly, regardless of usage). OR
- Yelp has 4 clusters!!
- Serverless: Pay for compute seconds used per query, plus storage.
- Best for:
- Frequent, complex analytics
- Large-scale, repeated reporting
- BI dashboards and heavy workloads
Example optimization
Hey! After some experiment runs I've updated and chose these spark parameters
spark.executor.memory: 28g
spark.executor.memoryOverhead: 3g
spark.executor.cores: 4
spark.dynamicAllocation.initialExecutors: 8
spark.dynamicAllocation.minExecutors: 8
spark.dynamicAllocation.maxExecutors: 32
spark.driver.memory: 28g
Run log
Run time 360 seconds, cost 1.41$ (previously it was 7$)
CPU average usage by the whole Pod is~30-40%. I won't decrease more the amount of executors since 8 to 32 dynamically allocated is already low compared to other jobs in this service.
Memory average usage by the whole Pod is 50-65% - this is ok since we want some room left
I use dynamic number of executors instead of fixed since CAPI has variable size input (as we onboard/churn clients num conversions can change a lot, matches data also depends a lot on the quality of the conversion client send to us which can vary over time). Also there are a few joins/groupbys in the DQS job that require variate amount of resources
Spark recommends memoryOverhead of executor to be about 6-10% of the container size = spark.executor.memory + spark.executor.memoryOverhead. I chose ~10% to be on the safe size