S3 vs DL vs Redshift

S3 (Simple Storage Service)

  • pure storage, no SQL or any analytics
  • Object Storage. NOT columnar, log-structured(LSM), or B-trees
  • S3 does not understand tables, rows, columns, indexes, or trees.
  • All structuring is defined by how you write/read files (e.g., CSV, Parquet, JSON), not by S3 itself!
  • stores objects within buckets
  • an object is a file and any metadata that describes the file
  • a bucket is a container for objects

DataLake

  • Still Object Storage (in S3)
  • Built on top of S3
  • An architectural concept or solution—not a product or "service".
  • Centralizes and organizes data (usually on S3)
  • in AWS you will go to S3 service and you will have many buckets. Some buckets might be data lakes
  • DLs often support query tools (Athena/Databricks) for direct analytics!

Note on S3/DL storage

  • DL and S3 are object storages. The objects (files) within might have structure, for example columnar format like Parquet, Avro and you can define schema on them.
  • But: The underlying object store (S3) remains ignorant of the structure; it just stores files. The columnar "magic" happens inside the files themselves.
  • you can not have B-trees, LSM-tree storage indexing in S3/DL - you need to use full fledged database management system.

Redshift

  • Amazon Redshift is a fully managed cloud data warehouse service
  • It is not just storage - it is a ful featured SQL analytics database
  • Columnar Storage:
    • Redshift stores table data in a columnar format (not row-based, not LSM, not B-tree).
    • This means data is stored column by column, which is highly efficient for analytical queries (scanning/aggregating specific columns in very large datasets).

See How does Yelp store data? or practical usage

Note on parquet vs csv

  • parquet files are columnar, meaning they store data in columns rather than rows.
  • This is different from CSV files, which are row-based.

Bottom line

  • S3 = object storage, NO index/tree, NO enforced schema, NO columnar unless your files are.
  • Data Lake = organization + schema/catalog on S3; columnar if files are (e.g., Parquet); NO B-tree/LSM tree.
  • Databases (like Redshift) = implement B-trees/columnar engines internally as part of their design for querying rows and columns efficiently.

Pricing

Data Lake

  • Athena and Spark charge based on the amount of data scanned, so inefficient queries or large raw datasets can quickly increase costs.
  • No infrastructure to manage: You don’t pay for servers or clusters—just for the queries you run.
  • Best for
    • Ad hoc queries
    • Infrequent or unpredictable workloads

Redshift

  • Provisioned clusters: Pay for reserved compute/storage capacity (hourly, regardless of usage). OR
    • Yelp has 4 clusters!!
  • Serverless: Pay for compute seconds used per query, plus storage.
  • Best for:
    • Frequent, complex analytics
    • Large-scale, repeated reporting
    • BI dashboards and heavy workloads

Example optimization

Hey! After some experiment runs I've updated and chose these spark parameters

    spark.executor.memory: 28g
    spark.executor.memoryOverhead: 3g
    spark.executor.cores: 4
    spark.dynamicAllocation.initialExecutors: 8
    spark.dynamicAllocation.minExecutors: 8
    spark.dynamicAllocation.maxExecutors: 32
    spark.driver.memory: 28g

Run log

Run time 360 seconds, cost 1.41$ (previously it was 7$)

CPU average usage by the whole Pod is~30-40%. I won't decrease more the amount of executors since 8 to 32 dynamically allocated is already low compared to other jobs in this service.

Memory average usage by the whole Pod is 50-65% - this is ok since we want some room left

I use dynamic number of executors instead of fixed since CAPI has variable size input (as we onboard/churn clients num conversions can change a lot, matches data also depends a lot on the quality of the conversion client send to us which can vary over time). Also there are a few joins/groupbys in the DQS job that require variate amount of resources

Spark recommends memoryOverhead of executor to be about 6-10% of the container size = spark.executor.memory + spark.executor.memoryOverhead. I chose ~10% to be on the safe size