Is your idea related to a problem? Please describe. Currently, our load tests are accessing public data. The issue with this is that we have no control over the data in those S3 buckets.

Describe the solution you'd like Move all the load test benchmarking data into our own buckets.

P.S. Please do not attach files as it's considered a security risk. Add code snippets directly in the message body as much as possible.


My suggestion to create meaningful load tests is to generate data along these dimensions:

  1. Number, size and type of objects:
  • Few large objects (1Gb+ objects)
  • Many small objects (Million+ objects)
  • Various extensions (CSV, Parquet, JSON...)
  1. Partitioning:
  • No partitioning: All objects under the same partition
  • Moderate partitioning: Balanced partitions/objects (i.e. the common/sensible use case)
  • High partitioning: High number of partitions with few objects
  1. Data types:
  • String, Int, Null, List, Dict...
  • "Realistic" data using Faker
© 2022 - All rights reserved.