Serverless Data Pipeline – S3, Glue, Athena
Fully serverless analytics pipeline using AWS Glue, Athena, and S3 with infrastructure defined as code.
Designed a pipeline to ingest raw CSV data, automatically discover schemas, and make curated datasets queryable with low operational overhead.
- S3 (raw, processed, Athena results)
- AWS Glue crawlers, database, ETL job (PySpark)
- Amazon Athena for interactive queries
- IAM roles and policies
- AWS CDK (Python) for IaC
- GitHub Actions for CI (lint, tests,
cdk synth)
I built the ETL job to cast types, compute derived metrics (e.g., total price), and write partitioned Parquet by order date, then codified the Glue resources with CDK and added CI checks to catch issues before deployment.