An end-to-end data pipeline that extracts raw Mercedes-Benz global car sales data from Kaggle, transforms it into three analytical fact tables, and loads the results as CSV reports — ready for dashboards or further analysis.
.
├── main.py # Pipeline entry point
├── requirements.txt
├── README.md
├── reports/ # Output CSVs written here after load
│ ├── fact_annual_metrics.csv
│ ├── fact_fuel_performance.csv
│ └── fact_model_profitability.csv
├── src/
│ ├── config/
│ │ ├── logging.py # Logging setup
│ │ └── settings.py # Constants: paths, column lists, body type map
│ ├── extract/
│ │ └── read_source.py # Pulls data from Kaggle via kagglehub
│ ├── transform/
│ │ └── data_cleaning.py # Validation, cleaning, and aggregation
│ └── load/
│ └── load_data.py # Writes fact tables to CSV
└── tests/ # Full unittest suite
├── extract/
│ └── test_source.py
├── transform/
│ └── test_transformation_schema.py
│ └── test_transformation.py
├── load/
│ └── test_loading.py
├── full_pipeline.py
└── utils.py
Kaggle Dataset
│
▼
[Extract] read_source.py → raw DataFrame
│
▼
[Transform] data_cleaning.py → fact_annual_metrics
→ fact_fuel_performance
→ fact_model_profitability
│
▼
[Load] load_data.py → reports/*.csv
| Table | Grain | Key Metrics |
|---|---|---|
fact_annual_metrics |
Year | Total units sold, total revenue, avg price, YoY revenue growth % |
fact_fuel_performance |
Year × Fuel Type | Units sold, avg price, avg horsepower, market share % |
fact_model_profitability |
Year × Model × Body Type | Units sold, total revenue, avg horsepower |
git clone https://github.com/franchierchie/End-to-End-Data-Solution-Mercedes-Global-Car-Sales-2020-2025.git
cd End-to-End-Data-Solution-Mercedes-Global-Car-Sales-2020-2025python -m venv .venv
source .venv/bin/activate # macOS / Linux
.venv\Scripts\activate # Windowspip install -r requirements.txtThe pipeline uses kagglehub to download the dataset. Authenticate once before running:
pip install kaggle
kaggle config set -n username -v <your_kaggle_username>
kaggle config set -n key -v <your_kaggle_api_key>Or export the environment variables:
export KAGGLE_USERNAME=<your_kaggle_username>
export KAGGLE_KEY=<your_kaggle_api_key>You can generate an API key from your Kaggle account settings.
python main.pyOutput CSVs are written to the reports/ directory.
python -m unittest tests/test_etl_pipeline.py -vThe test suite is self-contained — it does not require a Kaggle connection or the real project package to be installed. All data source calls are replaced by an in-memory fixture.
| Area | What is tested |
|---|---|
| Extract | Returns a DataFrame, non-empty, all required columns present, correct dtypes |
| Validate schema | Raises ValueError for missing columns; error message names the missing column |
| Transform | Column renaming, Sales Volume removal, Body Type mapping (known & unknown models), aggregation correctness, YoY growth, market share sums to 100 %, rounding, null-Model guard, deduplication |
| Load | Three CSVs are created, non-empty, round-trip correctly; raises ValueError for any empty fact table |
| End-to-end | Full pipeline with standard data; edge cases: single year, single fuel type, all-unknown models, YoY growth direction, CSV row count consistency |
See requirements.txt for the full dependency list.