Skip to content

franchierchie/End-to-End-Data-Solution-Mercedes-Global-Car-Sales-2020-2025

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚗 Mercedes Global Car Sales ETL Pipeline (2020–2025)

An end-to-end data pipeline that extracts raw Mercedes-Benz global car sales data from Kaggle, transforms it into three analytical fact tables, and loads the results as CSV reports — ready for dashboards or further analysis.


📁 Project Structure

.
├── main.py                        # Pipeline entry point
├── requirements.txt
├── README.md
├── reports/                       # Output CSVs written here after load
│   ├── fact_annual_metrics.csv
│   ├── fact_fuel_performance.csv
│   └── fact_model_profitability.csv
├── src/
│   ├── config/
│   │   ├── logging.py             # Logging setup
│   │   └── settings.py            # Constants: paths, column lists, body type map
│   ├── extract/
│   │   └── read_source.py         # Pulls data from Kaggle via kagglehub
│   ├── transform/
│   │   └── data_cleaning.py       # Validation, cleaning, and aggregation
│   └── load/
│       └── load_data.py           # Writes fact tables to CSV
└── tests/                         # Full unittest suite
    ├── extract/
    │   └── test_source.py
    ├── transform/
    │   └── test_transformation_schema.py
    │   └── test_transformation.py
    ├── load/
    │    └── test_loading.py
    ├── full_pipeline.py
    └── utils.py

⚙️ Pipeline Overview

Kaggle Dataset
     │
     ▼
[Extract]  read_source.py          → raw DataFrame
     │
     ▼
[Transform] data_cleaning.py       → fact_annual_metrics
                                   → fact_fuel_performance
                                   → fact_model_profitability
     │
     ▼
[Load]  load_data.py               → reports/*.csv

Fact Tables Produced

Table Grain Key Metrics
fact_annual_metrics Year Total units sold, total revenue, avg price, YoY revenue growth %
fact_fuel_performance Year × Fuel Type Units sold, avg price, avg horsepower, market share %
fact_model_profitability Year × Model × Body Type Units sold, total revenue, avg horsepower

🛠️ Setup

1. Clone the repository

git clone https://github.com/franchierchie/End-to-End-Data-Solution-Mercedes-Global-Car-Sales-2020-2025.git
cd End-to-End-Data-Solution-Mercedes-Global-Car-Sales-2020-2025

2. Create and activate a virtual environment

python -m venv .venv
source .venv/bin/activate        # macOS / Linux
.venv\Scripts\activate           # Windows

3. Install dependencies

pip install -r requirements.txt

4. Configure Kaggle credentials

The pipeline uses kagglehub to download the dataset. Authenticate once before running:

pip install kaggle
kaggle config set -n username -v <your_kaggle_username>
kaggle config set -n key     -v <your_kaggle_api_key>

Or export the environment variables:

export KAGGLE_USERNAME=<your_kaggle_username>
export KAGGLE_KEY=<your_kaggle_api_key>

You can generate an API key from your Kaggle account settings.


▶️ Running the Pipeline

python main.py

Output CSVs are written to the reports/ directory.


🧪 Running the Tests

python -m unittest tests/test_etl_pipeline.py -v

The test suite is self-contained — it does not require a Kaggle connection or the real project package to be installed. All data source calls are replaced by an in-memory fixture.

Test Coverage

Area What is tested
Extract Returns a DataFrame, non-empty, all required columns present, correct dtypes
Validate schema Raises ValueError for missing columns; error message names the missing column
Transform Column renaming, Sales Volume removal, Body Type mapping (known & unknown models), aggregation correctness, YoY growth, market share sums to 100 %, rounding, null-Model guard, deduplication
Load Three CSVs are created, non-empty, round-trip correctly; raises ValueError for any empty fact table
End-to-end Full pipeline with standard data; edge cases: single year, single fuel type, all-unknown models, YoY growth direction, CSV row count consistency

📦 requirements.txt

See requirements.txt for the full dependency list.

About

End-to-end data pipeline project using Python, Kaggle API, and Power BI. Features automated ETL processing, modular unit testing, and multidimensional data modeling to analyze global car sales trends (2020-2025).

Topics

Resources

Stars

Watchers

Forks

Contributors

Languages