Lightning-fast data preprocessing and feature engineering for machine learning
Gators is a lightning-fast data preprocessing and feature engineering library built on top of Polars, designed to streamline your entire ML workflow from raw data to production-ready models. Leveraging Polarsβ blazing-fast multi-core processing.
Built by the PSP Data Team at PayPal, Gators makes data preprocessing and feature engineering both faster and simpler.
- π Lightning Fast: Built on Polars for multi-core parallel processing
- π Unified API: Consistent sklearn-style
.fit()and.transform()interface - π¦ Production Ready: Deploy the same Python code from notebook to production
- π― Comprehensive: 75+ preprocessing transformers covering every use case
- π Pipeline Support: Chain transformers seamlessly with the Pipeline class
- π Easy to Learn: If you know sklearn, you already know Gators
Clean and prepare your data with powerful transformers:
CastColumns- Convert column data typesCorrelationFilter- Remove highly correlated featuresDropColumns- Remove specified columnsDropConstantColumns- Remove columns with constant valuesDropDuplicateColumns- Remove duplicate columnsDropDuplicateRows- Remove duplicate rowsDropHighNaNRatio- Remove columns with high missing value ratioDropLowCardinality- Remove low cardinality columnsHighCardinalityFilter- Filter high cardinality featuresOutlierFilter- Detect and filter outliersRenameColumns- Rename columnsReplace- Replace values in dataVarianceFilter- Remove low variance features
Transform categorical variables with advanced encoding techniques:
BinaryEncoder- Binary representation encodingCatBoostEncoder- CatBoost-style encodingCountEncoder- Frequency-based encodingLeaveOneOutEncoder- Leave-one-out encodingOneHotEncoder- Classic one-hot encodingOrdinalEncoder- Order-based encodingRareCategoryEncoder- Handle rare categories intelligentlyTargetEncoder- Target-based encoding for supervised learningWOEEncoder- Weight of Evidence encoding
Create powerful numeric features:
ComparisonFeatures- Generate comparison featuresConditionFeatures- Create conditional featuresDistanceFeatures- Calculate distance featuresGroupLagFeatures- Generate lag features by groupGroupScalingFeatures- Scale features within groupsGroupStatisticsFeatures- Calculate group statisticsIsNull- Generate null indicator featuresMathFeatures- Apply mathematical operations (add, subtract, multiply, divide)PlanRotationFeatures- Rotate features in feature spacePolynomialFeatures- Generate polynomial combinationsRatioFeatures- Create ratio features between columnsRowStatisticsFeatures- Calculate row-wise statisticsRuleFeatures- Apply custom business rulesScalarMathFeatures- Apply scalar operations
Extract insights from text data:
CharacterStatistics- Extract character-level statisticsCombineFeatures- Combine string featuresContains- Check if string contains patternEndswith- Check if string ends with patternExtractSubstring- Extract substring from textInteractionFeatures- Generate string interaction featuresLength- Calculate string lengthLower- Convert text to lowercaseNGram- Generate n-gram featuresOccurrences- Count pattern occurrencesPatternDetector- Detect patterns in textSplit- Split stringsSplitExtract- Split and extract from stringsStartswith- Check if string starts with patternUpper- Convert text to uppercase
Unlock temporal patterns:
BusinessTimeFeatures- Business hours/days calculationsCyclicFeatures- Circular encoding for cyclical time featuresDiffFeatures- Calculate time differencesDurationToDatetime- Convert duration to datetimeHolidayFeatures- Detect and encode holidaysOrdinalFeatures- Extract year, month, day, hour, etc.TimeBinFeatures- Bin times into categoriesTimeWindowFeatures- Generate time window features
Handle missing data intelligently:
BooleanImputer- Impute boolean columnsGroupByImputer- Group-based imputation strategiesNumericImputer- Impute numeric columns (mean, median, mode, constant)StringImputer- Impute string columns (mode, constant)
Convert continuous variables into bins:
CustomDiscretizer- Custom bin edgesEqualLengthDiscretizer- Equal-width binningEqualSizeDiscretizer- Equal-frequency binningGeometricDiscretizer- Geometric progression binningKMeansDiscretizer- K-means clustering-based binningQuantileDiscretizer- Quantile-based binningTreeBasedDiscretizer- Decision tree-based binning
Normalize your features:
ArcsinSquarerootScaler- Arcsine square root transformationArcsinhScaler- Inverse hyperbolic sine transformationBoxCox- Box-Cox power transformationLogScaler- Logarithmic scalingMinmaxScaler- Min-max normalizationPowerScaler- Power transformationStandardScaler- Standardization (z-score normalization)YeoJonhson- Yeo-Johnson power transformation
Chain all transformers together:
Pipeline- sklearn-compatible pipeline for chaining transformers
import polars as pl
from gators.data_cleaning import DropHighNaNRatio, VarianceFilter
from gators.encoders import OneHotEncoder
from gators.imputers import NumericImputer
from gators.scalers import StandardScaler
from gators.pipeline import Pipeline
# Load your data
X = pl.read_csv("data.csv")
# Build a preprocessing pipeline
pipeline = Pipeline([
('drop_nan', DropHighNaNRatio(threshold=0.5)),
('impute', NumericImputer(strategy='median')),
('variance', VarianceFilter(threshold=0.01)),
('encode', OneHotEncoder()),
('scale', StandardScaler())
])
# Fit and transform
X_processed = pipeline.fit_transform(X)
# Deploy the same pipeline in production!pip install gatorsOr install from source:
git clone https://github.com/paypal/gators.git
cd gators
pip install -e .For detailed documentation, tutorials, and API reference, visit:
https://paypal.github.io/gators/
Gators is perfect for:
- Fraud Detection - Extensive feature engineering for anomaly detection
- Risk Modeling - Create powerful predictive features
- Customer Analytics - Transform complex customer data
- Time Series - Rich datetime feature engineering
- NLP Tasks - String feature extraction and encoding
We welcome contributions! Please check out our contributing guidelines.
Gators is licensed under the Apache License 2.0. See LICENSE file for details.
Developed by the PSP Data Team at PayPal.
Built by data scientists, for data scientists