Skip to content

feat(naive_bayes): add CategoricalNB classifier#1936

Open
uttam12331 wants to merge 1 commit into
online-ml:mainfrom
uttam12331:feat/1399-categorical-nb
Open

feat(naive_bayes): add CategoricalNB classifier#1936
uttam12331 wants to merge 1 commit into
online-ml:mainfrom
uttam12331:feat/1399-categorical-nb

Conversation

@uttam12331

Copy link
Copy Markdown

Closes #1399.

Adds an online CategoricalNB naive Bayes classifier for categorical features (e.g. {"weather": "sunny", "humidity": "high"}), mirroring scikit-learn's CategoricalNB but learning incrementally. It fills the gap alongside the existing MultinomialNB/BernoulliNB/ComplementNB/GaussianNB.

Model

For each feature f, category v and class c:

P(x_f = v | c) = (count[f][(c, v)] + alpha) / (class_count[c] + alpha * n_categories(f))

with the empirical class prior P(c) = class_count[c] / N. New categories seen after the first observations are handled gracefully (online setting).

Implementation

  • learn_one / learn_many (mini-batch via MiniBatchClassifier), joint_log_likelihood / joint_log_likelihood_many, plus p_class / p_feature_given_class helpers — consistent with the other NB classes.
  • Registered in naive_bayes/__init__.py.

Verification

  • Matches scikit-learn's CategoricalNB to machine precision (max predict_proba difference ~4e-16 across alphas) — covered by a new test_categorical_vs_sklearn (parametrized over alpha).
  • learn_many produces an identical model to repeated learn_one (test_categorical_learn_many_vs_learn_one).
  • Unseen categories at predict time don't raise and keep probabilities normalized (test_categorical_handles_unseen_feature_value).
  • A runnable doctest example in the class docstring.
  • Full river/naive_bayes suite passes (50 tests), the generic estimator checks pass (test_estimators.py -k CategoricalNB, 53 checks), and ruff check / ruff format are clean.

@MaxHalford MaxHalford left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition!

for c in self.classes_
}

def learn_many(self, X: pd.DataFrame, y: pd.Series):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you support Narwhals? We're in the process of moving all mini-batch methods to Narwhals instead of pandas

Comment on lines +138 to +139
for (_, row), label in zip(X.iterrows(), y):
self.learn_one(row.to_dict(), label)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not want mini-batch methods to be for loops over the inputs. Mini-batch methods must use vectorization, else they're not bringing anything to the table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CategoricalNB

2 participants