Skip to content

configuring KnativeServing to allow zero initial scale when installed#1774

Merged
openshift-merge-bot[bot] merged 3 commits into
opendatahub-io:mainfrom
brettmthompson:feature/RHOAIENG-18965-changing-replicas-to-0-causes-new-pod-creation
Apr 18, 2025
Merged

configuring KnativeServing to allow zero initial scale when installed#1774
openshift-merge-bot[bot] merged 3 commits into
opendatahub-io:mainfrom
brettmthompson:feature/RHOAIENG-18965-changing-replicas-to-0-causes-new-pod-creation

Conversation

@brettmthompson
Copy link
Copy Markdown
Contributor

@brettmthompson brettmthompson commented Mar 19, 2025

Description

This PR is required to resolve RHOAIENG-18965.

The cause of "Changing 0 replicas to 0 causes a new pod creation" is the default KnativeServing configuration, which does not allow for this functionality. By default KnativeServing is configured to not allow zero initial scale and has the default initial scale value as 0. This means that for any knative revision to be considered ready at least one replica must first be created. In the case of changing an inference service's replicas to 0, a new revision is created with initial scale as the default value of 1, which causes a single pod to be spun up. Once the initial scale is reached this pod is then destroyed to achieve the desired state of 0 replicas.

Per the Knative Configuring Scale Bounds documentation, to prevent pods from being created when an inference service is configured to have 0 replicas, the following is required:

  1. KnativeServing must be configured to allow zero initial scale. Covered in this PR.
  2. The autoscaling.knative.dev/initial-scale: "0" annotation must be added to created knative revisions. Covered in PR.

How Has This Been Tested?

  • Unit Tests
  • Manual Integration Tests

Test Strategy:

Inference Service:

  1. Deploy ODH model serving component in a cluster
  2. Deploy sample ISVC
  3. Update the KnativeServing operator and apply the following value under the config key:
autoscaler:
      allow-zero-initial-scale: "true"
  1. Scale the ODH operator to 0, this will prevent the change in step 5 from being reverted by the operator
  2. Edit the kserve-controller-manager-deployment and update the image to be quay.io/bretthom/kserve-dev:autoscaling-odh
  3. Edit the ISVC to change minReplicas to 0.
  4. Observe that no new pods are created for the new knative revision

Inference Service:

  1. Deploy ODH model serving component in a cluster
  2. Deploy sample IG
  3. Update the KnativeServing operator and apply the following value under the config key:
autoscaler:
      allow-zero-initial-scale: "true"
  1. Scale the ODH operator to 0, this will prevent the change in step 5 from being reverted by the operator
  2. Edit the kserve-controller-manager-deployment and update the image to be quay.io/bretthom/kserve-dev:autoscaling-odh
  3. Edit the IG to add the autoscaling.knative.dev/min-scale: "0" annotation
  4. Observe that no new pods are created for the new knative revision

Merge criteria

  • You have read the contributors guide.
  • Commit messages are meaningful - have a clear and concise summary and detailed explanation of what was changed and why.
  • Pull Request contains a description of the solution, a link to the JIRA issue, and to any dependent or related Pull Request.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

@brettmthompson
Copy link
Copy Markdown
Contributor Author

/test opendatahub-operator-e2e

Copy link
Copy Markdown
Member

@carlkyrillos carlkyrillos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes make sense and look good to me but it'd be good to have someone with more operator/kserve experience take a look as well.
CC: @zdtsw

@brettmthompson
Copy link
Copy Markdown
Contributor Author

/cc @Jooho

@openshift-ci openshift-ci Bot requested a review from Jooho March 19, 2025 21:12
@zdtsw zdtsw requested review from israel-hdez and spolti March 20, 2025 06:44
@zdtsw
Copy link
Copy Markdown
Member

zdtsw commented Mar 20, 2025

The changes make sense and look good to me but it'd be good to have someone with more operator/kserve experience take a look as well. CC: @zdtsw

i added other serving guys as reviewers here as well.
i am fine with the change, just wait for ack from any of them.

@zdtsw
Copy link
Copy Markdown
Member

zdtsw commented Mar 20, 2025

btw, if this is not needed for rhoai 2.19, we can wait till next week to sync downstream, right?
@brettmthompson

@brettmthompson
Copy link
Copy Markdown
Contributor Author

@zdtsw RHOAIENG-18965 is marked as fix version RHOAI_2.20.0

@Jooho
Copy link
Copy Markdown
Contributor

Jooho commented Mar 20, 2025

The changes looks good to me and I tested it on my end, it is working fine.
However, I want to make sure

  • the odh operator can update this object properly when odh is upgraded?
  • Do knative pods need to be restarted to apply the new configuration

@lburgazzoli
Copy link
Copy Markdown
Member

* the odh operator can update this object properly when odh is upgraded?

It should, ideally, there should be an e2e test to ensure it happens

* Do knative pods need to be restarted to apply the new configuration

since this changes the spec of the KnativeServing, I would expect the related controller would act upon the change. This can be tested independently of the odh operator.

Copy link
Copy Markdown
Contributor

@Jooho Jooho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Mar 21, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: carlkyrillos, spolti

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Jooho
Copy link
Copy Markdown
Contributor

Jooho commented Mar 24, 2025

@zdtsw the failed use case is not related to this change. Could you please review the e2e ci?

@zdtsw
Copy link
Copy Markdown
Member

zdtsw commented Mar 24, 2025

/test opendatahub-operator-e2e

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 24, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 25.71%. Comparing base (8f96e1d) to head (b82ef58).
Report is 21 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1774   +/-   ##
=======================================
  Coverage   25.71%   25.71%           
=======================================
  Files         173      173           
  Lines       11892    11892           
=======================================
  Hits         3058     3058           
  Misses       8545     8545           
  Partials      289      289           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@spolti
Copy link
Copy Markdown
Member

spolti commented Mar 24, 2025

@brettmthompson can you please fix the conflict?

@brettmthompson brettmthompson force-pushed the feature/RHOAIENG-18965-changing-replicas-to-0-causes-new-pod-creation branch from c77e3c8 to f82c35a Compare March 24, 2025 18:00
@brettmthompson brettmthompson force-pushed the feature/RHOAIENG-18965-changing-replicas-to-0-causes-new-pod-creation branch from f82c35a to 4c3d3d8 Compare March 24, 2025 19:17
Copy link
Copy Markdown
Contributor

@israel-hdez israel-hdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brettmthompson Please, make sure that docs for manual setup are updated.

@lburgazzoli
Copy link
Copy Markdown
Member

/test opendatahub-operator-e2e

@hdefazio
Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm label Apr 18, 2025
@openshift-merge-bot openshift-merge-bot Bot merged commit 1fcd734 into opendatahub-io:main Apr 18, 2025
10 checks passed
@github-project-automation github-project-automation Bot moved this from Todo to Done in ODH Platform Planning Apr 18, 2025
@grdryn
Copy link
Copy Markdown
Member

grdryn commented Apr 23, 2025

/cherry-pick rhoai

@openshift-cherrypick-robot
Copy link
Copy Markdown

@grdryn: new pull request created: #1877

Details

In response to this:

/cherry-pick rhoai

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.