Skip to content

250731 sync kserve/master into odh/master batch3#820

Merged
openshift-merge-bot[bot] merged 10 commits into
opendatahub-io:masterfrom
mholder6:250731_sync_upstream_batch3
Aug 6, 2025
Merged

250731 sync kserve/master into odh/master batch3#820
openshift-merge-bot[bot] merged 10 commits into
opendatahub-io:masterfrom
mholder6:250731_sync_upstream_batch3

Conversation

@mholder6
Copy link
Copy Markdown

@mholder6 mholder6 commented Aug 5, 2025

What this PR does / why we need it:
Synced the kserve/master branch into odh/master branch [batch3]
[RHOAIENG-29333]

Summary by CodeRabbit

  • New Features

    • Added support for configuring and deploying secure, parallelized LLM inference services with new Kubernetes resource templates, including decode, prefill, worker, router, and scheduler configurations.
    • Introduced options to return raw logits or probabilities for Hugging Face model outputs via new CLI flags.
    • Enhanced LLM inference service configuration with additional parallelism options (data, dataLocal, dataRPCPort, expert).
  • Bug Fixes

    • Improved validation for predictor configurations to disallow invalid name fields in standard predictors.
    • Fixed test assertions and output expectations to align with updated raw logits and probability outputs.
  • Refactor

    • Modularized Hugging Face model output processing for better clarity and extensibility.
    • Replaced hardcoded configuration strings with constants in OpenTelemetry Collector setup for maintainability.
  • Tests

    • Added and updated tests to cover new output modes (raw logits) and configuration presets.
    • Expanded end-to-end and unit tests for new LLM inference service features and configuration options.
  • Chores

    • Added new Kubernetes manifests and Kustomization files for easier resource management and deployment.
    • Updated Makefile and deployment scripts for improved resource handling and test reporting.

bartoszmajsak and others added 10 commits July 19, 2025 20:43
…ve#4601)

Signed-off-by: Bartosz Majsak <bartosz.majsak@gmail.com>
…r different namespaces (kserve#4593)

Signed-off-by: Vincent Hou <shou73@bloomberg.net>
Signed-off-by: Sivanantham Chinnaiyan <sivanantham.chinnaiyan@ideas2it.com>
Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
Signed-off-by: HutakiHare <sunniessheep@gmail.com>
…r_cipn (kserve#4607)

Signed-off-by: Sivanantham Chinnaiyan <sivanantham.chinnaiyan@ideas2it.com>
Signed-off-by: Vedant Mahabaleshwarkar <vmahabal@redhat.com>
Co-authored-by: Bartosz Majsak <bartosz.majsak@gmail.com>
Co-authored-by: Pierangelo Di Pilato <pierdipi@redhat.com>
Signed-off-by: Vedant Mahabaleshwarkar <vmahabal@redhat.com>
Co-authored-by: Bartosz Majsak <bartosz.majsak@gmail.com>
Co-authored-by: Pierangelo Di Pilato <pierdipi@redhat.com>
Co-authored-by: Sivanantham <90966311+sivanantha321@users.noreply.github.com>
…io#784) (kserve#4619)

Signed-off-by: Bartosz Majsak <bartosz.majsak@gmail.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Aug 5, 2025

Walkthrough

This change introduces new and updated Kubernetes CRD schemas, resource templates, and supporting Go logic for large language model (LLM) inference services, including new parallelism fields and expert flags. It adds several new Go utility, lifecycle, and defaulting methods, refactors and expands Python model postprocessing for raw logits, and enhances test coverage for new output modes and controller behaviors.

Changes

Cohort / File(s) Change Summary
CRD Schema Updates
config/crd/full/serving.kserve.io_llminferenceserviceconfigs.yaml, config/crd/full/serving.kserve.io_llminferenceservices.yaml, charts/llmisvc-crd/templates/serving.kserve.io_llminferenceserviceconfigs.yaml, charts/llmisvc-crd/templates/serving.kserve.io_llminferenceservices.yaml
Added data, dataLocal, dataRPCPort (int32), and expert (bool) fields to parallelism and prefill.parallelism; changed pipeline and tensor types to int32; removed required: [model] from specs.
LLM Inference Service Config Templates
config/llmisvc/config-llm-*.yaml, charts/llmisvc-resources/templates/config-llm-*.yaml, config/llmisvc/kustomization.yaml
Introduced multiple new LLMInferenceServiceConfig resource templates for various LLM workloads (decode, prefill, worker, router, scheduler) with detailed pod specs, volume mounts, probes, and TLS settings. Added a kustomization file listing these resources.
Makefile
Makefile
Modified manifests target to copy config/llmisvc templates to Helm chart templates and remove kustomization.yaml after copying.
Go API Types & Utilities
pkg/apis/serving/v1alpha1/llm_inference_service_types.go, pkg/apis/serving/v1alpha1/zz_generated.deepcopy.go, pkg/apis/serving/v1alpha1/llm_inference_service_types_func.go, pkg/apis/serving/v1alpha1/v1alpha1.go
Updated ParallelismSpec types and added new fields. Added utility methods for parallelism and reference checks. Added GVK variables for new kinds. Updated deepcopy logic for new fields.
Go API Defaulting & Lifecycle
pkg/apis/serving/v1alpha1/llm_inference_service_defaults.go, pkg/apis/serving/v1alpha1/llm_inference_service_lifecycle.go
Added defaulting for model name and detailed condition/lifecycle management methods for LLMInferenceService resources.
LLM Service Controller Logic
pkg/controller/v1alpha1/llmisvc/config_merge.go, pkg/controller/v1alpha1/llmisvc/utils.go, pkg/controller/v1alpha1/llmisvc/sample.go
Added config merging logic for base and preset overlays, utility for workload label selectors, and a sample LLMInferenceService resource generator.
LLM Service Controller Tests
pkg/controller/v1alpha1/llmisvc/config_merge_test.go, pkg/controller/v1alpha1/llmisvc/config_presets_test.go
Updated tests for int32 parallelism fields; added comprehensive preset validation test loading and checking all config YAML files.
InferenceGraph Controller Enhancements
pkg/controller/v1alpha1/inferencegraph/controller.go, pkg/controller/v1alpha1/inferencegraph/controller_test.go, pkg/controller/v1alpha1/inferencegraph/knative_reconciler.go
Added/updated logic and tests to support "force stop" annotation, updating status conditions and controlling Knative service lifecycle accordingly.
Python Huggingface Server: Raw Logits Support
python/huggingfaceserver/huggingfaceserver/__main__.py, python/huggingfaceserver/huggingfaceserver/encoder_model.py
Added --return_raw_logits CLI flag (mutually exclusive with --return_probabilities); refactored model output postprocessing to support raw logits and modular task-specific handling.
Python Huggingface Server Tests
python/huggingfaceserver/tests/test_model.py, python/huggingfaceserver/tests/test_output.py
Added fixtures and tests for raw logits output; updated expected output structures and test assertions for new output modes.
E2E and Integration Test Updates
test/e2e/predictor/test_huggingface.py, test/e2e/predictor/test_output.py, test/e2e/predictor/test_autoscaling.py, test/e2e/logger/test_raw_logger.py, test/scripts/gh-actions/run-e2e-tests.sh
Split Huggingface sequence classification tests for logits and probabilities; updated test output variable names; adjusted KEDA metric query assertions; minor test decorator and script output tweaks.
KEDA & OpenTelemetry Reconciler Updates
pkg/controller/v1beta1/inferenceservice/reconcilers/keda/keda_reconciler.go, pkg/controller/v1beta1/inferenceservice/reconcilers/keda/keda_reconciler_test.go, pkg/controller/v1beta1/inferenceservice/reconcilers/otel/otel_reconciler.go, pkg/controller/v1beta1/inferenceservice/reconcilers/otel/otel_reconciler_test.go, pkg/controller/v1beta1/inferenceservice/rawkube_controller_test.go
Enhanced KEDA metric query construction to include namespace and deployment; refactored OpenTelemetry collector config to use constants and always include resource detection and transform processors; updated tests accordingly.
Agent Model Watcher Test
pkg/agent/watcher_test.go
Refactored test server/client initialization for model event tests to use dynamic URLs and correct client association.
InferenceService v1beta1 Validation
pkg/apis/serving/v1beta1/inference_service_validation.go, pkg/apis/serving/v1beta1/inference_service_validation_test.go
Added validation to disallow name field in standard predictors and corresponding test.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant K8s API Server
    participant Controller
    participant LLMInferenceServiceConfig
    participant LLMInferenceService

    User->>K8s API Server: Apply LLMInferenceService manifest
    K8s API Server->>Controller: Watch/notify new LLMInferenceService
    Controller->>LLMInferenceServiceConfig: Merge base refs, overlays, presets
    Controller->>LLMInferenceService: Set defaults, update conditions
    Controller->>K8s API Server: Create/patch workloads, router, scheduler
    Controller->>LLMInferenceService: Update status/conditions (Ready/Stopped)
Loading
sequenceDiagram
    participant User
    participant PythonServer
    participant Model
    participant Client

    User->>PythonServer: Run with --return_raw_logits or --return_probabilities
    Client->>PythonServer: Send inference request
    PythonServer->>Model: Run model forward pass
    Model-->>PythonServer: Return outputs
    PythonServer->>PythonServer: Postprocess (raw logits or probabilities)
    PythonServer-->>Client: Return formatted result
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~75+ minutes

Possibly related PRs

  • opendatahub-io/kserve#805: Also modifies the Makefile's manifests target for "llmisvc" resources, enhancing manifest generation steps—directly related to build and resource handling logic.
  • opendatahub-io/kserve#752: Introduces the initial LLMInferenceServiceConfig and LLMInferenceService CRDs; this PR extends and builds upon those foundational definitions.

Suggested labels

approved, lgtm, tide/merge-method-merge

Suggested reviewers

  • brettmthompson

Poem

In the warren where code bunnies dwell,
New configs and CRDs ring the bell.
Parallel fields and expert flags sprout,
Raw logits hop in, probabilities out!
With templates and tests, we leap ever higher—
This LLM patch is a rabbit’s desire!

((\
( -.-)
o_(")(")

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ast-grep (0.38.6)
pkg/controller/v1beta1/inferenceservice/rawkube_controller_test.go
🔧 golangci-lint (2.2.2)

Error: can't load config: can't unmarshal config by viper (flags, file): 1 error(s) decoding:

  • 'output.formats' expected a map, got 'slice'
    The command is terminated due to an error: can't load config: can't unmarshal config by viper (flags, file): 1 error(s) decoding:

  • 'output.formats' expected a map, got 'slice'

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Aug 5, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mholder6

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved label Aug 5, 2025
@mholder6 mholder6 added the tide/merge-method-merge Change lgtm to use a merge commit label Aug 5, 2025
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 28

🧹 Nitpick comments (30)
test/scripts/gh-actions/run-e2e-tests.sh (1)

34-38: Quote shell variables when passing to pytest

$PARALLELISM and $NETWORK_LAYER may contain whitespace or other shell-special characters coming from CI parameterization. Quoting avoids accidental word-splitting and protects against edge cases.

-    pytest -m "$MARKER" --ignore=qpext --log-cli-level=INFO -n $PARALLELISM --dist worksteal --network-layer $NETWORK_LAYER --ignore=explainer/
+    pytest -m "$MARKER" --ignore=qpext --log-cli-level=INFO -n "$PARALLELISM" --dist worksteal --network-layer "$NETWORK_LAYER" --ignore=explainer/

Same applies to the else branch.

python/huggingfaceserver/huggingfaceserver/encoder_model.py (2)

354-385: Well-structured sequence classification processing.

The method correctly handles three output modes (raw logits, probabilities, indices) with proper CPU conversion and type handling.

Consider extracting the CPU conversion pattern into a small helper to reduce duplication:

def _to_cpu(self, tensor: Tensor) -> Tensor:
    return tensor.cpu() if tensor.is_cuda else tensor

433-474: Consider including decoded tokens for better interpretability.

The methods correctly process logits and probabilities, but the output format uses token IDs as keys. Consider including decoded tokens for better interpretability.

Example enhancement:

token = self._tokenizer.decode([token_id])
token_logits.append({"token": token, "token_id": token_id, "logit": logit.item()})
pkg/controller/v1alpha1/inferencegraph/knative_reconciler.go (1)

95-99: Consider using the existing utility function for consistency.

While the implementation is correct, there's an existing utility function utils.GetForceStopRuntime that performs the same logic. However, since this checks the annotation on the Knative service template rather than the InferenceGraph resource itself, the current implementation may be intentional.

If the intention is to check the InferenceGraph's annotation rather than the Knative service template's annotation, consider using the utility:

-forceStopRuntime := false
-if val, exist := desired.Spec.Template.Annotations[constants.StopAnnotationKey]; exist {
-    forceStopRuntime = strings.EqualFold(val, "true")
-}
+// If checking the IG's annotation was intended
+// forceStopRuntime := utils.GetForceStopRuntime(ig)
pkg/controller/v1alpha1/inferencegraph/controller_test.go (1)

1542-1765: Well-structured test implementation with comprehensive coverage!

The test implementation is thorough and follows best practices. The helper functions are reusable and the test scenarios comprehensively cover the stop annotation behavior.

Consider using the defined timeout constant instead of hardcoding the duration:

-}, time.Second*10, interval).Should(BeTrue(), "%T %s should not be created", obj, objKey.Name)
+}, timeout, interval).Should(BeTrue(), "%T %s should not be created", obj, objKey.Name)
config/llmisvc/kustomization.yaml (1)

14-14: Add missing newline at end of file.

YAML files should end with a newline character for proper formatting.

   - config-llm-worker-data-parallel.yaml
+
charts/llmisvc-crd/templates/serving.kserve.io_llminferenceservices.yaml (1)

129-146: Duplicate schema block—extract with $ref to avoid drift

The prefill.parallelism block duplicates the exact structure just defined for spec.parallelism. Duplicated schemas easily diverge over time.

Consider hoisting this definition to components.schemas.<Parallelism> and referencing it twice:

components:
  schemas:
    Parallelism:
      type: object
      properties:
        ...
      required: []   # if any
...
spec:
  properties:
    parallelism:
      $ref: '#/components/schemas/Parallelism'
    prefill:
      properties:
        parallelism:
          $ref: '#/components/schemas/Parallelism'

Reduces maintenance burden and guarantees both locations stay identical.

charts/llmisvc-resources/templates/config-llm-template.yaml (1)

1-82: Add missing newline at end of file.

The YAML file is missing a newline character at the end, which violates standard formatting conventions.

          secretName: "{{ ChildName .ObjectMeta.Name `-kserve-self-signed-certs` }}"
+
config/llmisvc/config-llm-router-route.yaml (1)

1-38: Add missing newline at end of file.

The YAML file is missing a newline character at the end, which violates standard formatting conventions.

                request: 0s
+
charts/llmisvc-resources/templates/config-llm-router-route.yaml (1)

1-38: Add missing newline at end of file.

The YAML file is missing a newline character at the end, which violates standard formatting conventions.

                request: 0s
+
config/llmisvc/config-llm-template.yaml (1)

35-40: Harden the securityContext

Good start with allowPrivilegeEscalation: false and dropping MKNOD, but best-practice for public clusters is also to:

  • run as non-root
  • set readOnlyRootFilesystem: true (you already mount /home for writes)
  • drop all caps and add back only what is required (none for vLLM).
         securityContext:
           allowPrivilegeEscalation: false
-          capabilities:
-            drop:
-              - MKNOD
+          runAsNonRoot: true
+          readOnlyRootFilesystem: true
+          capabilities:
+            drop: ["ALL"]
config/llmisvc/config-llm-prefill-template.yaml (2)

16-29: Potentially duplicated model specification

vllm serve already receives the model name as positional arg (line 18) and via --served-model-name (lines 20-22). If both are supplied the binary currently ignores one, but future versions could error. Consider keeping only one form:

-          command:
-            - vllm
-            - serve
-            - "{{ .Spec.Model.Name }}"
+          command: ["vllm", "serve"]

30-40: Align terminationMessagePolicy with other presets

Other LLM templates use FallbackToLogsOnError; this one sets File. Unless there is a specific reason, use the same policy everywhere to keep operator behaviour predictable.

config/crd/full/serving.kserve.io_llminferenceservices.yaml (1)

129-139: Duplicate schema fragment – consider $ref or helper definition

The prefill.parallelism block repeats the exact structure of spec.model.parallelism.
While valid, duplication increases maintenance overhead and the risk of drift.
You can DRY this by defining a reusable Parallelism schema in components.schemas and $ref-ing it in both places.

config/llmisvc/config-llm-worker-data-parallel.yaml (1)

122-123: Add missing trailing newline

yamllint flags a missing EOL. Some tooling (e.g. git diff –stat) warns on this.

-secret:
-  secretName: "{{ ChildName .ObjectMeta.Name `-kserve-self-signed-certs` }}"
+secret:
+  secretName: "{{ ChildName .ObjectMeta.Name `-kserve-self-signed-certs` }}"
+
config/llmisvc/config-llm-decode-worker-data-parallel.yaml (1)

168-170: Missing trailing newline

Add a final EOL to satisfy yamllint.

config/llmisvc/config-llm-prefill-worker-data-parallel.yaml (1)

122-124: Add newline at EOF

Insert a trailing newline for lint cleanliness.

config/llmisvc/config-llm-scheduler.yaml (2)

64-68: Only requests, no limits defined

While the scheduler container declares resources.requests, it omits limits. Defining limits prevents potential resource ballooning in failure scenarios.


83-89: EOF newline missing

Add a terminating newline to comply with yamllint.

charts/llmisvc-resources/templates/config-llm-prefill-template.yaml (3)

37-40: TerminationMessagePolicy File may hide crash logs

File captures only the last 4 KB of the container’s STDIO. Using FallbackToLogsOnError (as in other presets) provides richer debugging information.

-terminationMessagePolicy: File
+terminationMessagePolicy: FallbackToLogsOnError

38-39: Security hardening

Consider adding:

securityContext:
  readOnlyRootFilesystem: true
  runAsNonRoot: true
  capabilities:
    drop:
      - ALL

to align with the stricter posture in the scheduler preset.


79-81: Missing trailing newline

Insert EOL to satisfy lint.

charts/llmisvc-resources/templates/config-llm-worker-data-parallel.yaml (1)

123-123: Add newline at end of file

YAML files should end with a newline character per best practices.

-          secretName: "{{ ChildName .ObjectMeta.Name `-kserve-self-signed-certs` }}"
+          secretName: "{{ ChildName .ObjectMeta.Name `-kserve-self-signed-certs` }}"
+
charts/llmisvc-resources/templates/config-llm-scheduler.yaml (1)

89-89: Add newline at end of file

YAML files should end with a newline character per best practices.

-        terminationGracePeriodSeconds: 30
+        terminationGracePeriodSeconds: 30
+
config/llmisvc/config-llm-decode-template.yaml (2)

131-131: Add newline at end of file

YAML files should end with a newline character per best practices.

-          secretName: "{{ ChildName .ObjectMeta.Name `-kserve-self-signed-certs` }}"
+          secretName: "{{ ChildName .ObjectMeta.Name `-kserve-self-signed-certs` }}"
+

79-79: Consider adding security context for init container

The init container has an empty security context. Consider adding security restrictions similar to the main container for defense in depth.

-        securityContext: { }
+        securityContext:
+          allowPrivilegeEscalation: false
+          runAsNonRoot: true
+          capabilities:
+            drop:
+              - ALL
charts/llmisvc-resources/templates/config-llm-decode-worker-data-parallel.yaml (2)

170-170: Add newline at end of file

YAML files should end with a newline character per best practices.

-          secretName: "{{ ChildName .ObjectMeta.Name `-kserve-self-signed-certs` }}"
+          secretName: "{{ ChildName .ObjectMeta.Name `-kserve-self-signed-certs` }}"
+

7-54: Consider adding security context for init container

The routing sidecar init container lacks an explicit security context. For consistency and defense in depth, consider adding security restrictions.

After line 16, add:

        securityContext:
          allowPrivilegeEscalation: false
          runAsNonRoot: true
          capabilities:
            drop:
              - ALL
pkg/controller/v1alpha1/llmisvc/config_merge.go (1)

214-216: Address TODO: Improve error messages with available configs.

The TODO comment indicates that error messages should list available LLMInferenceServiceConfig resources when the requested config is not found. This would significantly improve debugging experience.

Would you like me to implement this enhancement or create an issue to track it?

charts/llmisvc-resources/templates/config-llm-decode-template.yaml (1)

131-131: Add newline at end of file.

YAMLlint reports missing newline character at the end of the file.

-          secretName: "{{ ChildName .ObjectMeta.Name `-kserve-self-signed-certs` }}"
+          secretName: "{{ ChildName .ObjectMeta.Name `-kserve-self-signed-certs` }}"
+
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dd3f39a and 20cbd00.

📒 Files selected for processing (53)
  • Makefile (1 hunks)
  • charts/llmisvc-crd/templates/serving.kserve.io_llminferenceserviceconfigs.yaml (1 hunks)
  • charts/llmisvc-crd/templates/serving.kserve.io_llminferenceservices.yaml (1 hunks)
  • charts/llmisvc-resources/templates/config-llm-decode-template.yaml (1 hunks)
  • charts/llmisvc-resources/templates/config-llm-decode-worker-data-parallel.yaml (1 hunks)
  • charts/llmisvc-resources/templates/config-llm-prefill-template.yaml (1 hunks)
  • charts/llmisvc-resources/templates/config-llm-prefill-worker-data-parallel.yaml (1 hunks)
  • charts/llmisvc-resources/templates/config-llm-router-route.yaml (1 hunks)
  • charts/llmisvc-resources/templates/config-llm-scheduler.yaml (1 hunks)
  • charts/llmisvc-resources/templates/config-llm-template.yaml (1 hunks)
  • charts/llmisvc-resources/templates/config-llm-worker-data-parallel.yaml (1 hunks)
  • config/crd/full/serving.kserve.io_llminferenceserviceconfigs.yaml (1 hunks)
  • config/crd/full/serving.kserve.io_llminferenceservices.yaml (1 hunks)
  • config/llmisvc/config-llm-decode-template.yaml (1 hunks)
  • config/llmisvc/config-llm-decode-worker-data-parallel.yaml (1 hunks)
  • config/llmisvc/config-llm-prefill-template.yaml (1 hunks)
  • config/llmisvc/config-llm-prefill-worker-data-parallel.yaml (1 hunks)
  • config/llmisvc/config-llm-router-route.yaml (1 hunks)
  • config/llmisvc/config-llm-scheduler.yaml (1 hunks)
  • config/llmisvc/config-llm-template.yaml (1 hunks)
  • config/llmisvc/config-llm-worker-data-parallel.yaml (1 hunks)
  • config/llmisvc/kustomization.yaml (1 hunks)
  • pkg/agent/watcher_test.go (1 hunks)
  • pkg/apis/serving/v1alpha1/llm_inference_service_defaults.go (1 hunks)
  • pkg/apis/serving/v1alpha1/llm_inference_service_lifecycle.go (1 hunks)
  • pkg/apis/serving/v1alpha1/llm_inference_service_types.go (2 hunks)
  • pkg/apis/serving/v1alpha1/llm_inference_service_types_func.go (1 hunks)
  • pkg/apis/serving/v1alpha1/v1alpha1.go (1 hunks)
  • pkg/apis/serving/v1alpha1/zz_generated.deepcopy.go (1 hunks)
  • pkg/apis/serving/v1beta1/inference_service_validation.go (2 hunks)
  • pkg/apis/serving/v1beta1/inference_service_validation_test.go (2 hunks)
  • pkg/controller/v1alpha1/inferencegraph/controller.go (2 hunks)
  • pkg/controller/v1alpha1/inferencegraph/controller_test.go (4 hunks)
  • pkg/controller/v1alpha1/inferencegraph/knative_reconciler.go (2 hunks)
  • pkg/controller/v1alpha1/llmisvc/config_merge.go (2 hunks)
  • pkg/controller/v1alpha1/llmisvc/config_merge_test.go (13 hunks)
  • pkg/controller/v1alpha1/llmisvc/config_presets_test.go (1 hunks)
  • pkg/controller/v1alpha1/llmisvc/sample.go (1 hunks)
  • pkg/controller/v1alpha1/llmisvc/utils.go (1 hunks)
  • pkg/controller/v1beta1/inferenceservice/rawkube_controller_test.go (2 hunks)
  • pkg/controller/v1beta1/inferenceservice/reconcilers/keda/keda_reconciler.go (3 hunks)
  • pkg/controller/v1beta1/inferenceservice/reconcilers/keda/keda_reconciler_test.go (16 hunks)
  • pkg/controller/v1beta1/inferenceservice/reconcilers/otel/otel_reconciler.go (4 hunks)
  • pkg/controller/v1beta1/inferenceservice/reconcilers/otel/otel_reconciler_test.go (7 hunks)
  • python/huggingfaceserver/huggingfaceserver/__main__.py (2 hunks)
  • python/huggingfaceserver/huggingfaceserver/encoder_model.py (4 hunks)
  • python/huggingfaceserver/tests/test_model.py (4 hunks)
  • python/huggingfaceserver/tests/test_output.py (1 hunks)
  • test/e2e/logger/test_raw_logger.py (1 hunks)
  • test/e2e/predictor/test_autoscaling.py (1 hunks)
  • test/e2e/predictor/test_huggingface.py (4 hunks)
  • test/e2e/predictor/test_output.py (1 hunks)
  • test/scripts/gh-actions/run-e2e-tests.sh (1 hunks)
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: bartoszmajsak
PR: opendatahub-io/kserve#729
File: test/overlays/llm-istio-experimental/istiod.yaml:2559-2566
Timestamp: 2025-07-10T14:58:38.489Z
Learning: In the kserve repository, files in `test/overlays/llm-istio-experimental/` directory are considered external files by the maintainers and should not be flagged for YAML formatting issues or other code quality concerns.
📚 Learning: in the kserve repository, files in `test/overlays/llm-istio-experimental/` directory are considered ...
Learnt from: bartoszmajsak
PR: opendatahub-io/kserve#729
File: test/overlays/llm-istio-experimental/istiod.yaml:2559-2566
Timestamp: 2025-07-10T14:58:38.489Z
Learning: In the kserve repository, files in `test/overlays/llm-istio-experimental/` directory are considered external files by the maintainers and should not be flagged for YAML formatting issues or other code quality concerns.

Applied to files:

  • Makefile
  • config/llmisvc/kustomization.yaml
  • charts/llmisvc-resources/templates/config-llm-template.yaml
  • config/llmisvc/config-llm-template.yaml
  • charts/llmisvc-crd/templates/serving.kserve.io_llminferenceservices.yaml
  • config/llmisvc/config-llm-prefill-template.yaml
  • charts/llmisvc-resources/templates/config-llm-worker-data-parallel.yaml
  • config/llmisvc/config-llm-prefill-worker-data-parallel.yaml
  • charts/llmisvc-resources/templates/config-llm-prefill-template.yaml
  • config/llmisvc/config-llm-decode-template.yaml
  • config/llmisvc/config-llm-scheduler.yaml
  • config/llmisvc/config-llm-decode-worker-data-parallel.yaml
  • pkg/controller/v1alpha1/llmisvc/config_presets_test.go
  • charts/llmisvc-resources/templates/config-llm-scheduler.yaml
  • config/llmisvc/config-llm-worker-data-parallel.yaml
  • charts/llmisvc-resources/templates/config-llm-prefill-worker-data-parallel.yaml
  • charts/llmisvc-resources/templates/config-llm-decode-worker-data-parallel.yaml
📚 Learning: the file `test/overlays/llm-istio-experimental/istiod.yaml` is considered external and irrelevant fo...
Learnt from: bartoszmajsak
PR: opendatahub-io/kserve#729
File: test/overlays/llm-istio-experimental/istiod.yaml:46-65
Timestamp: 2025-07-10T14:58:54.997Z
Learning: The file `test/overlays/llm-istio-experimental/istiod.yaml` is considered external and irrelevant for review in this repository.

Applied to files:

  • config/llmisvc/kustomization.yaml
  • charts/llmisvc-resources/templates/config-llm-worker-data-parallel.yaml
  • pkg/controller/v1alpha1/llmisvc/config_presets_test.go
📚 Learning: the file `test/overlays/llm-istio-experimental/istio_base.yaml` is considered external and irrelevan...
Learnt from: bartoszmajsak
PR: opendatahub-io/kserve#729
File: test/overlays/llm-istio-experimental/istio_base.yaml:6854-6869
Timestamp: 2025-07-10T14:58:14.161Z
Learning: The file `test/overlays/llm-istio-experimental/istio_base.yaml` is considered external and irrelevant for review in this repository.

Applied to files:

  • config/llmisvc/kustomization.yaml
📚 Learning: in pkg/apis/serving/v1beta1/configmap.go, the getstorageinitializerconfigs function was moved from a...
Learnt from: israel-hdez
PR: opendatahub-io/kserve#738
File: pkg/apis/serving/v1beta1/configmap.go:0-0
Timestamp: 2025-07-17T16:19:36.771Z
Learning: In pkg/apis/serving/v1beta1/configmap.go, the GetStorageInitializerConfigs function was moved from another package to prevent circular imports, which explains why it may contain existing patterns rather than being newly written code.

Applied to files:

  • config/llmisvc/kustomization.yaml
  • pkg/apis/serving/v1alpha1/v1alpha1.go
  • pkg/controller/v1alpha1/llmisvc/utils.go
  • pkg/apis/serving/v1beta1/inference_service_validation_test.go
  • pkg/controller/v1beta1/inferenceservice/reconcilers/keda/keda_reconciler.go
  • pkg/controller/v1beta1/inferenceservice/reconcilers/otel/otel_reconciler.go
  • pkg/agent/watcher_test.go
  • pkg/controller/v1alpha1/llmisvc/sample.go
  • pkg/controller/v1beta1/inferenceservice/reconcilers/otel/otel_reconciler_test.go
  • pkg/controller/v1beta1/inferenceservice/reconcilers/keda/keda_reconciler_test.go
  • pkg/controller/v1alpha1/llmisvc/config_presets_test.go
  • pkg/controller/v1alpha1/inferencegraph/controller_test.go
  • pkg/controller/v1alpha1/llmisvc/config_merge.go
  • pkg/apis/serving/v1alpha1/llm_inference_service_types_func.go
📚 Learning: in pkg/controller/v1alpha1/utils/utils.go, the condition `minreplicas == nil && constants.defaultmin...
Learnt from: brettmthompson
PR: opendatahub-io/kserve#709
File: pkg/controller/v1alpha1/utils/utils.go:118-128
Timestamp: 2025-06-30T22:30:38.045Z
Learning: In pkg/controller/v1alpha1/utils/utils.go, the condition `minReplicas == nil && constants.DefaultMinReplicas == 0` in ValidateInitialScaleAnnotation function is intentionally kept even though currently unreachable (DefaultMinReplicas = 1) to handle potential future changes to DefaultMinReplicas value, implementing defensive programming to limit future code changes.

Applied to files:

  • pkg/apis/serving/v1alpha1/llm_inference_service_defaults.go
  • pkg/controller/v1alpha1/inferencegraph/knative_reconciler.go
  • pkg/controller/v1alpha1/inferencegraph/controller.go
  • pkg/controller/v1alpha1/inferencegraph/controller_test.go
📚 Learning: in the kserve llminferenceservice controller, the discoverurls function in router_discovery.go inten...
Learnt from: bartoszmajsak
PR: opendatahub-io/kserve#757
File: pkg/controller/llmisvc/router_discovery.go:45-49
Timestamp: 2025-07-21T17:40:11.003Z
Learning: In the KServe LLMInferenceService controller, the DiscoverURLs function in router_discovery.go intentionally returns nil, nil for missing Gateways rather than errors. This allows the reconciler to continue processing and update status conditions through the separate validation logic in ref_validation.go, preventing reconciliation loops that would occur if errors were returned.

Applied to files:

  • charts/llmisvc-resources/templates/config-llm-router-route.yaml
  • config/llmisvc/config-llm-router-route.yaml
  • pkg/controller/v1alpha1/inferencegraph/knative_reconciler.go
  • config/llmisvc/config-llm-scheduler.yaml
  • charts/llmisvc-resources/templates/config-llm-scheduler.yaml
  • pkg/controller/v1alpha1/inferencegraph/controller.go
  • pkg/controller/v1alpha1/inferencegraph/controller_test.go
  • pkg/controller/v1alpha1/llmisvc/config_merge.go
  • pkg/apis/serving/v1alpha1/llm_inference_service_lifecycle.go
  • pkg/apis/serving/v1alpha1/llm_inference_service_types_func.go
🧬 Code Graph Analysis (9)
pkg/apis/serving/v1alpha1/v1alpha1.go (1)
pkg/apis/serving/v1beta1/v1beta1.go (1)
  • SchemeGroupVersion (38-38)
pkg/controller/v1alpha1/llmisvc/config_merge_test.go (1)
pkg/apis/serving/v1alpha1/llm_inference_service_types.go (2)
  • WorkloadSpec (92-115)
  • ParallelismSpec (247-266)
pkg/controller/v1alpha1/llmisvc/utils.go (1)
pkg/apis/serving/v1alpha1/llm_inference_service_types.go (1)
  • LLMInferenceServiceSpec (60-89)
pkg/apis/serving/v1alpha1/llm_inference_service_defaults.go (1)
pkg/apis/serving/v1alpha1/llm_inference_service_types.go (1)
  • LLMInferenceService (40-46)
pkg/apis/serving/v1beta1/inference_service_validation_test.go (3)
pkg/apis/serving/v1beta1/inference_service.go (2)
  • InferenceService (137-145)
  • InferenceServiceSpec (24-36)
pkg/apis/serving/v1beta1/predictor.go (2)
  • PredictorSpec (32-69)
  • PredictorExtensionSpec (88-105)
pkg/apis/serving/v1beta1/predictor_sklearn.go (1)
  • SKLearnSpec (27-30)
pkg/controller/v1alpha1/llmisvc/sample.go (1)
pkg/apis/serving/v1alpha1/llm_inference_service_types.go (13)
  • LLMInferenceService (40-46)
  • LLMInferenceServiceSpec (60-89)
  • LLMModelSpec (118-141)
  • LLMStorageSpec (270-281)
  • WorkloadSpec (92-115)
  • ParallelismSpec (247-266)
  • RouterSpec (153-174)
  • GatewayRoutesSpec (177-181)
  • HTTPRouteSpec (185-195)
  • GatewaySpec (198-203)
  • UntypedObjectReference (286-291)
  • SchedulerSpec (223-232)
  • InferencePoolSpec (236-244)
pkg/controller/v1beta1/inferenceservice/reconcilers/otel/otel_reconciler_test.go (1)
pkg/controller/v1beta1/inferenceservice/reconcilers/otel/otel_reconciler.go (16)
  • AnnotationPrometheusPort (47-47)
  • KeyJobName (77-77)
  • JobNameOtelCollector (42-42)
  • KeyScrapeInterval (78-78)
  • KeyStaticConfigs (79-79)
  • KeyTargets (80-80)
  • PrometheusReceiver (43-43)
  • KeyConfig (75-75)
  • KeyScrapeConfigs (76-76)
  • ProcessorFilterMetrics (41-41)
  • KeyMetrics (71-71)
  • KeyInclude (72-72)
  • KeyMetricNames (74-74)
  • ProcessorResourcedetectionEnv (39-39)
  • ProcessorTransform (40-40)
  • PipelineMetrics (59-59)
pkg/controller/v1alpha1/inferencegraph/controller.go (2)
pkg/utils/utils.go (1)
  • GetForceStopRuntime (220-239)
pkg/apis/serving/v1beta1/inference_service_status.go (2)
  • Stopped (132-132)
  • StoppedISVCReason (205-205)
pkg/apis/serving/v1alpha1/llm_inference_service_types_func.go (1)
pkg/apis/serving/v1alpha1/llm_inference_service_types.go (5)
  • GatewayRoutesSpec (177-181)
  • GatewaySpec (198-203)
  • HTTPRouteSpec (185-195)
  • InferencePoolSpec (236-244)
  • ParallelismSpec (247-266)
🪛 YAMLlint (1.37.1)
config/llmisvc/kustomization.yaml

[error] 14-14: no new line character at the end of file

(new-line-at-end-of-file)

charts/llmisvc-resources/templates/config-llm-template.yaml

[error] 82-82: no new line character at the end of file

(new-line-at-end-of-file)

charts/llmisvc-resources/templates/config-llm-router-route.yaml

[error] 38-38: no new line character at the end of file

(new-line-at-end-of-file)

config/llmisvc/config-llm-template.yaml

[error] 82-82: no new line character at the end of file

(new-line-at-end-of-file)

config/llmisvc/config-llm-router-route.yaml

[error] 38-38: no new line character at the end of file

(new-line-at-end-of-file)

config/llmisvc/config-llm-prefill-template.yaml

[error] 81-81: no new line character at the end of file

(new-line-at-end-of-file)

charts/llmisvc-resources/templates/config-llm-worker-data-parallel.yaml

[error] 123-123: no new line character at the end of file

(new-line-at-end-of-file)

config/llmisvc/config-llm-prefill-worker-data-parallel.yaml

[error] 124-124: no new line character at the end of file

(new-line-at-end-of-file)

charts/llmisvc-resources/templates/config-llm-prefill-template.yaml

[error] 81-81: no new line character at the end of file

(new-line-at-end-of-file)

config/llmisvc/config-llm-decode-template.yaml

[error] 131-131: no new line character at the end of file

(new-line-at-end-of-file)

config/llmisvc/config-llm-scheduler.yaml

[error] 89-89: no new line character at the end of file

(new-line-at-end-of-file)

config/llmisvc/config-llm-decode-worker-data-parallel.yaml

[error] 170-170: no new line character at the end of file

(new-line-at-end-of-file)

charts/llmisvc-resources/templates/config-llm-decode-template.yaml

[error] 131-131: no new line character at the end of file

(new-line-at-end-of-file)

charts/llmisvc-resources/templates/config-llm-scheduler.yaml

[error] 89-89: no new line character at the end of file

(new-line-at-end-of-file)

config/llmisvc/config-llm-worker-data-parallel.yaml

[error] 123-123: no new line character at the end of file

(new-line-at-end-of-file)

charts/llmisvc-resources/templates/config-llm-prefill-worker-data-parallel.yaml

[error] 124-124: no new line character at the end of file

(new-line-at-end-of-file)

charts/llmisvc-resources/templates/config-llm-decode-worker-data-parallel.yaml

[error] 170-170: no new line character at the end of file

(new-line-at-end-of-file)

🔇 Additional comments (82)
test/e2e/logger/test_raw_logger.py (1)

70-71: Decorator alignment looks good

Adding the @pytest.mark.asyncio(scope="session") marker brings this test in line with test_kserve_logger, ensuring both use the same session-scoped event loop. No further issues spotted in this segment.

test/scripts/gh-actions/run-e2e-tests.sh (1)

29-29: LGTM – unconditional log line simplifies flow

Always printing the chosen parallelism is clearer than the previous conditional branch.
No further action needed.

test/e2e/predictor/test_autoscaling.py (1)

683-683: LGTM! Test assertion updated to match new metric query format.

The test correctly updates the expected metricQuery to use the new fully qualified Prometheus query format with namespace and deployment label selectors. This change aligns with the corresponding updates in the KEDA reconciler logic.

pkg/controller/v1beta1/inferenceservice/reconcilers/keda/keda_reconciler.go (3)

70-71: LGTM! Function signature enhancement for metric isolation.

Adding the componentMeta parameter enables the function to construct namespace and deployment-specific metric queries, which improves metric isolation in multi-tenant environments.


159-164: Excellent improvement for metric isolation and security.

The new metric query construction properly isolates metrics by wrapping the base query with sum() and adding namespace and deployment label selectors. This prevents cross-tenant metric leakage and ensures each InferenceService only scales based on its own metrics.

The Prometheus query syntax is correct and follows best practices for metric aggregation.


196-196: LGTM! Function call updated to pass required parameter.

The call to getKedaMetrics correctly passes the componentMeta parameter, enabling the enhanced metric query construction.

pkg/controller/v1beta1/inferenceservice/rawkube_controller_test.go (2)

10358-10372: LGTM! Well-structured OpenTelemetry processor configurations.

The added processor configurations follow OpenTelemetry best practices:

  • resourcedetection/env properly configures environment-based resource detection with reasonable timeout
  • transform processor correctly maps Kubernetes resource attributes to metric attributes using standard attribute names

10387-10387: LGTM! Proper OpenTelemetry pipeline processor ordering.

The processors are correctly ordered in the metrics pipeline:

  1. resourcedetection/env - Resource detection first
  2. transform - Attribute transformation using detected resources
  3. filter/metrics - Filtering last after enrichment

This follows OpenTelemetry best practices for data processing flow.

pkg/controller/v1beta1/inferenceservice/reconcilers/keda/keda_reconciler_test.go (3)

58-61: LGTM: Consistent test setup for new function signature

The addition of componentMeta initialization across all test cases properly supports the updated getKedaMetrics function signature. The consistent use of "test-component" and "test-namespace" ensures uniform test data.

Also applies to: 73-76, 90-93, 258-261, 288-291, 318-321, 409-412, 424-427, 453-456, 481-484, 510-513, 539-542, 585-588, 621-624


65-65: LGTM: Function calls updated for new signature

All calls to getKedaMetrics are correctly updated to include the new componentMeta parameter as the first argument, maintaining consistency across the test suite.

Also applies to: 80-80, 97-97, 280-280, 310-310, 339-339, 418-418, 445-445, 473-473, 502-502, 531-531, 570-570, 609-609, 646-646


102-103: LGTM: Enhanced metric query validation

The updated expected metric queries correctly validate the new enhanced format that includes namespace and deployment selectors wrapped in a sum() function. This improvement provides better metric targeting and aggregation capabilities.

Also applies to: 651-651

pkg/controller/v1beta1/inferenceservice/reconcilers/otel/otel_reconciler_test.go (3)

50-50: LGTM: Improved maintainability with constants

The replacement of hardcoded string literals with predefined constants (e.g., AnnotationPrometheusPort, KeyJobName, KeyScrapeInterval) enhances code maintainability and reduces the risk of typos in test assertions.

Also applies to: 59-64, 79-84, 100-112


117-141: LGTM: Comprehensive processor configuration validation

The enhanced test logic properly validates the new processor configuration behavior:

  • Always includes ProcessorResourcedetectionEnv and ProcessorTransform processors
  • Conditionally includes ProcessorFilterMetrics when metric names are specified
  • Correctly validates pipeline processor ordering

This provides thorough coverage of the processor configuration logic.


211-225: LGTM: Consistent constant usage throughout tests

The consistent use of constants in existing collector configurations maintains the refactoring pattern and improves overall code quality.

Also applies to: 258-264

pkg/controller/v1beta1/inferenceservice/reconcilers/otel/otel_reconciler.go (3)

38-85: LGTM: Excellent refactoring with comprehensive constants

The introduction of well-organized constants significantly improves code maintainability by:

  • Eliminating hardcoded strings throughout the codebase
  • Following Go naming conventions consistently
  • Providing clear semantic meaning for configuration values
  • Reducing the risk of typos and inconsistencies

This is a high-quality refactoring that enhances code readability and maintainability.


118-151: LGTM: Well-architected processor configuration

The enhanced processor configuration logic provides a solid foundation for metric processing:

  • Always includes essential processors: resourcedetection/env for Kubernetes environment detection and transform for adding namespace/deployment/pod attributes
  • Conditional filtering: Adds filter/metrics processor only when specific metric names are specified
  • Proper pipeline ordering: Ensures processors are applied in the correct sequence

This architecture ensures consistent metric labeling while providing flexibility for metric filtering when needed.


163-200: LGTM: Clean and maintainable collector configuration

The OpenTelemetry Collector configuration is now much more maintainable with consistent use of constants. The configuration properly includes:

  • Prometheus receiver with configurable scrape settings
  • OTLP exporter with appropriate TLS configuration
  • Dynamic processor configuration based on the enhanced logic
  • Well-structured service pipeline connecting all components

The use of constants makes the configuration more readable and less prone to configuration errors.

test/e2e/predictor/test_output.py (1)

404-409: LGTM! Variable renaming aligns with new raw logits functionality.

The renaming from huggingface_sequence_classification_with_probabilities_expected_output to huggingface_sequence_classification_with_raw_logits_expected_output correctly reflects the change in output format. The values are appropriate for raw logits (pre-softmax outputs).

python/huggingfaceserver/tests/test_output.py (1)

18-179: LGTM! Variable renaming and structure simplification are appropriate.

The renaming to bert_token_classification_return_raw_logits_expected_output correctly reflects the new functionality. The simplification of the nested structure by removing the extra list level around the logits dictionaries makes the data structure cleaner and more straightforward while maintaining the correct logit values for token classification.

python/huggingfaceserver/huggingfaceserver/__main__.py (2)

142-154: LGTM! Well-implemented mutually exclusive CLI arguments.

The implementation correctly uses a mutually exclusive group to prevent conflicts between --return_probabilities and --return_raw_logits. The help text clearly specifies supported tasks and the argument handling follows established patterns.


303-303: LGTM! Proper argument forwarding to model constructor.

The return_raw_logits argument is correctly forwarded from the CLI arguments to the HuggingfaceEncoderModel constructor, maintaining consistency with other boolean flags.

test/e2e/predictor/test_huggingface.py (3)

43-43: LGTM! Import updated to match new expected output variable.

The import correctly references the renamed expected output variable for raw logits.


577-640: LGTM! Well-structured test for raw logits functionality.

The test function is properly renamed and configured to test the --return_raw_logits flag. The output parsing and assertion logic correctly handles the raw logits format, and proper cleanup is maintained.


642-697: LGTM! Comprehensive test coverage for both output modes.

The addition of a separate test function for probabilities ensures both output modes are properly validated. The test correctly uses the --return_probabilities flag and validates the expected probability output format.

python/huggingfaceserver/tests/test_model.py (6)

31-31: LGTM! Import updated for raw logits expected output.

The import correctly references the renamed expected output variable for token classification raw logits.


100-111: LGTM! Well-configured fixture for sequence classification raw logits.

The new fixture properly sets up the model with return_raw_logits=True and follows the established pattern for model fixtures.


127-139: LGTM! Well-configured fixture for token classification raw logits.

The new fixture properly sets up the token classification model with return_raw_logits=True and maintains consistency with other fixtures.


284-304: LGTM! Comprehensive test for sequence classification raw logits.

The test function properly validates raw logits output with appropriate floating-point approximations. The expected values appear to be valid logit values for binary classification.


308-316: LGTM! Proper test for token classification raw logits.

The test function correctly validates token classification raw logits output using the imported expected output, maintaining consistency with the test pattern.


281-281: LGTM! Updated assertion for probability output format.

The assertion correctly validates the probability output format as a dictionary with class probabilities.

python/huggingfaceserver/huggingfaceserver/encoder_model.py (6)

16-16: LGTM!

The additional type imports (Tuple, Union) are properly used in the new helper methods' type annotations.


294-329: Excellent refactoring for improved modularity!

The delegation of task-specific processing to dedicated helper methods significantly improves code organization and maintainability. The clear docstring and consistent pattern across all ML tasks make the code much easier to understand and extend.


330-352: LGTM! Clean input normalization.

The method properly handles both Tensor and InferResponse types, ensuring consistent output format for downstream processing.


476-514: LGTM! Consistent token classification processing.

The method follows the same pattern as sequence classification, properly handling three output modes with appropriate CPU conversion and type handling.


516-539: LGTM! Standard text embedding processing.

The method correctly implements mean pooling with attention mask and L2 normalization, which are standard practices for text embeddings.


105-105: Mutual exclusivity enforced at CLI level

The --return_probabilities and --return_raw_logits flags are both added to the same output_format_group (created via parser.add_mutually_exclusive_group()) in __main__.py. This guarantees you cannot set both flags at once when invoking the server via CLI, so no additional runtime validation is required.

pkg/controller/v1alpha1/inferencegraph/knative_reconciler.go (2)

106-116: LGTM!

The deletion logic correctly handles the force stop scenario by checking for existing deletion timestamps to avoid duplicate deletion attempts and properly exits the retry loop.


141-146: LGTM!

The code correctly prevents Knative service creation when the stop annotation is set, while maintaining consistent status reporting.

pkg/controller/v1alpha1/inferencegraph/controller.go (1)

157-158: LGTM!

Correctly uses the shared utility function to retrieve the stop annotation value early in the reconciliation process.

pkg/agent/watcher_test.go (2)

706-710: LGTM! Good test refactoring.

Moving the test server and HTTPSProvider client initialization earlier in the scenario iteration is a good improvement. This ensures the client is properly associated with the test server before it's used for model operations.


719-719: Excellent improvement using dynamic test server URLs.

Replacing hardcoded example URLs with dynamic test server URLs (ts.URL + "/test.tar" and ts.URL + "/test.zip") makes the tests more robust and self-contained. This ensures the tests use the actual mock server instead of potentially failing external endpoints.

Also applies to: 726-726

Makefile (1)

99-101: LGTM! Proper integration of LLM inference service templates.

The addition of these commands correctly integrates the newly added LLM inference service configurations into the Helm chart packaging process. The pattern follows the existing approach used for other components (rbac, localmodel, etc.), ensuring consistency in the build process.

pkg/controller/v1alpha1/llmisvc/config_merge_test.go (1)

290-290: LGTM! Proper type alignment with updated API schema.

The changes from ptr.To[int64] to ptr.To[int32] for parallelism fields (Tensor and Pipeline) correctly align with the updated ParallelismSpec struct definition where these fields are now *int32. This ensures type consistency with the schema changes introduced in the CRD definitions.

Also applies to: 298-298, 307-308, 382-383, 428-429, 475-476, 535-536, 567-568, 617-618, 648-649, 710-711, 967-967, 995-995, 1028-1029

pkg/apis/serving/v1beta1/inference_service_validation_test.go (1)

21-21: LGTM!

The strings import is correctly added to support the new test function that uses strings.Contains().

config/llmisvc/kustomization.yaml (1)

1-14: LGTM! Well-structured Kustomization file.

The Kustomization configuration is properly structured with:

  • Correct API version and kind
  • Appropriate namespace (kserve)
  • Comprehensive list of LLM inference service resource configurations

This aligns well with the broader PR changes introducing LLM inference service components.

pkg/apis/serving/v1alpha1/v1alpha1.go (1)

43-45: LGTM! Proper GVK variable definitions.

The new LLMInferenceServiceGVK and LLMInferenceServiceConfigGVK variables are correctly defined following the established pattern in the codebase. These provide standardized GroupVersionKind identifiers for the new LLM inference service resources, which will be useful for controller logic and type identification throughout the system.

charts/llmisvc-crd/templates/serving.kserve.io_llminferenceservices.yaml (1)

119-123: Downgrade from int64int32 may break existing objects

pipeline and tensor were previously int64. Narrowing to int32 drops the valid range from ≈9 quadrillion to 2 billion and will reject any existing CRs using larger values.

Double-check that:

  1. No production manifests exceed 2 147 483 647.
  2. API go-types are updated (*int32) everywhere to stay consistent.

If there is any doubt, keep int64 or provide a conversion webhook.

pkg/apis/serving/v1alpha1/llm_inference_service_defaults.go (1)

27-33: LGTM! Clean defaulting implementation.

The defaulting logic follows Kubernetes best practices by implementing the apis.Defaultable interface correctly. The nil and empty string checks ensure robust handling, and using ptr.To for pointer creation is the recommended approach.

pkg/apis/serving/v1beta1/inference_service_validation.go (2)

128-131: LGTM! Proper integration with existing validation flow.

The new validation is correctly integrated into the main validation workflow with proper error handling and early return.


153-185: LGTM! Comprehensive validation for standard predictors.

The validation function correctly:

  • Logs the predictor for debugging purposes
  • Checks all standard predictor types for the disallowed name field
  • Uses consistent error messaging
  • Follows the existing validation patterns in the codebase

The explicit switch statement approach is preferable to reflection for this use case as it's more readable and type-safe.

pkg/controller/v1alpha1/llmisvc/utils.go (1)

25-32: LGTM! Well-structured label selector generation.

The function follows Kubernetes labeling best practices using standard labels (app.kubernetes.io/part-of, app.kubernetes.io/name) and appropriate custom labels (kserve.io/component). The unused LLMInferenceServiceSpec parameter appears intentional for future extensibility or interface consistency.

pkg/apis/serving/v1alpha1/zz_generated.deepcopy.go (1)

1163-1187: LGTM! Auto-generated deep copy methods are correct.

The generated deep copy logic correctly handles:

  • Type changes from *int64 to *int32 for Tensor and Pipeline fields
  • New *int32 fields: Data, DataLocal, and DataRPCPort
  • Proper nil checking and pointer allocation for all fields

These changes align with the API schema updates for enhanced parallelism configuration.

config/crd/full/serving.kserve.io_llminferenceserviceconfigs.yaml (3)

100-105: Changing pipeline / tensor from int64int32 may break existing CRs

Previous releases allowed values up to 2^63-1; truncating to 32-bit will reject any CRs that relied on larger counts. Confirm no user workloads depend on >2^31-1 values or bump the CRD version and document the breaking change.


122-127: Double-check backward-compat risk for prefill.parallelism.pipeline|tensor

As with Lines 100-105, confirm that reducing the integer width will not invalidate pre-existing LLMInferenceServiceConfig objects that specify large values.


88-97: Add schema validation and descriptions for new parallelism fields

Please update the CRD to forbid zero/negative values and out-of-range ports, and mirror these changes in the Helm chart templates.

• File: config/crd/full/serving.kserve.io_llminferenceserviceconfigs.yaml (lines 88–97)

   data:
     format: int32
     type: integer
+      description: Number of data-parallel workers (must be ≥ 1)
+      minimum: 1
   dataLocal:
     format: int32
     type: integer
+      description: Number of local data-parallel shards on this node (must be ≥ 1)
+      minimum: 1
   dataRPCPort:
     format: int32
     type: integer
+      description: RPC port used by data-parallel workers (1–65535)
+      minimum: 1
+      maximum: 65535
   expert:
     type: boolean
+      description: Enables expert parallelism optimization

• Ensure the same description, minimum/maximum entries are added to the corresponding Helm chart templates so CRD and chart stay in sync.

charts/llmisvc-crd/templates/serving.kserve.io_llminferenceserviceconfigs.yaml (1)

100-105: Potentially breaking change: pipeline / tensor downgraded from int64 → int32

Changing the format from int64 to int32 tightens the allowed numeric range to ±2 147 483 647.
Any existing CRs or automation that used larger counts (common in large-scale GPU clusters) will now fail validation on upgrade.

Please double-check backward-compatibility requirements; if larger values must remain supported, keep format: int64 or introduce dual-field migration logic.

charts/llmisvc-resources/templates/config-llm-template.yaml (3)

8-27: LGTM! Well-configured container with security best practices.

The container configuration properly enables SSL/TLS with certificate mounting and uses appropriate security settings. The vllm serve configuration with SSL refresh and proper certificate paths is well-structured.


35-39: Good security practices implemented.

The security context properly restricts privilege escalation and drops the MKNOD capability, following container security best practices.


42-59: Health probe configuration is appropriate.

The liveness and readiness probes are well-configured with reasonable timeouts and use HTTPS for security. The initial delay and failure thresholds are sensible for LLM inference services.

config/llmisvc/config-llm-router-route.yaml (2)

36-38: Consider the implications of zero timeouts.

Both backendRequest and request timeouts are set to 0s, which typically means no timeout. For LLM inference services that can have long processing times, this may be intentional, but ensure this aligns with your infrastructure's timeout policies.

Verify that zero timeouts are appropriate for your LLM inference workloads and won't conflict with upstream proxy or load balancer timeout settings.


10-35: Routing configuration looks correct.

The Gateway API HTTP route configuration properly references the parent Gateway, sets up path prefix matching with URL rewrite, and routes to the correct InferencePool backend. The templating for dynamic names is well-structured.

pkg/apis/serving/v1alpha1/llm_inference_service_types.go (3)

62-64: Good clarification for optional model field.

Making the Model field optional for LLMInferenceServiceConfig kind makes sense for template configurations. The comment and annotation clearly document this behavior.


254-266: Well-designed new parallelism fields.

The new data parallelism fields (Data, DataLocal, DataRPCPort) are properly typed as *int32 for consistency and include appropriate documentation. The Expert field as a boolean is appropriate for a feature flag.


250-253: ParallelismSpec Tensor/Pipeline fields safe as int32

A full-code search confirms that Tensor and Pipeline are only used within the ParallelismSpec struct, its autogenerated deep-copy methods, and helper functions in llm_inference_service_types_func.go. No other code treats them as int64. All environment-variable handling, arithmetic for GPU/Ray node counts, and defaults operate on int32 (or small literal values), so overflow beyond 2³¹-1 is effectively impossible in practice. No further changes required.

charts/llmisvc-resources/templates/config-llm-router-route.yaml (2)

36-38: Consider the implications of zero timeouts.

Both backendRequest and request timeouts are set to 0s, which typically means no timeout. For LLM inference services that can have long processing times, this may be intentional, but ensure this aligns with your infrastructure's timeout policies.

This appears to be the same configuration as config/llmisvc/config-llm-router-route.yaml. Verify that zero timeouts are appropriate for your LLM inference workloads.


1-35: Router configuration is consistent with config version.

This Helm template version of the router configuration maintains the same structure and templating as the config version, which is good for consistency.

pkg/controller/v1alpha1/llmisvc/sample.go (4)

29-36: Well-documented sample function.

The function provides a comprehensive sample that covers all major components of an LLMInferenceService. The purpose and usage are clearly documented.


63-68: Good use of new parallelism fields.

The sample properly demonstrates the new parallelism fields including Data, DataLocal, Tensor, and Expert, which aligns with the API type changes.


81-90: Reasonable resource specifications.

The CPU and memory resource requests and limits are appropriate for different workload types, with prefill pods getting more resources than main containers, which aligns with LLM inference patterns.


169-223: Comprehensive router configuration sample.

The router specification includes all major components (HTTP routes, gateway references, scheduler with inference pool) and demonstrates the full capabilities of the routing system.

config/llmisvc/config-llm-template.yaml (1)

80-82: Verify that the TLS secret is created before this template is rendered

The template expects {{ ChildName .ObjectMeta.Name "-kserve-self-signed-certs" }} to exist; if the controller does not create it beforehand the Pod will stay in Pending.
Please double-check controller logic or Helm hooks.

charts/llmisvc-resources/templates/config-llm-prefill-worker-data-parallel.yaml (2)

23-24: Arithmetic expression may break when the template renders spaces

START_RANK=$(( ${LWS_WORKER_INDEX:-0} * {{ or .Spec.Prefill.Parallelism.DataLocal 1 }} ))

If .Spec.Prefill.Parallelism.DataLocal is missing, the rendered line becomes:

$(( ${LWS_WORKER_INDEX:-0} * 1 ))

…but if the field is present the Go template inserts a space-prefixed integer (due to YAML indent), yielding:

$(( ${LWS_WORKER_INDEX:-0} * 2 )) (double space) – Bash still parses it, yet safer is to trim:

-START_RANK=$(( ${LWS_WORKER_INDEX:-0} * {{ or .Spec.Prefill.Parallelism.DataLocal 1 }} ))
+START_RANK=$(( ${LWS_WORKER_INDEX:-0} * {{ trimPrefix " " (printf "%v" (or .Spec.Prefill.Parallelism.DataLocal 1)) }} ))

76-81: Review added capabilities

IPC_LOCK and SYS_RAWIO are powerful. Confirm they are strictly necessary; otherwise drop them to keep the attack surface minimal.

config/llmisvc/config-llm-worker-data-parallel.yaml (1)

75-81: Questionable need for SYS_RAWIO capability

SYS_RAWIO is a privileged capability that allows raw I/O port access and is rarely required by user-space applications. Retaining it widens the pod’s attack surface.

If the container does not actually perform low-level I/O, drop this capability (and ideally set readOnlyRootFilesystem: true & runAsNonRoot) to align with the “least privilege” principle.

config/llmisvc/config-llm-decode-worker-data-parallel.yaml (1)

123-128: Same privileged capability concern as worker config

SYS_RAWIO should be dropped unless strictly necessary. Consider mirroring the more restrictive policy used in the scheduler preset (drop: [ALL], readOnlyRootFilesystem: true, runAsNonRoot: true).

config/llmisvc/config-llm-prefill-worker-data-parallel.yaml (1)

76-82: Privileged capabilities should be reviewed

Unless vllm requires SYS_RAWIO, drop it and enable readOnlyRootFilesystem plus runAsNonRoot.

charts/llmisvc-resources/templates/config-llm-worker-data-parallel.yaml (1)

1-123: LGTM! Well-structured worker data parallel configuration

The configuration properly handles leader/worker differentiation using LWS_WORKER_INDEX and includes appropriate security controls, health probes, and TLS support.

pkg/controller/v1alpha1/llmisvc/config_presets_test.go (1)

41-354: Comprehensive test coverage for configuration presets

The test properly validates all preset configurations, checks for well-known configs, and verifies variable substitution. Good use of table-driven tests with the detailed expected configuration for the decode-worker-data-parallel preset.

charts/llmisvc-resources/templates/config-llm-scheduler.yaml (1)

70-78: Excellent security configuration

The security context follows best practices with read-only filesystem, non-root user, dropped capabilities, and seccomp profile. This provides strong security isolation for the scheduler component.

config/llmisvc/config-llm-decode-template.yaml (1)

105-108: TLS configuration uses self-signed certificates

The configuration uses --decoder-tls-insecure-skip-verify=true and --prefiller-tls-insecure-skip-verify=true, which is appropriate for self-signed certificates referenced in the volume mount. This aligns with the secret name pattern.

charts/llmisvc-resources/templates/config-llm-decode-worker-data-parallel.yaml (1)

54-170: Well-structured decode worker configuration with proper parallelism support

The configuration correctly implements leader/worker differentiation, includes comprehensive parallelism options (expert, tensor, data), and has appropriate health probes and TLS support.

pkg/apis/serving/v1alpha1/llm_inference_service_types_func.go (1)

64-75: Verify the parallelism size calculation logic.

The calculation on line 69 divides Data by DataLocal to compute the effective parallelism size. Please verify:

  1. This performs integer division which truncates decimals - is this intended?
  2. The formula assumes DataLocal represents the local group size within data parallelism - please confirm this interpretation.

Consider adding a comment to document the calculation logic:

 func (p *ParallelismSpec) GetSize() *int32 {
 	if p == nil {
 		return nil
 	}
 	if p.IsDataParallel() {
+		// Calculate number of data parallel groups: total data parallel / local group size
 		return ptr.To(max(ptr.Deref(p.Data, 1), 1) / max(ptr.Deref(p.DataLocal, 1), 1))
 	}

Comment thread charts/llmisvc-resources/templates/config-llm-decode-template.yaml
Comment thread pkg/controller/v1alpha1/llmisvc/config_merge.go
Comment thread pkg/controller/v1alpha1/llmisvc/config_presets_test.go
Comment thread pkg/controller/v1alpha1/llmisvc/sample.go
Comment thread python/huggingfaceserver/huggingfaceserver/encoder_model.py
Comment thread test/scripts/gh-actions/run-e2e-tests.sh
@spolti
Copy link
Copy Markdown
Member

spolti commented Aug 6, 2025

/lgtm

@openshift-ci openshift-ci Bot added the lgtm label Aug 6, 2025
@openshift-merge-bot openshift-merge-bot Bot merged commit 7b04aea into opendatahub-io:master Aug 6, 2025
29 checks passed
@github-project-automation github-project-automation Bot moved this from New/Backlog to Done in ODH Model Serving Planning Aug 6, 2025
@mholder6 mholder6 deleted the 250731_sync_upstream_batch3 branch August 6, 2025 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm tide/merge-method-merge Change lgtm to use a merge commit

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

7 participants