fix(scheduler): Uses ClusterRole for TokenReview and SAR resources#760
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThe changes expand RBAC permissions to include Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant K8s API
participant LLMInferenceServiceReconciler
participant Scheduler Service Account
User->>K8s API: Create/Update/Delete LLMInferenceService
K8s API->>LLMInferenceServiceReconciler: Trigger Reconcile
alt Resource is being deleted
LLMInferenceServiceReconciler->>LLMInferenceService: Check finalizer
alt Finalizer present
LLMInferenceServiceReconciler->>Scheduler Service Account: Cleanup (reconcileSchedulerServiceAccount)
LLMInferenceServiceReconciler->>LLMInferenceService: Remove finalizer and update
end
LLMInferenceServiceReconciler-->>K8s API: Exit reconciliation
else Resource is not being deleted
LLMInferenceServiceReconciler->>LLMInferenceService: Add finalizer if missing
LLMInferenceServiceReconciler->>Scheduler Service Account: Reconcile Service Account, Role, RoleBinding, and Auth Delegator ClusterRoleBinding
end
Estimated code review effort4 (~75 minutes) Suggested labels
Poem
✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
| ownerLogLine, | ||
| owner.GetNamespace(), owner.GetName(), | ||
| ) | ||
| if len(expected.GetNamespace()) != 0 { |
There was a problem hiding this comment.
@bartoszmajsak Is there a better way for cluster-scoped resources?
There was a problem hiding this comment.
nit
| if len(expected.GetNamespace()) != 0 { | |
| if expected.GetNamespace() == "" { |
There was a problem hiding this comment.
You can try apiutil.IsObjectNamespaced(expected, c.Scheme(), c.RESTMapper())
https://github.com/kubernetes-sigs/controller-runtime/blob/main/pkg/client/apiutil/apimachinery.go#L63
There was a problem hiding this comment.
it seems to make a full (non-cached) remote call, I'd not use it for now as the non-empty for namespace is quite reliable already?
There was a problem hiding this comment.
I looked at the code and I saw caching mentioned here https://github.com/kubernetes-sigs/controller-runtime/blob/e08f24a1e29fb84e4b059dd471b1c8b4928751e7/pkg/client/apiutil/apimachinery.go#L75-L77 and subsequently here: https://github.com/kubernetes-sigs/controller-runtime/blob/e08f24a1e29fb84e4b059dd471b1c8b4928751e7/pkg/client/apiutil/restmapper.go#L170-L172
Which would mean it's going to be cached as we are using shared client? Or am I missing something?
e5f942e to
9e1a29c
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
| ownerLogLine, | ||
| owner.GetNamespace(), owner.GetName(), | ||
| ) | ||
| if len(expected.GetNamespace()) != 0 { |
There was a problem hiding this comment.
nit
| if len(expected.GetNamespace()) != 0 { | |
| if expected.GetNamespace() == "" { |
| ownerLogLine, | ||
| owner.GetNamespace(), owner.GetName(), | ||
| ) | ||
| if len(expected.GetNamespace()) != 0 { |
There was a problem hiding this comment.
You can try apiutil.IsObjectNamespaced(expected, c.Scheme(), c.RESTMapper())
https://github.com/kubernetes-sigs/controller-runtime/blob/main/pkg/client/apiutil/apimachinery.go#L63
|
|
||
| func (r *LLMInferenceServiceReconciler) reconcileSchedulerAuthDelegatorBinding(ctx context.Context, llmSvc *v1alpha1.LLMInferenceService, sa *corev1.ServiceAccount) error { | ||
| authDelegatorBinding := r.expectedSchedulerAuthDelegatorBinding(llmSvc, sa) | ||
| if !llmSvc.DeletionTimestamp.IsZero() || llmSvc.Spec.Router == nil || llmSvc.Spec.Router.Scheduler == nil { |
There was a problem hiding this comment.
!llmSvc.DeletionTimestamp.IsZero() - I noticed we don't do it in other places. How about moving this check to Delete func as a guard to all types of resources? That would simplify the code below too.
There was a problem hiding this comment.
But notice that this condition is over llmSvc instead of authDelegatorBinding. I did this way just to not tamper with the garbage collector for the owned resources which should automatically do the deletion so that, to some extent, it preserves behavior.
In practice, I'm not sure how useful the guard in Delete will be. I see the pattern in the code is to delete the built expected resource; i.e. it doesn't query for the actual one, so I don't think we will have the timestamp. In any case, I agree that having the guard is good. I'll dd it. Yet, IMO this check over llmSvc should stay, unless we are OK with the controller trying the delete despite the ownership?
There was a problem hiding this comment.
I did this way just to not tamper with the garbage collector for the owned resources which should automatically do the deletion
I was thinking to move this down to Delete and guard in the same fashion - if expected is owned by existing resource (owner) we would have a timestamp on owner.
But I see now this is about cluster-scoped resource that I missed going quickly through the code. Apologies for the initial noise.
I am not sure about using Delete for such a case. This method expects an owner and here it is misleading. Perhaps we should just delete it using available client? The same applies to Reconcile. Thinking about it more - maybe we need separate methods for cluster-scoped resources instead of patching the behaviour of the code path not intended for this use-case?
Thoughts @pierDipi?
There was a problem hiding this comment.
Maybe what we want is to allow nil owner for Reconcile and Delete without having separate paths for cluster-scoped resources?
There was a problem hiding this comment.
I am a bit alergic to nils as I've seen many methods with couple of those passed out in the wild. That is making the intent blurry, but we can refactor later.
There was a problem hiding this comment.
I'm not seeing any actionable change. Let me know if I'm misunderstanding.
There was a problem hiding this comment.
but we can refactor later.
follow-up :)
8d836f3 to
3c3efc7
Compare
|
@bartoszmajsak @pierDipi feedback applied. |
| } | ||
|
|
||
| func (r *LLMInferenceServiceReconciler) finalize(ctx context.Context, llmSvc *v1alpha1.LLMInferenceService) error { | ||
| return r.reconcileSchedulerServiceAccount(ctx, llmSvc) |
There was a problem hiding this comment.
| return r.reconcileSchedulerServiceAccount(ctx, llmSvc) | |
| if err := r.reconcileSchedulerServiceAccount(ctx, llmSvc); err != nil { | |
| return fmt.Errorf("failed to finalize scheduler service account: %w", err) | |
| } | |
| return nil |
|
|
||
| func (r *LLMInferenceServiceReconciler) reconcileSchedulerAuthDelegatorBinding(ctx context.Context, llmSvc *v1alpha1.LLMInferenceService, sa *corev1.ServiceAccount) error { | ||
| authDelegatorBinding := r.expectedSchedulerAuthDelegatorBinding(llmSvc, sa) | ||
| if !llmSvc.DeletionTimestamp.IsZero() || llmSvc.Spec.Router == nil || llmSvc.Spec.Router.Scheduler == nil { |
There was a problem hiding this comment.
Maybe what we want is to allow nil owner for Reconcile and Delete without having separate paths for cluster-scoped resources?
| func (r *LLMInferenceServiceReconciler) expectedSchedulerAuthDelegatorBinding(llmSvc *v1alpha1.LLMInferenceService, sa *corev1.ServiceAccount) *rbacv1.ClusterRoleBinding { | ||
| crb := &rbacv1.ClusterRoleBinding{ | ||
| ObjectMeta: metav1.ObjectMeta{ | ||
| Name: kmeta.ChildName(llmSvc.GetName(), "-epp-auth-rb"), |
There was a problem hiding this comment.
we might want to factor in the namespace given the cluster-scoped nature of the object
There was a problem hiding this comment.
| Name: kmeta.ChildName(llmSvc.GetName(), "-epp-auth-rb"), | |
| Name: kmeta.ChildName(llmSvc.GetNamespace(), "-" + llmSvc.GetName() + "-epp-auth-rb"), |
Namepsace as "prefix" to have some hierarchy
The llm-d scheduler uses cluster-scoped TokenReviews and SubjectAccessReview resources to validate for access to the metrics endpoint. This fixes the RBAC privileges to correctly enable scraping metrics. Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>
[skip ci] Co-authored-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com> Signed-off-by: Edgar Hernández <ehernand@redhat.com>
Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>
Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>
Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>
3c3efc7 to
f4d9031
Compare
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: israel-hdez, pierDipi The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
7617345
into
opendatahub-io:release-v0.15
…pendatahub-io#760) * fix(scheduler): Uses ClusterRole for TokenReview and SAR resources The llm-d scheduler uses cluster-scoped TokenReviews and SubjectAccessReview resources to validate for access to the metrics endpoint. This fixes the RBAC privileges to correctly enable scraping metrics. Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com> * Code review fixes: pierDipi [skip ci] Co-authored-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com> Signed-off-by: Edgar Hernández <ehernand@redhat.com> * Code review fixes: bartoszmajsak & pierDipi Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com> * Code review fixes: pierDipi Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com> * Fixing test flakiness Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com> --------- Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com> Signed-off-by: Edgar Hernández <ehernand@redhat.com> Co-authored-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>
* [RHOAIENG-30277] Add router status Signed-off-by: Andres Llausas <allausas@redhat.com> [RHOAIENG-30277] Add router status Signed-off-by: Andres Llausas <allausas@redhat.com> * [RHOAIENG-30277] Add router status Signed-off-by: Andres Llausas <allausas@redhat.com> * fix(scheduler): Uses ClusterRole for TokenReview and SAR resources (#760) * fix(scheduler): Uses ClusterRole for TokenReview and SAR resources The llm-d scheduler uses cluster-scoped TokenReviews and SubjectAccessReview resources to validate for access to the metrics endpoint. This fixes the RBAC privileges to correctly enable scraping metrics. Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com> * Code review fixes: pierDipi [skip ci] Co-authored-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com> Signed-off-by: Edgar Hernández <ehernand@redhat.com> * Code review fixes: bartoszmajsak & pierDipi Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com> * Code review fixes: pierDipi Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com> * Fixing test flakiness Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com> --------- Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com> Signed-off-by: Edgar Hernández <ehernand@redhat.com> Co-authored-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com> * add prebuilt OSSM installation guide (#793) * add prebuilt OSSM installation guide Signed-off-by: Jooho Lee <jlee@redhat.com> * follow up comments Signed-off-by: Jooho Lee <jlee@redhat.com> * remove the authz label Signed-off-by: Jooho Lee <jlee@redhat.com> * fix scripts Signed-off-by: Jooho Lee <jlee@redhat.com> * fix small part Signed-off-by: Jooho Lee <jlee@redhat.com> --------- Signed-off-by: Jooho Lee <jlee@redhat.com> * remove uidModelcar value from inferenceservice-config configmap inodh (#804) * remove uidModelcar value from inferenceservice-config configmap inodh Signed-off-by: Brett Thompson <196701379+brettmthompson@users.noreply.github.com> * adding uidModelcar value back to default configmap Signed-off-by: Brett Thompson <196701379+brettmthompson@users.noreply.github.com> --------- Signed-off-by: Brett Thompson <196701379+brettmthompson@users.noreply.github.com> * update gatewayclass controller name (#811) Signed-off-by: Jooho Lee <jlee@redhat.com> * [RHOAIENG-30277] Add router status Signed-off-by: Andres Llausas <allausas@redhat.com> * [RHOAIENG-30277] Addressed Pierangelo's comments Signed-off-by: Andres Llausas <allausas@redhat.com> * [RHOAIENG-30277] Addressed Pierangelo's comments and Bartosz comments Signed-off-by: Andres Llausas <allausas@redhat.com> * [RHOAIENG-30277] Fixed failing tests Signed-off-by: Andres Llausas <allausas@redhat.com> * [RHOAIENG-30277] Fixed failing tests Signed-off-by: Andres Llausas <allausas@redhat.com> * [RHOAIENG-30277] Addressed nit pick from Pierangelo Signed-off-by: Andres Llausas <allausas@redhat.com> * Ran precommit Signed-off-by: Andres Llausas <allausas@redhat.com> * Ran make precommit Signed-off-by: Andres Llausas <allausas@redhat.com> --------- Signed-off-by: Andres Llausas <allausas@redhat.com> Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com> Signed-off-by: Edgar Hernández <ehernand@redhat.com> Signed-off-by: Jooho Lee <jlee@redhat.com> Signed-off-by: Brett Thompson <196701379+brettmthompson@users.noreply.github.com> Co-authored-by: Edgar Hernández <ehernand@redhat.com> Co-authored-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com> Co-authored-by: Jooho Lee <jlee@redhat.com> Co-authored-by: Brett Thompson <196701379+brettmthompson@users.noreply.github.com>
What this PR does / why we need it:
The llm-d scheduler uses cluster-scoped
TokenReviewsandSubjectAccessReviewresources to validate for access to the metrics endpoint. This fixes the RBAC privileges to correctly enable scraping metrics.Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)format, will close the issue(s) when PR gets merged):Related to https://issues.redhat.com/browse/RHOAIENG-28166
Type of changes
Please delete options that are not relevant.
Feature/Issue validation/testing:
Manual testing
Checklist:
Summary by CodeRabbit
New Features
Bug Fixes
Chores