NETOBSERV-2675: Centralized K8s informers to reduce API load #1221
NETOBSERV-2675: Centralized K8s informers to reduce API load #1221leandroberetta wants to merge 8 commits into
Conversation
|
Skipping CI for Draft Pull Request. |
|
@leandroberetta: This pull request references NETOBSERV-2675 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a k8scache subsystem: new flp-informers binary pushes Kubernetes resource events to processors via a bidirectional gRPC cache protocol; flowlogs-pipeline can host a k8scache server applying updates into a new KubernetesStore used for enrichment/lookups. Manifests, Make targets, proto, client/server, discovery, handlers, and tests included. Changes
Sequence Diagram(s)sequenceDiagram
participant Informer as flp-informers
participant KubeAPI as Kubernetes API
participant Client as k8scache Client
participant Processor as flowlogs-pipeline (processor)
participant Server as k8scache Server
participant KStore as KubernetesStore
Informer->>KubeAPI: List/Watch resources
KubeAPI-->>Informer: Resource events
Informer->>Client: OnAdd/OnUpdate/OnDelete (enqueue)
Client->>Processor: Dial & Open StreamUpdates (per-processor)
Client->>Processor: Send CacheUpdate (versioned)
Processor->>Server: StreamUpdates receives CacheUpdate
Server->>KStore: Apply Add/Update/Delete
KStore-->>Server: Applied
Server->>Client: Send SyncAck (success + version)
sequenceDiagram
participant Discovery as Processor Discovery
participant ClientQueue as k8scache Client
participant Broadcaster as Broadcaster Loop
participant ProcConn as Processor Connection
participant Server as k8scache Server
participant DS as Datasource
Discovery->>ClientQueue: Discover pods -> AddProcessorWithTimeout
Informer->>ClientQueue: Enqueue update
ClientQueue->>Broadcaster: Dequeue update (version++)
Broadcaster->>ProcConn: Send update to healthy connections (concurrent)
ProcConn->>Server: StreamUpdates RPC receives CacheUpdate
Server->>DS: ApplyCacheAddOrUpdate/Delete
DS-->>Server: ✓
Server->>ProcConn: SyncAck (version)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@leandroberetta: This pull request references NETOBSERV-2675 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 14
🧹 Nitpick comments (7)
proto/k8scache.proto (2)
1-5: Address Buf lint: package directory mismatch.Static analysis reports: Files with package "k8scache" must be within a directory "k8scache" relative to root. Consider moving
k8scache.prototoproto/k8scache/k8scache.protoor adjusting the package name.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@proto/k8scache.proto` around lines 1 - 5, The proto package declaration "package k8scache" and option go_package = "./k8scache" conflict with Buf lint expecting the proto file to live in a matching directory; either move this proto file into a directory named "k8scache" under proto (e.g., proto/k8scache/k8scache.proto) or change the package/option to match its current directory. Update the package line (package k8scache) and the option go_package accordingly if you choose to rename, or relocate the file into a "k8scache" folder if you want to keep the package and go_package as-is, then re-run buf lint to confirm the mismatch is resolved.
55-69:operationfield is ambiguous whenis_snapshot = true.The
operationfield is only relevant for incremental updates, but nothing prevents setting it on snapshots. Consider using aoneofto make this mutually exclusive, or document explicitly thatoperationis ignored for snapshots.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@proto/k8scache.proto` around lines 55 - 69, CacheUpdate's operation field is ambiguous when is_snapshot = true; change CacheUpdate to make snapshot vs incremental mutually exclusive by introducing a oneof (e.g., oneof update_kind { bool is_snapshot = 2; OperationType operation = 4; }) or a more explicit oneof with message Snapshot {} and message Incremental { OperationType operation = 1; } so that OperationType (OperationType) cannot be set on snapshots; update the CacheUpdate definition and any code that reads CacheUpdate (parsers/handlers) to check the oneof variant instead of reading both is_snapshot and operation.cmd/flowlogs-pipeline/main.go (2)
269-275: Consider adding server address to startup log beforeServeblocks.The log at line 271 runs in the goroutine, but if
Servefails immediately, the startup message may not appear. Moving the log before the goroutine or adding a startup confirmation would improve observability.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/flowlogs-pipeline/main.go` around lines 269 - 275, The startup info log currently runs inside the goroutine and may never appear if grpcServer.Serve(listener) fails immediately; move or duplicate the log.WithField("address", address).Info("starting K8s cache sync server") so it runs synchronously before launching the goroutine (or emit an immediate confirmation log right before go func) and keep the existing error log that wraps grpcServer.Serve's returned error; reference grpcServer.Serve, listener, address and the log.WithField call to locate and update the code.
240-278: Unreachable code afterlog.Fatal.Line 266 (
return nil) is unreachable sincelog.Fatalcallsos.Exit(1). While harmless, it's dead code.Proposed fix
listener, err := net.Listen("tcp", address) if err != nil { log.WithError(err).WithField("address", address).Fatal("failed to start K8s cache server") - return nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/flowlogs-pipeline/main.go` around lines 240 - 278, In startK8sCacheServer, the return nil after log.WithError(...).Fatal(...) is unreachable; remove that dead code (the return nil following the Fatal call) from the net.Listen error branch so the function doesn't contain unreachable statements in the error handling for net.Listen in startK8sCacheServer.pkg/pipeline/transform/kubernetes/k8scache/server_test.go (2)
80-96: Test name mismatch with actual coverage.
TestBackwardCompatibility_DatasourceLookuptests the FakeInformers path (viaSetupStubs), not theKubernetesStoreintegration. The test name suggests it validates backward compatibility with the new store, but it only exercises the old informer-backed lookups.Consider renaming to
TestDatasourceLookup_ViaInformersor adding a parallel test that verifies lookups work viaKubernetesStorewhen set.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/pipeline/transform/kubernetes/k8scache/server_test.go` around lines 80 - 96, Test name TestBackwardCompatibility_DatasourceLookup is misleading because it exercises the informer-backed path (via SetupStubs/setupTestDatasource) rather than the KubernetesStore integration; either rename the test to TestDatasourceLookup_ViaInformers to reflect it calls setupTestDatasource and exercises ds.IndexLookup and ds.GetNodeByName, or add a new parallel test that creates a KubernetesStore (use the same fixture setup flow you use for KubernetesStore integration), performs the same lookups against that store, and asserts identical results so the KubernetesStore path is validated.
235-253: DELETE operation not verified.The test sends a DELETE for
pod1but never verifies the entry was removed from the cache. Consider adding assertions similar toTestKubernetesCacheServer_ReceivesAddthat confirm cache state after operations.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/pipeline/transform/kubernetes/k8scache/server_test.go` around lines 235 - 253, The test sends a DELETE CacheUpdate (OperationType_OPERATION_DELETE) for pod1 but never asserts the cache state; update the test (around server.StreamUpdates and after GetCurrentVersion()) to verify the server's cache no longer contains the ResourceEntry for Kind "Pod", Name "pod1", Namespace "default" (similar to assertions in TestKubernetesCacheServer_ReceivesAdd), e.g., use the cache lookup or server-provided accessors to assert the entry is absent and optionally assert the remaining entry count; keep checks for version via server.GetCurrentVersion() and existing ACK/SYNC message assertions.pkg/pipeline/transform/kubernetes/k8scache/client.go (1)
271-283:SyncRequest.LastVersionis ignored.The handler logs
LastVersionbut doesn't use it. Per the comment, this is by design (no snapshots). However, the protocol includesLastVersionwhich suggests catch-up was considered.If intentional, consider removing
LastVersionfrom the proto or documenting why it exists but isn't used.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/pipeline/transform/kubernetes/k8scache/client.go` around lines 271 - 283, The handler handleSyncRequest currently logs SyncRequest.LastVersion but doesn't use it; decide whether to remove the unused field from the protocol or explicitly document why it's ignored: either (A) remove LastVersion from the proto and regenerate types (and update any processor code that sets it), or (B) keep the proto and add a clear comment in Client.handleSyncRequest stating that LastVersion is intentionally ignored (no initial snapshots sent), update the log message to reflect "ignored LastVersion", and add a TODO or link to design rationale so future readers know this is by-design; reference symbols: handleSyncRequest, SyncRequest.LastVersion, processorConnection, Client.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/flowlogs-pipeline/main.go`:
- Around line 251-252: The assignment to the Datasource.KubernetesStore pointer
is unsynchronized and can race with goroutines started by mainPipeline.Run; fix
by making pointer access atomic or mutex-protected: add a sync.RWMutex field (or
use atomic.Pointer[*KubernetesStore]) on Datasource, update SetKubernetesStore
to perform a write under the mutex (or atomic.StorePointer), and change all
places that read d.KubernetesStore (the enrichment/lookup code that runs in
pipeline goroutines and any methods that dereference KubernetesStore) to use the
corresponding read path (mutex RLock/RUnlock or atomic.LoadPointer) so
reads/writes are properly synchronized. Ensure SetKubernetesStore, the
Datasource.KubernetesStore field, and the enrichment/lookup code that references
it are all updated consistently.
In `@cmd/flp-informers/main.go`:
- Around line 77-80: The flag/env binding currently calls
viper.BindPFlags(rootCmd.PersistentFlags()) but the rest of the code continues
to read values directly from the opts struct (opts.LogLevel, opts.Kubeconfig,
opts.ProcessorSelector, opts.ProcessorPort, opts.ResyncInterval), so environment
overrides never take effect; fix by configuring key mapping with
viper.SetEnvKeyReplacer(strings.NewReplacer("-", "_")), then after Cobra parses
flags call either viper.Unmarshal(&opts) to populate the opts struct from Viper
or replace direct reads with viper.GetString/viper.GetInt for the named keys
(e.g., "log-level", "kubeconfig", "processor-selector", "processor-port",
"resync-interval") so env vars like FLP_INFORMERS_PROCESSOR_SELECTOR override
flag values.
- Around line 180-185: OnDelete in cacheEventHandler currently does a direct
type assertion to *model.ResourceMetaData and ignores
cache.DeletedFinalStateUnknown tombstones, so missed-delete events leave stale
cache entries; change OnDelete to unwrap tombstones by checking for
cache.DeletedFinalStateUnknown and extracting its Obj into a
*model.ResourceMetaData (falling back to a direct *model.ResourceMetaData
assertion), handle nil after extraction, then proceed with the existing delete
logic so deletions from tombstones are processed.
In `@contrib/kubernetes/deployment-flp-informers.yaml`:
- Around line 62-65: The ClusterRoleBinding subjects block currently pins the
ServiceAccount to namespace "default" (subjects -> kind: ServiceAccount, name:
flp-informers, namespace: default), which breaks RBAC when deployed into other
namespaces; update the ClusterRoleBinding so the subject's namespace is not
hardcoded — either remove the namespace field or set it to the release/manifest
namespace (so the ServiceAccount in the target namespace receives the
ClusterRole) ensuring the binding references the same namespace as the
deployment.
In `@contrib/kubernetes/deployment-k8scache.yaml`:
- Around line 94-96: The ClusterRole granting apiGroups: ["*"], resources:
["*"], verbs: ["*"] for flowlogs-pipeline is effectively cluster-admin and
should be narrowed; update the ClusterRole named flowlogs-pipeline to only
include the specific apiGroups, resources and verbs the k8scache processor
actually needs (e.g., namespaced resources like
pods/endpoints/configmaps/secrets or watching CRDs) and limit scope to the
target namespace via a Role or by restricting resourceNames and verbs
(get/list/watch/create/update/patch/exec as required) instead of wildcards;
ensure any ClusterRoleBindings are replaced with RoleBindings if cluster-wide
access is not required.
- Around line 102-105: The ClusterRoleBinding's subjects block hardcodes
namespace: default which breaks when the manifest is applied to a different
namespace; update the subjects entry for the ServiceAccount (name:
flowlogs-pipeline) so the namespace is not hardcoded but uses the manifest's
target namespace (e.g., reference the resource's metadata.namespace or the
project's templating variable like {{ .Release.Namespace }} / ${NAMESPACE}), or
remove the namespace field so the binding targets the ServiceAccount in the
current namespace, ensuring the subjects block dynamically matches the
ServiceAccount's namespace.
In `@pkg/pipeline/transform/kubernetes/informers/cache_export.go`:
- Around line 4-21: The AddEventHandler method on Informers currently registers
handlers for pods, nodes, and services but omits replicaSets and deployments;
update Informers.AddEventHandler to also call
k.replicaSets.AddEventHandler(handler) and
k.deployments.AddEventHandler(handler) (guarded by nil checks like the others)
and return any errors from those calls so replica set and deployment events are
registered and kept in sync with pods/nodes/services.
In `@pkg/pipeline/transform/kubernetes/k8scache/client.go`:
- Around line 59-105: Add reconnection/cleanup for processor connections: update
receiveFromProcessor (and/or AddProcessor) to detect stream errors where
pc.healthy is set false and either attempt automatic reconnection with
exponential backoff using the existing address (re-creating the grpc client,
stream and replacing pc.conn/pc.stream/pc.cancel and setting pc.healthy true
upon success) or, if retries exceed a threshold, remove the processorConnection
from the Client.connections map and close resources. Ensure synchronization with
c.mu when modifying c.connections, reuse the processorConnection.address, conn,
stream, cancel, and healthy symbols, and log retry attempts and final removal so
broadcasts no longer skip permanently-failed processors.
- Around line 227-240: The broadcast currently blocks on wg.Wait because a stuck
pc.stream.Send can hang forever; modify the goroutine that calls pc.stream.Send
to enforce a per-send timeout (use a context with timeout if the stream supports
Send with context, or run the Send in a nested goroutine and select on a
time.After timeout for that send). On timeout or send error, log the error with
clog.WithError(...).WithField("address", pc.address).Error(...), mark
pc.healthy.Store(false), and ensure wg.Done() is still called so the loop cannot
block indefinitely; keep references to processorConnection, pc.stream.Send,
pc.healthy, and connections to locate the change.
- Around line 72-75: The gRPC client currently uses insecure.NewCredentials() in
the grpc.NewClient call (using variables address and the MaxCallRecvMsgSize
option), leaving traffic unencrypted; replace this with TLS/mTLS by loading
client certificate and CA (e.g., from Kubernetes Secrets) and passing real
transport credentials to grpc.NewClient (use
grpc.WithTransportCredentials(credentials.NewTLS(tlsConfig)) or the mTLS
equivalent) or wire in code paths to use platform-provided mTLS (sidecar/service
mesh) when enabled; ensure certificate validation (RootCAs) and SNI are
configured and preserve existing call options like grpc.MaxCallRecvMsgSize.
In `@pkg/pipeline/transform/kubernetes/k8scache/server_test.go`:
- Around line 75-78: The tests use setupTestDatasource() which only returns a
Datasource with Informers set, but most tests (e.g.,
TestKubernetesCacheServer_ReceivesIncrementalUpdate and
TestKubernetesCacheServer_MultipleUpdates) never call SetKubernetesStore so
server methods ApplyCacheAddOrUpdate/ApplyCacheDelete have nowhere to persist
data; update setupTestDatasource() to attach a KubernetesStore (call
SetKubernetesStore on the Datasource) so every test has a real store, or
alternatively add explicit assertions in those tests that validate cache
contents after operations; modify either setupTestDatasource() to initialize the
store (preferred) or add checks in
TestKubernetesCacheServer_ReceivesIncrementalUpdate and
TestKubernetesCacheServer_MultipleUpdates to ensure the cache actually contains
the expected entries after ApplyCacheAddOrUpdate/ApplyCacheDelete.
In `@pkg/pipeline/transform/kubernetes/k8scache/server.go`:
- Around line 100-112: The ACK path unconditionally stores
s.version.Store(update.Version) and sends a success ACK via
stream.Send(&SyncMessage{... SyncAck{ProcessorId: connectionID, Version:
update.Version, Success: true}}) even when an operation was dropped/unknown;
change the logic so s.version.Store and sending Success=true only happen for
successfully applied updates. Locate the switch/handler that logs
OPERATION_UNSPECIFIED/unknown values and make it return or set a failed result
for that update (or send an ACK with Success=false) instead of falling through
to the success ACK; ensure stream.Send and s.version.Store are executed only
when the apply/dispatch path indicates success (use the update result flag or
error from the handler to gate those calls, referencing s.version.Store,
stream.Send, SyncMessage_Ack, SyncAck, connectionID, and update.Version).
- Around line 45-53: The server currently sends a SyncRequest with lastVersion
from s.version.Load() but does not honor snapshot semantics for fresh processors
(version==0) and the sender drops initial adds; update the sync protocol so when
SyncRequest.LastVersion == 0 the server responds with a full snapshot
SyncMessage (use the existing Replace semantics) instead of/in addition to
incrementals, by wiring KubernetesStore.Replace into the response path and
ensuring stream.Send sends that full snapshot before any incremental events;
alternatively, introduce an per-connection initialSyncComplete flag so the
sender will not drop initial add events and will gate switching to
KubernetesStore until Replace has been applied, referencing SyncMessage,
SyncMessage_Request/SyncRequest, s.version.Load(), KubernetesStore.Replace, and
the stream.Send sender logic to locate and modify the code paths.
In `@proto/k8scache.proto`:
- Around line 71-94: The proto is missing a secondary_net_names mapping needed
to preserve secondary network name resolution: add a map<string, string>
secondary_net_names field to the ResourceEntry message (alongside the existing
repeated secondary_net_keys) and bump the field number accordingly, then update
the conversion logic in converter.go (the code that maps SecondaryNetNames ->
ResourceEntry) and the informer population in informers.go to populate this new
map so FLP processors can resolve NetworkName from secondary network keys.
---
Nitpick comments:
In `@cmd/flowlogs-pipeline/main.go`:
- Around line 269-275: The startup info log currently runs inside the goroutine
and may never appear if grpcServer.Serve(listener) fails immediately; move or
duplicate the log.WithField("address", address).Info("starting K8s cache sync
server") so it runs synchronously before launching the goroutine (or emit an
immediate confirmation log right before go func) and keep the existing error log
that wraps grpcServer.Serve's returned error; reference grpcServer.Serve,
listener, address and the log.WithField call to locate and update the code.
- Around line 240-278: In startK8sCacheServer, the return nil after
log.WithError(...).Fatal(...) is unreachable; remove that dead code (the return
nil following the Fatal call) from the net.Listen error branch so the function
doesn't contain unreachable statements in the error handling for net.Listen in
startK8sCacheServer.
In `@pkg/pipeline/transform/kubernetes/k8scache/client.go`:
- Around line 271-283: The handler handleSyncRequest currently logs
SyncRequest.LastVersion but doesn't use it; decide whether to remove the unused
field from the protocol or explicitly document why it's ignored: either (A)
remove LastVersion from the proto and regenerate types (and update any processor
code that sets it), or (B) keep the proto and add a clear comment in
Client.handleSyncRequest stating that LastVersion is intentionally ignored (no
initial snapshots sent), update the log message to reflect "ignored
LastVersion", and add a TODO or link to design rationale so future readers know
this is by-design; reference symbols: handleSyncRequest,
SyncRequest.LastVersion, processorConnection, Client.
In `@pkg/pipeline/transform/kubernetes/k8scache/server_test.go`:
- Around line 80-96: Test name TestBackwardCompatibility_DatasourceLookup is
misleading because it exercises the informer-backed path (via
SetupStubs/setupTestDatasource) rather than the KubernetesStore integration;
either rename the test to TestDatasourceLookup_ViaInformers to reflect it calls
setupTestDatasource and exercises ds.IndexLookup and ds.GetNodeByName, or add a
new parallel test that creates a KubernetesStore (use the same fixture setup
flow you use for KubernetesStore integration), performs the same lookups against
that store, and asserts identical results so the KubernetesStore path is
validated.
- Around line 235-253: The test sends a DELETE CacheUpdate
(OperationType_OPERATION_DELETE) for pod1 but never asserts the cache state;
update the test (around server.StreamUpdates and after GetCurrentVersion()) to
verify the server's cache no longer contains the ResourceEntry for Kind "Pod",
Name "pod1", Namespace "default" (similar to assertions in
TestKubernetesCacheServer_ReceivesAdd), e.g., use the cache lookup or
server-provided accessors to assert the entry is absent and optionally assert
the remaining entry count; keep checks for version via
server.GetCurrentVersion() and existing ACK/SYNC message assertions.
In `@proto/k8scache.proto`:
- Around line 1-5: The proto package declaration "package k8scache" and option
go_package = "./k8scache" conflict with Buf lint expecting the proto file to
live in a matching directory; either move this proto file into a directory named
"k8scache" under proto (e.g., proto/k8scache/k8scache.proto) or change the
package/option to match its current directory. Update the package line (package
k8scache) and the option go_package accordingly if you choose to rename, or
relocate the file into a "k8scache" folder if you want to keep the package and
go_package as-is, then re-run buf lint to confirm the mismatch is resolved.
- Around line 55-69: CacheUpdate's operation field is ambiguous when is_snapshot
= true; change CacheUpdate to make snapshot vs incremental mutually exclusive by
introducing a oneof (e.g., oneof update_kind { bool is_snapshot = 2;
OperationType operation = 4; }) or a more explicit oneof with message Snapshot
{} and message Incremental { OperationType operation = 1; } so that
OperationType (OperationType) cannot be set on snapshots; update the CacheUpdate
definition and any code that reads CacheUpdate (parsers/handlers) to check the
oneof variant instead of reading both is_snapshot and operation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 26b3ef17-050e-40d9-a5a8-68ead7d65bb8
⛔ Files ignored due to path filters (2)
pkg/pipeline/transform/kubernetes/k8scache/k8scache.pb.gois excluded by!**/*.pb.gopkg/pipeline/transform/kubernetes/k8scache/k8scache_grpc.pb.gois excluded by!**/*.pb.go
📒 Files selected for processing (18)
.mk/development.mkREADME.mdcmd/flowlogs-pipeline/main.gocmd/flp-informers/main.gocontrib/docker/Dockerfilecontrib/kubernetes/deployment-flp-informers.yamlcontrib/kubernetes/deployment-k8scache.yamlpkg/config/config.gopkg/pipeline/transform/kubernetes/datasource/datasource.gopkg/pipeline/transform/kubernetes/datasource/kubernetesstore.gopkg/pipeline/transform/kubernetes/enrich.gopkg/pipeline/transform/kubernetes/informers/cache_export.gopkg/pipeline/transform/kubernetes/k8scache/client.gopkg/pipeline/transform/kubernetes/k8scache/converter.gopkg/pipeline/transform/kubernetes/k8scache/server.gopkg/pipeline/transform/kubernetes/k8scache/server_test.goproto/README.mdproto/k8scache.proto
| func (h *cacheEventHandler) OnDelete(obj interface{}) { | ||
| meta, ok := obj.(*model.ResourceMetaData) | ||
| if !ok { | ||
| log.Warnf("unexpected object type in OnDelete: %T", obj) | ||
| return | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Can a k8s.io/client-go SharedInformer OnDelete callback receive cache.DeletedFinalStateUnknown, and what is the recommended way to unwrap it before type assertions?
💡 Result:
Yes, a k8s.io/client-go SharedInformer OnDelete callback can receive cache.DeletedFinalStateUnknown. This occurs when the watch deletion event is missed (e.g., due to disconnection from the API server) and the deletion is detected during a subsequent re-list. The OnDelete handler receives the last known non-absent state of the object wrapped in DeletedFinalStateUnknown, or the tombstone directly. The recommended way to unwrap it before type assertions is to check the type and access the .Obj field, which contains the actual object (potentially stale). Use helper functions like DeletionHandlingMetaNamespaceKeyFunc for key extraction, which handles unwrapping internally. Example code pattern from client-go usage: DeleteFunc: func(obj interface{}) { var pod *corev1.Pod switch t := obj.(type) { case cache.DeletedFinalStateUnknown: pod = t.Obj.(*corev1.Pod) case *corev1.Pod: pod = t default: runtime.HandleError(fmt.Errorf("unexpected type: %T", obj)) return } // Now use pod safely } Alternatively, for key-based handling without full object: key, err := cache.DeletionHandlingMetaNamespaceKeyFunc(obj) if err != nil { runtime.HandleError(err) return } // Use key for workqueue, etc.
Citations:
- 1: https://pkg.go.dev/k8s.io/client-go/tools/cache
- 2: K8s Informers OnDelete: handle cache.DeletedFinalStateUnknown grafana/beyla#1242
- 3: https://oneuptime.com/blog/post/2026-01-07-go-client-go-kubernetes/view
- 4: panic: interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.Service kubernetes/ingress-nginx#10015
🏁 Script executed:
# First, check the full OnDelete function implementation
cat -n cmd/flp-informers/main.go | sed -n '175,195p'Repository: netobserv/flowlogs-pipeline
Length of output: 954
🏁 Script executed:
# Search for any handling of cache.DeletedFinalStateUnknown in the codebase
rg "DeletedFinalStateUnknown" cmd/flp-informers/Repository: netobserv/flowlogs-pipeline
Length of output: 53
🏁 Script executed:
# Check the imports in main.go to see what packages are available
head -30 cmd/flp-informers/main.goRepository: netobserv/flowlogs-pipeline
Length of output: 916
🏁 Script executed:
# Check what cache imports are available in main.go
rg "import|cache" cmd/flp-informers/main.go | head -20Repository: netobserv/flowlogs-pipeline
Length of output: 1081
🏁 Script executed:
# Look at the full import section to understand dependencies
sed -n '20,60p' cmd/flp-informers/main.goRepository: netobserv/flowlogs-pipeline
Length of output: 1260
🏁 Script executed:
# Check what informers package is being used
rg "import.*informers" cmd/flp-informers/main.go -A 1 -B 1Repository: netobserv/flowlogs-pipeline
Length of output: 53
🏁 Script executed:
# Look for where the informer is created and handlers are added
rg "setupInformerHandlers|AddEventHandler" cmd/flp-informers/main.go -B 3 -A 8Repository: netobserv/flowlogs-pipeline
Length of output: 1055
🏁 Script executed:
# Check the informers package to see if it uses client-go
find . -name "informers.go" -o -name "informers" -type d | head -5Repository: netobserv/flowlogs-pipeline
Length of output: 211
🏁 Script executed:
# Check the local informers package to understand how AddEventHandler works
cat -n ./pkg/pipeline/transform/kubernetes/informers/informers.go | head -100Repository: netobserv/flowlogs-pipeline
Length of output: 3756
🏁 Script executed:
# Look for where the actual SharedInformer is being used
rg "SharedInformer|cache\.DeletedFinalStateUnknown" ./pkg/pipeline/transform/kubernetes/informers/Repository: netobserv/flowlogs-pipeline
Length of output: 1268
🏁 Script executed:
# Look for the AddEventHandler method in the informers package
rg "AddEventHandler" ./pkg/pipeline/transform/kubernetes/informers/ -B 3 -A 10Repository: netobserv/flowlogs-pipeline
Length of output: 2369
🏁 Script executed:
# Search for where informer.AddEventHandler or similar is called
rg "AddEventHandler|cache.ResourceEventHandlerFuncs" ./pkg/pipeline/transform/kubernetes/informers/ -B 2 -A 5Repository: netobserv/flowlogs-pipeline
Length of output: 1871
🏁 Script executed:
# Get the full informers.go file to understand event handler implementation
wc -l ./pkg/pipeline/transform/kubernetes/informers/informers.goRepository: netobserv/flowlogs-pipeline
Length of output: 133
🏁 Script executed:
# Read the complete cache_export.go file
cat -n ./pkg/pipeline/transform/kubernetes/informers/cache_export.goRepository: netobserv/flowlogs-pipeline
Length of output: 1027
Handle cache.DeletedFinalStateUnknown tombstones in OnDelete callback.
When a watch deletion is missed (e.g., due to API server disconnection), k8s.io/client-go sends cache.DeletedFinalStateUnknown containing the last known state. The current code's direct type assertion to *model.ResourceMetaData fails for these tombstones, causing the handler to return without processing the deletion. This leaves stale entries in the remote cache.
Unwrap tombstones before the assertion:
func (h *cacheEventHandler) OnDelete(obj interface{}) {
var meta *model.ResourceMetaData
switch t := obj.(type) {
case cache.DeletedFinalStateUnknown:
meta, _ = t.Obj.(*model.ResourceMetaData)
case *model.ResourceMetaData:
meta = t
default:
log.Warnf("unexpected object type in OnDelete: %T", obj)
return
}
if meta == nil {
return
}
// ... send delete ...
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cmd/flp-informers/main.go` around lines 180 - 185, OnDelete in
cacheEventHandler currently does a direct type assertion to
*model.ResourceMetaData and ignores cache.DeletedFinalStateUnknown tombstones,
so missed-delete events leave stale cache entries; change OnDelete to unwrap
tombstones by checking for cache.DeletedFinalStateUnknown and extracting its Obj
into a *model.ResourceMetaData (falling back to a direct *model.ResourceMetaData
assertion), handle nil after extraction, then proceed with the existing delete
logic so deletions from tombstones are processed.
| // Send SyncRequest with our current version so the client knows what to send | ||
| // (full snapshot if 0, or incrementals from that version) | ||
| lastVersion := s.version.Load() | ||
| err := stream.Send(&SyncMessage{ | ||
| Message: &SyncMessage_Request{ | ||
| Request: &SyncRequest{ | ||
| ProcessorId: connectionID, | ||
| LastVersion: lastVersion, | ||
| }, |
There was a problem hiding this comment.
Fresh processors never receive a full cache.
Line 46 says version 0 should recover full state, but Lines 132-146 ignore snapshot semantics, and the sender side currently drops initial adds. Once the datasource switches lookups to KubernetesStore, a new/restarted processor misses every pre-existing pod/node/service until those objects change. There is already a Replace path in KubernetesStore; this needs to be wired into the sync protocol or gated behind an initial-sync completion signal.
Also applies to: 128-146
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/pipeline/transform/kubernetes/k8scache/server.go` around lines 45 - 53,
The server currently sends a SyncRequest with lastVersion from s.version.Load()
but does not honor snapshot semantics for fresh processors (version==0) and the
sender drops initial adds; update the sync protocol so when
SyncRequest.LastVersion == 0 the server responds with a full snapshot
SyncMessage (use the existing Replace semantics) instead of/in addition to
incrementals, by wiring KubernetesStore.Replace into the response path and
ensuring stream.Send sends that full snapshot before any incremental events;
alternatively, introduce an per-connection initialSyncComplete flag so the
sender will not drop initial add events and will gate switching to
KubernetesStore until Replace has been applied, referencing SyncMessage,
SyncMessage_Request/SyncRequest, s.version.Load(), KubernetesStore.Replace, and
the stream.Send sender logic to locate and modify the code paths.
| // ResourceEntry represents a single Kubernetes resource in the cache | ||
| message ResourceEntry { | ||
| string kind = 1; // "Pod", "Node", or "Service" | ||
| string namespace = 2; | ||
| string name = 3; | ||
| string uid = 4; | ||
|
|
||
| // Resource-specific metadata | ||
| string owner_name = 5; | ||
| string owner_kind = 6; | ||
| string host_name = 7; // For Pods: the node name | ||
| string host_ip = 8; // For Pods: the node IP | ||
| string network_name = 9; // For multi-network support | ||
|
|
||
| repeated string ips = 10; // IP addresses associated with this resource | ||
| repeated string secondary_net_keys = 11; // Secondary network keys for CNI plugins | ||
|
|
||
| map<string, string> labels = 12; | ||
| map<string, string> annotations = 13; | ||
|
|
||
| // Kubernetes metadata | ||
| int64 creation_timestamp = 14; // Unix timestamp | ||
| string resource_version = 15; // K8s resource version | ||
| } |
There was a problem hiding this comment.
Missing secondary_net_names field causes data loss for multi-network support.
Per converter.go:56-80 and informers.go:133-137, SecondaryNetNames (a map[string]string) is actively used to resolve NetworkName from secondary network keys. The proto only has secondary_net_keys (repeated string) but omits the names mapping.
Without this field, FLP processors using the cache can't perform network name resolution for secondary networks.
Proposed fix
repeated string secondary_net_keys = 11; // Secondary network keys for CNI plugins
+ map<string, string> secondary_net_names = 16; // Maps secondary network keys to network names
map<string, string> labels = 12;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // ResourceEntry represents a single Kubernetes resource in the cache | |
| message ResourceEntry { | |
| string kind = 1; // "Pod", "Node", or "Service" | |
| string namespace = 2; | |
| string name = 3; | |
| string uid = 4; | |
| // Resource-specific metadata | |
| string owner_name = 5; | |
| string owner_kind = 6; | |
| string host_name = 7; // For Pods: the node name | |
| string host_ip = 8; // For Pods: the node IP | |
| string network_name = 9; // For multi-network support | |
| repeated string ips = 10; // IP addresses associated with this resource | |
| repeated string secondary_net_keys = 11; // Secondary network keys for CNI plugins | |
| map<string, string> labels = 12; | |
| map<string, string> annotations = 13; | |
| // Kubernetes metadata | |
| int64 creation_timestamp = 14; // Unix timestamp | |
| string resource_version = 15; // K8s resource version | |
| } | |
| // ResourceEntry represents a single Kubernetes resource in the cache | |
| message ResourceEntry { | |
| string kind = 1; // "Pod", "Node", or "Service" | |
| string namespace = 2; | |
| string name = 3; | |
| string uid = 4; | |
| // Resource-specific metadata | |
| string owner_name = 5; | |
| string owner_kind = 6; | |
| string host_name = 7; // For Pods: the node name | |
| string host_ip = 8; // For Pods: the node IP | |
| string network_name = 9; // For multi-network support | |
| repeated string ips = 10; // IP addresses associated with this resource | |
| repeated string secondary_net_keys = 11; // Secondary network keys for CNI plugins | |
| map<string, string> secondary_net_names = 16; // Maps secondary network keys to network names | |
| map<string, string> labels = 12; | |
| map<string, string> annotations = 13; | |
| // Kubernetes metadata | |
| int64 creation_timestamp = 14; // Unix timestamp | |
| string resource_version = 15; // K8s resource version | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@proto/k8scache.proto` around lines 71 - 94, The proto is missing a
secondary_net_names mapping needed to preserve secondary network name
resolution: add a map<string, string> secondary_net_names field to the
ResourceEntry message (alongside the existing repeated secondary_net_keys) and
bump the field number accordingly, then update the conversion logic in
converter.go (the code that maps SecondaryNetNames -> ResourceEntry) and the
informer population in informers.go to populate this new map so FLP processors
can resolve NetworkName from secondary network keys.
|
@leandroberetta: This pull request references NETOBSERV-2675 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 9
♻️ Duplicate comments (1)
pkg/pipeline/transform/kubernetes/k8scache/client.go (1)
466-478:⚠️ Potential issue | 🔴 CriticalFresh processors still never receive the current cache state.
Lines 474-477 explicitly drop snapshot/replay. Combined with
pkg/pipeline/transform/kubernetes/k8scache/eventhandler.go, Lines 42-45 skipping initial-list ADDs, andpkg/pipeline/transform/kubernetes/datasource/datasource.go, Lines 11-45 switching lookups entirely toKubernetesStore, a restarted processor has no enrichment until each object changes again. This needs a snapshot/replace path or a gated fallback until initial sync completes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/pipeline/transform/kubernetes/k8scache/client.go` around lines 466 - 478, handleSyncRequest currently never sends an initial snapshot, so restarted processors (processorConnection / SyncRequest) never receive existing cache state; update handleSyncRequest in Client to detect when a processor needs an initial snapshot (e.g., req.LastVersion==0 or explicit snapshot flag) and send a snapshot/replace sequence using the current cache from the KubernetesStore/KubernetesClient (iterate the store and emit ADD/REPLACE events or a single REPLACE message) before switching to incremental updates; alternatively implement a gated fallback where processorConnection is marked syncing until an initial snapshot is sent and lookups from datasource.KubernetesStore honor that sync-complete flag so processors get consistent enrichment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.mk/development.mk:
- Around line 69-73: The kubectl invocations in the Makefile targets (e.g.,
target deploy-flp-informers and the deploy-k8scache block) currently apply
templated manifests but do not pass the intended namespace, so resources are
created/read in the default kubectl namespace; update all kubectl calls (kubectl
apply, kubectl rollout status, kubectl get/delete/patch etc. referenced in the
deploy-k8scache section) to include -n $(NAMESPACE) (or
--namespace=$(NAMESPACE)) so the operations use the same templated %NAMESPACE%
as the manifest generation; ensure every kubectl command string in those targets
is changed consistently.
In `@Makefile`:
- Around line 33-34: The NAMESPACE assignment (variable NAMESPACE) should treat
an empty jsonpath result as "default" in addition to handling kubectl failure;
update the shell expression that runs kubectl config view --minify --output
'jsonpath={..namespace}' so that if the command returns empty string it falls
back to "default" (for example by capturing the output and using a shell
conditional or a piped substitution to replace an empty result with "default").
Ensure you modify the Makefile line that defines NAMESPACE so it covers both
non-zero exit and empty output cases.
In `@pkg/pipeline/transform/kubernetes/k8scache/client.go`:
- Around line 135-168: Currently c.mu is held while dialing and creating the
stream (see c.mu, c.connections, getTransportCredentials, grpc.NewClient,
NewKubernetesCacheServiceClient, StreamUpdates), which can block broadcastUpdate
and back up updateChan; change the flow to: acquire c.mu and only check/record
existence, then release c.mu before calling
getTransportCredentials/grpc.NewClient and creating the stream, then re-acquire
c.mu to verify no other goroutine inserted a connection and either store the new
connection/stream into c.connections or, if one exists, close the newly-created
conn/stream and return; ensure you transfer the ctx/cancel handling correctly
when moving creation outside the lock so cancel/conn.Close are called on the new
resources if you drop them due to a race.
- Around line 336-400: The reconnect path races on pc.conn/pc.stream/pc.cancel
because reconnect sets those fields under pc.mu but broadcastUpdate,
receiveFromProcessor, RemoveProcessor, and Stop access/modify them without
locking; fix by ensuring all reads/writes of pc.conn, pc.stream, and pc.cancel
are performed while holding pc.mu (or via new accessor methods that acquire
pc.mu) — update broadcastUpdate, receiveFromProcessor, RemoveProcessor, and Stop
to lock pc.mu around any access/close/cancel of pc.conn, pc.stream, or pc.cancel
(or replace direct field access with synchronized getters/setters) so the mutex
consistently protects these fields along with reconnect.
- Around line 314-328: The timeout branch must actively tear down the blocked
send so the goroutine in receiveFromProcessor sees a receive error and
reconnects: when sendTimeout fires in the select (the case currently logging
"send operation timed out"), invoke the stream teardown (e.g., call
pc.stream.CloseSend() or cancel the stream/context associated with pc.stream),
handle/ log any CloseSend/cancel error, then mark pc.healthy.Store(false); this
ensures the goroutine blocked in pc.stream.Send(update) is unblocked and the
existing reconnect path will run. Use the same pc.stream reference used in the
goroutine and ensure any returned error from CloseSend/cancel is logged via
clog.WithError(...).WithField("address", pc.address).
In `@pkg/pipeline/transform/kubernetes/k8scache/discovery.go`:
- Around line 67-68: Before calling time.NewTicker in discovery.go, validate
cfg.ResyncInterval and fail fast for non-positive values: check if
cfg.ResyncInterval <= 0 and return a clear configuration error (e.g.,
fmt.Errorf("invalid ResyncInterval: %d", cfg.ResyncInterval)) instead of
creating the ticker; update the code around the ticker := time.NewTicker(...) /
defer ticker.Stop() to perform this validation first so time.NewTicker is never
called with a non-positive duration.
In `@pkg/pipeline/transform/kubernetes/k8scache/eventhandler_test.go`:
- Around line 40-62: The test TestOnDelete_TombstoneEvent is reimplementing the
tombstone unwrapping instead of exercising the real handler; update the test to
call the production OnDelete method (the event handler function named OnDelete)
with a cache.DeletedFinalStateUnknown wrapping the test pod, then
capture/inspect the emitted metadata (e.g., via the same channel, logger, or
mock used by the handler) and assert Name and Namespace on that emitted
model.ResourceMetaData; do the same replacement for the other similar tests
(lines 64-90 and 93-156) so they drive the real OnDelete path rather than
performing the type-assertion inline.
In `@pkg/pipeline/transform/kubernetes/k8scache/integration_test.go`:
- Around line 20-109: The tests only verify ACKs and not that the datasource was
mutated—ensure each integration test initializes a real store (call
ds.SetKubernetesStore(datasource.NewKubernetesStore()) where missing, e.g. in
TestIntegration_MultipleUpdatesFlow) and after receiving the server ACKs query
the store to assert the expected state: for the ADD path use the store lookup
API (by Uid, Name/Namespace or IP) to assert the ResourceEntry was persisted and
that Ips contains "10.0.0.100"; for DELETE/UPDATE paths assert the resource is
removed or updated accordingly. Locate the test functions
(TestIntegration_ServerReceivesAdd, TestIntegration_MultipleUpdatesFlow), the
datasource.Datasource instance, NewKubernetesStore, SetKubernetesStore,
KubernetesCacheServer, StreamUpdates, CacheUpdate and ResourceEntry to add these
post-ACK assertions.
- Around line 266-290: The test currently treats any successful stream.Recv() as
a valid ACK; update the ACK receive/validation after sending the ADD CacheUpdate
so that you assert the received message is the expected SyncAck and that its
fields are correct (e.g., type is SyncAck, Version matches the sent update or
expected ack version, and any status field indicates success). Locate the
Receive ACK block (the call to stream.Recv() used to validate the ACK) and
replace the loose nil-check with a typed assertion of the returned message
(ensure it's a SyncAck) and explicit checks on relevant fields to fail the test
on wrong message type or bad payload.
---
Duplicate comments:
In `@pkg/pipeline/transform/kubernetes/k8scache/client.go`:
- Around line 466-478: handleSyncRequest currently never sends an initial
snapshot, so restarted processors (processorConnection / SyncRequest) never
receive existing cache state; update handleSyncRequest in Client to detect when
a processor needs an initial snapshot (e.g., req.LastVersion==0 or explicit
snapshot flag) and send a snapshot/replace sequence using the current cache from
the KubernetesStore/KubernetesClient (iterate the store and emit ADD/REPLACE
events or a single REPLACE message) before switching to incremental updates;
alternatively implement a gated fallback where processorConnection is marked
syncing until an initial snapshot is sent and lookups from
datasource.KubernetesStore honor that sync-complete flag so processors get
consistent enrichment.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3bb4c139-47bc-4de5-8063-30b43f38c81f
⛔ Files ignored due to path filters (1)
pkg/pipeline/transform/kubernetes/k8scache/k8scache.pb.gois excluded by!**/*.pb.go
📒 Files selected for processing (16)
.mk/development.mkMakefilecmd/flp-informers/main.gocontrib/kubernetes/deployment-flp-informers.yamlcontrib/kubernetes/deployment-k8scache.yamlpkg/pipeline/transform/kubernetes/datasource/datasource.gopkg/pipeline/transform/kubernetes/informers/cache_export.gopkg/pipeline/transform/kubernetes/k8scache/client.gopkg/pipeline/transform/kubernetes/k8scache/converter.gopkg/pipeline/transform/kubernetes/k8scache/discovery.gopkg/pipeline/transform/kubernetes/k8scache/eventhandler.gopkg/pipeline/transform/kubernetes/k8scache/eventhandler_test.gopkg/pipeline/transform/kubernetes/k8scache/integration_test.gopkg/pipeline/transform/kubernetes/k8scache/server.gopkg/pipeline/transform/kubernetes/k8scache/server_test.goproto/k8scache.proto
✅ Files skipped from review due to trivial changes (2)
- pkg/pipeline/transform/kubernetes/k8scache/converter.go
- contrib/kubernetes/deployment-flp-informers.yaml
🚧 Files skipped from review as they are similar to previous changes (3)
- pkg/pipeline/transform/kubernetes/datasource/datasource.go
- contrib/kubernetes/deployment-k8scache.yaml
- pkg/pipeline/transform/kubernetes/informers/cache_export.go
| func TestIntegration_ServerReceivesAdd(t *testing.T) { | ||
| // Setup test datasource | ||
| _, informers := inf.SetupStubs(testIPInfo, nil, testNodes) | ||
| ds := &datasource.Datasource{Informers: informers} | ||
| ds.SetKubernetesStore(datasource.NewKubernetesStore()) | ||
|
|
||
| // Create cache server | ||
| cacheServer := NewKubernetesCacheServer(ds) | ||
|
|
||
| // Create gRPC server | ||
| grpcServer := grpc.NewServer() | ||
| RegisterKubernetesCacheServiceServer(grpcServer, cacheServer) | ||
|
|
||
| // Start server on random port | ||
| listener, err := net.Listen("tcp", "127.0.0.1:0") | ||
| require.NoError(t, err) | ||
| defer listener.Close() | ||
|
|
||
| address := listener.Addr().String() | ||
|
|
||
| // Start server in background | ||
| go func() { | ||
| _ = grpcServer.Serve(listener) | ||
| }() | ||
| defer grpcServer.Stop() | ||
|
|
||
| // Give server time to start | ||
| time.Sleep(100 * time.Millisecond) | ||
|
|
||
| // Create client connection | ||
| ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) | ||
| defer cancel() | ||
|
|
||
| conn, err := grpc.NewClient(address, | ||
| grpc.WithTransportCredentials(insecure.NewCredentials()), | ||
| ) | ||
| require.NoError(t, err, "Failed to connect to server") | ||
| defer conn.Close() | ||
|
|
||
| // Create client | ||
| client := NewKubernetesCacheServiceClient(conn) | ||
|
|
||
| // Open bidirectional stream | ||
| stream, err := client.StreamUpdates(ctx) | ||
| require.NoError(t, err) | ||
|
|
||
| // Client should receive SyncRequest from server first | ||
| syncMsg, err := stream.Recv() | ||
| require.NoError(t, err) | ||
| require.NotNil(t, syncMsg) | ||
|
|
||
| req, ok := syncMsg.Message.(*SyncMessage_Request) | ||
| require.True(t, ok, "Expected SyncRequest from server") | ||
| assert.Equal(t, int64(0), req.Request.LastVersion) | ||
|
|
||
| // Client sends ADD update | ||
| addUpdate := &CacheUpdate{ | ||
| Version: 1, | ||
| IsSnapshot: false, | ||
| Operation: OperationType_OPERATION_ADD, | ||
| Entries: []*ResourceEntry{ | ||
| { | ||
| Kind: "Pod", | ||
| Namespace: "default", | ||
| Name: "test-pod", | ||
| Uid: "test-uid", | ||
| Ips: []string{"10.0.0.100"}, | ||
| }, | ||
| }, | ||
| } | ||
|
|
||
| err = stream.Send(addUpdate) | ||
| require.NoError(t, err) | ||
|
|
||
| // Client should receive ACK from server | ||
| ackMsg, err := stream.Recv() | ||
| require.NoError(t, err) | ||
| require.NotNil(t, ackMsg) | ||
|
|
||
| ack, ok := ackMsg.Message.(*SyncMessage_Ack) | ||
| require.True(t, ok, "Expected SyncAck from server") | ||
| assert.True(t, ack.Ack.Success, "Server should ACK successfully") | ||
| assert.Equal(t, int64(1), ack.Ack.Version) | ||
|
|
||
| // Close stream | ||
| err = stream.CloseSend() | ||
| assert.NoError(t, err) | ||
|
|
||
| fmt.Printf("✓ Integration test passed: server on %s received and acknowledged ADD update\n", address) | ||
| } |
There was a problem hiding this comment.
Make these integration tests assert cache state.
Both tests stop at ACK/version checks, so they can still pass if the gRPC path never mutates the datasource. TestIntegration_MultipleUpdatesFlow also skips KubernetesStore, which means it cannot verify ADD/DELETE effects end-to-end at all. Please back these cases with a store, include queryable fields like IPs, and assert the datasource state after each update.
Also applies to: 112-214
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/pipeline/transform/kubernetes/k8scache/integration_test.go` around lines
20 - 109, The tests only verify ACKs and not that the datasource was
mutated—ensure each integration test initializes a real store (call
ds.SetKubernetesStore(datasource.NewKubernetesStore()) where missing, e.g. in
TestIntegration_MultipleUpdatesFlow) and after receiving the server ACKs query
the store to assert the expected state: for the ADD path use the store lookup
API (by Uid, Name/Namespace or IP) to assert the ResourceEntry was persisted and
that Ips contains "10.0.0.100"; for DELETE/UPDATE paths assert the resource is
removed or updated accordingly. Locate the test functions
(TestIntegration_ServerReceivesAdd, TestIntegration_MultipleUpdatesFlow), the
datasource.Datasource instance, NewKubernetesStore, SetKubernetesStore,
KubernetesCacheServer, StreamUpdates, CacheUpdate and ResourceEntry to add these
post-ACK assertions.
|
@leandroberetta: This pull request references NETOBSERV-2675 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
pkg/pipeline/transform/kubernetes/k8scache/integration_test.go (1)
114-116:⚠️ Potential issue | 🟠 MajorMissing
SetKubernetesStore- store operations are no-ops.Without
ds.SetKubernetesStore(datasource.NewKubernetesStore()),ApplyCacheAddOrUpdateandApplyCacheDeletesilently do nothing. The test verifies ACKs but not actual data persistence.🐛 Proposed fix
// Setup _, informers := inf.SetupStubs(testIPInfo, nil, testNodes) ds := &datasource.Datasource{Informers: informers} + ds.SetKubernetesStore(datasource.NewKubernetesStore()) cacheServer := NewKubernetesCacheServer(ds)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/pipeline/transform/kubernetes/k8scache/integration_test.go` around lines 114 - 116, The test constructs a Datasource with Informers but never initializes its internal store, so calls to ApplyCacheAddOrUpdate and ApplyCacheDelete become no-ops; fix by calling ds.SetKubernetesStore(datasource.NewKubernetesStore()) before creating the KubernetesCacheServer (NewKubernetesCacheServer) so the Datasource has a real store and the ApplyCacheAddOrUpdate / ApplyCacheDelete behavior is exercised and persisted during the test.
🧹 Nitpick comments (3)
pkg/pipeline/transform/kubernetes/k8scache/client.go (2)
166-174: Consider early exit when client is stopping.
AddProcessorcan be called by the discovery goroutine afterStop()has cancelledc.ctx. While not a data race, this wastes work establishing connections that will immediately fail.💡 Proposed improvement
func (c *Client) AddProcessor(address string) error { + // Early exit if client is stopping + select { + case <-c.ctx.Done(): + return c.ctx.Err() + default: + } + // First check: quick lock to see if already connected c.mu.Lock()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/pipeline/transform/kubernetes/k8scache/client.go` around lines 166 - 174, Add an early-exit in AddProcessor to avoid dialing when the client is stopping: immediately after the initial existence check (or at the very start of AddProcessor) check c.ctx for cancellation (e.g., if c.ctx.Err() != nil or select on c.ctx.Done()) and return early (nil or a contextual error) so you don't proceed to establish a connection that will fail; update the logic around the existing lock/unlock and the code paths that follow to respect this canceled-context check before performing any network work or spawning connection goroutines.
409-497: Long-held lock during reconnection may blockStop().
pc.muis held for the entire reconnect loop (up to ~2 minutes with backoff).Stop()→closeConnection()will block waiting for this lock. Consider releasing the lock duringtime.Afterbackoff or checking for shutdown before each retry.💡 Suggested approach
Check for client shutdown at the start of each retry iteration before acquiring
pc.mu:func (c *Client) reconnect(pc *processorConnection) bool { - pc.mu.Lock() - defer pc.mu.Unlock() - backoff := initialBackoff for attempt := 1; attempt <= maxReconnectAttempts; attempt++ { + // Check if client is shutting down + select { + case <-c.ctx.Done(): + clog.WithField("address", pc.address).Info("reconnection cancelled") + return false + default: + } + + pc.mu.Lock() pc.reconnectAttempts = attempt // ... rest of attempt logic ... + pc.mu.Unlock() + + // Wait before retry (outside lock) + select { + case <-time.After(backoff): + case <-c.ctx.Done(): + return false + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/pipeline/transform/kubernetes/k8scache/client.go` around lines 409 - 497, The reconnect loop holds pc.mu for the entire backoff loop, blocking Stop()/closeConnection(); to fix, restructure reconnect so it checks c.ctx.Done() at the start of each retry and does not hold pc.mu while sleeping: move pc.mu.Lock()/Unlock() inside the loop only around sections that mutate pc (e.g., updating pc.reconnectAttempts, closing/setting pc.conn, pc.stream, pc.cancel, and pc.healthy), perform the time.After(backoff) wait without holding pc.mu, and only acquire the lock immediately before touching pc fields (and release it promptly afterward); keep use of c.getTransportCredentials(), grpc.NewClient, client.StreamUpdates, and the final deletion from c.connections unchanged but ensure mutations to pc and the connections map occur under their respective locks.pkg/pipeline/transform/kubernetes/k8scache/integration_test.go (1)
46-48: Consider replacingtime.Sleepwith retry-based connection.Sleep-based waits can cause flaky tests. Consider using
grpc.WithBlock()option or a retry loop for more reliable server-ready detection.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/pipeline/transform/kubernetes/k8scache/integration_test.go` around lines 46 - 48, Replace the brittle time.Sleep(100 * time.Millisecond) used to wait for the test server with a deterministic connection attempt: remove the time.Sleep call and instead perform a blocking/retry dial using grpc.WithBlock() (or grpc.DialContext) with a context timeout (context.WithTimeout) or a short retry loop that repeatedly calls grpc.Dial until success, then proceed once the connection is established; update the test setup where time.Sleep is invoked to use grpc.WithBlock/ctx timeout or a retry loop around grpc.Dial to reliably detect server readiness.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/pipeline/transform/kubernetes/k8scache/client.go`:
- Around line 131-133: The tls.Config initialization in the client (the
tlsConfig variable in client.go) omits MinVersion, leaving protocol downgrade
risk; update the tls.Config used in the client (where tlsConfig is created) to
explicitly set MinVersion: tls.VersionTLS12 (or tls.VersionTLS13 for stricter
security) while keeping InsecureSkipVerify as currently sourced from
c.tlsConfig.InsecureSkipVerify so the config enforces a minimum TLS version for
connections.
In `@pkg/pipeline/transform/kubernetes/k8scache/integration_test.go`:
- Around line 219-222: The test is missing initialization of the Kubernetes
store so the cache data path isn't exercised; call SetKubernetesStore on the
Datasource used by NewKubernetesCacheServer after creating ds
(datasource.Datasource) so the informer-backed store is wired before running the
integration test (mirror the fix applied in TestIntegration_MultipleUpdatesFlow
by invoking SetKubernetesStore with the appropriate informer/store stub returned
from inf.SetupStubs).
---
Duplicate comments:
In `@pkg/pipeline/transform/kubernetes/k8scache/integration_test.go`:
- Around line 114-116: The test constructs a Datasource with Informers but never
initializes its internal store, so calls to ApplyCacheAddOrUpdate and
ApplyCacheDelete become no-ops; fix by calling
ds.SetKubernetesStore(datasource.NewKubernetesStore()) before creating the
KubernetesCacheServer (NewKubernetesCacheServer) so the Datasource has a real
store and the ApplyCacheAddOrUpdate / ApplyCacheDelete behavior is exercised and
persisted during the test.
---
Nitpick comments:
In `@pkg/pipeline/transform/kubernetes/k8scache/client.go`:
- Around line 166-174: Add an early-exit in AddProcessor to avoid dialing when
the client is stopping: immediately after the initial existence check (or at the
very start of AddProcessor) check c.ctx for cancellation (e.g., if c.ctx.Err()
!= nil or select on c.ctx.Done()) and return early (nil or a contextual error)
so you don't proceed to establish a connection that will fail; update the logic
around the existing lock/unlock and the code paths that follow to respect this
canceled-context check before performing any network work or spawning connection
goroutines.
- Around line 409-497: The reconnect loop holds pc.mu for the entire backoff
loop, blocking Stop()/closeConnection(); to fix, restructure reconnect so it
checks c.ctx.Done() at the start of each retry and does not hold pc.mu while
sleeping: move pc.mu.Lock()/Unlock() inside the loop only around sections that
mutate pc (e.g., updating pc.reconnectAttempts, closing/setting pc.conn,
pc.stream, pc.cancel, and pc.healthy), perform the time.After(backoff) wait
without holding pc.mu, and only acquire the lock immediately before touching pc
fields (and release it promptly afterward); keep use of
c.getTransportCredentials(), grpc.NewClient, client.StreamUpdates, and the final
deletion from c.connections unchanged but ensure mutations to pc and the
connections map occur under their respective locks.
In `@pkg/pipeline/transform/kubernetes/k8scache/integration_test.go`:
- Around line 46-48: Replace the brittle time.Sleep(100 * time.Millisecond) used
to wait for the test server with a deterministic connection attempt: remove the
time.Sleep call and instead perform a blocking/retry dial using grpc.WithBlock()
(or grpc.DialContext) with a context timeout (context.WithTimeout) or a short
retry loop that repeatedly calls grpc.Dial until success, then proceed once the
connection is established; update the test setup where time.Sleep is invoked to
use grpc.WithBlock/ctx timeout or a retry loop around grpc.Dial to reliably
detect server readiness.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d5a6fdb2-fc7f-4715-9a77-86e86d2025f3
📒 Files selected for processing (7)
.mk/development.mkMakefilepkg/pipeline/transform/kubernetes/k8scache/client.gopkg/pipeline/transform/kubernetes/k8scache/discovery.gopkg/pipeline/transform/kubernetes/k8scache/eventhandler.gopkg/pipeline/transform/kubernetes/k8scache/eventhandler_test.gopkg/pipeline/transform/kubernetes/k8scache/integration_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
- Makefile
- pkg/pipeline/transform/kubernetes/k8scache/discovery.go
- pkg/pipeline/transform/kubernetes/k8scache/eventhandler.go
| tlsConfig := &tls.Config{ | ||
| InsecureSkipVerify: c.tlsConfig.InsecureSkipVerify, | ||
| } |
There was a problem hiding this comment.
Set TLS MinVersion to prevent downgrade attacks.
The tls.Config lacks MinVersion, defaulting to TLS 1.0 for servers (TLS 1.2 for clients). Explicitly set MinVersion: tls.VersionTLS12 (or tls.VersionTLS13 for stronger security).
🔒 Proposed fix
tlsConfig := &tls.Config{
InsecureSkipVerify: c.tlsConfig.InsecureSkipVerify,
+ MinVersion: tls.VersionTLS12,
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/pipeline/transform/kubernetes/k8scache/client.go` around lines 131 - 133,
The tls.Config initialization in the client (the tlsConfig variable in
client.go) omits MinVersion, leaving protocol downgrade risk; update the
tls.Config used in the client (where tlsConfig is created) to explicitly set
MinVersion: tls.VersionTLS12 (or tls.VersionTLS13 for stricter security) while
keeping InsecureSkipVerify as currently sourced from
c.tlsConfig.InsecureSkipVerify so the config enforces a minimum TLS version for
connections.
| // Setup | ||
| _, informers := inf.SetupStubs(testIPInfo, nil, testNodes) | ||
| ds := &datasource.Datasource{Informers: informers} | ||
| cacheServer := NewKubernetesCacheServer(ds) |
There was a problem hiding this comment.
Missing SetKubernetesStore - same issue as TestIntegration_MultipleUpdatesFlow.
Add store initialization for complete data path coverage.
🐛 Proposed fix
// Setup
_, informers := inf.SetupStubs(testIPInfo, nil, testNodes)
ds := &datasource.Datasource{Informers: informers}
+ ds.SetKubernetesStore(datasource.NewKubernetesStore())
cacheServer := NewKubernetesCacheServer(ds)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Setup | |
| _, informers := inf.SetupStubs(testIPInfo, nil, testNodes) | |
| ds := &datasource.Datasource{Informers: informers} | |
| cacheServer := NewKubernetesCacheServer(ds) | |
| // Setup | |
| _, informers := inf.SetupStubs(testIPInfo, nil, testNodes) | |
| ds := &datasource.Datasource{Informers: informers} | |
| ds.SetKubernetesStore(datasource.NewKubernetesStore()) | |
| cacheServer := NewKubernetesCacheServer(ds) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/pipeline/transform/kubernetes/k8scache/integration_test.go` around lines
219 - 222, The test is missing initialization of the Kubernetes store so the
cache data path isn't exercised; call SetKubernetesStore on the Datasource used
by NewKubernetesCacheServer after creating ds (datasource.Datasource) so the
informer-backed store is wired before running the integration test (mirror the
fix applied in TestIntegration_MultipleUpdatesFlow by invoking
SetKubernetesStore with the appropriate informer/store stub returned from
inf.SetupStubs).
de63e00 to
ed2954b
Compare
|
@leandroberetta: This pull request references NETOBSERV-2675 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (5)
pkg/pipeline/transform/kubernetes/k8scache/eventhandler_test.go (1)
41-60: Consider adding tests for OnAdd and OnUpdate handlers.Only OnDelete is covered. OnAdd (with
isInInitialListflag) and OnUpdate have their own branching logic that should be tested.Want me to generate test cases for OnAdd and OnUpdate handlers?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/pipeline/transform/kubernetes/k8scache/eventhandler_test.go` around lines 41 - 60, Add unit tests for the event handler's OnAdd and OnUpdate paths: create tests that call the handler's OnAdd with isInInitialList both true and false to verify that SendAdd on mockClient is called (or suppressed) appropriately, and add tests for OnUpdate covering both a real-change case (old vs new ResourceMetaData differ) that should call SendUpdate and a no-change case that should not; use the existing mockClient (methods SendAdd, SendUpdate, SendDelete) and the same setup used for the OnDelete test (construct the handler via NewEventHandler or the eventHandler type and assert mockClient.addedMeta/updatedMeta contents and sendError behavior)..mk/development.mk (1)
142-148: Missingundeploy-k8scachetarget.There's
deploy-k8scachebut no correspondingundeploy-k8scachetarget. The cleanup chain (local-deployments-cleanup-k8scache) uses the genericundeploywhich may not fully clean up the k8scache-specific deployment.Proposed fix: Add undeploy-k8scache target
+.PHONY: undeploy-k8scache +undeploy-k8scache: ## Undeploy FLP with k8scache server + sed 's|%IMAGE_TAG_BASE%|$(IMAGE_TAG_BASE)|g;s|%VERSION%|$(VERSION)|g;s|%NAMESPACE%|$(NAMESPACE)|g' contrib/kubernetes/deployment-k8scache.yaml > /tmp/deployment-k8scache.yaml + kubectl --ignore-not-found=true delete configmap flowlogs-pipeline-configuration -n $(NAMESPACE) || true + kubectl --ignore-not-found=true delete -f /tmp/deployment-k8scache.yaml -n $(NAMESPACE) || true + .PHONY: local-deploy-k8scache🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.mk/development.mk around lines 142 - 148, There is a missing undeploy-k8scache Make target to mirror deploy-k8scache; add an undeploy-k8scache target that removes the k8scache-specific Kubernetes resources (delete the deployment created from /tmp/deployment-k8scache.yaml and the flowlogs-pipeline-configuration configmap) and wait for rollout/termination as needed; ensure the new target name is exactly undeploy-k8scache so local-deployments-cleanup-k8scache can call it, and reuse the same identifiers used in deploy-k8scache (e.g., flowlogs-pipeline configmap name and deploy/flowlogs-pipeline) so cleanup fully removes the k8scache artifacts.pkg/pipeline/transform/kubernetes/k8scache/server.go (1)
67-81: Recv blocking in select-default may cause tight loop on slow receivers.The
selectwithdefaultcase means ifctx.Done()isn't signaled, it immediately callsstream.Recv(). This is fine sinceRecv()blocks, but the pattern is slightly unusual. Consider using a dedicated select on ctx.Done() inside the Recv goroutine if you need cancellation responsiveness during receives.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/pipeline/transform/kubernetes/k8scache/server.go` around lines 67 - 81, The current loop uses select with a default and directly calls stream.Recv(), which can lead to a tight/select-default pattern; change to read from the stream in a dedicated goroutine that sends received updates and errors into channels, then in the main loop select between ctx.Done() and those channels (use symbols stream.Recv, ctx.Done, connectionID) so cancellation is handled promptly without relying on the default case—on ctx.Done() cancel/cleanup and on recv-channel handle io.EOF and errors the same way you currently log using connectionID.pkg/pipeline/transform/kubernetes/k8scache/discovery.go (1)
88-121: No cleanup of connections when pods terminate.
discoverAndConnectadds processors but never removes stale ones. When a pod terminates, itsprocessorConnectionremains in the client until the connection fails. This is acceptable sincebroadcastUpdateskips unhealthy connections, but consider adding explicit cleanup for faster resource reclamation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/pipeline/transform/kubernetes/k8scache/discovery.go` around lines 88 - 121, discoverAndConnect currently only adds processors and never removes stale connections; update it to reconcile the client's processor list with the current pod IPs: after listing pods build the set of desired addresses (fmt.Sprintf("%s:%d", pod.Status.PodIP, cfg.ProcessorPort) for running pods with PodIP) and then remove any client-side processors not in that set by calling a remove method (e.g., Client.RemoveProcessor or a new method you add) or by iterating client.processorConnections and closing/removing entries for addresses absent from the desired set; keep the existing AddProcessor calls for new addresses so discoverAndConnect performs full reconciliation.pkg/pipeline/transform/kubernetes/k8scache/client.go (1)
236-252: Lock released and re-acquired inRemoveProcessorcreates a window.Lines 249-251 release
c.mu, callpc.closeConnection(), then re-lock. Due todefer c.mu.Unlock()at line 238, the lock is re-acquired only to be immediately released. This is benign but confusing. Restructure to avoid the double unlock pattern.Cleaner pattern
func (c *Client) RemoveProcessor(address string) { c.mu.Lock() - defer c.mu.Unlock() - pc, exists := c.connections[address] if !exists { + c.mu.Unlock() return } clog.WithField("address", address).Info("disconnecting from FLP processor") delete(c.connections, address) - - // Close connection after releasing client lock to avoid holding both locks c.mu.Unlock() + + // Close connection after releasing client lock to avoid holding both locks pc.closeConnection() - c.mu.Lock() }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/pipeline/transform/kubernetes/k8scache/client.go` around lines 236 - 252, The RemoveProcessor method currently uses defer c.mu.Unlock() and also does an explicit c.mu.Unlock()/c.mu.Lock() around pc.closeConnection(), creating a confusing double-unlock/re-lock window; change the pattern to lock, look up pc, delete the map entry, then explicitly unlock before calling pc.closeConnection() (i.e., remove the defer and the later c.mu.Unlock()/c.mu.Lock()), so the map mutation (c.connections[address] deletion) happens under c.mu and the potentially blocking pc.closeConnection() runs without holding c.mu; reference Client.RemoveProcessor, c.mu, c.connections, and pc.closeConnection when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/flp-informers/main.go`:
- Around line 170-174: StartProcessorDiscovery is launched inside a goroutine so
any immediate errors (e.g., invalid discoveryConfig) are only logged and not
surfaced to startup; change this to detect and handle immediate failures by
calling k8scache.StartProcessorDiscovery synchronously for the initial startup
or by capturing the first error through a channel: invoke
k8scache.StartProcessorDiscovery(ctx, grpcClient, discoveryConfig) and if it
returns a non-nil error treat it as a fatal config/startup error (log and
return/exit), otherwise spawn the background goroutine or continue; reference
the existing goroutine invocation and variables grpcClient, discoveryConfig, and
k8scache.StartProcessorDiscovery when making the change.
- Around line 178-184: The shutdown path currently waits on sigChan but never
closes the Informers' stop channels, so informer watches keep running; after
detecting the shutdown signal (where sigChan is read in main) close the
Informers' stop channels (inf.stopChan and inf.mdStopChan) to signal informer
factories to stop — either defer closing them after InitFromConfig() returns or
explicitly call close(inf.stopChan) and close(inf.mdStopChan) immediately after
the <-sigChan read; reference the Informers instance returned by
InitFromConfig() (inf) so you close those two channels before exiting.
In `@pkg/pipeline/transform/kubernetes/k8scache/client.go`:
- Around line 409-484: reconnect currently holds pc.mu for the entire retry
loop, including the time.After backoff, which blocks other methods like
pc.getStream()/pc.cancelStream(); modify reconnect so it does not hold pc.mu
during the sleep: acquire lock only when reading/updating pc fields (e.g., set
pc.reconnectAttempts, check/close pc.conn, cancel pc.cancel, update
pc.conn/pc.stream/pc.cancel and pc.healthy), then release the lock before the
select time.After(backoff) and re-acquire it (or re-lock) only when you need to
touch pc again; keep the existing error handling and backoff updates but ensure
any access to pc fields is protected by pc.mu while the actual wait is done
without the lock to avoid blocking broadcastUpdate/getStream/cancelStream.
---
Nitpick comments:
In @.mk/development.mk:
- Around line 142-148: There is a missing undeploy-k8scache Make target to
mirror deploy-k8scache; add an undeploy-k8scache target that removes the
k8scache-specific Kubernetes resources (delete the deployment created from
/tmp/deployment-k8scache.yaml and the flowlogs-pipeline-configuration configmap)
and wait for rollout/termination as needed; ensure the new target name is
exactly undeploy-k8scache so local-deployments-cleanup-k8scache can call it, and
reuse the same identifiers used in deploy-k8scache (e.g., flowlogs-pipeline
configmap name and deploy/flowlogs-pipeline) so cleanup fully removes the
k8scache artifacts.
In `@pkg/pipeline/transform/kubernetes/k8scache/client.go`:
- Around line 236-252: The RemoveProcessor method currently uses defer
c.mu.Unlock() and also does an explicit c.mu.Unlock()/c.mu.Lock() around
pc.closeConnection(), creating a confusing double-unlock/re-lock window; change
the pattern to lock, look up pc, delete the map entry, then explicitly unlock
before calling pc.closeConnection() (i.e., remove the defer and the later
c.mu.Unlock()/c.mu.Lock()), so the map mutation (c.connections[address]
deletion) happens under c.mu and the potentially blocking pc.closeConnection()
runs without holding c.mu; reference Client.RemoveProcessor, c.mu,
c.connections, and pc.closeConnection when making the change.
In `@pkg/pipeline/transform/kubernetes/k8scache/discovery.go`:
- Around line 88-121: discoverAndConnect currently only adds processors and
never removes stale connections; update it to reconcile the client's processor
list with the current pod IPs: after listing pods build the set of desired
addresses (fmt.Sprintf("%s:%d", pod.Status.PodIP, cfg.ProcessorPort) for running
pods with PodIP) and then remove any client-side processors not in that set by
calling a remove method (e.g., Client.RemoveProcessor or a new method you add)
or by iterating client.processorConnections and closing/removing entries for
addresses absent from the desired set; keep the existing AddProcessor calls for
new addresses so discoverAndConnect performs full reconciliation.
In `@pkg/pipeline/transform/kubernetes/k8scache/eventhandler_test.go`:
- Around line 41-60: Add unit tests for the event handler's OnAdd and OnUpdate
paths: create tests that call the handler's OnAdd with isInInitialList both true
and false to verify that SendAdd on mockClient is called (or suppressed)
appropriately, and add tests for OnUpdate covering both a real-change case (old
vs new ResourceMetaData differ) that should call SendUpdate and a no-change case
that should not; use the existing mockClient (methods SendAdd, SendUpdate,
SendDelete) and the same setup used for the OnDelete test (construct the handler
via NewEventHandler or the eventHandler type and assert
mockClient.addedMeta/updatedMeta contents and sendError behavior).
In `@pkg/pipeline/transform/kubernetes/k8scache/server.go`:
- Around line 67-81: The current loop uses select with a default and directly
calls stream.Recv(), which can lead to a tight/select-default pattern; change to
read from the stream in a dedicated goroutine that sends received updates and
errors into channels, then in the main loop select between ctx.Done() and those
channels (use symbols stream.Recv, ctx.Done, connectionID) so cancellation is
handled promptly without relying on the default case—on ctx.Done()
cancel/cleanup and on recv-channel handle io.EOF and errors the same way you
currently log using connectionID.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e49448c4-5beb-4126-83d4-0466d0974109
⛔ Files ignored due to path filters (2)
pkg/pipeline/transform/kubernetes/k8scache/k8scache.pb.gois excluded by!**/*.pb.gopkg/pipeline/transform/kubernetes/k8scache/k8scache_grpc.pb.gois excluded by!**/*.pb.go
📒 Files selected for processing (24)
.mk/development.mkMakefileREADME.mdcmd/flowlogs-pipeline/main.gocmd/flp-informers/main.gocontrib/docker/Dockerfilecontrib/kubernetes/deployment-flp-informers.yamlcontrib/kubernetes/deployment-k8scache.yamlpkg/config/config.gopkg/pipeline/transform/kubernetes/datasource/datasource.gopkg/pipeline/transform/kubernetes/datasource/kubernetesstore.gopkg/pipeline/transform/kubernetes/enrich.gopkg/pipeline/transform/kubernetes/informers/cache_export.gopkg/pipeline/transform/kubernetes/k8scache/client.gopkg/pipeline/transform/kubernetes/k8scache/converter.gopkg/pipeline/transform/kubernetes/k8scache/discovery.gopkg/pipeline/transform/kubernetes/k8scache/discovery_test.gopkg/pipeline/transform/kubernetes/k8scache/eventhandler.gopkg/pipeline/transform/kubernetes/k8scache/eventhandler_test.gopkg/pipeline/transform/kubernetes/k8scache/integration_test.gopkg/pipeline/transform/kubernetes/k8scache/server.gopkg/pipeline/transform/kubernetes/k8scache/server_test.goproto/README.mdproto/k8scache.proto
✅ Files skipped from review due to trivial changes (6)
- README.md
- contrib/docker/Dockerfile
- pkg/pipeline/transform/kubernetes/enrich.go
- cmd/flowlogs-pipeline/main.go
- contrib/kubernetes/deployment-flp-informers.yaml
- pkg/pipeline/transform/kubernetes/k8scache/converter.go
🚧 Files skipped from review as they are similar to previous changes (6)
- pkg/config/config.go
- Makefile
- pkg/pipeline/transform/kubernetes/informers/cache_export.go
- contrib/kubernetes/deployment-k8scache.yaml
- pkg/pipeline/transform/kubernetes/k8scache/integration_test.go
- pkg/pipeline/transform/kubernetes/datasource/kubernetesstore.go
| go func() { | ||
| if err := k8scache.StartProcessorDiscovery(ctx, grpcClient, discoveryConfig); err != nil { | ||
| log.WithError(err).Error("processor discovery stopped") | ||
| } | ||
| }() |
There was a problem hiding this comment.
Discovery failure is logged but not surfaced.
If StartProcessorDiscovery returns an error immediately (e.g., invalid config), it's only logged. Consider checking the error and failing startup for configuration errors vs. transient errors.
Proposed fix
+ // Use error channel to catch immediate failures
+ errCh := make(chan error, 1)
go func() {
- if err := k8scache.StartProcessorDiscovery(ctx, grpcClient, discoveryConfig); err != nil {
- log.WithError(err).Error("processor discovery stopped")
+ errCh <- k8scache.StartProcessorDiscovery(ctx, grpcClient, discoveryConfig)
+ }()
+
+ // Give discovery a moment to fail fast on config errors
+ select {
+ case err := <-errCh:
+ if err != nil && !errors.Is(err, context.Canceled) {
+ log.WithError(err).Fatal("processor discovery failed")
}
- }()
+ case <-time.After(2 * time.Second):
+ // Discovery started successfully, continue
+ }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cmd/flp-informers/main.go` around lines 170 - 174, StartProcessorDiscovery is
launched inside a goroutine so any immediate errors (e.g., invalid
discoveryConfig) are only logged and not surfaced to startup; change this to
detect and handle immediate failures by calling k8scache.StartProcessorDiscovery
synchronously for the initial startup or by capturing the first error through a
channel: invoke k8scache.StartProcessorDiscovery(ctx, grpcClient,
discoveryConfig) and if it returns a non-nil error treat it as a fatal
config/startup error (log and return/exit), otherwise spawn the background
goroutine or continue; reference the existing goroutine invocation and variables
grpcClient, discoveryConfig, and k8scache.StartProcessorDiscovery when making
the change.
| // Wait for shutdown signal | ||
| sigChan := make(chan os.Signal, 1) | ||
| signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM) | ||
| <-sigChan | ||
|
|
||
| log.Info("Shutdown signal received, stopping...") | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if Informers has a Stop method
ast-grep --pattern 'func ($_ *Informers) Stop() {
$$$
}'Repository: netobserv/flowlogs-pipeline
Length of output: 53
🏁 Script executed:
cat -n cmd/flp-informers/main.go | sed -n '130,190p'Repository: netobserv/flowlogs-pipeline
Length of output: 2332
🏁 Script executed:
# Find where Informers is defined
rg 'type Informers' --type go -A 5Repository: netobserv/flowlogs-pipeline
Length of output: 635
🏁 Script executed:
# Check if there's a Stop method on any informers-related types
rg 'func.*Stop\(' --type go | grep -i informerRepository: netobserv/flowlogs-pipeline
Length of output: 53
🏁 Script executed:
# Check for any cleanup/close methods on Informers
rg 'func.*Informers.*\(' --type go | grep -E '(Close|Release|Shutdown|Stop|Clean)'Repository: netobserv/flowlogs-pipeline
Length of output: 53
🏁 Script executed:
# Look at the full Informers struct and its methods
cat -n pkg/pipeline/transform/kubernetes/informers/informers.go | head -100Repository: netobserv/flowlogs-pipeline
Length of output: 3756
🏁 Script executed:
# Check what the Interface type is
rg 'type Interface interface' --type go -A 20 pkg/pipeline/transform/kubernetes/informers/Repository: netobserv/flowlogs-pipeline
Length of output: 2045
🏁 Script executed:
# Check how stopChan and mdStopChan are used
rg 'stopChan|mdStopChan' --type go pkg/pipeline/transform/kubernetes/informers/informers.goRepository: netobserv/flowlogs-pipeline
Length of output: 386
🏁 Script executed:
# Look for any goroutine starts or stoppage logic
rg 'go func|stopChan|for.*stopChan' --type go -B 2 -A 2 pkg/pipeline/transform/kubernetes/informers/informers.goRepository: netobserv/flowlogs-pipeline
Length of output: 554
🏁 Script executed:
# Search for where stopChan or mdStopChan are closed
rg 'close\(.*stopChan\)' --type goRepository: netobserv/flowlogs-pipeline
Length of output: 149
🏁 Script executed:
# Search across entire codebase for any Stop method or cleanup on Informers
rg 'func.*\*Informers.*\(' --type go pkg/pipeline/transform/kubernetes/informers/informers.go | head -20Repository: netobserv/flowlogs-pipeline
Length of output: 1827
🏁 Script executed:
# Check the entire informers.go file length
wc -l pkg/pipeline/transform/kubernetes/informers/informers.goRepository: netobserv/flowlogs-pipeline
Length of output: 131
Close the Informers stop channels on shutdown to stop watches and release resources.
The stopChan and mdStopChan created during InitFromConfig() are never closed. Closing these channels signals the underlying informer factories to stop their watches. Add the following before the function returns:
defer func() {
close(inf.stopChan)
close(inf.mdStopChan)
}()Or close them explicitly when receiving the shutdown signal at line 181.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cmd/flp-informers/main.go` around lines 178 - 184, The shutdown path
currently waits on sigChan but never closes the Informers' stop channels, so
informer watches keep running; after detecting the shutdown signal (where
sigChan is read in main) close the Informers' stop channels (inf.stopChan and
inf.mdStopChan) to signal informer factories to stop — either defer closing them
after InitFromConfig() returns or explicitly call close(inf.stopChan) and
close(inf.mdStopChan) immediately after the <-sigChan read; reference the
Informers instance returned by InitFromConfig() (inf) so you close those two
channels before exiting.
ed2954b to
1117b70
Compare
|
@leandroberetta: This pull request references NETOBSERV-2675 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 6
♻️ Duplicate comments (2)
cmd/flp-informers/main.go (1)
192-196:⚠️ Potential issue | 🟠 MajorFail fast if processor discovery can't start.
StartProcessorDiscoveryreturns immediate config errors, but this goroutine only logs them. A bad kubeconfig or non-positive resync interval leavesflp-informersrunning with no processor connections. Treat non-context.Canceledstartup errors as fatal.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/flp-informers/main.go` around lines 192 - 196, The anonymous goroutine that calls StartProcessorDiscovery currently only logs errors; change it to treat startup errors (except context.Canceled) as fatal so the process fails fast: after calling k8scache.StartProcessorDiscovery(ctx, grpcClient, discoveryConfig), if err != nil and !errors.Is(err, context.Canceled) then log the error and exit (e.g., log.WithError(err).Fatal or os.Exit(1)); keep the existing non-fatal handling for context.Canceled. Ensure you reference StartProcessorDiscovery, ctx, grpcClient and discoveryConfig and import/use errors/context as needed.pkg/pipeline/transform/kubernetes/k8scache/client.go (1)
131-133:⚠️ Potential issue | 🟠 MajorSet TLS
MinVersionto prevent protocol downgrade attacks.The
tls.ConfiglacksMinVersion, defaulting to TLS 1.0 for servers. Explicitly setMinVersion: tls.VersionTLS12(ortls.VersionTLS13for stronger security) as flagged by static analysis.Proposed fix
tlsConfig := &tls.Config{ InsecureSkipVerify: c.tlsConfig.InsecureSkipVerify, + MinVersion: tls.VersionTLS12, }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/pipeline/transform/kubernetes/k8scache/client.go` around lines 131 - 133, The tls.Config constructed in the client (variable tlsConfig) currently only sets InsecureSkipVerify from c.tlsConfig; update the config to include a MinVersion to prevent protocol downgrade attacks by setting MinVersion: tls.VersionTLS12 (or tls.VersionTLS13 for stricter security). Locate the tlsConfig creation in the client.go code path that uses c.tlsConfig, add the MinVersion field to the tls.Config literal, and ensure any related tests or lints reflect the new minimum TLS version.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/flowlogs-pipeline/main.go`:
- Around line 241-243: The shutdown currently calls grpcServer.GracefulStop()
with no timeout which can hang because StreamUpdates is long-lived; modify
shutdown to perform a bounded graceful stop: start GracefulStop() in a
goroutine, wait for it to finish with a timeout (e.g., time.After or
context.WithTimeout), and if the timeout elapses call grpcServer.Stop() to
force-close active streams; locate the grpcServer usage in main.go and update
the shutdown block around grpcServer.GracefulStop() to implement the
timeout/fallback to Stop(), referencing grpcServer, GracefulStop, Stop, and the
StreamUpdates RPC.
In `@cmd/flp-informers/main.go`:
- Around line 163-164: The informer bootstrap currently constructs
api.NetworkTransformKubeConfig as an empty struct and calls
informers.NewConfig(apiConfig), which forces defaults and diverges from
processor-side settings; instead obtain the real NetworkTransformKubeConfig used
by the pipeline (e.g., load it from the same CLI/config source or shared
initializer) and pass that populated config to informers.NewConfig so managed
CNI, secondary networks and tracked kinds match processor enrichment (replace
the empty api.NetworkTransformKubeConfig{} used when creating infConfig).
In `@pkg/pipeline/transform/kubernetes/k8scache/client.go`:
- Around line 219-223: The timeout cleanup calls stream.CloseSend() without
handling its error; modify the timeout handling block to capture the return
(e.g. err := stream.CloseSend()) and either log or explicitly ignore it (for
example log using the existing logger/processLogger with context like
"stream.CloseSend failed" if err != nil, or assign to _ to document intentional
ignore). Locate the cleanup where stream.CloseSend() is invoked and replace the
bare call with the error-aware version so the CloseSend error is handled or
explicitly ignored.
- Around line 272-289: The RemoveProcessor method currently defers c.mu.Unlock()
and also manually unlocks and re-locks c.mu, causing a double-unlock panic;
remove the defer and manage the mutex manually in RemoveProcessor: call
c.mu.Lock() at the top, look up pc := c.connections[address], if it doesn't
exist call c.mu.Unlock() and return, otherwise delete(c.connections, address),
then call c.mu.Unlock() before invoking pc.closeConnection(); do not re-lock
after close so the mutex is unlocked exactly once. Reference:
Client.RemoveProcessor, c.mu, c.connections, pc.closeConnection.
In `@pkg/pipeline/transform/kubernetes/k8scache/discovery.go`:
- Around line 114-118: Discovery currently dials
pod.Status.PodIP:cfg.ProcessorPort (address) which causes TLS verification to
fail because getTransportCredentials() (client.go) doesn't set
tls.Config.ServerName and Go will validate the IP against certificate SANs;
update the flow so that either (A) the discovery code (where address is built
and passed to client.AddProcessorWithTimeout) uses a DNS name that matches the
processor certificate (e.g., construct host like "<podHostname>.<namespace>..."
or a configured ProcessorHost) instead of pod.Status.PodIP, or (B) set
tls.Config.ServerName inside getTransportCredentials() when constructing
transport credentials (use a configured expected server name or derive one from
the processor metadata) so the TLS handshake validates against the certificate
CN/SAN; modify address construction in discovery.go and/or set ServerName in
getTransportCredentials() accordingly.
In `@proto/README.md`:
- Around line 15-16: The README's protoc examples use paths=source_relative but
pass ./proto/k8scache.proto which causes generated files to be nested under a
proto/ subdirectory; update the commands to add an import root and use the bare
filename (e.g., add -I ./proto and use k8scache.proto as input) so that the
outputs for the protoc --go_out and --go-grpc_out invocations land directly in
pkg/pipeline/transform/kubernetes/k8scache/ instead of pkg/.../k8scache/proto/.
---
Duplicate comments:
In `@cmd/flp-informers/main.go`:
- Around line 192-196: The anonymous goroutine that calls
StartProcessorDiscovery currently only logs errors; change it to treat startup
errors (except context.Canceled) as fatal so the process fails fast: after
calling k8scache.StartProcessorDiscovery(ctx, grpcClient, discoveryConfig), if
err != nil and !errors.Is(err, context.Canceled) then log the error and exit
(e.g., log.WithError(err).Fatal or os.Exit(1)); keep the existing non-fatal
handling for context.Canceled. Ensure you reference StartProcessorDiscovery,
ctx, grpcClient and discoveryConfig and import/use errors/context as needed.
In `@pkg/pipeline/transform/kubernetes/k8scache/client.go`:
- Around line 131-133: The tls.Config constructed in the client (variable
tlsConfig) currently only sets InsecureSkipVerify from c.tlsConfig; update the
config to include a MinVersion to prevent protocol downgrade attacks by setting
MinVersion: tls.VersionTLS12 (or tls.VersionTLS13 for stricter security). Locate
the tlsConfig creation in the client.go code path that uses c.tlsConfig, add the
MinVersion field to the tls.Config literal, and ensure any related tests or
lints reflect the new minimum TLS version.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5022cd27-6a5c-470b-b91f-2f54ec42500d
⛔ Files ignored due to path filters (2)
pkg/pipeline/transform/kubernetes/k8scache/k8scache.pb.gois excluded by!**/*.pb.gopkg/pipeline/transform/kubernetes/k8scache/k8scache_grpc.pb.gois excluded by!**/*.pb.go
📒 Files selected for processing (24)
.mk/development.mkMakefileREADME.mdcmd/flowlogs-pipeline/main.gocmd/flp-informers/main.gocontrib/docker/Dockerfilecontrib/kubernetes/deployment-flp-informers.yamlcontrib/kubernetes/deployment-k8scache.yamlpkg/config/config.gopkg/pipeline/transform/kubernetes/datasource/datasource.gopkg/pipeline/transform/kubernetes/datasource/kubernetesstore.gopkg/pipeline/transform/kubernetes/enrich.gopkg/pipeline/transform/kubernetes/informers/cache_export.gopkg/pipeline/transform/kubernetes/k8scache/client.gopkg/pipeline/transform/kubernetes/k8scache/converter.gopkg/pipeline/transform/kubernetes/k8scache/discovery.gopkg/pipeline/transform/kubernetes/k8scache/discovery_test.gopkg/pipeline/transform/kubernetes/k8scache/eventhandler.gopkg/pipeline/transform/kubernetes/k8scache/eventhandler_test.gopkg/pipeline/transform/kubernetes/k8scache/integration_test.gopkg/pipeline/transform/kubernetes/k8scache/server.gopkg/pipeline/transform/kubernetes/k8scache/server_test.goproto/README.mdproto/k8scache.proto
✅ Files skipped from review due to trivial changes (7)
- README.md
- contrib/docker/Dockerfile
- pkg/pipeline/transform/kubernetes/k8scache/server.go
- pkg/pipeline/transform/kubernetes/k8scache/eventhandler_test.go
- pkg/pipeline/transform/kubernetes/k8scache/eventhandler.go
- pkg/pipeline/transform/kubernetes/k8scache/converter.go
- contrib/kubernetes/deployment-flp-informers.yaml
🚧 Files skipped from review as they are similar to previous changes (8)
- Makefile
- pkg/pipeline/transform/kubernetes/enrich.go
- pkg/config/config.go
- pkg/pipeline/transform/kubernetes/informers/cache_export.go
- .mk/development.mk
- contrib/kubernetes/deployment-k8scache.yaml
- pkg/pipeline/transform/kubernetes/k8scache/integration_test.go
- pkg/pipeline/transform/kubernetes/datasource/kubernetesstore.go
| if grpcServer != nil { | ||
| log.Info("stopping K8s cache sync server") | ||
| grpcServer.GracefulStop() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's find and examine the main.go file around lines 241-243
find . -name main.go -path "*/cmd/flowlogs-pipeline/*" | head -1Repository: netobserv/flowlogs-pipeline
Length of output: 102
🏁 Script executed:
# Get context around the shutdown code
cat -n cmd/flowlogs-pipeline/main.go | sed -n '230,260p'Repository: netobserv/flowlogs-pipeline
Length of output: 1253
🏁 Script executed:
# Find the gRPC service definition and StreamUpdates implementation
find . -type f -name "*.go" | xargs grep -l "StreamUpdates" | head -5Repository: netobserv/flowlogs-pipeline
Length of output: 370
🏁 Script executed:
# Check for any timeout configurations related to gRPC server
rg "GracefulStop|time\.After|context" cmd/flowlogs-pipeline/ -A 2 -B 2Repository: netobserv/flowlogs-pipeline
Length of output: 889
🏁 Script executed:
# Examine the gRPC service definition for StreamUpdates
cat -n pkg/pipeline/transform/kubernetes/k8scache/k8scache_grpc.pb.go | grep -A 20 "StreamUpdates"Repository: netobserv/flowlogs-pipeline
Length of output: 7670
🏁 Script executed:
# Also check the server.go implementation
cat -n pkg/pipeline/transform/kubernetes/k8scache/server.goRepository: netobserv/flowlogs-pipeline
Length of output: 5814
Guard GracefulStop() with a timeout.
StreamUpdates is a long-lived bidirectional streaming RPC. Waiting indefinitely for active streams to drain can hang shutdown; use a bounded wait and fall back to Stop().
Suggested change
if grpcServer != nil {
log.Info("stopping K8s cache sync server")
- grpcServer.GracefulStop()
+ stopped := make(chan struct{})
+ go func() {
+ grpcServer.GracefulStop()
+ close(stopped)
+ }()
+ select {
+ case <-stopped:
+ case <-time.After(5 * time.Second):
+ log.Warn("timed out waiting for K8s cache sync streams to drain; forcing stop")
+ grpcServer.Stop()
+ }
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if grpcServer != nil { | |
| log.Info("stopping K8s cache sync server") | |
| grpcServer.GracefulStop() | |
| if grpcServer != nil { | |
| log.Info("stopping K8s cache sync server") | |
| stopped := make(chan struct{}) | |
| go func() { | |
| grpcServer.GracefulStop() | |
| close(stopped) | |
| }() | |
| select { | |
| case <-stopped: | |
| case <-time.After(5 * time.Second): | |
| log.Warn("timed out waiting for K8s cache sync streams to drain; forcing stop") | |
| grpcServer.Stop() | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cmd/flowlogs-pipeline/main.go` around lines 241 - 243, The shutdown currently
calls grpcServer.GracefulStop() with no timeout which can hang because
StreamUpdates is long-lived; modify shutdown to perform a bounded graceful stop:
start GracefulStop() in a goroutine, wait for it to finish with a timeout (e.g.,
time.After or context.WithTimeout), and if the timeout elapses call
grpcServer.Stop() to force-close active streams; locate the grpcServer usage in
main.go and update the shutdown block around grpcServer.GracefulStop() to
implement the timeout/fallback to Stop(), referencing grpcServer, GracefulStop,
Stop, and the StreamUpdates RPC.
| apiConfig := &api.NetworkTransformKubeConfig{} // Empty config - will use defaults | ||
| infConfig := informers.NewConfig(apiConfig) |
There was a problem hiding this comment.
Don't bootstrap central informers with an empty Kubernetes config.
informers.NewConfig derives managed CNI, secondary networks, and tracked kinds from api.NetworkTransformKubeConfig. Passing {} here hardcodes defaults, so centralized cache mode diverges from processor-side enrichment whenever the pipeline uses non-default Kubernetes settings.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cmd/flp-informers/main.go` around lines 163 - 164, The informer bootstrap
currently constructs api.NetworkTransformKubeConfig as an empty struct and calls
informers.NewConfig(apiConfig), which forces defaults and diverges from
processor-side settings; instead obtain the real NetworkTransformKubeConfig used
by the pipeline (e.g., load it from the same CLI/config source or shared
initializer) and pass that populated config to informers.NewConfig so managed
CNI, secondary networks and tracked kinds match processor enrichment (replace
the empty api.NetworkTransformKubeConfig{} used when creating infConfig).
| $ protoc --go_out=./pkg/pipeline/transform/kubernetes/k8scache --go_opt=paths=source_relative ./proto/k8scache.proto | ||
| $ protoc --go-grpc_out=./pkg/pipeline/transform/kubernetes/k8scache --go-grpc_opt=paths=source_relative ./proto/k8scache.proto |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, check if the file exists and show the actual content
cat -n proto/README.md | head -20Repository: netobserv/flowlogs-pipeline
Length of output: 821
🏁 Script executed:
# Look for the actual .pb.go files to understand the current layout
find . -name "*k8scache*.pb.go" -type fRepository: netobserv/flowlogs-pipeline
Length of output: 195
🏁 Script executed:
# Check the proto directory structure
ls -la proto/ | grep -i k8sRepository: netobserv/flowlogs-pipeline
Length of output: 135
🏁 Script executed:
# Check protoc documentation for paths=source_relative behavior
web_search "protoc paths=source_relative option behavior documentation"Repository: netobserv/flowlogs-pipeline
Length of output: 121
🏁 Script executed:
# Search for any existing protoc invocations in the codebase to see patterns
rg "protoc.*--go_out" --type bash --type md -A 1 -B 1Repository: netobserv/flowlogs-pipeline
Length of output: 99
🏁 Script executed:
# Check if there's a Makefile or build script that might show correct protoc usage
find . -name "Makefile" -o -name "*.mk" -o -name "build.sh" | head -5Repository: netobserv/flowlogs-pipeline
Length of output: 188
🏁 Script executed:
# Check the Makefile for protoc usage patterns
rg "protoc" Makefile -A 2 -B 2Repository: netobserv/flowlogs-pipeline
Length of output: 53
🏁 Script executed:
# Also check development.mk
rg "protoc" .mk/development.mk -A 2 -B 2Repository: netobserv/flowlogs-pipeline
Length of output: 53
🌐 Web query:
protoc paths=source_relative option behavior go output
💡 Result:
The --go_opt=paths=source_relative option (or --go_out=paths=source_relative,...) in the protoc command for Go code generation controls the directory structure of the output .pb.go files relative to the --go_out directory. By default (or with paths=import), output mirrors the Go package import path from the go_package option in the .proto file. With paths=source_relative, the output file is placed in the same relative directory as the input .proto file (after stripping --proto_path prefixes). For example: protoc --proto_path=src --go_out=out --go_opt=paths=source_relative src/foo.proto src/bar/baz.proto This generates: - out/foo.pb.go - out/bar/baz.pb.go The compiler creates nested subdirectories as needed within the --go_out base directory but does not create the base output directory itself. This option is specified via --go_opt for protoc-gen-go (for .pb.go files). Use --go-grpc_opt=paths=source_relative similarly for protoc-gen-go-grpc (for _grpc.pb.go files). Other modes: - module=$PREFIX: Strips $PREFIX from the import path before placing in output. - Without paths=...: Defaults to paths=import. Requires go_package option in every .proto file for proper import paths.
Citations:
- 1: https://protobuf.dev/reference/go/go-generated-opaque/
- 2: https://protobuf.dev/reference/go/go-generated/
- 3: https://stackoverflow.com/questions/70731053/protoc-go-opt-paths-source-relative-vs-go-grpc-opt-paths-source-relative
- 4: https://stackoverflow.com/questions/72816024/protobuf-with-go-not-able-to-compile-to-the-desired-folder-with-paths-source-rel/72816972
These commands regenerate into the wrong path.
With paths=source_relative, the input path ./proto/k8scache.proto preserves the proto/ prefix in the output directory, placing files under pkg/pipeline/transform/kubernetes/k8scache/proto/ instead of directly in pkg/pipeline/transform/kubernetes/k8scache/. Use -I ./proto with just k8scache.proto as input so contributors regenerate into the current package.
Suggested doc fix
-$ protoc --go_out=./pkg/pipeline/transform/kubernetes/k8scache --go_opt=paths=source_relative ./proto/k8scache.proto
-$ protoc --go-grpc_out=./pkg/pipeline/transform/kubernetes/k8scache --go-grpc_opt=paths=source_relative ./proto/k8scache.proto
+$ protoc -I ./proto --go_out=./pkg/pipeline/transform/kubernetes/k8scache --go_opt=paths=source_relative k8scache.proto
+$ protoc -I ./proto --go-grpc_out=./pkg/pipeline/transform/kubernetes/k8scache --go-grpc_opt=paths=source_relative k8scache.proto📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| $ protoc --go_out=./pkg/pipeline/transform/kubernetes/k8scache --go_opt=paths=source_relative ./proto/k8scache.proto | |
| $ protoc --go-grpc_out=./pkg/pipeline/transform/kubernetes/k8scache --go-grpc_opt=paths=source_relative ./proto/k8scache.proto | |
| $ protoc -I ./proto --go_out=./pkg/pipeline/transform/kubernetes/k8scache --go_opt=paths=source_relative k8scache.proto | |
| $ protoc -I ./proto --go-grpc_out=./pkg/pipeline/transform/kubernetes/k8scache --go-grpc_opt=paths=source_relative k8scache.proto |
🧰 Tools
🪛 markdownlint-cli2 (0.22.0)
[warning] 15-15: Dollar signs used before commands without showing output
(MD014, commands-show-output)
[warning] 16-16: Dollar signs used before commands without showing output
(MD014, commands-show-output)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@proto/README.md` around lines 15 - 16, The README's protoc examples use
paths=source_relative but pass ./proto/k8scache.proto which causes generated
files to be nested under a proto/ subdirectory; update the commands to add an
import root and use the bare filename (e.g., add -I ./proto and use
k8scache.proto as input) so that the outputs for the protoc --go_out and
--go-grpc_out invocations land directly in
pkg/pipeline/transform/kubernetes/k8scache/ instead of pkg/.../k8scache/proto/.
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
@coderabbitai re-review |
|
✅ Actions performedFull review triggered. |
|
@leandroberetta: This pull request references NETOBSERV-2675 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 5
♻️ Duplicate comments (4)
cmd/flp-informers/main.go (2)
203-207:⚠️ Potential issue | 🟠 MajorSurface the first discovery error instead of burying it in the goroutine.
StartProcessorDiscoveryfails fast for invalid config and kubeconfig setup, but this path only logs the error and keeps the process alive with no processor connections. Treat the first returned error as startup failure.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/flp-informers/main.go` around lines 203 - 207, Start treating StartProcessorDiscovery failures as startup errors instead of silently logging them in the goroutine: remove the fire-and-forget goroutine and invoke or run StartProcessorDiscovery such that its first returned error is propagated to main (e.g., call StartProcessorDiscovery synchronously or start it in a goroutine but send its error on a channel and block on that channel during startup); when StartProcessorDiscovery(ctx, grpcClient, discoveryConfig) returns a non-nil error, log it and return/exit the process to fail fast. Ensure you reference StartProcessorDiscovery, grpcClient and discoveryConfig when changing the call site so the initial discovery error is surfaced.
173-174:⚠️ Potential issue | 🟠 MajorDon't bootstrap centralized informers from an empty kube transform config.
pkg/pipeline/transform/kubernetes/informers/informers.go:362-431uses this config for secondary-network extraction and ownership tracking. Hard-coding{}here makes the exported cache diverge from processor-side enrichment whenever the pipeline sets non-default Kubernetes options.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/flp-informers/main.go` around lines 173 - 174, Currently the informers are being bootstrapped with an empty api.NetworkTransformKubeConfig (api.NetworkTransformKubeConfig{}) which diverges from the pipeline's transform settings; instead locate and pass the same NetworkTransformKubeConfig instance the pipeline uses into informers.NewConfig (i.e., replace the hard-coded &api.NetworkTransformKubeConfig{} with the pipeline's existing config object or a shared variable), so NewConfig and the informers use identical Kubernetes options as the processor-side transform/ownership logic.cmd/flowlogs-pipeline/main.go (1)
241-243:⚠️ Potential issue | 🟠 MajorBound
GracefulStop()with a timeout.
StreamUpdatesis long-lived, so shutdown can block forever here if a client never drains. Fall back toStop()after a short deadline.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/flowlogs-pipeline/main.go` around lines 241 - 243, The current shutdown calls grpcServer.GracefulStop() which can block indefinitely because StreamUpdates may never return; modify the shutdown to attempt a graceful stop with a timeout and fall back to grpcServer.Stop() if the timeout elapses. Implement this by launching a goroutine that calls grpcServer.GracefulStop() and closes a done channel when finished, then in the main goroutine select on that done channel vs time.After(a short deadline, e.g., 5s) and call grpcServer.Stop() if the deadline fires; keep the existing nil check on grpcServer and reference grpcServer, GracefulStop, Stop, and StreamUpdates in your changes.pkg/pipeline/transform/kubernetes/k8scache/client.go (1)
135-137:⚠️ Potential issue | 🟡 MinorSet TLS minimum protocol version explicitly.
tls.Configdoes not setMinVersion. Set a floor (tls.VersionTLS12ortls.VersionTLS13) to avoid protocol downgrade risk.Suggested fix
tlsConfig := &tls.Config{ InsecureSkipVerify: c.tlsConfig.InsecureSkipVerify, + MinVersion: tls.VersionTLS12, }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/pipeline/transform/kubernetes/k8scache/client.go` around lines 135 - 137, The TLS config currently sets only InsecureSkipVerify in the tls.Config initialization; explicitly set a minimum TLS version to prevent protocol downgrades by adding MinVersion: tls.VersionTLS12 (or tls.VersionTLS13) to the tlsConfig struct initialization (i.e., update the tlsConfig := &tls.Config{ InsecureSkipVerify: c.tlsConfig.InsecureSkipVerify, MinVersion: tls.VersionTLS12 }).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/flp-informers/main.go`:
- Around line 184-189: The informer currently registers NewEventHandler but
never feeds it the informer’s existing objects because OnAdd in
pkg/pipeline/transform/kubernetes/k8scache/eventhandler.go ignores
isInInitialList=true; fix by, immediately after calling
inf.AddEventHandler(handler) (and before logging registration success),
enumerate the informer’s current store (via the informer’s GetStore()/List() or
Indexer) and invoke the handler’s OnAdd for each existing object with
isInInitialList=false (or call the handler’s public add method) so the
KubernetesStore is seeded with the initial snapshot; reference NewEventHandler,
AddEventHandler and OnAdd to locate the code to change.
In `@pkg/pipeline/transform/kubernetes/datasource/kubernetesstore.go`:
- Around line 77-90: The loops in KubernetesStore.Replace (and the similar loops
that call storeKey at other update paths) may dereference nil entries and panic;
before calling storeKey or otherwise using meta, add a nil check (skip the
iteration if meta == nil) so malformed cache update slices are ignored, and
still call addToIndexes/removeFromIndexes only for non-nil metas; update the
loops in Replace and the other update/remove loops that reference storeKey,
addToIndexes, or removeFromIndexes to guard against nil entries.
- Around line 23-24: The byIP map currently violates the "first match wins"
contract because writers unconditionally overwrite entries and deletions remove
keys regardless of owner; update the code that writes to byIP so it does not
overwrite an existing entry (if byIP[ip] != nil, skip storing the new
ResourceMetaData), and change the deletion logic so it only removes a key when
the stored ResourceMetaData matches the resource being deleted (compare a stable
identifier on model.ResourceMetaData such as UID/ID or another unique field
rather than pointer equality). Ensure all places that assign to or delete from
byIP follow these two rules so the first-matched owner is preserved and deletes
only remove the intended owner.
In `@pkg/pipeline/transform/kubernetes/k8scache/client.go`:
- Around line 364-372: The sendUpdate function can succeed after shutdown
because the send case may win the select even if c.ctx.Done() is already closed;
before attempting to enqueue, check the context and fail fast (e.g., if
c.ctx.Err() != nil return c.ctx.Err()), or perform a preliminary select on
c.ctx.Done() to return its error, then proceed to the existing select for
sending to c.updateChan (keeping the timeout case); reference sendUpdate,
c.updateChan, and c.ctx.Done()/c.ctx.Err() when applying the change so
Stop()-initiated cancellation is always honored.
- Around line 443-455: The send error branch in the select (case err := <-done)
currently only logs the error and sets pc.healthy false, which can leave the
stream open; modify that branch to also force teardown by calling
pc.cancelStream() so blocked Recv() calls unblock and trigger reconnection;
update the block handling err != nil in the same select where sendCtx and done
are used (reference: the select handling case err := <-done,
pc.healthy.Store(false), sendCtx.Done(), and the pc.cancelStream() method) to
call pc.cancelStream() after setting healthy to false and logging.
---
Duplicate comments:
In `@cmd/flowlogs-pipeline/main.go`:
- Around line 241-243: The current shutdown calls grpcServer.GracefulStop()
which can block indefinitely because StreamUpdates may never return; modify the
shutdown to attempt a graceful stop with a timeout and fall back to
grpcServer.Stop() if the timeout elapses. Implement this by launching a
goroutine that calls grpcServer.GracefulStop() and closes a done channel when
finished, then in the main goroutine select on that done channel vs time.After(a
short deadline, e.g., 5s) and call grpcServer.Stop() if the deadline fires; keep
the existing nil check on grpcServer and reference grpcServer, GracefulStop,
Stop, and StreamUpdates in your changes.
In `@cmd/flp-informers/main.go`:
- Around line 203-207: Start treating StartProcessorDiscovery failures as
startup errors instead of silently logging them in the goroutine: remove the
fire-and-forget goroutine and invoke or run StartProcessorDiscovery such that
its first returned error is propagated to main (e.g., call
StartProcessorDiscovery synchronously or start it in a goroutine but send its
error on a channel and block on that channel during startup); when
StartProcessorDiscovery(ctx, grpcClient, discoveryConfig) returns a non-nil
error, log it and return/exit the process to fail fast. Ensure you reference
StartProcessorDiscovery, grpcClient and discoveryConfig when changing the call
site so the initial discovery error is surfaced.
- Around line 173-174: Currently the informers are being bootstrapped with an
empty api.NetworkTransformKubeConfig (api.NetworkTransformKubeConfig{}) which
diverges from the pipeline's transform settings; instead locate and pass the
same NetworkTransformKubeConfig instance the pipeline uses into
informers.NewConfig (i.e., replace the hard-coded
&api.NetworkTransformKubeConfig{} with the pipeline's existing config object or
a shared variable), so NewConfig and the informers use identical Kubernetes
options as the processor-side transform/ownership logic.
In `@pkg/pipeline/transform/kubernetes/k8scache/client.go`:
- Around line 135-137: The TLS config currently sets only InsecureSkipVerify in
the tls.Config initialization; explicitly set a minimum TLS version to prevent
protocol downgrades by adding MinVersion: tls.VersionTLS12 (or tls.VersionTLS13)
to the tlsConfig struct initialization (i.e., update the tlsConfig :=
&tls.Config{ InsecureSkipVerify: c.tlsConfig.InsecureSkipVerify, MinVersion:
tls.VersionTLS12 }).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: db98a847-8104-4140-be80-e8c51387672b
⛔ Files ignored due to path filters (2)
pkg/pipeline/transform/kubernetes/k8scache/k8scache.pb.gois excluded by!**/*.pb.gopkg/pipeline/transform/kubernetes/k8scache/k8scache_grpc.pb.gois excluded by!**/*.pb.go
📒 Files selected for processing (24)
.mk/development.mkMakefileREADME.mdcmd/flowlogs-pipeline/main.gocmd/flp-informers/main.gocontrib/docker/Dockerfilecontrib/kubernetes/deployment-flp-informers.yamlcontrib/kubernetes/deployment-k8scache.yamlpkg/config/config.gopkg/pipeline/transform/kubernetes/datasource/datasource.gopkg/pipeline/transform/kubernetes/datasource/kubernetesstore.gopkg/pipeline/transform/kubernetes/enrich.gopkg/pipeline/transform/kubernetes/informers/cache_export.gopkg/pipeline/transform/kubernetes/k8scache/client.gopkg/pipeline/transform/kubernetes/k8scache/converter.gopkg/pipeline/transform/kubernetes/k8scache/discovery.gopkg/pipeline/transform/kubernetes/k8scache/discovery_test.gopkg/pipeline/transform/kubernetes/k8scache/eventhandler.gopkg/pipeline/transform/kubernetes/k8scache/eventhandler_test.gopkg/pipeline/transform/kubernetes/k8scache/integration_test.gopkg/pipeline/transform/kubernetes/k8scache/server.gopkg/pipeline/transform/kubernetes/k8scache/server_test.goproto/README.mdproto/k8scache.proto
✅ Files skipped from review due to trivial changes (7)
- README.md
- contrib/docker/Dockerfile
- pkg/pipeline/transform/kubernetes/k8scache/eventhandler_test.go
- contrib/kubernetes/deployment-flp-informers.yaml
- pkg/pipeline/transform/kubernetes/k8scache/converter.go
- contrib/kubernetes/deployment-k8scache.yaml
- pkg/pipeline/transform/kubernetes/k8scache/eventhandler.go
🚧 Files skipped from review as they are similar to previous changes (8)
- Makefile
- pkg/pipeline/transform/kubernetes/enrich.go
- pkg/pipeline/transform/kubernetes/k8scache/discovery_test.go
- pkg/pipeline/transform/kubernetes/k8scache/server.go
- pkg/pipeline/transform/kubernetes/k8scache/integration_test.go
- pkg/pipeline/transform/kubernetes/datasource/datasource.go
- pkg/pipeline/transform/kubernetes/k8scache/server_test.go
- .mk/development.mk
| // Setup informer event handlers to push updates via gRPC | ||
| handler := k8scache.NewEventHandler(grpcClient) | ||
| if err := inf.AddEventHandler(handler); err != nil { | ||
| log.WithError(err).Fatal("failed to add event handlers") | ||
| } | ||
| log.Info("Informer event handlers registered for cache sync") |
There was a problem hiding this comment.
Processors never get the initial cache contents.
This wires pkg/pipeline/transform/kubernetes/k8scache/eventhandler.go:46-64, whose OnAdd drops isInInitialList=true, and the PR explicitly avoids snapshots. A fresh processor/informer pair therefore starts with an empty KubernetesStore until each existing object changes, which breaks enrichment for steady-state workloads.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cmd/flp-informers/main.go` around lines 184 - 189, The informer currently
registers NewEventHandler but never feeds it the informer’s existing objects
because OnAdd in pkg/pipeline/transform/kubernetes/k8scache/eventhandler.go
ignores isInInitialList=true; fix by, immediately after calling
inf.AddEventHandler(handler) (and before logging registration success),
enumerate the informer’s current store (via the informer’s GetStore()/List() or
Indexer) and invoke the handler’s OnAdd for each existing object with
isInInitialList=false (or call the handler’s public add method) so the
KubernetesStore is seeded with the initial snapshot; reference NewEventHandler,
AddEventHandler and OnAdd to locate the code to change.
| // index: IP -> meta (first match wins; pods may have multiple IPs) | ||
| byIP map[string]*model.ResourceMetaData |
There was a problem hiding this comment.
byIP does not implement the documented "first match wins" contract.
The comment on Line 23 says first match wins, but Line 65 overwrites any existing owner and Line 48 deletes the key outright. If two resources ever share an IP, updates/deletes can return the wrong object or drop the surviving mapping.
Also applies to: 47-49, 64-66
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/pipeline/transform/kubernetes/datasource/kubernetesstore.go` around lines
23 - 24, The byIP map currently violates the "first match wins" contract because
writers unconditionally overwrite entries and deletions remove keys regardless
of owner; update the code that writes to byIP so it does not overwrite an
existing entry (if byIP[ip] != nil, skip storing the new ResourceMetaData), and
change the deletion logic so it only removes a key when the stored
ResourceMetaData matches the resource being deleted (compare a stable identifier
on model.ResourceMetaData such as UID/ID or another unique field rather than
pointer equality). Ensure all places that assign to or delete from byIP follow
these two rules so the first-matched owner is preserved and deletes only remove
the intended owner.
| func (s *KubernetesStore) Replace(entries []*model.ResourceMetaData) { | ||
| s.mu.Lock() | ||
| defer s.mu.Unlock() | ||
|
|
||
| s.byKey = make(map[string]*model.ResourceMetaData) | ||
| s.byIP = make(map[string]*model.ResourceMetaData) | ||
| s.byNodeName = make(map[string]*model.ResourceMetaData) | ||
| s.bySecondaryKey = make(map[string]*model.ResourceMetaData) | ||
|
|
||
| for _, meta := range entries { | ||
| key := storeKey(meta) | ||
| s.byKey[key] = meta | ||
| s.addToIndexes(meta) | ||
| } |
There was a problem hiding this comment.
Guard nil entries before calling storeKey.
addToIndexes/removeFromIndexes already tolerate nil, but these loops dereference meta first. A malformed cache update containing a nil entry will panic the processor instead of being skipped or rejected.
Also applies to: 94-105, 110-120
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/pipeline/transform/kubernetes/datasource/kubernetesstore.go` around lines
77 - 90, The loops in KubernetesStore.Replace (and the similar loops that call
storeKey at other update paths) may dereference nil entries and panic; before
calling storeKey or otherwise using meta, add a nil check (skip the iteration if
meta == nil) so malformed cache update slices are ignored, and still call
addToIndexes/removeFromIndexes only for non-nil metas; update the loops in
Replace and the other update/remove loops that reference storeKey, addToIndexes,
or removeFromIndexes to guard against nil entries.
| case err := <-done: | ||
| if err != nil { | ||
| clog.WithError(err).WithField("address", pc.address).Error("failed to send update") | ||
| pc.healthy.Store(false) | ||
| } | ||
| case <-sendCtx.Done(): | ||
| clog.WithField("address", pc.address).Error("send operation timed out") | ||
| pc.healthy.Store(false) | ||
|
|
||
| // Cancel the stream context to ensure any blocked operations are unblocked | ||
| // This will cause receiveFromProcessor to see an error and trigger reconnection | ||
| pc.cancelStream() | ||
| } |
There was a problem hiding this comment.
Force stream teardown on send error path too.
At Line 443, send errors only set healthy=false. If Recv() doesn’t fail promptly, this connection can stay permanently skipped. Call pc.cancelStream() in the error branch as well.
Suggested fix
case err := <-done:
if err != nil {
clog.WithError(err).WithField("address", pc.address).Error("failed to send update")
pc.healthy.Store(false)
+ // Force Recv to unblock so reconnect path can run
+ pc.cancelStream()
}
case <-sendCtx.Done():
clog.WithField("address", pc.address).Error("send operation timed out")
pc.healthy.Store(false)
// Cancel the stream context to ensure any blocked operations are unblocked
// This will cause receiveFromProcessor to see an error and trigger reconnection
pc.cancelStream()
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/pipeline/transform/kubernetes/k8scache/client.go` around lines 443 - 455,
The send error branch in the select (case err := <-done) currently only logs the
error and sets pc.healthy false, which can leave the stream open; modify that
branch to also force teardown by calling pc.cancelStream() so blocked Recv()
calls unblock and trigger reconnection; update the block handling err != nil in
the same select where sendCtx and done are used (reference: the select handling
case err := <-done, pc.healthy.Store(false), sendCtx.Done(), and the
pc.cancelStream() method) to call pc.cancelStream() after setting healthy to
false and logging.
|
@leandroberetta: This pull request references NETOBSERV-2675 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 5
♻️ Duplicate comments (11)
pkg/pipeline/transform/kubernetes/k8scache/server.go (1)
136-138:⚠️ Potential issue | 🟠 MajorSnapshot semantics explicitly ignored—fresh processors start empty.
The comment documents this design choice. On restart, processors rebuild cache from incoming events. This means a processor restart during low-activity periods may have stale/missing data until resources change.
Consider implementing initial sync (snapshot) when
SyncRequest.LastVersion == 0so restarted processors can catch up immediately. Theis_snapshotfield andKubernetesStore.Replaceexist but aren't wired in.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/pipeline/transform/kubernetes/k8scache/server.go` around lines 136 - 138, The comment notes snapshots are ignored causing restarts to miss state; implement initial sync so when SyncRequest.LastVersion == 0 the server treats the request as a snapshot: honor the is_snapshot flag and call KubernetesStore.Replace with the full resource set instead of only applying ADD/UPDATE events. Update the sync handling in server.go (the SyncRequest processing code path) to detect LastVersion == 0 (or is_snapshot) and perform a Replace on the store, populating version metadata and returning the appropriate response so restarted processors get an immediate, consistent cache state.cmd/flowlogs-pipeline/main.go (1)
241-244:⚠️ Potential issue | 🟠 MajorGuard
GracefulStop()with a timeout.
StreamUpdatesis a long-lived bidirectional streaming RPC. Indefinite wait for active streams to drain can hang shutdown.Suggested fix
if grpcServer != nil { log.Info("stopping K8s cache sync server") - grpcServer.GracefulStop() + stopped := make(chan struct{}) + go func() { + grpcServer.GracefulStop() + close(stopped) + }() + select { + case <-stopped: + case <-time.After(5 * time.Second): + log.Warn("graceful stop timed out; forcing stop") + grpcServer.Stop() + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/flowlogs-pipeline/main.go` around lines 241 - 244, The shutdown currently calls grpcServer.GracefulStop() with no timeout which can hang waiting on long-lived StreamUpdates RPCs; fix by starting a goroutine that sleeps for a bounded timeout (e.g., a configurable shutdownTimeout duration) and calls grpcServer.Stop() to force-close if GracefulStop() doesn't finish in time, then call grpcServer.GracefulStop() in the main shutdown path and log both the graceful completion and the forced stop; reference grpcServer.GracefulStop(), grpcServer.Stop(), and the StreamUpdates RPC so the timeout protects long-lived streams.pkg/pipeline/transform/kubernetes/k8scache/client.go (1)
134-137:⚠️ Potential issue | 🟡 MinorSet TLS
MinVersionto prevent protocol downgrade.Static analysis and past review both flag this. Without
MinVersion, the default allows older TLS versions.tlsConfig := &tls.Config{ InsecureSkipVerify: c.tlsConfig.InsecureSkipVerify, + MinVersion: tls.VersionTLS12, }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/pipeline/transform/kubernetes/k8scache/client.go` around lines 134 - 137, The tls.Config created in client.go (variable tlsConfig with InsecureSkipVerify: c.tlsConfig.InsecureSkipVerify) doesn't set MinVersion, allowing protocol downgrade; update the tls.Config initialization in the code that builds tlsConfig to set MinVersion to a safe minimum (e.g., tls.VersionTLS12 or use a configured minimum from c.tlsConfig.MinVersion) so the client enforces a modern TLS version and prevents downgrade attacks.pkg/pipeline/transform/kubernetes/k8scache/integration_test.go (2)
113-116:⚠️ Potential issue | 🟡 MinorMissing
SetKubernetesStore- cache mutations aren't tested.Without store initialization, the gRPC path processes updates but doesn't verify they actually mutate the datasource. Add store setup for complete data path coverage.
// Setup _, informers := inf.SetupStubs(testIPInfo, nil, testNodes) ds := &datasource.Datasource{Informers: informers} + ds.SetKubernetesStore(datasource.NewKubernetesStore()) cacheServer := NewKubernetesCacheServer(ds)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/pipeline/transform/kubernetes/k8scache/integration_test.go` around lines 113 - 116, The test never initializes the datasource store so mutations via the gRPC path aren't validated; after creating ds := &datasource.Datasource{Informers: informers} call SetKubernetesStore on ds (or use the existing helper that wires the store for kube objects) before creating cacheServer with NewKubernetesCacheServer(ds) so updates applied through the server mutate ds.Store and the test can assert expected changes; ensure you reference and initialize the same store type used elsewhere in tests (e.g., the Kubernetes store) so subsequent gRPC update calls actually persist to datasource.Datasource.Store.
219-222:⚠️ Potential issue | 🟡 MinorSame issue: Missing
SetKubernetesStorein multi-client test.// Setup _, informers := inf.SetupStubs(testIPInfo, nil, testNodes) ds := &datasource.Datasource{Informers: informers} + ds.SetKubernetesStore(datasource.NewKubernetesStore()) cacheServer := NewKubernetesCacheServer(ds)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/pipeline/transform/kubernetes/k8scache/integration_test.go` around lines 219 - 222, The multi-client integration test is missing a call to set the shared store on the cache server; after creating the Datasource and cacheServer (Datasource.Informers and NewKubernetesCacheServer), call SetKubernetesStore on the cacheServer with the appropriate store instance (the informer store returned by inf.SetupStubs or the Datasource's store) so the multi-client path uses the same underlying store; update the test to obtain the store from inf.SetupStubs (or Datasource) and pass it into cacheServer.SetKubernetesStore(...) before running multi-client assertions.cmd/flp-informers/main.go (3)
172-180:⚠️ Potential issue | 🟠 MajorEmpty
NetworkTransformKubeConfigmay cause configuration drift.The informer is initialized with an empty config
{}, using defaults. If processors use non-default Kubernetes settings (managed CNI, secondary networks, tracked kinds), the centralized informers won't track the same resources.Consider loading the config from a shared source or making it configurable via CLI flags.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/flp-informers/main.go` around lines 172 - 180, The code creates an empty NetworkTransformKubeConfig (NetworkTransformKubeConfig{}) and passes it to informers.NewConfig, which can cause informer/processor config drift; instead load the real transformer kube config from the shared configuration or CLI before creating the informers: parse the existing shared config or add CLI flags to populate NetworkTransformKubeConfig, build infConfig via informers.NewConfig(populatedConfig) and then call inf.InitFromConfig(opts.Kubeconfig, &infConfig, opMetrics) so informers track the same managed CNI/secondary networks/tracked kinds as processors (refer to NetworkTransformKubeConfig, informers.NewConfig, informers.Informers, and inf.InitFromConfig).
203-207:⚠️ Potential issue | 🟡 MinorDiscovery startup errors are silently logged.
If
StartProcessorDiscoveryreturns an error immediately (e.g., invalid config), it's only logged. Configuration errors should fail startup rather than silently continue with no discovery.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/flp-informers/main.go` around lines 203 - 207, StartProcessorDiscovery errors are only logged inside a goroutine which can let the process continue without discovery; change the startup to surface failures: call StartProcessorDiscovery in a way that captures its immediate return value (e.g., run it synchronously or send its error to a startup channel) and if err != nil (from StartProcessorDiscovery(ctx, grpcClient, discoveryConfig)) then log the error with log.WithError and stop startup by returning the error from main or exiting non‑zero (or cancelling ctx and failing initialization). Ensure you reference the existing symbols StartProcessorDiscovery, grpcClient, discoveryConfig, ctx and log.WithError when implementing this propagation.
211-217:⚠️ Potential issue | 🟠 MajorInformer stop channels are not closed on shutdown.
The
Informerstype has internalstopChanandmdStopChanthat should be closed to stop watches and release resources. The current shutdown only cancels the context and stops the gRPC client.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/flp-informers/main.go` around lines 211 - 217, On shutdown, the code only cancels context and stops the gRPC client but never closes the Informers internal channels; update the shutdown handling (after receiving from sigChan and before exiting) to close the Informers' stop channels (e.g., close(informers.stopChan) and close(informers.mdStopChan)) or call a safe stop method on the Informers instance; ensure you guard against double-close (check nil or use a Stop() on Informers that uses sync.Once) so watches are terminated and resources released.pkg/pipeline/transform/kubernetes/datasource/kubernetesstore.go (3)
114-120:⚠️ Potential issue | 🟡 MinorSame nil guard needed in Delete.
for _, meta := range entries { + if meta == nil { + continue + } key := storeKey(meta)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/pipeline/transform/kubernetes/datasource/kubernetesstore.go` around lines 114 - 120, In the Delete method loop over entries, add the same nil guard used elsewhere to avoid panics when the store map is nil: before iterating check that s != nil and s.byKey != nil (or return early if s.byKey == nil); then proceed to compute key via storeKey(meta), check s.byKey[key], call s.removeFromIndexes(existing) and delete(s.byKey, key) as before. Ensure you reference the Delete method, s.byKey, storeKey, and removeFromIndexes when making the change.
86-90:⚠️ Potential issue | 🟡 MinorGuard nil entries in Replace loop.
storeKey(meta)dereferencesmetabeforeaddToIndexescan return early on nil. A malformed update slice with nil entries will panic.for _, meta := range entries { + if meta == nil { + continue + } key := storeKey(meta) s.byKey[key] = meta s.addToIndexes(meta) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/pipeline/transform/kubernetes/datasource/kubernetesstore.go` around lines 86 - 90, The Replace loop can panic when entries contain nil because storeKey(meta) dereferences meta before addToIndexes can guard; update the loop in the Replace implementation to check for nil first (e.g., if meta == nil { continue }) and skip calling storeKey(meta), s.byKey[...] and s.addToIndexes(meta) for nil items so malformed update slices do not cause a panic; ensure the nil check is the first operation inside the loop covering calls to storeKey and addToIndexes.
98-105:⚠️ Potential issue | 🟡 MinorSame nil guard needed in AddOrUpdate.
for _, meta := range entries { + if meta == nil { + continue + } key := storeKey(meta)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/pipeline/transform/kubernetes/datasource/kubernetesstore.go` around lines 98 - 105, Add a nil guard in AddOrUpdate: ensure the receiver and the byKey map are non-nil and return early if entries is nil; specifically, at the start of AddOrUpdate check if s == nil or entries == nil then return, and if s.byKey == nil initialize it (s.byKey = make(map[string]*Meta)) before the loop so storeKey, s.byKey[key] assignment, removeFromIndexes(existing) and addToIndexes(meta) are safe.
🧹 Nitpick comments (5)
proto/k8scache.proto (1)
1-5: Buf lint: package directory mismatch.Static analysis reports the file should be in a
k8scache/directory to match the package name. This is a lint warning, not a functional issue.If you want to satisfy Buf lint, either:
- Move proto to
proto/k8scache/k8scache.proto- Or add a Buf lint exception
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@proto/k8scache.proto` around lines 1 - 5, The proto file declares package k8scache but currently lives at proto/k8scache.proto which triggers Buf lint for package-directory mismatch; fix by either moving the file into a k8scache subdirectory (e.g., relocate proto/k8scache.proto -> proto/k8scache/k8scache.proto) so the file path matches package k8scache, or add a Buf lint exception/config entry to ignore this rule for proto/k8scache.proto; ensure the package k8scache and option go_package remain unchanged when you move or update the lint config.pkg/pipeline/transform/kubernetes/k8scache/converter.go (1)
14-32: Maps are shared, not copied.
Labels,Annotations, andSecondaryNetNamesare assigned by reference. If the receiver modifies these maps, the source is also modified. Consider defensive copies if mutation is possible.Optional: defensive map copy
+func copyMap(m map[string]string) map[string]string { + if m == nil { + return nil + } + out := make(map[string]string, len(m)) + for k, v := range m { + out[k] = v + } + return out +} + // In resourceEntryToMeta: - Labels: entry.Labels, - Annotations: entry.Annotations, + Labels: copyMap(entry.Labels), + Annotations: copyMap(entry.Annotations), // ... - SecondaryNetNames: entry.SecondaryNetNames, + SecondaryNetNames: copyMap(entry.SecondaryNetNames),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/pipeline/transform/kubernetes/k8scache/converter.go` around lines 14 - 32, The ResourceMetaData construction in converter.go assigns Labels, Annotations, and SecondaryNetNames by reference (from entry.Labels, entry.Annotations, entry.SecondaryNetNames), which can cause shared-mutability bugs; update the assignment in the block that creates meta (model.ResourceMetaData) to use defensive copies for maps and slices (e.g., create new map instances and copy each key/value for Labels and Annotations, and copy SecondaryNetNames into a new slice) or call small helper functions (e.g., copyStringMap, copyStringSlice) so mutations to ResourceMetaData do not affect the original entry.pkg/pipeline/transform/kubernetes/k8scache/discovery.go (1)
93-105: Consider structured logging for discovery errors.The error logging swallows context about which selector/namespace failed. This is acceptable for resilience but could complicate debugging.
pods, err := clientset.CoreV1().Pods(namespace).List(ctx, metav1.ListOptions{ LabelSelector: cfg.ProcessorSelector, }) if err != nil { - log.WithError(err).Error("failed to list processor pods") + log.WithError(err).WithFields(log.Fields{ + "namespace": namespace, + "selector": cfg.ProcessorSelector, + }).Error("failed to list processor pods") return }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/pipeline/transform/kubernetes/k8scache/discovery.go` around lines 93 - 105, The discoverAndConnect function logs discovery errors without context; update the error logging in the error branch after clientset.CoreV1().Pods(...).List to include namespace and cfg.ProcessorSelector so failures show which selector/namespace failed; locate the log.WithError(err).Error call in discoverAndConnect and replace it (or augment it) with a structured log entry that attaches fields like "namespace" and "selector" alongside the error for clearer diagnostics.pkg/pipeline/transform/kubernetes/datasource/kubernetesstore.go (1)
142-151: GetNodeByName error semantics are inconsistent with caller expectations.Returns
errors.New("notFound")which is a sentinel string, not a typed error. The caller inenrich.go(line 143-146) treats any non-nil error as a cache miss and logs at Trace level, so this works but is unconventional.Consider using a typed error or returning
(nil, nil)for not-found since callers already handle nil nodeInfo.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/pipeline/transform/kubernetes/datasource/kubernetesstore.go` around lines 142 - 151, The GetNodeByName method on KubernetesStore currently returns errors.New("notFound") for missing entries which is a plain string error and inconsistent with callers; update GetNodeByName (and its callers if needed) to either return (nil, nil) for a cache miss or use a package-level sentinel error (e.g., var ErrNotFound = errors.New("node not found")) so callers can distinguish "not found" from real errors — locate KubernetesStore.GetNodeByName and the byNodeName lookup and change the return for the not-found branch to use the chosen pattern, and if you add ErrNotFound update the caller in enrich.go to check for ErrNotFound explicitly instead of treating any non-nil error as a cache miss.contrib/kubernetes/deployment-k8scache.yaml (1)
17-36: Consider adding resource requests/limits.The container lacks resource constraints. Without them, the pod may be evicted under memory pressure or consume unbounded resources.
Example resource specification
- name: flowlogs-pipeline image: %IMAGE_TAG_BASE%:%VERSION% args: - "--config=/etc/flowlogs-pipeline/flowlogs-pipeline.conf.yaml" - "--k8scache.port=9090" - "--k8scache.address=0.0.0.0" + resources: + requests: + memory: "256Mi" + cpu: "100m" + limits: + memory: "1Gi" + cpu: "1000m" ports:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@contrib/kubernetes/deployment-k8scache.yaml` around lines 17 - 36, The flowlogs-pipeline container currently has no resource constraints; add a resources block under the container spec for the container named "flowlogs-pipeline" that defines both requests and limits (e.g., cpu and memory) to prevent eviction and unbounded usage; update the container spec near where image, args, ports, imagePullPolicy, and volumeMounts are defined to include resources.requests (minimum cpu/memory) and resources.limits (max cpu/memory) values appropriate for this workload.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/flowlogs-pipeline/main.go`:
- Around line 288-291: The call to log.WithError(err).WithField("address",
address).Fatal(...) makes the subsequent return nil unreachable; change the flow
so the error is returned instead of exiting: replace Fatal with a non-fatal
logger (e.g., Error or Warn) and return the actual error (return err) from the
surrounding function, or if you intend to terminate the process keep Fatal but
remove the unreachable return; update the log/return around
log.WithError(...).WithField(...) accordingly.
- Around line 316-319: The tls.Config used to build the server TLS settings
(tlsConfig := &tls.Config{...}) lacks a minimum protocol version, allowing
insecure TLS versions; update the tls.Config in main.go (tlsConfig) to include
MinVersion: tls.VersionTLS12 (or tls.VersionTLS13) so the server refuses TLS
<1.2, e.g., add MinVersion: tls.VersionTLS12 alongside Certificates and
ClientAuth in the tls.Config literal that uses cert.
- Around line 274-276: The return nil after log.WithError(err).Fatal("failed to
configure TLS for K8s cache server") is unreachable because Fatal exits the
process; remove the redundant return nil OR change the call to a non-fatal
logger (e.g., log.WithError(err).Error(...)) if you intend to return nil from
the function; update the code path around the TLS configuration (the
log.WithError(err).Fatal call and the subsequent return nil) accordingly.
In `@contrib/kubernetes/deployment-k8scache.yaml`:
- Around line 78-81: Fix the misspelled container port name in the Kubernetes
Service/Deployment spec by changing the name field value from "prometheous" to
"prometheus" (the entries involving port: 9102 and targetPort: 9102 and the
name: field). Update only the name string so it matches the correct "prometheus"
spelling wherever that service port is defined.
In `@README.md`:
- Around line 57-58: The README's flags list is missing the TLS-related k8scache
options; update the help output block to include --k8scache.tls-enabled,
--k8scache.tls-cert-path, --k8scache.tls-key-path, and --k8scache.tls-ca-path
alongside the existing --k8scache.address and --k8scache.port entries, showing
their default values and short descriptions (e.g. whether TLS is enabled, paths
for cert/key/CA), and regenerate the help output so the README accurately
documents all k8scache options.
---
Duplicate comments:
In `@cmd/flowlogs-pipeline/main.go`:
- Around line 241-244: The shutdown currently calls grpcServer.GracefulStop()
with no timeout which can hang waiting on long-lived StreamUpdates RPCs; fix by
starting a goroutine that sleeps for a bounded timeout (e.g., a configurable
shutdownTimeout duration) and calls grpcServer.Stop() to force-close if
GracefulStop() doesn't finish in time, then call grpcServer.GracefulStop() in
the main shutdown path and log both the graceful completion and the forced stop;
reference grpcServer.GracefulStop(), grpcServer.Stop(), and the StreamUpdates
RPC so the timeout protects long-lived streams.
In `@cmd/flp-informers/main.go`:
- Around line 172-180: The code creates an empty NetworkTransformKubeConfig
(NetworkTransformKubeConfig{}) and passes it to informers.NewConfig, which can
cause informer/processor config drift; instead load the real transformer kube
config from the shared configuration or CLI before creating the informers: parse
the existing shared config or add CLI flags to populate
NetworkTransformKubeConfig, build infConfig via
informers.NewConfig(populatedConfig) and then call
inf.InitFromConfig(opts.Kubeconfig, &infConfig, opMetrics) so informers track
the same managed CNI/secondary networks/tracked kinds as processors (refer to
NetworkTransformKubeConfig, informers.NewConfig, informers.Informers, and
inf.InitFromConfig).
- Around line 203-207: StartProcessorDiscovery errors are only logged inside a
goroutine which can let the process continue without discovery; change the
startup to surface failures: call StartProcessorDiscovery in a way that captures
its immediate return value (e.g., run it synchronously or send its error to a
startup channel) and if err != nil (from StartProcessorDiscovery(ctx,
grpcClient, discoveryConfig)) then log the error with log.WithError and stop
startup by returning the error from main or exiting non‑zero (or cancelling ctx
and failing initialization). Ensure you reference the existing symbols
StartProcessorDiscovery, grpcClient, discoveryConfig, ctx and log.WithError when
implementing this propagation.
- Around line 211-217: On shutdown, the code only cancels context and stops the
gRPC client but never closes the Informers internal channels; update the
shutdown handling (after receiving from sigChan and before exiting) to close the
Informers' stop channels (e.g., close(informers.stopChan) and
close(informers.mdStopChan)) or call a safe stop method on the Informers
instance; ensure you guard against double-close (check nil or use a Stop() on
Informers that uses sync.Once) so watches are terminated and resources released.
In `@pkg/pipeline/transform/kubernetes/datasource/kubernetesstore.go`:
- Around line 114-120: In the Delete method loop over entries, add the same nil
guard used elsewhere to avoid panics when the store map is nil: before iterating
check that s != nil and s.byKey != nil (or return early if s.byKey == nil); then
proceed to compute key via storeKey(meta), check s.byKey[key], call
s.removeFromIndexes(existing) and delete(s.byKey, key) as before. Ensure you
reference the Delete method, s.byKey, storeKey, and removeFromIndexes when
making the change.
- Around line 86-90: The Replace loop can panic when entries contain nil because
storeKey(meta) dereferences meta before addToIndexes can guard; update the loop
in the Replace implementation to check for nil first (e.g., if meta == nil {
continue }) and skip calling storeKey(meta), s.byKey[...] and
s.addToIndexes(meta) for nil items so malformed update slices do not cause a
panic; ensure the nil check is the first operation inside the loop covering
calls to storeKey and addToIndexes.
- Around line 98-105: Add a nil guard in AddOrUpdate: ensure the receiver and
the byKey map are non-nil and return early if entries is nil; specifically, at
the start of AddOrUpdate check if s == nil or entries == nil then return, and if
s.byKey == nil initialize it (s.byKey = make(map[string]*Meta)) before the loop
so storeKey, s.byKey[key] assignment, removeFromIndexes(existing) and
addToIndexes(meta) are safe.
In `@pkg/pipeline/transform/kubernetes/k8scache/client.go`:
- Around line 134-137: The tls.Config created in client.go (variable tlsConfig
with InsecureSkipVerify: c.tlsConfig.InsecureSkipVerify) doesn't set MinVersion,
allowing protocol downgrade; update the tls.Config initialization in the code
that builds tlsConfig to set MinVersion to a safe minimum (e.g.,
tls.VersionTLS12 or use a configured minimum from c.tlsConfig.MinVersion) so the
client enforces a modern TLS version and prevents downgrade attacks.
In `@pkg/pipeline/transform/kubernetes/k8scache/integration_test.go`:
- Around line 113-116: The test never initializes the datasource store so
mutations via the gRPC path aren't validated; after creating ds :=
&datasource.Datasource{Informers: informers} call SetKubernetesStore on ds (or
use the existing helper that wires the store for kube objects) before creating
cacheServer with NewKubernetesCacheServer(ds) so updates applied through the
server mutate ds.Store and the test can assert expected changes; ensure you
reference and initialize the same store type used elsewhere in tests (e.g., the
Kubernetes store) so subsequent gRPC update calls actually persist to
datasource.Datasource.Store.
- Around line 219-222: The multi-client integration test is missing a call to
set the shared store on the cache server; after creating the Datasource and
cacheServer (Datasource.Informers and NewKubernetesCacheServer), call
SetKubernetesStore on the cacheServer with the appropriate store instance (the
informer store returned by inf.SetupStubs or the Datasource's store) so the
multi-client path uses the same underlying store; update the test to obtain the
store from inf.SetupStubs (or Datasource) and pass it into
cacheServer.SetKubernetesStore(...) before running multi-client assertions.
In `@pkg/pipeline/transform/kubernetes/k8scache/server.go`:
- Around line 136-138: The comment notes snapshots are ignored causing restarts
to miss state; implement initial sync so when SyncRequest.LastVersion == 0 the
server treats the request as a snapshot: honor the is_snapshot flag and call
KubernetesStore.Replace with the full resource set instead of only applying
ADD/UPDATE events. Update the sync handling in server.go (the SyncRequest
processing code path) to detect LastVersion == 0 (or is_snapshot) and perform a
Replace on the store, populating version metadata and returning the appropriate
response so restarted processors get an immediate, consistent cache state.
---
Nitpick comments:
In `@contrib/kubernetes/deployment-k8scache.yaml`:
- Around line 17-36: The flowlogs-pipeline container currently has no resource
constraints; add a resources block under the container spec for the container
named "flowlogs-pipeline" that defines both requests and limits (e.g., cpu and
memory) to prevent eviction and unbounded usage; update the container spec near
where image, args, ports, imagePullPolicy, and volumeMounts are defined to
include resources.requests (minimum cpu/memory) and resources.limits (max
cpu/memory) values appropriate for this workload.
In `@pkg/pipeline/transform/kubernetes/datasource/kubernetesstore.go`:
- Around line 142-151: The GetNodeByName method on KubernetesStore currently
returns errors.New("notFound") for missing entries which is a plain string error
and inconsistent with callers; update GetNodeByName (and its callers if needed)
to either return (nil, nil) for a cache miss or use a package-level sentinel
error (e.g., var ErrNotFound = errors.New("node not found")) so callers can
distinguish "not found" from real errors — locate KubernetesStore.GetNodeByName
and the byNodeName lookup and change the return for the not-found branch to use
the chosen pattern, and if you add ErrNotFound update the caller in enrich.go to
check for ErrNotFound explicitly instead of treating any non-nil error as a
cache miss.
In `@pkg/pipeline/transform/kubernetes/k8scache/converter.go`:
- Around line 14-32: The ResourceMetaData construction in converter.go assigns
Labels, Annotations, and SecondaryNetNames by reference (from entry.Labels,
entry.Annotations, entry.SecondaryNetNames), which can cause shared-mutability
bugs; update the assignment in the block that creates meta
(model.ResourceMetaData) to use defensive copies for maps and slices (e.g.,
create new map instances and copy each key/value for Labels and Annotations, and
copy SecondaryNetNames into a new slice) or call small helper functions (e.g.,
copyStringMap, copyStringSlice) so mutations to ResourceMetaData do not affect
the original entry.
In `@pkg/pipeline/transform/kubernetes/k8scache/discovery.go`:
- Around line 93-105: The discoverAndConnect function logs discovery errors
without context; update the error logging in the error branch after
clientset.CoreV1().Pods(...).List to include namespace and cfg.ProcessorSelector
so failures show which selector/namespace failed; locate the
log.WithError(err).Error call in discoverAndConnect and replace it (or augment
it) with a structured log entry that attaches fields like "namespace" and
"selector" alongside the error for clearer diagnostics.
In `@proto/k8scache.proto`:
- Around line 1-5: The proto file declares package k8scache but currently lives
at proto/k8scache.proto which triggers Buf lint for package-directory mismatch;
fix by either moving the file into a k8scache subdirectory (e.g., relocate
proto/k8scache.proto -> proto/k8scache/k8scache.proto) so the file path matches
package k8scache, or add a Buf lint exception/config entry to ignore this rule
for proto/k8scache.proto; ensure the package k8scache and option go_package
remain unchanged when you move or update the lint config.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 57586583-f7a8-4dc5-8f57-e388b8d716ea
⛔ Files ignored due to path filters (2)
pkg/pipeline/transform/kubernetes/k8scache/k8scache.pb.gois excluded by!**/*.pb.gopkg/pipeline/transform/kubernetes/k8scache/k8scache_grpc.pb.gois excluded by!**/*.pb.go
📒 Files selected for processing (24)
.mk/development.mkMakefileREADME.mdcmd/flowlogs-pipeline/main.gocmd/flp-informers/main.gocontrib/docker/Dockerfilecontrib/kubernetes/deployment-flp-informers.yamlcontrib/kubernetes/deployment-k8scache.yamlpkg/config/config.gopkg/pipeline/transform/kubernetes/datasource/datasource.gopkg/pipeline/transform/kubernetes/datasource/kubernetesstore.gopkg/pipeline/transform/kubernetes/enrich.gopkg/pipeline/transform/kubernetes/informers/cache_export.gopkg/pipeline/transform/kubernetes/k8scache/client.gopkg/pipeline/transform/kubernetes/k8scache/converter.gopkg/pipeline/transform/kubernetes/k8scache/discovery.gopkg/pipeline/transform/kubernetes/k8scache/discovery_test.gopkg/pipeline/transform/kubernetes/k8scache/eventhandler.gopkg/pipeline/transform/kubernetes/k8scache/eventhandler_test.gopkg/pipeline/transform/kubernetes/k8scache/integration_test.gopkg/pipeline/transform/kubernetes/k8scache/server.gopkg/pipeline/transform/kubernetes/k8scache/server_test.goproto/README.mdproto/k8scache.proto
| log.WithError(err).Fatal("failed to configure TLS for K8s cache server") | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Unreachable code after log.Fatal.
log.Fatal calls os.Exit(1), so return nil is never reached. Either use log.Error + return nil or remove the return.
Suggested fix
if cfg.TLSEnabled {
tlsConfig, err := createServerTLSConfig(cfg)
if err != nil {
- log.WithError(err).Fatal("failed to configure TLS for K8s cache server")
- return nil
+ log.WithError(err).Error("failed to configure TLS for K8s cache server")
+ os.Exit(1)
}Or simply remove return nil since Fatal terminates.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| log.WithError(err).Fatal("failed to configure TLS for K8s cache server") | |
| return nil | |
| } | |
| log.WithError(err).Error("failed to configure TLS for K8s cache server") | |
| os.Exit(1) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cmd/flowlogs-pipeline/main.go` around lines 274 - 276, The return nil after
log.WithError(err).Fatal("failed to configure TLS for K8s cache server") is
unreachable because Fatal exits the process; remove the redundant return nil OR
change the call to a non-fatal logger (e.g., log.WithError(err).Error(...)) if
you intend to return nil from the function; update the code path around the TLS
configuration (the log.WithError(err).Fatal call and the subsequent return nil)
accordingly.
| if err != nil { | ||
| log.WithError(err).WithField("address", address).Fatal("failed to start K8s cache server") | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Same issue: unreachable return nil after log.Fatal.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cmd/flowlogs-pipeline/main.go` around lines 288 - 291, The call to
log.WithError(err).WithField("address", address).Fatal(...) makes the subsequent
return nil unreachable; change the flow so the error is returned instead of
exiting: replace Fatal with a non-fatal logger (e.g., Error or Warn) and return
the actual error (return err) from the surrounding function, or if you intend to
terminate the process keep Fatal but remove the unreachable return; update the
log/return around log.WithError(...).WithField(...) accordingly.
| tlsConfig := &tls.Config{ | ||
| Certificates: []tls.Certificate{cert}, | ||
| ClientAuth: tls.NoClientCert, // Default: no client cert required | ||
| } |
There was a problem hiding this comment.
Set MinVersion: tls.VersionTLS12 (or TLS 1.3) to prevent weak protocol negotiation.
Without MinVersion, Go defaults to TLS 1.0 for servers, which is insecure.
Suggested fix
tlsConfig := &tls.Config{
Certificates: []tls.Certificate{cert},
ClientAuth: tls.NoClientCert, // Default: no client cert required
+ MinVersion: tls.VersionTLS12,
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| tlsConfig := &tls.Config{ | |
| Certificates: []tls.Certificate{cert}, | |
| ClientAuth: tls.NoClientCert, // Default: no client cert required | |
| } | |
| tlsConfig := &tls.Config{ | |
| Certificates: []tls.Certificate{cert}, | |
| ClientAuth: tls.NoClientCert, // Default: no client cert required | |
| MinVersion: tls.VersionTLS12, | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cmd/flowlogs-pipeline/main.go` around lines 316 - 319, The tls.Config used to
build the server TLS settings (tlsConfig := &tls.Config{...}) lacks a minimum
protocol version, allowing insecure TLS versions; update the tls.Config in
main.go (tlsConfig) to include MinVersion: tls.VersionTLS12 (or
tls.VersionTLS13) so the server refuses TLS <1.2, e.g., add MinVersion:
tls.VersionTLS12 alongside Certificates and ClientAuth in the tls.Config literal
that uses cert.
| --k8scache.address string K8s cache sync server address (default "0.0.0.0") | ||
| --k8scache.port int K8s cache sync server port (default: disabled) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check all k8scache flags defined in main.go
rg "k8scache\." cmd/flowlogs-pipeline/main.go -nRepository: netobserv/flowlogs-pipeline
Length of output: 1048
Include TLS flags in the help output.
The code defines --k8scache.tls-enabled, --k8scache.tls-cert-path, --k8scache.tls-key-path, and --k8scache.tls-ca-path, but the README only documents the address and port flags. Regenerate the help output to include all k8scache options.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@README.md` around lines 57 - 58, The README's flags list is missing the
TLS-related k8scache options; update the help output block to include
--k8scache.tls-enabled, --k8scache.tls-cert-path, --k8scache.tls-key-path, and
--k8scache.tls-ca-path alongside the existing --k8scache.address and
--k8scache.port entries, showing their default values and short descriptions
(e.g. whether TLS is enabled, paths for cert/key/CA), and regenerate the help
output so the README accurately documents all k8scache options.
jpinsonneau
left a comment
There was a problem hiding this comment.
Let's make the cache configurable (buffer / timeout / batch size) and ensure the snapshot are sent when FLP start / restart
| // AddProcessorWithTimeout is idempotent (won't duplicate if already connected) | ||
| // Use a 10-second timeout to avoid blocking the discovery loop for too long | ||
| if err := client.AddProcessorWithTimeout(address, 10*time.Second); err != nil { | ||
| log.WithError(err).WithField("pod", pod.Name).Error("failed to connect to processor") | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
We should also handle RemoveProcessor ? FLPs may restart and change address.
There was a problem hiding this comment.
Created a mechanism to remove stale processors.
| for _, pc := range connections { | ||
| wg.Add(1) | ||
| go func(pc *processorConnection) { | ||
| defer wg.Done() | ||
|
|
||
| // Get stream reference under lock | ||
| stream := pc.getStream() | ||
| if stream == nil { | ||
| clog.WithField("address", pc.address).Warn("stream is nil, skipping send") | ||
| pc.healthy.Store(false) | ||
| return | ||
| } | ||
|
|
||
| // Create a context with timeout for this send operation | ||
| sendCtx, sendCancel := context.WithTimeout(context.Background(), sendTimeout) | ||
| defer sendCancel() | ||
|
|
||
| // Use a channel to signal completion or timeout | ||
| done := make(chan error, 1) | ||
| go func() { | ||
| select { | ||
| case <-sendCtx.Done(): | ||
| // Context cancelled, exit goroutine to prevent leak | ||
| return | ||
| case done <- stream.Send(update): | ||
| // Send completed (success or error) | ||
| } | ||
| }() | ||
|
|
||
| select { | ||
| case err := <-done: | ||
| if err != nil { | ||
| clog.WithError(err).WithField("address", pc.address).Error("failed to send update") | ||
| pc.healthy.Store(false) | ||
| } | ||
| case <-sendCtx.Done(): | ||
| clog.WithField("address", pc.address).Error("send operation timed out") | ||
| pc.healthy.Store(false) | ||
|
|
||
| // Cancel the stream context to ensure any blocked operations are unblocked | ||
| // This will cause receiveFromProcessor to see an error and trigger reconnection | ||
| pc.cancelStream() | ||
| } | ||
| }(pc) | ||
| } |
There was a problem hiding this comment.
| for _, pc := range connections { | |
| wg.Add(1) | |
| go func(pc *processorConnection) { | |
| defer wg.Done() | |
| // Get stream reference under lock | |
| stream := pc.getStream() | |
| if stream == nil { | |
| clog.WithField("address", pc.address).Warn("stream is nil, skipping send") | |
| pc.healthy.Store(false) | |
| return | |
| } | |
| // Create a context with timeout for this send operation | |
| sendCtx, sendCancel := context.WithTimeout(context.Background(), sendTimeout) | |
| defer sendCancel() | |
| // Use a channel to signal completion or timeout | |
| done := make(chan error, 1) | |
| go func() { | |
| select { | |
| case <-sendCtx.Done(): | |
| // Context cancelled, exit goroutine to prevent leak | |
| return | |
| case done <- stream.Send(update): | |
| // Send completed (success or error) | |
| } | |
| }() | |
| select { | |
| case err := <-done: | |
| if err != nil { | |
| clog.WithError(err).WithField("address", pc.address).Error("failed to send update") | |
| pc.healthy.Store(false) | |
| } | |
| case <-sendCtx.Done(): | |
| clog.WithField("address", pc.address).Error("send operation timed out") | |
| pc.healthy.Store(false) | |
| // Cancel the stream context to ensure any blocked operations are unblocked | |
| // This will cause receiveFromProcessor to see an error and trigger reconnection | |
| pc.cancelStream() | |
| } | |
| }(pc) | |
| } | |
| for _, pc := range connections { | |
| wg.Add(1) | |
| go func(pc *processorConnection) { | |
| defer wg.Done() | |
| // Get stream reference under lock | |
| stream := pc.getStream() | |
| if stream == nil { | |
| clog.WithField("address", pc.address).Warn("stream is nil, skipping send") | |
| pc.healthy.Store(false) | |
| return | |
| } | |
| // Race stream.Send against a timeout. stream.Send doesn't accept | |
| // a context, so we run it in a separate goroutine and cancel the | |
| // underlying stream on timeout to unblock it. | |
| done := make(chan error, 1) | |
| go func() { | |
| done <- stream.Send(update) | |
| }() | |
| select { | |
| case err := <-done: | |
| if err != nil { | |
| clog.WithError(err).WithField("address", pc.address).Error("failed to send update") | |
| pc.healthy.Store(false) | |
| } | |
| case <-time.After(sendTimeout): | |
| clog.WithField("address", pc.address).Error("send operation timed out") | |
| pc.healthy.Store(false) | |
| // Cancel the stream context so the blocked Send returns an error. | |
| // The goroutine will then write to `done` and exit (buffered channel). | |
| // This also causes receiveFromProcessor to see an error and trigger reconnection. | |
| pc.cancelStream() | |
| } | |
| }(pc) |
| // handleSyncRequest handles a SyncRequest from a processor | ||
| func (c *Client) handleSyncRequest(pc *processorConnection, req *SyncRequest) { | ||
| clog.WithFields(log.Fields{ | ||
| "address": pc.address, | ||
| "processor_id": req.ProcessorId, | ||
| "last_version": req.LastVersion, | ||
| }).Info("received SyncRequest from processor") | ||
|
|
||
| // Note: We only send incremental updates (ADD/UPDATE/DELETE). | ||
| // No initial snapshot is sent. Processors start with empty cache and build up | ||
| // from incoming events. If a processor restarts, it loses enrichment until | ||
| // resources are updated again (acceptable trade-off per design). | ||
| } |
There was a problem hiding this comment.
Does that mean every pods running before startup / restart will not be enriched ❓
I feel we should send a snapshot on SyncRequest when lastVersion is 0 to recover.
There was a problem hiding this comment.
Initial snapshot re-added.
| processorID: config.ProcessorID, | ||
| tlsConfig: *config, | ||
| connections: make(map[string]*processorConnection), | ||
| updateChan: make(chan *CacheUpdate, 100), // Buffer for updates |
There was a problem hiding this comment.
That may be not enough for big clusters especially with the 5s timeout 🤔
We should make those configurable and test some values
| select { | ||
| case c.updateChan <- update: | ||
| return nil | ||
| case <-time.After(5 * time.Second): |
There was a problem hiding this comment.
Let's make this configurable
| // SendAdd sends an ADD operation to all connected processors | ||
| func (c *Client) SendAdd(entries []*model.ResourceMetaData) error { | ||
| version := c.version.Add(1) | ||
| update := &CacheUpdate{ | ||
| Version: version, | ||
| IsSnapshot: false, | ||
| Operation: OperationType_OPERATION_ADD, | ||
| Entries: metaToResourceEntries(entries), | ||
| } | ||
|
|
||
| return c.sendUpdate(update) | ||
| } | ||
|
|
||
| // SendUpdate sends an UPDATE operation to all connected processors | ||
| func (c *Client) SendUpdate(entries []*model.ResourceMetaData) error { | ||
| version := c.version.Add(1) | ||
| update := &CacheUpdate{ | ||
| Version: version, | ||
| IsSnapshot: false, | ||
| Operation: OperationType_OPERATION_UPDATE, | ||
| Entries: metaToResourceEntries(entries), | ||
| } | ||
|
|
||
| return c.sendUpdate(update) | ||
| } | ||
|
|
||
| // SendDelete sends a DELETE operation to all connected processors | ||
| func (c *Client) SendDelete(entries []*model.ResourceMetaData) error { | ||
| version := c.version.Add(1) | ||
| update := &CacheUpdate{ | ||
| Version: version, | ||
| IsSnapshot: false, | ||
| Operation: OperationType_OPERATION_DELETE, | ||
| Entries: metaToResourceEntries(entries), | ||
| } | ||
|
|
||
| return c.sendUpdate(update) | ||
| } |
There was a problem hiding this comment.
We should batch messages instead of sending those one by one.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
jpinsonneau
left a comment
There was a problem hiding this comment.
LGTM in terms of code ! time to test and get some metrics 😸
| DiscoveryErrors: prometheus.NewCounter(prometheus.CounterOpts{ | ||
| Name: "flp_informers_discovery_errors_total", | ||
| Help: "Total number of processor discovery errors", | ||
| }), |
There was a problem hiding this comment.
It's a nice thing to have this observability layer
For errors, I think it's better to have a single error metric associated with a code label. So that over time, if we already have a dashboard that surfaces those errors, we don't have to update it every time we add more kinds of errors
Currently that would be an error counter with an error code (or provenance) for "discovery" and another for UDN disambiguation
There was a problem hiding this comment.
Good idea, done.
| // It is populated by the k8s cache sync gRPC server when receiving updates from flp-informers. | ||
| // When k8scache is enabled, Informers is nil and only kubernetesStore is used. | ||
| // Access is synchronized via atomic operations to prevent race conditions. | ||
| kubernetesStore atomic.Pointer[KubernetesStore] |
There was a problem hiding this comment.
Does it really needs to be synchronized? The KubernetesStore struct itself has a mutex, isn't it sufficient? (If I'm reading correctly, the kubernetesStore var is only set once, in main.go , plus in tests)
There was a problem hiding this comment.
That is true, it's not required. Removed.
|
|
||
| func (d *Datasource) IndexLookup(potentialKeys []string, ip string) *model.ResourceMetaData { | ||
| return d.Informers.IndexLookup(potentialKeys, ip) | ||
| if store := d.kubernetesStore.Load(); store != nil { |
There was a problem hiding this comment.
it would be cleaner to make Datasource an interface, with 2 implementations (k8s-store and informers), that would avoid having this pattern of "1. check store ; 2. fallback to informers"
we can defer that to later
There was a problem hiding this comment.
Yes, I will evaluate effort, but I agree we can do it in a follow up PR to focus on performance now.
|
@leandroberetta: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Description
Add k8scache: centralized K8s informers to reduce API load
This PR implements centralized Kubernetes metadata caching to reduce
K8s API load when running multiple FLP processor pods.
Architecture:
Deployment modes:
Benefits:
Dependencies
n/a
Checklist
To run a perfscale test, comment with:
/test flp-node-density-heavy-25nodesSummary by CodeRabbit
New Features
Documentation
Tests