Prevent dynamic service deletion during upgrade#16151
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a check to skip service registration if the underlying Kubernetes pod is terminating, passing the pod name to KubeDiscoveryService and querying its deletion timestamp. It also updates service update logic to check for changes in owner references. The review feedback highlights several improvement opportunities: removing debugging log prefixes ("Priyanshuuuu-" and "Priyanshuuu -"), downgrading a registration skip log from error to warning, and adding null checks for currentService.getMetadata() and the retrieved V1Pod object to prevent potential NullPointerExceptions.
15215aa to
604e025
Compare
982992a to
0a0da88
Compare
0a0da88 to
a61646a
Compare
| * @param loadBalancerServiceList list of services that should be exposed via LoadBalancer | ||
| * @param loadBalancerServiceAnnotations annotations to apply to LoadBalancer services | ||
| */ | ||
| public KubeDiscoveryService(String namespace, String namePrefix, @Nullable String podName, |
There was a problem hiding this comment.
Avoid introducing new constructor.
| private static final String SERVICE_TYPE_CLUSTER_IP = "ClusterIP"; | ||
| private static final String PAYLOAD_NAME = "cdap.service.payload"; | ||
|
|
||
| @Nullable |
There was a problem hiding this comment.
This should not be nullable.
| .map(V1ObjectMeta::getOwnerReferences) | ||
| .orElse(Collections.emptyList()); | ||
|
|
||
| List<V1OwnerReference> safeExpectedOwners = expectedOwners == null ? |
There was a problem hiding this comment.
No need for null check.
See:
| List<V1OwnerReference> safeExpectedOwners = expectedOwners == null ? | ||
| Collections.emptyList() : expectedOwners; | ||
|
|
||
| return currentOwners.size() != safeExpectedOwners.size() |
There was a problem hiding this comment.
Use predefined Gauva or Collection util method for comparing collections.
There was a problem hiding this comment.
I have used Set
| } | ||
|
|
||
| private boolean isPodTerminating() { | ||
| if (podName == null) { |
| .map(V1ObjectMeta::getDeletionTimestamp) | ||
| .isPresent(); | ||
|
|
||
| } catch (Exception e) { |
There was a problem hiding this comment.
Return true only if e is ApiException with 404.
Let other exceptions propagate.
a61646a to
2bd23de
Compare
| } | ||
|
|
||
| private boolean hasOwnerReferencesChanged(V1Service currentService, | ||
| @Nullable List<V1OwnerReference> expectedOwners) { |
There was a problem hiding this comment.
Remove @Nullable annotation
2bd23de to
c61d14f
Compare
|


Context
During CDAP upgrades (such as
metadataorappfabric), the Kubernetes Services dynamically created by CDAP pods were sometimes garbage-collected (deleted) by Kubernetes, leaving the active pods unreachable and causing 503 errors.This occurred due to two separate issues:
KubeDiscoveryServiceskipped updating the Service. The Service'sownerReferencesremained pointing to the old, inactive ReplicaSet. When GKE cleaned up the old ReplicaSet, the Service was automatically garbage-collected.ownerReferencesback to the dying ReplicaSet, causing GC deletion when the old ReplicaSet is pruned.Testing