Skip to content

[occm] slow reconciliation of LoadBalancer service with multiple listeners on k8s node add/remove #2858

@kayrus

Description

@kayrus

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

What happened:

LoadBalancer services with multiple listeners (ports, see an example below) take too much time to reconcile pool members.
This happens due to octavia API design, if there is at least one loadbalancer's child resource is in PENDING_UPDATE, octavia responses with 409 not allowing to modify any other child resource.

OCCM logic has hardcoded retry and backoff parameters that check resource statuses with an exponential delay.

klog.InfoS("Waiting for load balancer ACTIVE", "lbID", loadbalancerID)
steps := getTimeoutSteps("OCCM_WAIT_LB_ACTIVE_STEPS", waitLoadbalancerActiveSteps)
backoff := wait.Backoff{
Duration: waitLoadbalancerInitDelay,
Factor: waitLoadbalancerFactor,
Steps: steps,
}

  • Duration = 1 second (initial wait time)
  • Factor = 1.2 (multiplier for each step)
  • Steps = 23 (maximum retries)
  • Total time to wait for each pool: 2-3m

Considering the above, status update happens on the 19-20 step, where the exponential delay is between 2.5-3.1 minutes. Most likely the actual status update happens after 2 minutes. If we add an upper limit delay to 10s, we can save maximum 10 minutes (or 40%) from 26-30 m) of total wait for 10 pools.

What you expected to happen:

These delay settings should be configurable via service annotations. In addition, we should allow to check the LB status every second without an exponential delay.

How to reproduce it:

create a service on a cluster with one node:

apiVersion: v1
kind: Service
metadata:
  name: openstack-loadbalancer
  annotations:
    loadbalancer.openstack.org/protocol: "tcp"
spec:
  type: LoadBalancer
  selector:
    app: my-app
  ports:
    - name: port-80
      protocol: TCP
      port: 80
      targetPort: 80
    - name: port-81
      protocol: TCP
      port: 81
      targetPort: 81
    - name: port-82
      protocol: TCP
      port: 82
      targetPort: 82
    - name: port-83
      protocol: TCP
      port: 83
      targetPort: 83
    - name: port-84
      protocol: TCP
      port: 84
      targetPort: 84
    - name: port-85
      protocol: TCP
      port: 85
      targetPort: 85
    - name: port-86
      protocol: TCP
      port: 86
      targetPort: 86
    - name: port-87
      protocol: TCP
      port: 87
      targetPort: 87
    - name: port-88
      protocol: TCP
      port: 88
      targetPort: 88
    - name: port-89
      protocol: TCP
      port: 89
      targetPort: 89
    - name: port-90
      protocol: TCP
      port: 90
      targetPort: 90

add a new node to a cluster (or remove one), the service reconciliation will take ~30m

Anything else we need to know?:

See also #1770

Another approach to fix this is to implement batch pools update in upstream octavia API.

Environment:

  • openstack-cloud-controller-manager(or other related binary) version:
  • OpenStack version:
  • Others:

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions