Is this a BUG REPORT or FEATURE REQUEST?:
/kind feature
What happened:
LoadBalancer services with multiple listeners (ports, see an example below) take too much time to reconcile pool members.
This happens due to octavia API design, if there is at least one loadbalancer's child resource is in PENDING_UPDATE, octavia responses with 409 not allowing to modify any other child resource.
OCCM logic has hardcoded retry and backoff parameters that check resource statuses with an exponential delay.
|
klog.InfoS("Waiting for load balancer ACTIVE", "lbID", loadbalancerID) |
|
steps := getTimeoutSteps("OCCM_WAIT_LB_ACTIVE_STEPS", waitLoadbalancerActiveSteps) |
|
backoff := wait.Backoff{ |
|
Duration: waitLoadbalancerInitDelay, |
|
Factor: waitLoadbalancerFactor, |
|
Steps: steps, |
|
} |
- Duration = 1 second (initial wait time)
- Factor = 1.2 (multiplier for each step)
- Steps = 23 (maximum retries)
- Total time to wait for each pool: 2-3m
Considering the above, status update happens on the 19-20 step, where the exponential delay is between 2.5-3.1 minutes. Most likely the actual status update happens after 2 minutes. If we add an upper limit delay to 10s, we can save maximum 10 minutes (or 40%) from 26-30 m) of total wait for 10 pools.
What you expected to happen:
These delay settings should be configurable via service annotations. In addition, we should allow to check the LB status every second without an exponential delay.
How to reproduce it:
create a service on a cluster with one node:
apiVersion: v1
kind: Service
metadata:
name: openstack-loadbalancer
annotations:
loadbalancer.openstack.org/protocol: "tcp"
spec:
type: LoadBalancer
selector:
app: my-app
ports:
- name: port-80
protocol: TCP
port: 80
targetPort: 80
- name: port-81
protocol: TCP
port: 81
targetPort: 81
- name: port-82
protocol: TCP
port: 82
targetPort: 82
- name: port-83
protocol: TCP
port: 83
targetPort: 83
- name: port-84
protocol: TCP
port: 84
targetPort: 84
- name: port-85
protocol: TCP
port: 85
targetPort: 85
- name: port-86
protocol: TCP
port: 86
targetPort: 86
- name: port-87
protocol: TCP
port: 87
targetPort: 87
- name: port-88
protocol: TCP
port: 88
targetPort: 88
- name: port-89
protocol: TCP
port: 89
targetPort: 89
- name: port-90
protocol: TCP
port: 90
targetPort: 90
add a new node to a cluster (or remove one), the service reconciliation will take ~30m
Anything else we need to know?:
See also #1770
Another approach to fix this is to implement batch pools update in upstream octavia API.
Environment:
- openstack-cloud-controller-manager(or other related binary) version:
- OpenStack version:
- Others:
Is this a BUG REPORT or FEATURE REQUEST?:
/kind feature
What happened:
LoadBalancer services with multiple listeners (ports, see an example below) take too much time to reconcile pool members.
This happens due to octavia API design, if there is at least one loadbalancer's child resource is in PENDING_UPDATE, octavia responses with 409 not allowing to modify any other child resource.
OCCM logic has hardcoded retry and backoff parameters that check resource statuses with an exponential delay.
cloud-provider-openstack/pkg/util/openstack/loadbalancer.go
Lines 167 to 173 in 251508d
Considering the above, status update happens on the 19-20 step, where the exponential delay is between 2.5-3.1 minutes. Most likely the actual status update happens after 2 minutes. If we add an upper limit delay to 10s, we can save maximum 10 minutes (or 40%) from 26-30 m) of total wait for 10 pools.
What you expected to happen:
These delay settings should be configurable via service annotations. In addition, we should allow to check the LB status every second without an exponential delay.
How to reproduce it:
create a service on a cluster with one node:
add a new node to a cluster (or remove one), the service reconciliation will take ~30m
Anything else we need to know?:
See also #1770
Another approach to fix this is to implement batch pools update in upstream octavia API.
Environment: