BUG REPORT INFORMATION
Description
Let's say if I have a file at /home/core/test, and then I apply a new machineconfig to write to /home/core/test/test, since /home/core/test is a file, the MCO properly catches that it is unable to create a directory there, and thus degrades.
If I then delete the machineconfig that introduced this change, the MCC will properly detect that the worker pool should go back to targeting the previous machineconfig for the pool worker. However, the MCD running on the node does not detect this change. It will continuously fail-loop on Marking Degraded due to: failed to create directory "/home/core/test": mkdir /home/core/test: not a directory, marking the node as schedulingdisabled and failing to make any progress. In fact, since the annotation on the node never gets updated, even deleting the MCD pod doesn't fix the error, as the new one will attempt the same update and fail on the same error.
To recover, we would need to update the annotation on the node by hand to the previous desiredConfig, and the manually oc adm uncordon node. This is obviously not desired behaviour, as we should be able to recover automatically when the MC is deleted.
I've not tested every degrade-recovery scenario but I remember we were able to recover from some cases before. Will test to see if other types of degrades exhibit the same behaviour.
Steps to reproduce the issue:
- Create a file with a machineconfig snippet like:
...
storage:
files:
- contents:
source: data:,hello%20world%0A
verification: {}
filesystem: root
mode: 420
path: /home/core/test
- Create a second file with another machineconfig like:
...
storage:
files:
- contents:
source: data:,hello%20worlddd%0A
verification: {}
filesystem: root
mode: 420
path: /home/core/test/test
- Notice that the second machineconfig causes a degrade on one of the nodes
- Delete the second machineconfig, and notice that the node is unable to recover
Additional environment details (platform, options, etc.):
Reproduced so far on 4.4 Azure and AWS
BUG REPORT INFORMATION
Description
Let's say if I have a file at
/home/core/test, and then I apply a new machineconfig to write to/home/core/test/test, since/home/core/testis a file, the MCO properly catches that it is unable to create a directory there, and thus degrades.If I then delete the machineconfig that introduced this change, the MCC will properly detect that the worker pool should go back to targeting the previous machineconfig for the pool worker. However, the MCD running on the node does not detect this change. It will continuously fail-loop on
Marking Degraded due to: failed to create directory "/home/core/test": mkdir /home/core/test: not a directory, marking the node as schedulingdisabled and failing to make any progress. In fact, since the annotation on the node never gets updated, even deleting the MCD pod doesn't fix the error, as the new one will attempt the same update and fail on the same error.To recover, we would need to update the annotation on the node by hand to the previous
desiredConfig, and the manuallyoc adm uncordon node. This is obviously not desired behaviour, as we should be able to recover automatically when the MC is deleted.I've not tested every degrade-recovery scenario but I remember we were able to recover from some cases before. Will test to see if other types of degrades exhibit the same behaviour.
Steps to reproduce the issue:
Additional environment details (platform, options, etc.):
Reproduced so far on 4.4 Azure and AWS