-
Notifications
You must be signed in to change notification settings - Fork 560
Enable Network Observability on Day 0 #1908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
3236e7f
56f2fba
2c3467a
d1528f4
4a83664
b0421f5
d7dfee5
6187a1e
5df95e0
073794e
3a7ea40
a4e8e4e
8972b96
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,340 @@ | ||
| --- | ||
| title: enable-network-observability-on-day-0 | ||
| authors: | ||
| - "@stleerh" | ||
| reviewers: | ||
| - "@jotak" | ||
| - "@jpinsonneau" | ||
| - "@memodi" | ||
| - "Mike Fiedler" | ||
| - "@pavolloffay" | ||
| - "@jan--f" | ||
| - "@abhat" | ||
| - "@simonpasquier" | ||
| - "@everettraven" | ||
| approvers: | ||
| - "@jotak" | ||
| - "@dave-tucker" | ||
| api-approvers: | ||
| - "@jotak" | ||
| - "@dave-tucker" | ||
| - "@everettraven" | ||
| creation-date: 2025-09-30 | ||
| last-updated: 2026-02-25 | ||
| tracking-link: | ||
| - https://issues.redhat.com/browse/OCPSTRAT-2469 | ||
| see-also: | ||
| - N/A | ||
| replaces: | ||
| - N/A | ||
| superseded-by: | ||
| - N/A | ||
| --- | ||
|
|
||
| # Enable Network Observability on Day 0 | ||
|
|
||
| ## Summary | ||
|
|
||
| This feature enhancement makes Network Observability available on day 0 by default. That is, Network Observability is up and running after you create an OpenShift cluster using `openshift-install`. It installs Network Observability Operator and creates a basic FlowCollector instance. There is an option to turn this off. It also makes it easy to have Network Observability available on day 1. | ||
|
|
||
| ## Motivation | ||
|
|
||
| Network Observability is an optional OLM operator that collects and stores traffic flow information and provides insights into your network traffic, including troubleshooting features like packet drops, latencies, DNS tracking, and more. | ||
|
|
||
| ### User Stories | ||
|
|
||
| * As a cluster admin or developer, I expect to be able to observe and manage my network traffic without having to install other components. It should just be there. | ||
| * As a cluster admin, I should be able to see the networking health of my cluster after creating it. | ||
| * As a customer support engineer, I want the customer to be aware that Network Observability exists and can provide insights into their network traffic, including the ability to troubleshoot a number of networking issues. | ||
|
|
||
| These are the related issues for this feature enhancement. | ||
|
|
||
| * (Feature) [OCPSTRAT-2469](https://issues.redhat.com/browse/OCPSTRAT-2469) Provide a default OpenShift install experience for Network Observability | ||
| * (Epic) [NETOBSERV-2454](https://issues.redhat.com/browse/NETOBSERV-2454) Install Network Observability operator by default on OpenShift clusters | ||
| * (Spike) [NETOBSERV-2236](https://issues.redhat.com/browse/NETOBSERV-2236) What it would take to enable Network Observability by default in the console | ||
| * (PoC) [NETOBSERV-2247](https://issues.redhat.com/browse/NETOBSERV-2247) Have network observability be available and enabled on day 0 | ||
|
|
||
| ### Goals | ||
|
|
||
| Being able to manage and observe the network in an OpenShift cluster is critical in maintaining the health and integrity of the network. Without it, there’s no easy way to verify whether your changes are working as expected or whether your network is experiencing issues. | ||
|
|
||
| Currently, Network Observability is an optional operator and a majority of customers do not have Network Observability installed. Customers are missing out on features that they should have and have already paid for. | ||
|
|
||
| Network observability should be an integral part of networking and not thought of as a separate component. You shouldn't have to ask, "Do I need observability?" any more than you would ask "Do I need security?" Because it requires resources, basic observability should exist and additional features can be enabled as needed. There are a few scenarios where you might not want Network Observability, so there is an easy way to opt out. | ||
|
|
||
| There is no one-size-fits-all solution in terms of configuring Network Observability, but the goal is to keep this part simple, while still providing as much value as possible given the constraints, and make it an easy way to change parameters on day 2. | ||
|
|
||
| ### Non-Goals | ||
|
|
||
| There are other proposals to make Network Observability more visible and prominent, such as displaying a panel that would describe the features of Network Observability and provide a button to install it. However, this feature enhancement addresses [OCPSTRAT-2469](https://issues.redhat.com/browse/OCPSTRAT-2469) that explicitly calls for Network Observability to be up and running after install. | ||
|
|
||
| There is a separate effort to add Network Observability to OpenShift Assisted Installer ([NETOBSERV-2486](https://issues.redhat.com/browse/NETOBSERV-2486)). That addresses some installation cases but not all. | ||
|
|
||
| Network Observability Operator manages the components, such as flowlog pipelines. Therefore, there is no need to consider the lifecycle management, since that will not change. | ||
|
|
||
| ## Proposal | ||
|
|
||
| There are three OpenShift repositories that this proposal changes. They are [openshift/api](https://github.com/openshift/api), [openshift/cluster-network-operator](https://github.com/openshift/cluster-network-operator), and [openshift/install](https://github.com/openshift/installer). | ||
|
|
||
| ### Repository: openshift/api | ||
|
|
||
| The openshift/api repository is a shared repository for defining the API. This adds the `networkObservability` field and a nested `installationPolicy` field in the Network Custom Resource Definition (CRD) under the spec section. | ||
|
|
||
| ```yaml | ||
| apiVersion: config.openshift.io/v1 | ||
| kind: Network | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are two "Network" types, one in Although we have not been consistent, the idea was supposed to be that So if this is telling CNO whether or not to deploy the observability operator, then it belongs in the
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This file was generated by If you have an install-config.yaml file and enter Because of the Network Observability config in install-config.yaml, this file gets modified to include:
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, but that all happens because you told openshift-install to do that, and I'm saying, you should tell it to write that config to a new object modifying the
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could do that, but it would make it harder for someone to not enable Network Observability. What other attributes are written into When I enter |
||
| metadata: | ||
| name: cluster | ||
| spec: | ||
| networkObservability: | ||
| installationPolicy: InstallAndEnable | DoNotInstall | ||
| ``` | ||
|
|
||
| This allows flexibility for future growth as opposed to having a simple true/false field. | ||
|
|
||
| Listing 1: Network manifest | ||
|
|
||
| If the value is `InstallAndEnable` or doesn't exist, Network Observability is enabled. That is, Network Observability will be installed and a FlowCollector custom resource will be created (more details below). If it is set to `DoNotInstall`, Network Observability is not enabled or to be precise, *nothing is done*. It doesn’t remove Network Observability if it is set to `DoNotInstall`. | ||
|
|
||
| ### Repository: openshift/cluster-network-operator | ||
|
|
||
| The actual enabling of Network Observability is done in the Cluster Network Operator (CNO). The rationale is that we want the network observability feature to be part of networking. This is as opposed to being part of the general observability or as a standalone entity. Yet, there is still a separation at the lower level so that the two can be independently developed and released at different times, particularly for bug fixes. | ||
|
|
||
| In CNO, it adds a new controller for observability and adds it to the manager. The controller is a single Go file where the Reconciler reads the state of the `installationPolicy` field. If set to `InstallAndEnable`, it does the following: | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What if a users changes from
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, once NOO has been installed, the CNO has finished its work, switching back to DoNotInstall do nothing.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is why it's called "DoNotInstall" rather than "Uninstall".
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be beneficial to document that for future reference |
||
|
|
||
| 1. Check if Network Observability Operator (NOO) is installed. If yes, exit. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How are you intending to evaluate this?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See PR #2925.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you elaborate please @stleerh?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are two different things :
|
||
| 2. Install NOO using OLM's OperatorGroup and Subscription. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Have you considered making this fully managed by the cluster-network-operator instead of utilizing OLM? If this is going to be considered a core component of OpenShift moving forward, it seems like it could be reasonable to try and move it away from being a layered product and thus installed by OLM so there is a tighter coupling between OCP version and NOO version. If you take the OLM approach, will you be pinning the version of NOO to a particular version to prevent automatic upgrades? Are we going to support customers modifying the OLM resources created by the CNO? Would we support running a newer version of NOO on an older version of OCP (since OLM supports over-the-air style updates and upgrades? There are some nuances when leveraging OLM as the deployment mechanism and I'd like to better understand the interactions here.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We want Network Observability to be there when you have Cluster Network Operator (CNO), but it can run independent of that in the upstream version. We have one version of NOO that is backwards-compatible to all versions of OCP. We've been releasing a new X.Y version alongside the OCP version.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be helpful to document how this pattern has been used before, i.e in CIO
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You call out OLM v0 constructs, but the implementation uses OLM v1.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, what if someone installed NOO via OLM v0?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. After discussing with OLM team yesterday we did the following change on the implementation :
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The goal was to avoid implementation details, so it will check if NOO is installed, regardless of whether it was done using OLM v0 or v1. Step 2 could have avoided implementation details by just saying "Install NOO".
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happens if the user mutates the resources created by CNO (i.e ClusterExtension)?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The user own both the ClusterExtension and the Flowcollector. If the user delete the ClusterExtension, NOO is removed and the CNO does not try to reinstall it.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you explain how CNO and NOO lifecycles are supposed to interact? Should NOO enter degraded state if NOO has failed to deploy for some reason?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The new controller first install NOO, and then create the flowcollector object. When the flowcollector object is created, NOO will install the network observability stack. If something fails during this phase NOO will enter degraded state. But if something fails during NOO installation itself, it does not enter degraded state but the ClusterExtension does display some errors. |
||
| 3. Wait for NOO to be ready. | ||
| 4. Create the "netobserv" namespace if it doesn't exist. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a discussion we've had in the past. I'll resurface this topic, and while this proposal might influence that decision, it has to be in agreement with what Network Observability does.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. like I mentioned in another comment, netobserv itself doesn't enforce a target namespace, it simply has a default; the CNO can enforce a namespace that differs from the default. So, +1 to what dave says: let's use openshift-network-observability here. |
||
| 5. Check if a FlowCollector instance exists. If yes, exit. | ||
| 6. Create a FlowCollector instance. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How would CNO signal to the user when one of the steps fails?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It will be in the log file.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about reporting this via status conditions?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 for a status condition
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, that can be done
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A sentence was added about this in the last section.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Who owns the
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The FlowCollector is owned by the user. The user can change any field or even delete it to disable NOO.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Network Observability Operator (NOO) manages FlowCollector. It is very likely the user will want to change attributes of their FlowCollector instance. If the default features change, it will only affect future clusters that don't have Network Observability already enabled. We don't anticipate problems with different features on different clusters, since this is primarily a single cluster component. Worse case, we can make a change to update the feature set, but that is something we want to avoid.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To summarize the discussion we had a few weeks ago, it was suggested that CNO deploys an empty FlowCollector, and that NOO is then responsible for establishing the default featureset, potentially writing that back to the CRD if it needs to. That way, CNO doesn't need to be aware of NOO defaults in future. |
||
|
|
||
| The Reconciler leverages the existing framework and reuses the concept of client, scheme, and manager. It provides a clear ownership by having a separate controller for it. If the Network CR changes, the Reconciler will repeat the above steps. Note it doesn’t monitor NOO or any of NOO's components for changes, and it doesn’t do any upgrades. That is still the responsibility of NOO. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Networking component upgrade is managed via CNO, which, AIUI has pinned versions for every release.
☝️ is a good reason to do so. Let's say we rollout OVN-K8s 4.1.1 and to observe a new feature you need NOO 24.1.1 (and maybe a new feature being toggled on in the FlowCollector). Managing this through CNO seems easier than asking the user to upgrade NOO from the catalog and edit the flow collector themselves.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't see an issue with NOO upgrading as needed for OVN upgrades. We can make that mostly transparent to the user. In fact, I would say it would make things more complicated if you have to go through COO, because there might be other reasons why you want to upgrade Network Observability. And if you have two ways of upgrading, either through COO or NOO, that just make things more confusing. Having the tie-in plus still be separate and independent gives it the most flexibility.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Paraphrasing from our previous call, NOO will always upgrade itself to the latest - I don't recall if that was the default, or user opt-in. CNO doesn't care, we just keep networking up-to-date. |
||
|
|
||
| ### Repository: openshift/install | ||
|
|
||
| The openshift/install repository contains the source code for the **openshift-install** binary. This adds the same fields as in the Network CRD but under the existing `networking` section in the **install-config.yaml** file. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The installer lets you pass arbitrary manifests (ie, other objects to create at install time), and that can include a
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Allowing it to be set in install-config.yaml is how it gets day 0 capability. It's the same mechanism that was used when you want the CNI to be OVN-Kubernetes back in the days when OpenShiftSDN was the default. File: install-config.yaml We've been working with the installer team on this.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, but for example, if you wanted to override the ovn-kubernetes join subnet, you can't do that via install-config. You have to do it by creating a manifest. Because most people aren't going to need to do that, so we don't expose it in install-config
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The "join subnet" is a very specific configuration for OVN-Kubernetes and belongs in |
||
|
|
||
| ```yaml | ||
| apiVersion: v1 | ||
| baseDomain: devcluster.openshift.com | ||
| networking: | ||
| networkObservability: | ||
| installationPolicy: InstallAndEnable | DoNotInstall | ||
| ``` | ||
|
|
||
| Listing 2: install-config.yaml | ||
|
|
||
| The `installationPolicy` value is passed on to CNO to set the field of the same name in the Network CRD. If this field is set to `InstallAndEnable` or doesn’t exist, it sets the Network CR’s `installationPolicy` field to `InstallAndEnable`. To *not* enable Network Observability, set it to `DoNotInstall`. This then sets the Network CR’s `installationPolicy` field to `DoNotInstall`. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (CNO is not involved at install time. The installer would need to create the
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See my comment above. |
||
|
|
||
| ### FlowCollector Custom Resource (CR) | ||
|
|
||
| Here is the FlowCollector Custom Resource (CR) that is instantiated. | ||
|
|
||
| ```yaml | ||
| apiVersion: flows.netobserv.io/v1beta2 | ||
| kind: FlowCollector | ||
| metadata: | ||
| name: cluster | ||
| spec: | ||
| agent: | ||
| ebpf: | ||
| features: | ||
| - DNSTracking | ||
| sampling: 400 | ||
| type: eBPF | ||
| deploymentModel: Service | ||
| loki: | ||
| enable: false | ||
| namespace: netobserv | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not very familiar with network observability but does
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. that's a good point, the CRD default is "netobserv", but it could make more sense to use "openshift-netobserv" now that's baked into the installer.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This was a discussion from many years ago (NETOBSERV-373). The original description was changed so if you look at the History link, it said: Nevertheless, it's a good point that should be reconsidered. However, this proposal would just follow whatever Network Observability decides to do.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was rather thinking that, on the contrary, what we do here can deviate from upstream
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If it is going to be an openshift-owned thing, deviating from upstream and using an
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @stleerh can you update the EP to use "openshift-netobserv" or "openshift-network-observability" here?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's get the code merged and then this EP can be updated to match it. |
||
| ``` | ||
|
|
||
| Listing 3: FlowCollector configuration | ||
|
|
||
| Other eBPF features were considered, but the criteria was to avoid features that needed privilege mode and features that consumed significant resources. | ||
|
|
||
| Summary: | ||
|
|
||
| * Sampling at 400 | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does it depend on the in-cluster Prometheus stack? If yes how does it play with #1880?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. PM has said, "The default OpenShift experience should continue to include Prometheus, alerts, and baseline dashboards."
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @simonpasquier I'm currently trying to understand how that plays with optional monitoring, but I don't think that's something tied to this EP (I mean, the question is equally relevant to when netobserv or any other redhat operator is installed from operatorhub?)
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My guess is that, when CMO is in lightweight mode like that, netobserv will show no data and potentially query failures;
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To rephrase my question: is it worth deploying netobserv if there's no platform monitoring stack? I understand that users create sub-optimal setups but it might be better to prevent it in the first place?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Whatever option we pick (never install NOO at day-0 when monitoring is disabled or delegate the decision to the user), I'd rather see it documented somewhere including what it entails in terms of features/limitations (and this EP seems to be the right place).
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That should be documented in Network Observability. This EP just enables or not enables Network Observability.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @simonpasquier what's the ETA for OptionalMonitoring ? We may want to iterate once the feature is live, but at the moment it isn't, so there's little we can do, right ?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the doc aspects, I think it should be documented on both sides (netobserv & monitoring)
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| * No Loki | ||
| * No Kafka | ||
| * DNSTracking feature enabled | ||
|
|
||
| ### Workflow Description | ||
|
|
||
| Network Observability is enabled by default on day 0 (planning stage). You don’t have to configure anything when using `openshift-install`, and Network Observability Operator will be installed and a FlowCollector custom resource (CR) will be created (Listing 3 above). | ||
|
|
||
| If you don’t want Network Observability enabled, first create the **install-config.yaml** file using the command below. | ||
|
|
||
| `$ openshift-install create install-config` | ||
|
|
||
| Then add the following as shown in Listing 4. | ||
|
|
||
| ``` | ||
| networking: | ||
| networkObservability: | ||
| installationPolicy: DoNotInstall | ||
| ``` | ||
|
|
||
| Listing 4: Don't enable Network Observability in install-config.yaml | ||
|
|
||
| Here's an alternate approach. Using your **install-config.yaml** file, you can create manifests from it and add the change there instead. To create the manifests, enter: | ||
|
|
||
| `$ openshift-install create manifests` | ||
|
|
||
| This creates a **manifests** directory. Of particular relevance in this directory is a file named **cluster-network-02-config.yml**, which is the Network CR. Under the `spec` section, add the following as shown in Listing 5. | ||
|
|
||
| ``` | ||
| spec: | ||
| networkObservability: | ||
| installationPolicy: DoNotInstall | ||
| ``` | ||
|
|
||
| Listing 5: Don't enable Network Observability in Network CR | ||
|
|
||
| Finally, to create the cluster, enter: | ||
|
|
||
| `$ openshift-install create cluster` | ||
|
|
||
| When you bring up the OpenShift web console, you should see that NOO is installed just as it would be if you had gone to **Ecosystem > Software Catalog** to install **Network Observability** from Red Hat (not the Community version). In **Installed Operators**, there should be a row for **Network Observability**. In the **Observe** menu, there should be a panel named **Network Traffic**. | ||
|
|
||
| The Technology Preview (TP) release will have a feature gate named `NetworkObservabilityInstall` that needs to be enabled. To enable this on day 0, enter: | ||
|
|
||
| ``` | ||
| $ openshift-install create install-config | ||
| $ openshift-install create manifests | ||
| ``` | ||
|
|
||
| Now create a file named **99-feature-gate.yml** in the **manifests** directory with the following: | ||
|
|
||
| ```yaml | ||
| apiVersion: config.openshift.io/v1 | ||
| kind: FeatureGate | ||
| metadata: | ||
| name: cluster | ||
| spec: | ||
| featureSet: CustomNoUpgrade | ||
| customNoUpgrade: | ||
| enabled: | ||
| - NetworkObservabilityInstall | ||
| ``` | ||
|
|
||
| Then enter: | ||
|
|
||
| `$ openshift-install create cluster` | ||
|
|
||
| If you have a running cluster, you can update the feature gate by entering `oc edit featuregate` and make the changes shown above. | ||
|
|
||
| At General Availability (GA), the feature gate for this feature will be enabled by default, so you no longer need to modify the FeatureGate resource. | ||
|
|
||
| ### API Extensions | ||
|
|
||
| See Listing 2 above for the changes to Network CRD. | ||
|
|
||
| ### Topology Considerations | ||
|
stleerh marked this conversation as resolved.
|
||
|
|
||
| #### Hypershift / Hosted Control Planes | ||
|
|
||
| This proposal doesn't change how Network Observability works in a Hosted Control Plane (HCP) environment. Network Observability is supported on host clusters and the management cluster, therefore it will be enabled by default. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How does the cluster-network-operator equivalent in HyperShift work? For example, I know that the HyperShift Control Plane Operator has different mechanics than the cluster-authentication-operator for authentication/authorization related things. Will there need to be any changes to HyperShift's controller behaviors to support this new functionality?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Network Observability is already supported on HCP. Enabling NOO by default does nothing to change that.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This doesn't answer my question. Are there any changes needed on the HyperShift side to support enabling this by default?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No |
||
|
|
||
| #### Standalone Clusters | ||
|
|
||
| This proposal applies to standalone clusters. | ||
|
|
||
| #### Single-node Deployments or MicroShift | ||
|
|
||
| Due to resource constraints, Single Node OpenShift (SNO) is an exception and will not be enabled by default. | ||
|
|
||
| MicroShift is not supported since Network Observability and CNO are not supported on that platform. | ||
|
|
||
| #### OpenShift Kubernetes Engine | ||
|
|
||
| OpenShift Kubernetes Engine is supported. | ||
|
|
||
| ### Implementation Details/Notes/Constraints | ||
|
|
||
| ### Risks and Mitigations | ||
|
|
||
| * Network Observability requires CPU, memory, and storage that the customer might not be aware of. See the Test Plan section for the target goals. | ||
|
|
||
| **Mitigation:** The default setting stores only metrics at a high sampling interval to minimize the use of resources. If this isn’t sufficient, more fine-tuning and filtering can be done in the provided default configuration (e.g. filtering on specific interfaces only). | ||
|
|
||
| * Some of the Network Observability features aren’t enabled in order to use minimal resources. Therefore, users might not know about these features. | ||
|
|
||
| **Mitigation:** Determine what features, particularly related to troubleshooting, can be enabled with minimal CPU and memory impact. Mention other features in the panels. | ||
|
|
||
| ### Drawbacks | ||
|
|
||
| Rather than actually installing NOO and creating the FlowCollector instance, it is less risky and simpler to just display a panel or a button to let the user install and enable Network Observability. This resolves the awareness issue. However, it goes against the principle that networking and network observability should always go hand-in-hand and be there from the start. | ||
|
|
||
| ## Alternatives (Not Implemented) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Was installation via the assisted-installer considered? https://github.com/openshift/assisted-service/blob/master/docs/dev/olm-operator-plugins.md This seems like a viable option to mitigate the drawbacks around topologies and resource constraints. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's a parallel work: However, assisted installer doesn't cover all the installation cases. That's why we consider having an alternative. |
||
|
|
||
| ### Alternative #1: Make NOO a core component of OpenShift | ||
|
|
||
| Rather than have CNO enable Network Observability, take the existing Network Observability Operator (NOO) and have it be installed by default in the cluster. There needs to be some logic to accept the values in openshift-install to decide whether NOO should be enabled or not. | ||
|
|
||
| The core components of OpenShift are operators like Cluster Network Operator (CNO) and Cluster Storage Operator (CSO). NOO is a much smaller component and should not reside at the top level. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure the size of a component really dictates how it should be deployed.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Regardless, I don't believe it belongs at this level.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I agree with this. Note that we have something called CloudNetworkConfigController which is a core operator on cloud platforms and does very small things scope wise My preference is to ship NOO as core. Is the only reason why this alternative was not considered? or am I missing something? - installer code change avoidance doesn't seem like a valid reason.. Is NetObserv completely tied to "OVN-Kubernetes" ? and is never going to expand beyond being an observability solution outside of the components owned by CNO?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here are the two main points.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
again, why not make the same claim around "if you are running OpenShift, then you are running NOO" and hence you are running NetObserv, maybe I am missing that point still.
sure, that's still possible to do via NOO right? can't NOO adopt to OCP release cycle if its part of OCP? again apologies if I am not familiar with the NOO lifecycle
so does Network Observability work for all the 3rd party CNIs allowed on OCP? what happens when those CNIs are used instead of OVN-Kubernetes? example Cilium. Will NetObsev still be running by default and doing observ for cilium? what about Layer7 - do we plan to expand it there? or is netobserv also strictly restricted to l2/l3/l4 always?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm having a hard time wrapping my head around why we want this to essentially be a core feature of OpenShift without coupling it to the OCP version by including it as part of the payload. Maybe it helps to have a sync call with appropriate stakeholders to hash this out, but my biggest concern with the currently proposed approach is that you mention:
I could be misunderstanding here, but if the intention is that we should be tying NOO versions to specific OCP versions due to dependence of features, continuing to use OLM as the deployment mechanism starts to make things more confusing for end users. They have to start understanding that they are responsible for ensuring that NOO is up-to-date and that they aren't accidentally installing a version that is incompatible with the version of OpenShift that they are running. IIRC, there are some safeguards in OLM to help mitigate this but it still feels like we are putting the onus on the user to be more attentive to the platform requirements rather than us being responsible for making sure we are always installing a compatible version of NOO automatically and that it will always stay compatible without any user intervention.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Surya Seetharaman, Dave Tucker, Ben Bennett and I had a discussion on April 1 to hash this out. The conclusion is that we agreed to have CNO enable NOO, so the current proposal remains unchanged. Just a few comments to the questions above:
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note that:
just wanted to put this point here that its not that I am happy with what is being done, but its that the choice I would rather go for doesn't exist? @stleerh I think this specific alternative should call out the limitation aspect that we are not allowed to add new operators to payload. |
||
|
|
||
| ### Alternative #2: Have COO enable Network Observability | ||
|
|
||
| Instead of CNO enabling Network Observability, have Cluster Observability Operator (COO) do it instead. COO is becoming the operator and the central place for core observability components to be installed. In addition, it provides services like metrics, Perses for dashboards, and troubleshooting via Korrel8r (Observability Signal Correlation). | ||
|
|
||
| A critical issue is that COO is itself an optional operator, so it can’t enable Network Observability on day 0, because it has to be installed first. The central question is, "Is Network Observability part of OpenShift Networking or part of Cluster Observability?" The answer is the former. Component-based observability, such as Network Observability, is a layer on top of COO rather than a part of COO. | ||
|
|
||
| ### Alternative #3: Have CVO enable Network Observability | ||
|
|
||
| Similar to alternative #1, this explicitly suggests having the Cluster Version Operator (CVO) enable Network Observability. CVO currently manages larger scope operators that represent core cluster functions, such as CNO and CSO, rather than specific operators like Network Observability. | ||
|
|
||
| ## Test Plan | ||
|
|
||
| The test plan will consider the following: | ||
|
|
||
| - Different architectures | ||
| - Different size clusters | ||
| - Hosted Control Plane (HCP) environment | ||
| - e2e tests in [OpenShift Release Tooling](https://github.com/openshift/release) | ||
|
|
||
| Performance testing will be done to optimize the use of resources and to determine the specific FlowCollector settings, with the goal of using less than 5% resources (CPU and memory) and an ideal target of less than 3%, including external components that are affected. | ||
|
|
||
| ## Graduation Criteria | ||
|
|
||
| ### Dev Preview -> Tech Preview | ||
|
|
||
| Network Observability reached GA back in January 2023. Because the feature is to simply enable Network Observability, which has already existed for 3+ years, the plan is to forego the Dev Preview and go directly to Tech Preview. | ||
|
|
||
| ### Tech Preview -> GA | ||
|
|
||
| There are many different customer scenarios and cluster profiles. The Tech Preview will allow us to gauge the customer responses and make optimizations to the FlowCollector configuration or even the Network CRD if necessary. To enable the feature gate for this feature, see the **Workflow Description** above. | ||
|
|
||
| Here are the GA requirements. | ||
|
|
||
| * [NETOBSERV-2533](https://issues.redhat.com/browse/NETOBSERV-2533) Performance testing in Loki-less mode with default settings | ||
| - Provide guidance on CPU, memory, and storage resources | ||
| - Measure the impact on Prometheus in the In-Cluster Monitoring | ||
| - Optimize the default FlowCollector configuration | ||
| * [NETOBSERV-2534](https://issues.redhat.com/browse/NETOBSERV-2534) Have a way to pause Network Observability functions | ||
| * [NETOBSERV-2535](https://issues.redhat.com/browse/NETOBSERV-2535) Security audit on Network Observability code | ||
| * [NETOBSERV-2428](https://issues.redhat.com/browse/NETOBSERV-2428) New Service deployment model | ||
| * User facing documentation created in [openshift-docs](https://github.com/openshift/openshift-docs/) | ||
|
|
||
| ### Removing a deprecated feature | ||
|
|
||
| N/A | ||
|
|
||
| ## Upgrade / Downgrade Strategy | ||
|
|
||
| The upgrade strategy is treated like any other feature. At Tech Preview, you will need to enable the feature gate for this feature. At GA, Network Observability will be enabled by default without additional user intervention. | ||
|
|
||
| On a downgrade, it will no longer enable Network Observability. If Network Observability Operator and/or FlowCollector exists, they will remain and will not be removed. | ||
|
|
||
| ## Version Skew Strategy | ||
|
|
||
| There are no issues with version skew, since the logic to enable Network Observability only resides in CNO. | ||
|
|
||
| ## Operational Aspects of API Extensions | ||
|
|
||
| N/A | ||
|
|
||
| ## Support Procedures | ||
|
|
||
| Check the CNO logs and search for "observability\_controller.go" to determine whether Network Observability did or did not get enabled. This will also be reported in the Status conditions. | ||
Uh oh!
There was an error while loading. Please reload this page.