Skip to content

[update] add missing sleds to contact_support checks#10476

Open
karencfv wants to merge 3 commits into
oxidecomputer:mainfrom
karencfv:find-missing-sleds
Open

[update] add missing sleds to contact_support checks#10476
karencfv wants to merge 3 commits into
oxidecomputer:mainfrom
karencfv:find-missing-sleds

Conversation

@karencfv
Copy link
Copy Markdown
Contributor

Follow up to #10271

Closes #4745

@karencfv karencfv marked this pull request as draft May 21, 2026 09:15
Comment thread nexus/src/app/update.rs
/// Build a map of version strings to the number of components on that
/// version
async fn component_version_counts(
async fn get_internal_update_status(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the work was already done here, I just extracted the bit that was creating internal_views::UpdateStatus and used that to determine missing sleds.

@karencfv karencfv marked this pull request as ready for review May 21, 2026 09:41
Copy link
Copy Markdown
Contributor

@sunshowers sunshowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall! Just a few questions and comments.

Comment thread nexus/src/app/update.rs
.iter()
.filter(|sled| {
// `unknown()` returns the represenation of the update status
// for a given sled ID that isn't present in inventory.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this case also hit when a sled is present in inventory but there's no last reconciliation status?

Comment thread nexus/src/app/update.rs
Comment on lines +131 to +135
**sled
== internal_views::SledAgentUpdateStatus::unknown(
sled.sled_id,
)
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, looking at SledAgentUpdateStatus, the unknown state isn't recorded as an enum variant. This is pre-existing but I'm a little worried about the fragility of things like string equality here:

host_phase_2: HostPhase2Status {
boot_disk: Err("unknown".to_string()),
slot_a_version: TufRepoVersion::Unknown,
slot_b_version: TufRepoVersion::Unknown,
},

Comment thread nexus/src/app/update.rs
Comment on lines 595 to 601
/// - No sagas have been running for longer than an hour.
/// - An inventory collection exists
/// - No update is in progress, or an update is in progress and the last
/// blueprint created is not older than the value of
/// STUCK_UPDATE_THRESHOLD.
/// - All zpools are online.
/// - All enabled SMF services are in an online state.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this comment need an update?

Comment thread nexus/src/app/update.rs
}

#[test]
fn test_problems_missing_sleds() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a reconfigurator-cli test for this?

Comment thread nexus/src/app/update.rs
}
};

let missing_sleds: BTreeSet<SledUuid> = self
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does omdb need to show this information? Seems like for support it would be better than grepping through logs.

Comment thread nexus/src/app/update.rs
Comment on lines 712 to 718
let expected_sleds = self
.datastore()
.sled_list_all_batched(
opctx,
SledFilter::SpsUpdatedByReconfigurator,
)
.await?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm so this will do a live db query, but inventory is cached. Does this mean that when a sled is newly commissioned, contact_support might be true for a short period of time?

Comment thread nexus/src/app/update.rs
Comment on lines +748 to +751
async fn component_version_counts(
&self,
status: internal_views::UpdateStatus,
) -> Result<BTreeMap<String, usize>, Error> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two notes:

  1. This doesn't need to either be async or take &self, I think.
  2. Small nit: thoughts on reworking this to take a reference to UpdateStatus?

Comment thread nexus/src/app/update.rs
.sleds
.iter()
.filter(|sled| {
// `unknown()` returns the represenation of the update status
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
// `unknown()` returns the represenation of the update status
// `unknown()` returns the representation of the update status

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Good-enough (mvp) tool for rack health checks to support automated updates

2 participants