[update] add missing sleds to contact_support checks#10476
Conversation
| /// Build a map of version strings to the number of components on that | ||
| /// version | ||
| async fn component_version_counts( | ||
| async fn get_internal_update_status( |
There was a problem hiding this comment.
Most of the work was already done here, I just extracted the bit that was creating internal_views::UpdateStatus and used that to determine missing sleds.
sunshowers
left a comment
There was a problem hiding this comment.
Looks good overall! Just a few questions and comments.
| .iter() | ||
| .filter(|sled| { | ||
| // `unknown()` returns the represenation of the update status | ||
| // for a given sled ID that isn't present in inventory. |
There was a problem hiding this comment.
Is this case also hit when a sled is present in inventory but there's no last reconciliation status?
| **sled | ||
| == internal_views::SledAgentUpdateStatus::unknown( | ||
| sled.sled_id, | ||
| ) | ||
| }) |
There was a problem hiding this comment.
Hmm, looking at SledAgentUpdateStatus, the unknown state isn't recorded as an enum variant. This is pre-existing but I'm a little worried about the fragility of things like string equality here:
omicron/nexus/types/src/internal_api/views.rs
Lines 677 to 681 in 2fb3663
| /// - No sagas have been running for longer than an hour. | ||
| /// - An inventory collection exists | ||
| /// - No update is in progress, or an update is in progress and the last | ||
| /// blueprint created is not older than the value of | ||
| /// STUCK_UPDATE_THRESHOLD. | ||
| /// - All zpools are online. | ||
| /// - All enabled SMF services are in an online state. |
There was a problem hiding this comment.
Does this comment need an update?
| } | ||
|
|
||
| #[test] | ||
| fn test_problems_missing_sleds() { |
There was a problem hiding this comment.
Do we need a reconfigurator-cli test for this?
| } | ||
| }; | ||
|
|
||
| let missing_sleds: BTreeSet<SledUuid> = self |
There was a problem hiding this comment.
Does omdb need to show this information? Seems like for support it would be better than grepping through logs.
| let expected_sleds = self | ||
| .datastore() | ||
| .sled_list_all_batched( | ||
| opctx, | ||
| SledFilter::SpsUpdatedByReconfigurator, | ||
| ) | ||
| .await? |
There was a problem hiding this comment.
Hmm so this will do a live db query, but inventory is cached. Does this mean that when a sled is newly commissioned, contact_support might be true for a short period of time?
| async fn component_version_counts( | ||
| &self, | ||
| status: internal_views::UpdateStatus, | ||
| ) -> Result<BTreeMap<String, usize>, Error> { |
There was a problem hiding this comment.
Two notes:
- This doesn't need to either be
asyncor take&self, I think. - Small nit: thoughts on reworking this to take a reference to
UpdateStatus?
| .sleds | ||
| .iter() | ||
| .filter(|sled| { | ||
| // `unknown()` returns the represenation of the update status |
There was a problem hiding this comment.
nit:
| // `unknown()` returns the represenation of the update status | |
| // `unknown()` returns the representation of the update status |
Follow up to #10271
Closes #4745