Skip to content

rfc: secret sharing#2341

Open
davidweisse wants to merge 1 commit into
mainfrom
dav/rfc-13
Open

rfc: secret sharing#2341
davidweisse wants to merge 1 commit into
mainfrom
dav/rfc-13

Conversation

@davidweisse
Copy link
Copy Markdown
Member

No description provided.

@davidweisse davidweisse requested a review from burgerdev April 22, 2026 14:04
@davidweisse davidweisse added the no changelog PRs not listed in the release notes label Apr 22, 2026
Copy link
Copy Markdown
Member

@burgerdev burgerdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, great proposal!

Comment thread rfc/013-secret-sharing.md Outdated
Comment thread rfc/013-secret-sharing.md Outdated
Comment thread rfc/013-secret-sharing.md
Comment thread rfc/013-secret-sharing.md Outdated

### UserAPI Changes

For the implementation we can use a secret sharing scheme based on OpenBao's [Shamir](https://pkg.go.dev/github.com/openbao/openbao/sdk/v2/helper/shamir) go library.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any specific reason for using OpenBao's version over the original Hashiscorp version?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@burgerdev wdyt?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vault was relicensed to BSL: https://github.com/hashicorp/vault/blob/main/LICENSE. The headers still say MPL sometimes, but it's maybe safer to use the fork.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also looks like Bao is better maintained here: openbao/openbao#182

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For MarbleRun, we decided to vendor shamir.go because we didn't want to add a huge depdendency like Vault or OpenBao SDK to the go.mod just for this small, stable part of code.
We decided to take it from this commit: https://github.com/hashicorp/vault/tree/8d07273d14ae7f5a48cc96f66cc86615dea83390/shamir
It comes with the license next to it. These two files may just be copied like here https://github.com/edgelesssys/marblerun/tree/master/3rdparty/hashicorp/shamir
We also looked at the changes in OpenBao to it since then and concluded that none of them are needed.

Comment thread rfc/013-secret-sharing.md
```

This threshold defaults to the number of provided seed share owner public keys, so by default every seed share owner would need to be involved in the recovery process.
The lowest value would be 1, in which case the behavior is the same as the current implementation, allowing every seed share owner to recover the Coordinator on their own.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do note that SSS requires at least 2 shares, meaning the old implementation has to be kept for this case

Comment thread rfc/013-secret-sharing.md Outdated
Comment on lines +71 to +77
The `AuthorizeByManifest` function can now additionally take the seed shares as an argument to reconstruct the seed if the threshold is met and return the received seed share otherwise:

```go
type SecretSourceAuthorizer interface {
AuthorizeByManifest(ctx context.Context, manifest *manifest.Manifest, seedShares [][]byte) (seedengine.SeedEngine, crypto.PublicKey, []byte, error)
}
```
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the seed shares are not relevant for all implementations, wouldn't it make more sense to make this internal to seedAuthorizer?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense.

@sespiros sespiros self-requested a review May 5, 2026 12:05
Copy link
Copy Markdown
Member

@burgerdev burgerdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@burgerdev burgerdev requested review from msanft and thomasten May 6, 2026 08:53
Copy link
Copy Markdown
Member

@msanft msanft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, but I fail to see the practical motivation for this.

Comment thread rfc/013-secret-sharing.md
Comment thread rfc/013-secret-sharing.md
### UserAPI Changes

For the implementation we can use a secret sharing scheme based on OpenBao's [Shamir](https://pkg.go.dev/github.com/openbao/openbao/sdk/v2/helper/shamir) go library.
During the initial `contrast set`, the seed is split into shares using the specified threshold and the provided seed share owner public keys, and each share is encrypted with the corresponding public key and returned to the user.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a goal of Contrast to allow up-/downscaling of seed share owners on a live deployment? At least the latter wouldn't work anymore with this, right?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scaling down the seed share owners would need to generate a new seed (otherwise, assumptions about prior manifests' security might not hold anymore). At that point, it's probably easier to just start a fresh manifest history.

Comment thread rfc/013-secret-sharing.md
Comment on lines +24 to +25

For now, we can say the solution is to scale down the Coordinator to a single instance.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also what we do in MarbleRun btw

Comment thread rfc/013-secret-sharing.md Outdated

### UserAPI Changes

For the implementation we can use a secret sharing scheme based on OpenBao's [Shamir](https://pkg.go.dev/github.com/openbao/openbao/sdk/v2/helper/shamir) go library.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For MarbleRun, we decided to vendor shamir.go because we didn't want to add a huge depdendency like Vault or OpenBao SDK to the go.mod just for this small, stable part of code.
We decided to take it from this commit: https://github.com/hashicorp/vault/tree/8d07273d14ae7f5a48cc96f66cc86615dea83390/shamir
It comes with the license next to it. These two files may just be copied like here https://github.com/edgelesssys/marblerun/tree/master/3rdparty/hashicorp/shamir
We also looked at the changes in OpenBao to it since then and concluded that none of them are needed.

Comment thread rfc/013-secret-sharing.md

If the Coordinator is in recovery mode, this list is used to track the received seed shares via the `Recover` RPC of the user API.
Peer recovery still works as before, since there are no seed shares involved.
If a `Recover` call from the user is made where the newly received seed share combined with the already received shares meets the threshold, the seed is reconstructed and the Coordinator can be recovered normally.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should the Coordinator behave if the wrong seed is reconstructed because one of the inputs was wrong?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this would imply a malicious seed share owner? The Coordinator has no way to identify a wrong share, so it will create a new seed engine anyway. This should however be immediately noticeable when attesting the Coordinator with contrast verify because the root CA changed.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the signature of the latest transition is checked during recovery, so directly after the seed engine has been created. If the seed engine is wrong, the user that called Recover last will receive an error verifying latest transition.... Since the Coordinator doesn't know which seed share is incorrect, maybe we have to transition to a new state with no seed shares and start again?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we can't verify the signature with the combined seed we should throw all shares away and start from scratch.

Comment thread rfc/013-secret-sharing.md
When the `Recover` RPC is called and the state is stale, the `ResetState` method of the guard is invoked, which takes a `SecretSourceAuthorizer` as an argument.
This interface currently only has the method `AuthorizeByManifest`, which validates the peer given the manifest from the store and returns the new seed engine and mesh CA key.
This is used both by the user API for `contrast recover` and by the mesh API during peer recovery.
The `AuthorizeByManifest` function now additionally returns the reconstructed seed if the threshold is met and the received seed share otherwise:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/and/or

Comment thread rfc/013-secret-sharing.md
Comment on lines +3 to +8
## Background

During an initial `contrast set` the secret seed is encrypted with each seed share owner's public key and returned to the user.
This means that each seed share owner has access to the complete seed, and thus the ability to recover the Coordinator in case of a restart.
The original idea was to use a secret sharing scheme to split the seed into multiple shares, and require a threshold of them to recover the seed.
This would have the advantage of not giving any single seed share owner access to the complete seed.
Copy link
Copy Markdown
Collaborator

@sespiros sespiros May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The background section should mention the motivation behind this change more explicitly (by incorporating Markus' comment about mutually distrusting data owners or the more theoretical benefit of multi-party recovery).

Another necessity that also emerges then is that we need to extend our threat model to include a new actor, the less-than-fully-trusted seed share owner. This needs to be described either as part of this proposal or at least be documented as follow-up TODO to our threat model (security.md and threat model docs page).

Comment thread rfc/013-secret-sharing.md
seedEngine *seedengine.SeedEngine
...

seedShares [][]byte // new field to track the received seed shares
Copy link
Copy Markdown
Collaborator

@sespiros sespiros May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overloading State's stale state with another type of "staleness", for when we are in the middle of recovery, is confusing. A manifestation of this "confusion" is having to change AuthorizeByManifest's contract to behave differently depending on whether we have shares in flight or not.

Reading the code there, it seems to me that Guard could be a more suitable home for the intermediate shares until a threshold is met. An example sketch of a design that seems more straight-forward to me:

type Guard struct {
    ...
    partialMu sync.Mutex
    // nil when no user recovery is in flight.
    // Replaced as a whole on each share submission..
    // Never mutated in place because callers retain the pointer as a snapshot after AddSeedShare returns.
    partialRecovery *PartialRecovery
}

// PartialRecovery is the immutable accumulator of seed shares submitted
// during a multi-party user recovery. It is bound to a specific recovery
// target (insecure latest transition + manifest) and a single salt.
type PartialRecovery struct {
    TransitionHash [history.HashSize]byte // hist.GetLatestInsecure() at first submission
    Salt           []byte                  // captured from the first submitter
    Shares         map[ShareID][]byte      // keyed by submitter pubkey. Treated as read-only
}

// ShareID identifies a seed share submitter by their hex-encoded RSA
// public key, matching an entry in Manifest.SeedshareOwnerPubKeys.
type ShareID = manifest.HexString

func (g *Guard) AddSeedShare(ctx context.Context, callerID ShareID, share, salt []byte) (*PartialRecovery, error)

Then AddSeedShare would implement the logic that checks for errors, checks the threshold etc. If the threshold is met, it would reconstruct the seed, build a seedAuthorizer, call Guard.ResetState etc. It could be made to also return the complete seed, this is just an example.

We would clear PartialRecovery in case of either success or error. This is also in line with the other dicsussion about what to do is the wrong seed is reconstructed etc.

Copy link
Copy Markdown
Collaborator

@sespiros sespiros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this proposal. A couple of comments inline. The 2 most important ones are:

  • the consideration around how this should affect our threat model
  • the suggestion to put the shares in Guard instead of State

The rest are really comments on the implementation.

Comment thread rfc/013-secret-sharing.md

```go
type SecretSourceAuthorizer interface {
AuthorizeByManifest(ctx context.Context, manifest *manifest.Manifest) (seedengine.SeedEngine, crypto.PublicKey, []byte, error)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is meant to be *seedengine.SeedEngine and *ecdsa.PrivateKey but ideally see my other comment that suggests not changing this method's contract at all.

Comment thread rfc/013-secret-sharing.md
If the Coordinator is in recovery mode, this list is used to track the received seed shares via the `Recover` RPC of the user API.
Peer recovery still works as before, since there are no seed shares involved.
If a `Recover` call from the user is made where the newly received seed share combined with the already received shares meets the threshold, the seed is reconstructed and the Coordinator can be recovered normally.
If the list contains some number of seed shares below the threshold and peer recovery triggers, the Coordinator is recovered via peer recovery as expected.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding from the proposal section was that in the first version we would need to scale down the coordinator to one instance which would be in non-recovered state right? So how is it possible for peer recovery to trigger mid-user-recovery via shares?

Is there a scenario where while we are scaling down to one instance, a non-recovered coordinator with seed shares accumulated, does peer recover from a recovered peer before the recovered one gets torn down?

Should it be specified more like:

  • if a recovered coordinator exists, use peer recovery
  • if no recovered coordinator exists, scale down to one and perform threshold user recovery
  • don't use --force for threshold recovery while recovered peers exist

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more of an edge case: if one recovered Coordinator and one non-recovered Coordinator exist (e.g., 2 Coordinators and I restart one), and I call recover with --force, the non-recovered Coordinator can accumulate seed shares before peer recovery triggers. I think otherwise the behavior is the same as before, i.e., prefer peer recovery when --force is not specified. Scaling down to 1 for user recovery should then be documented in the docs.

Comment thread rfc/013-secret-sharing.md

We implement a secret sharing scheme for the seed, and each seed share owner only has an encrypted seed share.
To recover the Coordinator, a threshold of seed shares need to be combined to recover the seed.
This is done by each seed share owner calling `contrast recover` with their encrypted seed share, and the Coordinator combines the shares and recovers the seed if the threshold is met.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically seed share owners call contrast recover with a plain text seed share (but over encrypted transport/aTLS).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The seed share is encrypted on disk, so from a user perspective we call contrast recover with the encrypted seed share which is then decrypted by the CLI before sending it to the Coordinator.

Comment thread rfc/013-secret-sharing.md
If the Coordinator has enough seed shares to recover the seed but one of the seed shares is invalid, through, for example, a malicious seed share owner, the following signature verification of the latest transition will fail.
In that case, we don't know which seed shares are valid or invalid, so the only option is to reset the state and start the recovery process from scratch with no seed shares.

The proto messages for the `Recover` API remain basically unchanged, now including the encrypted seed share instead of the encrypted seed:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is obvious but I think it should be explicitly mentioned here that each share owner is expected to send the same Salt as part of RecoverRequest and that this should be validated by the coordinator (that's why I included the Salt in my suggested design change).

I have also made the comment in the proposal section but, wdym by "encrypted seed share", is this encrypted with something else that I have missed?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I forgot about that. It only makes sense to also validate the salt.

The proto paragraph needs rewriting: the recover request now includes the seed share (not encrypted), and the SeedShare message type belongs to the SetManifestResponse, where the encrypted seed share is correct.

Comment thread rfc/013-secret-sharing.md
If the Coordinator has enough seed shares to recover the seed but one of the seed shares is invalid, through, for example, a malicious seed share owner, the following signature verification of the latest transition will fail.
In that case, we don't know which seed shares are valid or invalid, so the only option is to reset the state and start the recovery process from scratch with no seed shares.

The proto messages for the `Recover` API remain basically unchanged, now including the encrypted seed share instead of the encrypted seed:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at

cipherText, err := rsa.EncryptOAEP(sha256.New(), rand.Reader, pubKey, seed, []byte("seedshare"))

I wonder if we should have a more specific label for these new shares (I gather this is like aad in aead), that would also bind them to properties of their intended use like "intended to be used by sss", the threshold, total share count etc.

I see this as a robustness/ux improvement that would help in potential early mismatch detection/cleaner domain separation.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do something like "seedshare-k/n", for a k out of n threshold scheme. Both the user and the Coordinator have access to the manifest, so they both know the threshold and number of seed share owners. I'm not sure how useful this is though. This would only help when a user accidentally runs contrast recover with a seed share from a threshold scheme with different parameters. Maybe even "seedshare-k-{hash(seedshare owner keys)}".

But keep in mind that the untrusted seed share owner can simply choose the label himself and forge an incorrect share with the correct label, so this would only be a UX improvement, not adding any robustness in that scenario.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no changelog PRs not listed in the release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants