Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
99 changes: 99 additions & 0 deletions rfc/013-secret-sharing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# Secret Sharing

## Background

During an initial `contrast set` the secret seed is encrypted with each seed share owner's public key and returned to the user.
This means that each seed share owner has access to the complete seed, and thus the ability to recover the Coordinator in case of a restart.
The original idea was to use a secret sharing scheme to split the seed into multiple shares, and require a threshold of them to recover the seed.
This would have the advantage of not giving any single seed share owner access to the complete seed.
Comment thread
msanft marked this conversation as resolved.
Comment on lines +3 to +8
Copy link
Copy Markdown
Collaborator

@sespiros sespiros May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The background section should mention the motivation behind this change more explicitly (by incorporating Markus' comment about mutually distrusting data owners or the more theoretical benefit of multi-party recovery).

Another necessity that also emerges then is that we need to extend our threat model to include a new actor, the less-than-fully-trusted seed share owner. This needs to be described either as part of this proposal or at least be documented as follow-up TODO to our threat model (security.md and threat model docs page).

It also comes with the drawback that the seed share owners would need to coordinate to recover the seed, which adds complexity to the recovery process.

## Proposal
Comment thread
burgerdev marked this conversation as resolved.

We implement a secret sharing scheme for the seed, and each seed share owner only has an encrypted seed share.
To recover the Coordinator, a threshold of seed shares need to be combined to recover the seed.
This is done by each seed share owner calling `contrast recover` with their encrypted seed share, and the Coordinator combines the shares and recovers the seed if the threshold is met.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically seed share owners call contrast recover with a plain text seed share (but over encrypted transport/aTLS).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The seed share is encrypted on disk, so from a user perspective we call contrast recover with the encrypted seed share which is then decrypted by the CLI before sending it to the Coordinator.


This approach only works if there is only one Coordinator instance, since `contrast recover` only connects to one Coordinator instance.
If there are multiple instances, the seed shares would need to be distributed to all of them.
In the current peer recovery process, Coordinators that enter a stale state call the `Recover` RPC of the mesh API of other Coordinators to recover.
In order to distribute the seed shares to all instances however, we need an RPC in the opposite direction, where the Coordinator in recovery can send the received seed shares to the other Coordinators.

When implementing this, we need to additionally consider concurrent recovery attempts on different Coordinator instances.
If 2 seed shares remain to meet the threshold and 2 different seed share owners call `contrast recover` at the same time, both Coordinators have to update each other's state without running into an error.

For now, we can say the solution is to scale down the Coordinator to a single instance.
Comment on lines +24 to +25
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also what we do in MarbleRun btw


### Manifest Changes

In the manifest, we add an additional field `SeedShareOwnerThreshold` which specifies the number of seed shares that need to be combined to recover the seed:

```json
{
...
"SeedShareOwnerThreshold": 3,
"SeedShareOwnerPubKeys": [ ... ]
}
```

This threshold defaults to the number of provided seed share owner public keys, so by default every seed share owner would need to be involved in the recovery process.
The lowest value would be 1, in which case the behavior is the same as the current implementation, allowing every seed share owner to recover the Coordinator on their own.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do note that SSS requires at least 2 shares, meaning the old implementation has to be kept for this case

Like the seed share owner public keys, this field can't be updated after the initial `contrast set`.

### UserAPI Changes

For the implementation we can use a secret sharing scheme based on [HashiCorp's](https://github.com/hashicorp/vault/tree/8d07273d14ae7f5a48cc96f66cc86615dea83390/shamir) or [OpenBao's](https://pkg.go.dev/github.com/openbao/openbao/sdk/v2/helper/shamir) Shamir go library.
During the initial `contrast set`, the seed is split into shares using the specified threshold and the provided seed share owner public keys, and each share is encrypted with the corresponding public key and returned to the user.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a goal of Contrast to allow up-/downscaling of seed share owners on a live deployment? At least the latter wouldn't work anymore with this, right?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scaling down the seed share owners would need to generate a new seed (otherwise, assumptions about prior manifests' security might not hold anymore). At that point, it's probably easier to just start a fresh manifest history.


For recovery, the Coordinator needs to track the received seed shares and combine them when the threshold is met.
For this, we add a new field to the `State` struct:

```go
type State struct {
seedEngine *seedengine.SeedEngine
...

seedShares [][]byte // new field to track the received seed shares
Copy link
Copy Markdown
Collaborator

@sespiros sespiros May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overloading State's stale state with another type of "staleness", for when we are in the middle of recovery, is confusing. A manifestation of this "confusion" is having to change AuthorizeByManifest's contract to behave differently depending on whether we have shares in flight or not.

Reading the code there, it seems to me that Guard could be a more suitable home for the intermediate shares until a threshold is met. An example sketch of a design that seems more straight-forward to me:

type Guard struct {
    ...
    partialMu sync.Mutex
    // nil when no user recovery is in flight.
    // Replaced as a whole on each share submission..
    // Never mutated in place because callers retain the pointer as a snapshot after AddSeedShare returns.
    partialRecovery *PartialRecovery
}

// PartialRecovery is the immutable accumulator of seed shares submitted
// during a multi-party user recovery. It is bound to a specific recovery
// target (insecure latest transition + manifest) and a single salt.
type PartialRecovery struct {
    TransitionHash [history.HashSize]byte // hist.GetLatestInsecure() at first submission
    Salt           []byte                  // captured from the first submitter
    Shares         map[ShareID][]byte      // keyed by submitter pubkey. Treated as read-only
}

// ShareID identifies a seed share submitter by their hex-encoded RSA
// public key, matching an entry in Manifest.SeedshareOwnerPubKeys.
type ShareID = manifest.HexString

func (g *Guard) AddSeedShare(ctx context.Context, callerID ShareID, share, salt []byte) (*PartialRecovery, error)

Then AddSeedShare would implement the logic that checks for errors, checks the threshold etc. If the threshold is met, it would reconstruct the seed, build a seedAuthorizer, call Guard.ResetState etc. It could be made to also return the complete seed, this is just an example.

We would clear PartialRecovery in case of either success or error. This is also in line with the other dicsussion about what to do is the wrong seed is reconstructed etc.

...

stale atomic.Bool
}
```

If the Coordinator is in recovery mode, this list is used to track the received seed shares via the `Recover` RPC of the user API.
Peer recovery still works as before, since there are no seed shares involved.
If a `Recover` call from the user is made where the newly received seed share combined with the already received shares meets the threshold, the seed is reconstructed and the Coordinator can be recovered normally.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should the Coordinator behave if the wrong seed is reconstructed because one of the inputs was wrong?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this would imply a malicious seed share owner? The Coordinator has no way to identify a wrong share, so it will create a new seed engine anyway. This should however be immediately noticeable when attesting the Coordinator with contrast verify because the root CA changed.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the signature of the latest transition is checked during recovery, so directly after the seed engine has been created. If the seed engine is wrong, the user that called Recover last will receive an error verifying latest transition.... Since the Coordinator doesn't know which seed share is incorrect, maybe we have to transition to a new state with no seed shares and start again?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we can't verify the signature with the combined seed we should throw all shares away and start from scratch.

If the list contains some number of seed shares below the threshold and peer recovery triggers, the Coordinator is recovered via peer recovery as expected.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding from the proposal section was that in the first version we would need to scale down the coordinator to one instance which would be in non-recovered state right? So how is it possible for peer recovery to trigger mid-user-recovery via shares?

Is there a scenario where while we are scaling down to one instance, a non-recovered coordinator with seed shares accumulated, does peer recover from a recovered peer before the recovered one gets torn down?

Should it be specified more like:

  • if a recovered coordinator exists, use peer recovery
  • if no recovered coordinator exists, scale down to one and perform threshold user recovery
  • don't use --force for threshold recovery while recovered peers exist

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more of an edge case: if one recovered Coordinator and one non-recovered Coordinator exist (e.g., 2 Coordinators and I restart one), and I call recover with --force, the non-recovered Coordinator can accumulate seed shares before peer recovery triggers. I think otherwise the behavior is the same as before, i.e., prefer peer recovery when --force is not specified. Scaling down to 1 for user recovery should then be documented in the docs.


When the `Recover` RPC is called and the state is stale, the `ResetState` method of the guard is invoked, which takes a `SecretSourceAuthorizer` as an argument.
This interface currently only has the method `AuthorizeByManifest`, which validates the peer given the manifest from the store and returns the new seed engine and mesh CA key.
This is used both by the user API for `contrast recover` and by the mesh API during peer recovery.
The `AuthorizeByManifest` function now additionally returns the reconstructed seed if the threshold is met and the received seed share otherwise:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/and/or


```go
type SecretSourceAuthorizer interface {
AuthorizeByManifest(ctx context.Context, manifest *manifest.Manifest) (seedengine.SeedEngine, crypto.PublicKey, []byte, error)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is meant to be *seedengine.SeedEngine and *ecdsa.PrivateKey but ideally see my other comment that suggests not changing this method's contract at all.

}
```

In case of peer recovery, the third output is always `nil` and the process stays the same as before.
For user recovery, the `SecretSourceAuthorizer` is initialized with the state's current seed shares and, if there are enough shares, reconstructs the seed and returns the seed engine with the recovered seed as before.
If there aren't enough shares yet, we only return the received seed share from the function without the seed engine and mesh key.
The function should already verify that the received seed share isn't present in the list of received seed shares.

If the returned seed engine is `nil` and we get a seed share that isn't already in the list of received seed shares, we return a new state.
This state has the `stale` boolean still set to `true` but now includes the list of received seed shares from before plus the new received seed share.
After updating the state, we return an error indicating that the threshold isn't met yet.
If we get a valid seed engine, we proceed as before.

If the Coordinator has enough seed shares to recover the seed but one of the seed shares is invalid, through, for example, a malicious seed share owner, the following signature verification of the latest transition will fail.
In that case, we don't know which seed shares are valid or invalid, so the only option is to reset the state and start the recovery process from scratch with no seed shares.

The proto messages for the `Recover` API remain basically unchanged, now including the encrypted seed share instead of the encrypted seed:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is obvious but I think it should be explicitly mentioned here that each share owner is expected to send the same Salt as part of RecoverRequest and that this should be validated by the coordinator (that's why I included the Salt in my suggested design change).

I have also made the comment in the proposal section but, wdym by "encrypted seed share", is this encrypted with something else that I have missed?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I forgot about that. It only makes sense to also validate the salt.

The proto paragraph needs rewriting: the recover request now includes the seed share (not encrypted), and the SeedShare message type belongs to the SetManifestResponse, where the encrypted seed share is correct.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at

cipherText, err := rsa.EncryptOAEP(sha256.New(), rand.Reader, pubKey, seed, []byte("seedshare"))

I wonder if we should have a more specific label for these new shares (I gather this is like aad in aead), that would also bind them to properties of their intended use like "intended to be used by sss", the threshold, total share count etc.

I see this as a robustness/ux improvement that would help in potential early mismatch detection/cleaner domain separation.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do something like "seedshare-k/n", for a k out of n threshold scheme. Both the user and the Coordinator have access to the manifest, so they both know the threshold and number of seed share owners. I'm not sure how useful this is though. This would only help when a user accidentally runs contrast recover with a seed share from a threshold scheme with different parameters. Maybe even "seedshare-k-{hash(seedshare owner keys)}".

But keep in mind that the untrusted seed share owner can simply choose the label himself and forge an incorrect share with the correct label, so this would only be a UX improvement, not adding any robustness in that scenario.


```proto
message SeedShare {
string PublicKey = 1;
bytes EncryptedSeedShare = 2;
}
```
2 changes: 2 additions & 0 deletions tools/vale/styles/config/vocabularies/edgeless/accept.txt
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,7 @@ podman
PPID
precalculator
Privatemode
proto
protobuf
proxied
QEMU
Expand All @@ -148,6 +149,7 @@ safeguardable
SBOM
Scaleway
serializable
Shamir
sigstore
snapshotters?
snapshotting
Expand Down