Skip to content

[RPC] VM error visibility indexing#26595

Open
jordanjennings-mysten wants to merge 15 commits into
mainfrom
vm-error-visibility-jsonrpc-indexing
Open

[RPC] VM error visibility indexing#26595
jordanjennings-mysten wants to merge 15 commits into
mainfrom
vm-error-visibility-jsonrpc-indexing

Conversation

@jordanjennings-mysten

@jordanjennings-mysten jordanjennings-mysten commented May 12, 2026

Copy link
Copy Markdown
Contributor

Description

Adds JSON-RPC index support for VM execution error visibility.

This threads richer ExecutionErrorContext through fullnode execution, extracts execution error metadata/source for failed txs, and writes it to a new execution_error_metadata index by tx digest. Validators continue using the on-chain data ExecutionFailure path and drop the extra metadata.

The new index table is prunable by transaction sequence number and has a round-trip unit test.

requires changes to sui-apis sui-rust-sdk:
MystenLabs/sui-rust-sdk#267
MystenLabs/sui-apis#28

Test plan

sui core

  • test_execution_error_metadata_round_trip
  • test_empty_execution_error_metadata_is_ignored
  • test_validator_execution_does_not_store_error_metadata
  • execution_error_metadata_table_accepts_future_proto_schema

Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.

For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.

  • Protocol:
  • Nodes (Validators and Full nodes):
  • gRPC:
  • JSON-RPC:
  • GraphQL:
  • CLI:
  • Rust SDK:
  • Indexing Framework:

@vercel

vercel Bot commented May 12, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
sui-docs Ready Ready Preview, Comment May 28, 2026 6:24pm
2 Skipped Deployments
Project Deployment Actions Updated (UTC)
multisig-toolkit Ignored Ignored Preview May 28, 2026 6:24pm
sui-kiosk Ignored Ignored Preview May 28, 2026 6:24pm

Request Review

@jordanjennings-mysten jordanjennings-mysten temporarily deployed to sui-typescript-aws-kms-test-env May 12, 2026 06:56 — with GitHub Actions Inactive
Base automatically changed from vm-error-visibility-execution-and-type to main May 14, 2026 03:41
@jordanjennings-mysten jordanjennings-mysten force-pushed the vm-error-visibility-jsonrpc-indexing branch from a8424b1 to e96bb61 Compare May 14, 2026 17:42
@jordanjennings-mysten jordanjennings-mysten temporarily deployed to sui-typescript-aws-kms-test-env May 14, 2026 17:42 — with GitHub Actions Inactive

@stefan-mysten stefan-mysten left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of comments. I am not familiar with this codebase so I hope someone else will take a look, but it was interesting to read and learn. Thanks @jordanjennings-mysten.

Comment thread crates/sui-core/src/authority/authority_store.rs
Comment thread crates/sui-core/src/execution_cache/writeback_cache.rs
Comment thread crates/sui-core/src/execution_cache.rs
Comment thread crates/sui-core/src/authority.rs Outdated
&tracking_store.into_read_objects(),
);

let execution_error_metadata = execution_error_opt

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the possible size of this error metadata? If I recall correctly, this will be added onto fullnodes, I assume in a table somewhere. If it's very large, can it cause any disruption/issues during insertion/retrieval?

@jordanjennings-mysten jordanjennings-mysten May 18, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my napkin math says about 1 gb of storage per week with 300 bytes per error, 2% error rate on ~20 million txs a day, depending on when the fullnode prunes it could be a bit smaller, maybe more in the 0.33gb range. based on a 5 hour sample window a few weeks back.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing more concrete numbers. I was wondering if there's any limits in the DB of how much data can be inserted in a field in a column?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they are quite large but you would hit performance problems at some point since its rocks db.

Comment thread crates/sui-core/src/authority.rs
Comment thread crates/sui-core/src/authority.rs
Comment thread crates/sui-core/src/authority.rs Outdated
.or_else(|| {
self.store
.get_execution_error_metadata(digest)
.expect("db error")

@stefan-mysten stefan-mysten May 18, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This .expect leads to a panic, no? Is that required? I'd expect that this should return None if it cannot get execution_error_metadata rather than panicking.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was an existing pattern in get_unchanged_loaded_runtime_objects if theres a TypedStoreError it bails which arguably should panic since that error type is associated serious db errors. I do agree probably something to flag though.

@amnn amnn left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks good to me, just some questions about the types.

I'm also curious what the roll out for this would be, because if you roll out just this change, and then iterate on the error messaging itself, we will end up with different fullnodes in the ecosystem producing different metadata for the same error based on the version of sui-node they are running.

The code otherwise looks good, thanks @jordanjennings-mysten

Comment thread crates/sui-core/src/authority.rs Outdated
gas_status,
effects,
timings,
execution_failure.map_err(ExecutionErrorContext::from),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worthwhile to split the extra metadata into its own distinct type? I.e. have Authority::execute_transaction_to_effects return ExecutionError and then a separate, optional ExecutionErrorContext or ExecutionMetadata (whatever you want to call the extra information only)?

The benefit (for me) of using separate types, is that I would be able to clearly see which part of the richer data is dropped in the validator path, because it would be statically None here.

This also fits the onward data flow better because the execution error would need to be indexed with the rest of the base transaction effect data, while the metadata goes into its own table.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also recommend we try to align on the shape of the type using a protobuf message and storing the protobuf message itself in the DB so that we can iterate on the shape if need be.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bmwill, is it possible to do that today with TideHunter/TypedStore? I was personally less worried about the evolvability because the metadata structure is already quite generic, but agree that if it's possible, then using a more generic protobuf value would be nicer for future evolvability.

@jordanjennings-mysten jordanjennings-mysten May 19, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put prost and serialize and it seemed to work out, I expect we will add some protos somewhere at some point but for now just used the prost derive.

Comment thread sui-execution/src/executor.rs Outdated
TransactionEffects,
Vec<ExecutionTiming>,
Result<(), ExecutionError>,
Result<(), ExecutionErrorContext>,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should dev_inspect_transaction also use this type?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup I was going to split that out into its own PR to keep this focused

@jordanjennings-mysten jordanjennings-mysten marked this pull request as ready for review May 19, 2026 17:00
@jordanjennings-mysten jordanjennings-mysten temporarily deployed to sui-typescript-aws-kms-test-env May 19, 2026 17:00 — with GitHub Actions Inactive
@jordanjennings-mysten jordanjennings-mysten changed the title [JSONRPC] VM error visibility indexing [RPC] VM error visibility indexing May 19, 2026
@jordanjennings-mysten jordanjennings-mysten temporarily deployed to sui-typescript-aws-kms-test-env May 19, 2026 23:05 — with GitHub Actions Inactive
@jordanjennings-mysten

Copy link
Copy Markdown
Contributor Author

I'm also curious what the roll out for this would be, because if you roll out just this change, and then iterate on the error messaging itself, we will end up with different fullnodes in the ecosystem producing different metadata for the same error based on the version of sui-node they are running.

This is an interesting question. I hadn't considered versioning/roll out. protocol version certainly feels overkill, I did have an interest in the past about feature flagging (unclear if that would work here originally for cli) but it seemed there was a preference to not introduce flags. I'll think about it some more.

@amnn amnn left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the traits on ExecutionErrorMetadata are having the desired effect (which also means the tests that were added are not catching the thing you want them to catch -- schema evolvability). Can you take another look?

Comment thread crates/sui-types/src/error.rs Outdated
pub attributes: BTreeMap<String, String>,
}

#[cfg(test)]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having tests in the middle of the implementation is a bit unconventional, can we move these to the end and just call the module tests?

Comment thread crates/sui-types/src/error.rs Outdated
pub(crate) type BoxError = Box<dyn std::error::Error + Send + Sync + 'static>;
pub type ExecutionErrorMetadata = BTreeMap<String, String>;

#[derive(Clone, Eq, PartialEq, JsonSchema, Serialize, Deserialize, prost::Message)]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that this type implements prost::Message does not mean that it's being serialized using proto when stored in the authority stores/indices -- I would expect that it's still being BCS encoded (with all the pitfalls around evolvability that this entails) unless you are manually encoding it to proto before writing it, which I didn't see happening in the code above.

This is where my earlier question to @bmwill came from -- whether there is an established pattern for storing proto into tidehunter/typed-store.

Comment thread sui-execution/src/executor.rs Outdated
TransactionEffects,
Vec<ExecutionTiming>,
Result<(), ExecutionError>,
Result<(), ExecutionErrorContext>,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I understand the rest of this change better, I'm pretty sure you do not want to introduce ExecutionErrorContext at this level of abstraction. It bakes in the idea that you have access to a source error on exit from the execution layer when the aim is to abstract that away.

Instead, replicate the change inside authority here -- expose the error and the metadata, and make it the execution layer's responsibility to extract the necessary metadata.

You could package the error and metadata up into a new type, like you have here, or you could introduce the metadata as a new optional field, like in authority (which has smaller knock-on consequences).

@amnn

amnn commented May 26, 2026

Copy link
Copy Markdown
Contributor

I ended up chatting with @bmwill about this, here's a summary from that discussion:

  • The reasons to use proto overlap with the reasons to keep this type generic (which drove the current design where we have a string -> string map). It doesn't make much sense to encode the string -> string map as proto, instead we should define a structured metadata type in proto, and store it as encoded proto to take advantage of the format's type safety and schema evolvability.
  • To do this properly, the metadata type needs to be defined in a .proto file, with CI support to check for schema evolution issues, and then we can generate the Rust definition from it. This .proto file would not contain a generic string-to-string map, it would contain the exact fields we want to expose (for now, let's just say a message: string)
  • This could be folded into sui-rpc-api, or it could be a separate thing (although then you would need to rebuild all that CI infra).
  • When it comes to roll-out, I would say that we should iterate on this until we are happy with the first E2E use case (a human readable message), and then we can look into backfilling by replaying transactions -- cc @tzakian, this is similar to the stuff we were talking about last Wednesday for the dual-layer execution stuff.

@jordanjennings-mysten

Copy link
Copy Markdown
Contributor Author

ignore the review request.. I didn't realize you had comments since githubs notifications seem to have failed and didn't tell me you had commented. I'll take a look!

@jordanjennings-mysten

Copy link
Copy Markdown
Contributor Author

we can look into backfilling by replaying transactions -- cc @tzakian, this is similar to the stuff we were talking about last Wednesday for the dual-layer execution stuff.

can we not build up the index first then ship the API?

@jordanjennings-mysten

Copy link
Copy Markdown
Contributor Author

ExecutionErrorContext no longer leaks through the executor interface, and execution now returns ExecutionError plus optional metadata. the DB path now stores encoded sui-rpc proto bytes in Vec instead of BCS-encoding the local metadata type, with conversions at the read/write functions.

now relies on:
MystenLabs/sui-apis#28
and
MystenLabs/sui-rust-sdk#267

@amnn amnn left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but still good to get a review from @bmwill as he is much more familiar with the inner workings on fullnodes.

Comment thread crates/sui-core/src/authority/authority_store_tables.rs
Comment thread crates/sui-core/src/authority/authority_store_tables.rs Outdated
Comment thread crates/sui-core/src/authority.rs Outdated
Comment thread crates/sui-types/src/rpc_proto_conversions.rs Outdated
Comment thread sui-execution/src/latest.rs Outdated
Comment thread sui-execution/src/v0.rs Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants