[RPC] VM error visibility indexing#26595
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
a8424b1 to
e96bb61
Compare
e96bb61 to
c32b90f
Compare
stefan-mysten
left a comment
There was a problem hiding this comment.
Left a couple of comments. I am not familiar with this codebase so I hope someone else will take a look, but it was interesting to read and learn. Thanks @jordanjennings-mysten.
| &tracking_store.into_read_objects(), | ||
| ); | ||
|
|
||
| let execution_error_metadata = execution_error_opt |
There was a problem hiding this comment.
What's the possible size of this error metadata? If I recall correctly, this will be added onto fullnodes, I assume in a table somewhere. If it's very large, can it cause any disruption/issues during insertion/retrieval?
There was a problem hiding this comment.
my napkin math says about 1 gb of storage per week with 300 bytes per error, 2% error rate on ~20 million txs a day, depending on when the fullnode prunes it could be a bit smaller, maybe more in the 0.33gb range. based on a 5 hour sample window a few weeks back.
There was a problem hiding this comment.
Thanks for sharing more concrete numbers. I was wondering if there's any limits in the DB of how much data can be inserted in a field in a column?
There was a problem hiding this comment.
I think they are quite large but you would hit performance problems at some point since its rocks db.
| .or_else(|| { | ||
| self.store | ||
| .get_execution_error_metadata(digest) | ||
| .expect("db error") |
There was a problem hiding this comment.
This .expect leads to a panic, no? Is that required? I'd expect that this should return None if it cannot get execution_error_metadata rather than panicking.
There was a problem hiding this comment.
this was an existing pattern in get_unchanged_loaded_runtime_objects if theres a TypedStoreError it bails which arguably should panic since that error type is associated serious db errors. I do agree probably something to flag though.
amnn
left a comment
There was a problem hiding this comment.
This PR looks good to me, just some questions about the types.
I'm also curious what the roll out for this would be, because if you roll out just this change, and then iterate on the error messaging itself, we will end up with different fullnodes in the ecosystem producing different metadata for the same error based on the version of sui-node they are running.
The code otherwise looks good, thanks @jordanjennings-mysten
| gas_status, | ||
| effects, | ||
| timings, | ||
| execution_failure.map_err(ExecutionErrorContext::from), |
There was a problem hiding this comment.
Would it be worthwhile to split the extra metadata into its own distinct type? I.e. have Authority::execute_transaction_to_effects return ExecutionError and then a separate, optional ExecutionErrorContext or ExecutionMetadata (whatever you want to call the extra information only)?
The benefit (for me) of using separate types, is that I would be able to clearly see which part of the richer data is dropped in the validator path, because it would be statically None here.
This also fits the onward data flow better because the execution error would need to be indexed with the rest of the base transaction effect data, while the metadata goes into its own table.
There was a problem hiding this comment.
I would also recommend we try to align on the shape of the type using a protobuf message and storing the protobuf message itself in the DB so that we can iterate on the shape if need be.
There was a problem hiding this comment.
@bmwill, is it possible to do that today with TideHunter/TypedStore? I was personally less worried about the evolvability because the metadata structure is already quite generic, but agree that if it's possible, then using a more generic protobuf value would be nicer for future evolvability.
There was a problem hiding this comment.
put prost and serialize and it seemed to work out, I expect we will add some protos somewhere at some point but for now just used the prost derive.
| TransactionEffects, | ||
| Vec<ExecutionTiming>, | ||
| Result<(), ExecutionError>, | ||
| Result<(), ExecutionErrorContext>, |
There was a problem hiding this comment.
Should dev_inspect_transaction also use this type?
There was a problem hiding this comment.
yup I was going to split that out into its own PR to keep this focused
This is an interesting question. I hadn't considered versioning/roll out. protocol version certainly feels overkill, I did have an interest in the past about feature flagging (unclear if that would work here originally for cli) but it seemed there was a preference to not introduce flags. I'll think about it some more. |
amnn
left a comment
There was a problem hiding this comment.
I don't think the traits on ExecutionErrorMetadata are having the desired effect (which also means the tests that were added are not catching the thing you want them to catch -- schema evolvability). Can you take another look?
| pub attributes: BTreeMap<String, String>, | ||
| } | ||
|
|
||
| #[cfg(test)] |
There was a problem hiding this comment.
Having tests in the middle of the implementation is a bit unconventional, can we move these to the end and just call the module tests?
| pub(crate) type BoxError = Box<dyn std::error::Error + Send + Sync + 'static>; | ||
| pub type ExecutionErrorMetadata = BTreeMap<String, String>; | ||
|
|
||
| #[derive(Clone, Eq, PartialEq, JsonSchema, Serialize, Deserialize, prost::Message)] |
There was a problem hiding this comment.
The fact that this type implements prost::Message does not mean that it's being serialized using proto when stored in the authority stores/indices -- I would expect that it's still being BCS encoded (with all the pitfalls around evolvability that this entails) unless you are manually encoding it to proto before writing it, which I didn't see happening in the code above.
This is where my earlier question to @bmwill came from -- whether there is an established pattern for storing proto into tidehunter/typed-store.
| TransactionEffects, | ||
| Vec<ExecutionTiming>, | ||
| Result<(), ExecutionError>, | ||
| Result<(), ExecutionErrorContext>, |
There was a problem hiding this comment.
Now that I understand the rest of this change better, I'm pretty sure you do not want to introduce ExecutionErrorContext at this level of abstraction. It bakes in the idea that you have access to a source error on exit from the execution layer when the aim is to abstract that away.
Instead, replicate the change inside authority here -- expose the error and the metadata, and make it the execution layer's responsibility to extract the necessary metadata.
You could package the error and metadata up into a new type, like you have here, or you could introduce the metadata as a new optional field, like in authority (which has smaller knock-on consequences).
|
I ended up chatting with @bmwill about this, here's a summary from that discussion:
|
|
ignore the review request.. I didn't realize you had comments since githubs notifications seem to have failed and didn't tell me you had commented. I'll take a look! |
can we not build up the index first then ship the API? |
|
ExecutionErrorContext no longer leaks through the executor interface, and execution now returns ExecutionError plus optional metadata. the DB path now stores encoded sui-rpc proto bytes in Vec instead of BCS-encoding the local metadata type, with conversions at the read/write functions. now relies on: |
Description
Adds JSON-RPC index support for VM execution error visibility.
This threads richer
ExecutionErrorContextthrough fullnode execution, extracts execution error metadata/source for failed txs, and writes it to a newexecution_error_metadataindex by tx digest. Validators continue using the on-chain dataExecutionFailurepath and drop the extra metadata.The new index table is prunable by transaction sequence number and has a round-trip unit test.
requires changes to sui-apis sui-rust-sdk:
MystenLabs/sui-rust-sdk#267
MystenLabs/sui-apis#28
Test plan
sui core
Release notes
Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.
For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.