[NONEVM-706][Solana] - Refactor TXM + Rebroadcast Expired Tx functionality#946
Conversation
| id string | ||
| createTs time.Time | ||
| state TxState | ||
| lastValidBlockHeight uint64 // to track expiration |
There was a problem hiding this comment.
Out of curiosity, what is an invalid block height? Like a reorged out chain?
There was a problem hiding this comment.
+1. I assume this is related to using latest blockhash in tx, and tracking its expiration. But the variable name is confusing, if it tracks a particular block number, or just a depth.
There was a problem hiding this comment.
In Solana, each transaction includes a "recent blockhash" that expires when that blockhash is no longer "recent enough"
During transaction processing, Solana validators will check if the "recent blockhash" of each transaction is recorded within the 150 most recent stored hashes. If the transaction's "recent blockhash" is older than this maximum processing age, the transaction won't be processed.
To track the expiration of each transaction, we call the RPC method GetLatestBlockhash and store the return value lastValidBlockHeight within our pendingTx object. This value represents the height of the last block at which the returned blockhash value will be valid.
Each slot (period of time a validator can produce a block) fluctuates between 400 ms and 600 ms. This means transactions can only use a given blockhash for 150 blocks, which is between 60 to 90 seconds since creation. After this, it will be considered expired by the network and you'll need a new blockhash to make it go through
There was a problem hiding this comment.
Ah yeah this became clear to me later in the PR review, thanks for clarifying though that makes sense
| id string | ||
| createTs time.Time | ||
| state TxState | ||
| lastValidBlockHeight uint64 // to track expiration |
There was a problem hiding this comment.
+1. I assume this is related to using latest blockhash in tx, and tracking its expiration. But the variable name is confusing, if it tracks a particular block number, or just a depth.
| broadcastedTxs map[string]pendingTx // transactions that require retry and bumping i.e broadcasted, processed | ||
| confirmedTxs map[string]pendingTx // transactions that require monitoring for re-org | ||
| finalizedErroredTxs map[string]finishedTx // finalized and errored transactions held onto for status | ||
| broadcastedProcessedTxs map[string]pendingTx // broadcasted and processed transactions that may require retry and bumping |
There was a problem hiding this comment.
In EVM, we just call these Unconfirmed.
Unconfirmed means tx was successfully broadcasted, but not yet mined into a block.
My preference would be to reuse that same name here too?
There was a problem hiding this comment.
Names except Broadcasted were chosen to be consistent with the Solana Commitment Status
I feel it's good to reuse the same names across chains, but I also think Processed is clearer than Unconfirmed in this case. In my opinion, Unconfirmed does not make it clear whether the transaction was Processed or simply Broadcasted
Broadcasted: Our TXM sent the txProcessed: tx was Processed by network and included in a block
@amit-momin @silaslenihan what do you think about Broadcasted vs Unconfirmed?
There was a problem hiding this comment.
I agree with Juan. Unconfirmed might be misleading in the context of Solana since we have this Processed state in between. In EVM, we could consider a broadcasted transaction as Unconfirmed since the next step in the lifecycle is Confirmed
There was a problem hiding this comment.
If Solana has these standard terms, then yes, better to align with that. I didn't know.
| } | ||
| // validate id does not exist | ||
| if _, exists := c.broadcastedTxs[tx.id]; exists { | ||
| if _, exists := c.broadcastedProcessedTxs[tx.id]; exists { |
There was a problem hiding this comment.
What is this tx.id? Is it the one caller provided as an idempotency key?
If this is checking if this id exists, then we likely need to do better checks across many other status.
There was a problem hiding this comment.
Yes! If set by caller we use that one. If not set, we generate a random UUID when enqueueing txes.
What checks do you think are missing?
There was a problem hiding this comment.
The previously sent Tx by the caller with same id, may be still enqueued in the DB, not yet broadcasted, or may be confirmed, or finalizedErroredTx, or just finalized right?
We can't assume that a previously used id will only be in this broadcastedProcessedTxs list.
There was a problem hiding this comment.
I think that's a fair point. We should check across all of the maps when adding a new transaction to our in-memory store
| return maps.Keys(c.sigToID) | ||
| } | ||
|
|
||
| // ListAllExpiredBroadcastedTxs returns all the txes that are in broadcasted state and have expired for given block height compared against their lastValidBlockHeight. |
There was a problem hiding this comment.
nit: could we use blockNumber instead of blockHeight in all these places?
Block number is a more unambiguous term.
There was a problem hiding this comment.
I think it's also easier in terms of block number than block height.
Guys, do you think it makes sense to rename lastValidBlockHeight to lastValidBlockNumber inside our TXM? The lastValidBlockHeight is the official one, but lastValidBlockNumber may be easier to deal with.
Another alternative would be to put comments when we use the term block height to be explicit that we are referring to block number
There was a problem hiding this comment.
I'm ok either way but I thought both of those terms referred to the same thing? What's the ambiguity with block height? Since I thought they were the same, I was on the side of using the official name from the RPC response
There was a problem hiding this comment.
Yes if Solana defines a specific meaning to some terms, lets use that.
Didn't know that.
In this case, lets then just add a comment saying what it means.
For example: lastValidBlockHeight is the highest block number till which this Tx is valid. After this, this Tx gets treated as expired.
| if initBuildErr != nil { | ||
| return solanaGo.Transaction{}, "", solanaGo.Signature{}, initBuildErr | ||
| // Build and sign initial transaction setting compute unit price and limit | ||
| initTx, err := txm.buildTx(ctx, msg, 0) |
There was a problem hiding this comment.
Do we know what happens to the previous tx(original one before the retry) which this is intending to replace?
Since this is updating the recentBlockHash, I would assume Solana treats this as a brand new Tx, not something which replaces the previous one. So maybe Solana confirms both of these Txs separately?
On EVM, we were sure only 1 would get confirmed, since they shared the same nonce. But here, they don't share anything.
There was a problem hiding this comment.
This previous response makes sense here: #946 (comment)
We will proceed and Rebroadcast a tx when the first one has already expired and become invalid
| } | ||
|
|
||
| // sendWithRetry attempts to send a transaction with exponential backoff retry logic. | ||
| // It builds, signs and sends the initial tx with a new valid blockhash, and starts a retry routine with fee bumping if needed. |
There was a problem hiding this comment.
I commented below too, but if you use a new valid blockHash, are you sure Solana would still figure out and treat this as a replacement of the older Tx?
If not, then we may have both of these txs get included.
There was a problem hiding this comment.
Also quoting #946 (comment) here
It won't be a "replacement" for the old TX, but the old TX became invalid in the eyes of validators and won't be picked up after 150 blocks have passed
50dd10f
| // Retries until context cancelled by timeout or called externally. | ||
| // It uses handleRetry helper function to handle each retry attempt. | ||
| func (txm *Txm) retryTx(ctx context.Context, msg pendingTx, currentTx solanaGo.Transaction, sigs *signatureList) { | ||
| deltaT := 1 // initial delay in ms |
There was a problem hiding this comment.
Isn't 1 ms too aggressive?
I assume just a round-trip time to Solana RPC call would be in order of 10s of ms atleast.
So after initial send, we may want to wait atleast 10 ms to give the initial tx some time to get confirmed.
Otherwise if say we have initial deltaT as 1ms, then next exponential backoffs are at 1, 2 ,4 8, 16, 31, 64, 128 ms.
But I see that the confirm logic, which checks Tx confirmation, has a polling that's 500 ms:
ConfirmPollPeriod: config.MustNewDuration(500 * time.Millisecond)
Seems like we will retry many times even before trying to check confirmation once.
There was a problem hiding this comment.
This was logic that carried over from the existing code for Data Feeds. We chose to keep this the same so we don't introduce any regressions from DF. We could maybe make this configurable in the future especially if we need to reduce RPC calls but seems like this was out of necessity to make sure our transaction gets out in front of the current validator. The multinode integration probably helps a bit with that though so spamming this much maybe isn't needed anymore. I had the same thoughts as you initially but was reluctant to relax the existing code in case it introduced any unforeseen issues.
There was a problem hiding this comment.
We decided here to create a ticket to make this retry configurable. We will keep the default values as they are for now. This will give us the flexibility to adjust them based on the use case. For example, we could relax these configurations for CCIP and make them more aggressive for DF
| // rebroadcastExpiredTxs attempts to rebroadcast all transactions that are in broadcasted state and have expired. | ||
| // An expired tx is one where it's blockhash lastValidBlockHeight (last valid block number) is smaller than the current block height (block number). | ||
| // If any error occurs during rebroadcast attempt, they are discarded, and the function continues with the next transaction. | ||
| func (txm *Txm) rebroadcastExpiredTxs(ctx context.Context, client client.ReaderWriter) { |
There was a problem hiding this comment.
this feels like there's a potential race condition here
- tx is broadcast to the network and is included just before blockhash expiration
- confirmer does not see it yet because it waits for
confirmedcommitment level - TXM marks a base transaction as expired because the block threshold has passed
- deletes the saved tx id from the in-memory storage, and rebroadcasts a tx with a new blockhash
- this could go on forever because it can never confirm that a successful tx is confirmed because the tx.id is always deleted before
There was a problem hiding this comment.
Wouldn't this be handled gracefully by the existing code?
The processConfirmations method is called before rebroadcastExpiredTxs updating each signature status. Here we would detect if we have transitioned from Broadcasted to Processed.
Then inside rebroadcastExpiredTxs we call ListAllExpiredBroadcastedTxs, where we only consider the ones which have Broadcasted state
There was a problem hiding this comment.
I think we would avoid that scenario since we only call rebroadcastExpiredTxs after processing the latest confirmation status changes. Since this method only processes Broadcasted transaction, the confirmation logic would handle any last minute state changes before we assess transaction for expiry. So in your scenario, we would notice that the transaction advanced to Processed, mark it accordingly in our in-memory storage, then move on to the expiry check. The expiry check would notice there aren't any transactions in Broadcasted state that are expired and just move on.
There was a problem hiding this comment.
what happens if a transaction skips the processed phase?
processed is just the view of a single RPC node - so if a transaction can go from processed to non-existent, then it might also be possible to go from non-existent to confirmed
it just feels odd to delete all previous attempts at a transaction and then recreate - rather than just updating the transaction with a new blockhash
There was a problem hiding this comment.
We would advance the state from non-existent (internally what we consider Broadcasted) to Confirmed and also avoid rebroadcasting with a new hash in that case. Our state transition methods don't require us to move one step up at a time in case of slow polling or other reasons. We can even jump straight to Finalized. We only delete all previous attempts/sigs if we can confirm that the tx has a non-existent state and its blockhash has expired. I was thinking it was better to clean up those old attempts/sigs than keep them around and waste cycles on status checks on them since they're expired anyways.
There was a problem hiding this comment.
ok - the chances of this race happening might be low
is there logging around a transaction signature that doesn't match any tx id? something that fires if an assumption is broken?
There was a problem hiding this comment.
If one of the state transition methods gets a sig that doesn't match to an ID in the internal map, it should return a ErrSigDoesNotExist error. Depending where the method was called, we have some logging to surface an assumption violation. But not in all cases like the confirmation expiration where we can maybe assume a different process cleaned up the signatures.
| return &txCopy | ||
| } | ||
|
|
||
| func TestTxm_ExpirationRebroadcast(t *testing.T) { |
There was a problem hiding this comment.
this tests only runs against a mocked endpoints under ideal behavior - please add additional tests that run against a live validator
There was a problem hiding this comment.
Hey! We've found that testing this scenario without mocking is challenging because we don't have a way to force a transaction to expire in a live validator.
The only solution we've identified involves modifying some methods to simulate this behavior for the tests. However, this approach is not ideal, as it would require adding additional flows solely to make the test pass and to trigger a rebroadcast in the live validator tests.
Do you have any ideas on how to achieve this that we might be missing?
There was a problem hiding this comment.
you could potentially introduce some sort of proxy that takes the transaction and returns a transaction signature, but doesn't allow any more transactions through until it knows the first transaction blockhash has expired
There was a problem hiding this comment.
Would it be okay to address this in a separate ticket under the Testing track? So we are able to get this one in before the freeze and finish the re-org work that depends on this PR to complete the implementation track
There was a problem hiding this comment.
why does the freeze impact the timeline? afaik that's just for production environments, not development?
also this PR touches significant portions of the most sensitive part of a solana integration, test coverage as the changes are made imo is a must
There was a problem hiding this comment.
just clarifying a few things:
- there is a production code freeze, but that shouldn't impact development here
- i'd like to see an attempt at a proper integration tests, but it can be time-boxed to a day or two (if it still doesn't work, then a follow-up ticket is fine)
There was a problem hiding this comment.
Thanks for clarifying, thought the freeze involved not merging new code in chainlink repo, even if it was development 😅. I'll start working on these integration tests
|




Description
Refactor
confirmandsendWithRetryfuncsAdding a new
TxExpirationRebroadcasttoggable routine:confirmloop, after checking confirmations, we will determine if we need to rebroadcast expired transactions that where not processed yet. Rebroadcasting a transaction involves:sendWithRetrydirectly without enqueuinghttps://smartcontract-it.atlassian.net/browse/NONEVM-706
prev branch: #928
Merge together with:
Soak Tests: