Skip to content

[NONEVM-706][Solana] - Refactor TXM + Rebroadcast Expired Tx functionality#946

Merged
aalu1418 merged 60 commits into
developfrom
backup-branch-fee-bumping
Dec 19, 2024
Merged

[NONEVM-706][Solana] - Refactor TXM + Rebroadcast Expired Tx functionality#946
aalu1418 merged 60 commits into
developfrom
backup-branch-fee-bumping

Conversation

@Farber98
Copy link
Copy Markdown
Contributor

@Farber98 Farber98 commented Nov 26, 2024

Description

Refactor confirm and sendWithRetry funcs

Adding a new TxExpirationRebroadcast toggable routine:

  • Inside confirm loop, after checking confirmations, we will determine if we need to rebroadcast expired transactions that where not processed yet. Rebroadcasting a transaction involves:
    • Removing previous tx and creating a new one with updated blockhash
      • Consideration: txid will be maintained, in case it was set by caller
    • calling sendWithRetry directly without enqueuing
    • Consideration: Implicitly it may be "bumping" despite config being off. We may get a new higher price for the rebroadcasted tx when we build and get a new compute unit price.

https://smartcontract-it.atlassian.net/browse/NONEVM-706

prev branch: #928

Merge together with:

Soak Tests:

Comment thread pkg/solana/txm/pendingtx.go Outdated
id string
createTs time.Time
state TxState
lastValidBlockHeight uint64 // to track expiration
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, what is an invalid block height? Like a reorged out chain?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I assume this is related to using latest blockhash in tx, and tracking its expiration. But the variable name is confusing, if it tracks a particular block number, or just a depth.

Copy link
Copy Markdown
Contributor Author

@Farber98 Farber98 Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Solana, each transaction includes a "recent blockhash" that expires when that blockhash is no longer "recent enough"

During transaction processing, Solana validators will check if the "recent blockhash" of each transaction is recorded within the 150 most recent stored hashes. If the transaction's "recent blockhash" is older than this maximum processing age, the transaction won't be processed.

To track the expiration of each transaction, we call the RPC method GetLatestBlockhash and store the return value lastValidBlockHeight within our pendingTx object. This value represents the height of the last block at which the returned blockhash value will be valid.

Each slot (period of time a validator can produce a block) fluctuates between 400 ms and 600 ms. This means transactions can only use a given blockhash for 150 blocks, which is between 60 to 90 seconds since creation. After this, it will be considered expired by the network and you'll need a new blockhash to make it go through

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah this became clear to me later in the PR review, thanks for clarifying though that makes sense

Comment thread pkg/solana/txm/pendingtx.go Outdated
silaslenihan
silaslenihan previously approved these changes Dec 10, 2024
Comment thread pkg/solana/txm/pendingtx.go Outdated
id string
createTs time.Time
state TxState
lastValidBlockHeight uint64 // to track expiration
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I assume this is related to using latest blockhash in tx, and tracking its expiration. But the variable name is confusing, if it tracks a particular block number, or just a depth.

broadcastedTxs map[string]pendingTx // transactions that require retry and bumping i.e broadcasted, processed
confirmedTxs map[string]pendingTx // transactions that require monitoring for re-org
finalizedErroredTxs map[string]finishedTx // finalized and errored transactions held onto for status
broadcastedProcessedTxs map[string]pendingTx // broadcasted and processed transactions that may require retry and bumping
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In EVM, we just call these Unconfirmed.
Unconfirmed means tx was successfully broadcasted, but not yet mined into a block.
My preference would be to reuse that same name here too?

Copy link
Copy Markdown
Contributor Author

@Farber98 Farber98 Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Names except Broadcasted were chosen to be consistent with the Solana Commitment Status

I feel it's good to reuse the same names across chains, but I also think Processed is clearer than Unconfirmed in this case. In my opinion, Unconfirmed does not make it clear whether the transaction was Processed or simply Broadcasted

  • Broadcasted: Our TXM sent the tx
  • Processed: tx was Processed by network and included in a block

@amit-momin @silaslenihan what do you think about Broadcasted vs Unconfirmed?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Juan. Unconfirmed might be misleading in the context of Solana since we have this Processed state in between. In EVM, we could consider a broadcasted transaction as Unconfirmed since the next step in the lifecycle is Confirmed

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Solana has these standard terms, then yes, better to align with that. I didn't know.

}
// validate id does not exist
if _, exists := c.broadcastedTxs[tx.id]; exists {
if _, exists := c.broadcastedProcessedTxs[tx.id]; exists {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this tx.id? Is it the one caller provided as an idempotency key?

If this is checking if this id exists, then we likely need to do better checks across many other status.

Copy link
Copy Markdown
Contributor Author

@Farber98 Farber98 Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! If set by caller we use that one. If not set, we generate a random UUID when enqueueing txes.

What checks do you think are missing?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previously sent Tx by the caller with same id, may be still enqueued in the DB, not yet broadcasted, or may be confirmed, or finalizedErroredTx, or just finalized right?
We can't assume that a previously used id will only be in this broadcastedProcessedTxs list.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a fair point. We should check across all of the maps when adding a new transaction to our in-memory store

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add this

Comment thread pkg/solana/txm/pendingtx.go Outdated
return maps.Keys(c.sigToID)
}

// ListAllExpiredBroadcastedTxs returns all the txes that are in broadcasted state and have expired for given block height compared against their lastValidBlockHeight.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could we use blockNumber instead of blockHeight in all these places?
Block number is a more unambiguous term.

Copy link
Copy Markdown
Contributor Author

@Farber98 Farber98 Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's also easier in terms of block number than block height.

Guys, do you think it makes sense to rename lastValidBlockHeight to lastValidBlockNumber inside our TXM? The lastValidBlockHeight is the official one, but lastValidBlockNumber may be easier to deal with.

Another alternative would be to put comments when we use the term block height to be explicit that we are referring to block number

CC: @amit-momin @silaslenihan

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok either way but I thought both of those terms referred to the same thing? What's the ambiguity with block height? Since I thought they were the same, I was on the side of using the official name from the RPC response

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes if Solana defines a specific meaning to some terms, lets use that.
Didn't know that.
In this case, lets then just add a comment saying what it means.
For example: lastValidBlockHeight is the highest block number till which this Tx is valid. After this, this Tx gets treated as expired.

Comment thread pkg/solana/txm/txm.go Outdated
Comment thread pkg/solana/txm/txm.go
if initBuildErr != nil {
return solanaGo.Transaction{}, "", solanaGo.Signature{}, initBuildErr
// Build and sign initial transaction setting compute unit price and limit
initTx, err := txm.buildTx(ctx, msg, 0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know what happens to the previous tx(original one before the retry) which this is intending to replace?

Since this is updating the recentBlockHash, I would assume Solana treats this as a brand new Tx, not something which replaces the previous one. So maybe Solana confirms both of these Txs separately?
On EVM, we were sure only 1 would get confirmed, since they shared the same nonce. But here, they don't share anything.

Copy link
Copy Markdown
Contributor Author

@Farber98 Farber98 Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This previous response makes sense here: #946 (comment)

We will proceed and Rebroadcast a tx when the first one has already expired and become invalid

Comment thread pkg/solana/txm/txm.go Outdated
}

// sendWithRetry attempts to send a transaction with exponential backoff retry logic.
// It builds, signs and sends the initial tx with a new valid blockhash, and starts a retry routine with fee bumping if needed.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented below too, but if you use a new valid blockHash, are you sure Solana would still figure out and treat this as a replacement of the older Tx?
If not, then we may have both of these txs get included.

Copy link
Copy Markdown
Contributor Author

@Farber98 Farber98 Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also quoting #946 (comment) here

It won't be a "replacement" for the old TX, but the old TX became invalid in the eyes of validators and won't be picked up after 150 blocks have passed

Comment thread pkg/solana/txm/txm.go
// Retries until context cancelled by timeout or called externally.
// It uses handleRetry helper function to handle each retry attempt.
func (txm *Txm) retryTx(ctx context.Context, msg pendingTx, currentTx solanaGo.Transaction, sigs *signatureList) {
deltaT := 1 // initial delay in ms
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't 1 ms too aggressive?
I assume just a round-trip time to Solana RPC call would be in order of 10s of ms atleast.
So after initial send, we may want to wait atleast 10 ms to give the initial tx some time to get confirmed.

Otherwise if say we have initial deltaT as 1ms, then next exponential backoffs are at 1, 2 ,4 8, 16, 31, 64, 128 ms.

But I see that the confirm logic, which checks Tx confirmation, has a polling that's 500 ms:
ConfirmPollPeriod: config.MustNewDuration(500 * time.Millisecond)

Seems like we will retry many times even before trying to check confirmation once.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was logic that carried over from the existing code for Data Feeds. We chose to keep this the same so we don't introduce any regressions from DF. We could maybe make this configurable in the future especially if we need to reduce RPC calls but seems like this was out of necessity to make sure our transaction gets out in front of the current validator. The multinode integration probably helps a bit with that though so spamming this much maybe isn't needed anymore. I had the same thoughts as you initially but was reluctant to relax the existing code in case it introduced any unforeseen issues.

Copy link
Copy Markdown
Contributor Author

@Farber98 Farber98 Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided here to create a ticket to make this retry configurable. We will keep the default values as they are for now. This will give us the flexibility to adjust them based on the use case. For example, we could relax these configurations for CCIP and make them more aggressive for DF

amit-momin
amit-momin previously approved these changes Dec 11, 2024
Comment thread pkg/solana/txm/txm.go
// rebroadcastExpiredTxs attempts to rebroadcast all transactions that are in broadcasted state and have expired.
// An expired tx is one where it's blockhash lastValidBlockHeight (last valid block number) is smaller than the current block height (block number).
// If any error occurs during rebroadcast attempt, they are discarded, and the function continues with the next transaction.
func (txm *Txm) rebroadcastExpiredTxs(ctx context.Context, client client.ReaderWriter) {
Copy link
Copy Markdown
Collaborator

@aalu1418 aalu1418 Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels like there's a potential race condition here

  • tx is broadcast to the network and is included just before blockhash expiration
  • confirmer does not see it yet because it waits for confirmed commitment level
  • TXM marks a base transaction as expired because the block threshold has passed
  • deletes the saved tx id from the in-memory storage, and rebroadcasts a tx with a new blockhash
  • this could go on forever because it can never confirm that a successful tx is confirmed because the tx.id is always deleted before

Copy link
Copy Markdown
Contributor Author

@Farber98 Farber98 Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this be handled gracefully by the existing code?

The processConfirmations method is called before rebroadcastExpiredTxs updating each signature status. Here we would detect if we have transitioned from Broadcasted to Processed.

Then inside rebroadcastExpiredTxs we call ListAllExpiredBroadcastedTxs, where we only consider the ones which have Broadcasted state

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would avoid that scenario since we only call rebroadcastExpiredTxs after processing the latest confirmation status changes. Since this method only processes Broadcasted transaction, the confirmation logic would handle any last minute state changes before we assess transaction for expiry. So in your scenario, we would notice that the transaction advanced to Processed, mark it accordingly in our in-memory storage, then move on to the expiry check. The expiry check would notice there aren't any transactions in Broadcasted state that are expired and just move on.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if a transaction skips the processed phase?
processed is just the view of a single RPC node - so if a transaction can go from processed to non-existent, then it might also be possible to go from non-existent to confirmed

it just feels odd to delete all previous attempts at a transaction and then recreate - rather than just updating the transaction with a new blockhash

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would advance the state from non-existent (internally what we consider Broadcasted) to Confirmed and also avoid rebroadcasting with a new hash in that case. Our state transition methods don't require us to move one step up at a time in case of slow polling or other reasons. We can even jump straight to Finalized. We only delete all previous attempts/sigs if we can confirm that the tx has a non-existent state and its blockhash has expired. I was thinking it was better to clean up those old attempts/sigs than keep them around and waste cycles on status checks on them since they're expired anyways.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok - the chances of this race happening might be low

is there logging around a transaction signature that doesn't match any tx id? something that fires if an assumption is broken?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If one of the state transition methods gets a sig that doesn't match to an ID in the internal map, it should return a ErrSigDoesNotExist error. Depending where the method was called, we have some logging to surface an assumption violation. But not in all cases like the confirmation expiration where we can maybe assume a different process cleaned up the signatures.

return &txCopy
}

func TestTxm_ExpirationRebroadcast(t *testing.T) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this tests only runs against a mocked endpoints under ideal behavior - please add additional tests that run against a live validator

Copy link
Copy Markdown
Contributor Author

@Farber98 Farber98 Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! We've found that testing this scenario without mocking is challenging because we don't have a way to force a transaction to expire in a live validator.

The only solution we've identified involves modifying some methods to simulate this behavior for the tests. However, this approach is not ideal, as it would require adding additional flows solely to make the test pass and to trigger a rebroadcast in the live validator tests.

Do you have any ideas on how to achieve this that we might be missing?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could potentially introduce some sort of proxy that takes the transaction and returns a transaction signature, but doesn't allow any more transactions through until it knows the first transaction blockhash has expired

Copy link
Copy Markdown
Contributor Author

@Farber98 Farber98 Dec 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be okay to address this in a separate ticket under the Testing track? So we are able to get this one in before the freeze and finish the re-org work that depends on this PR to complete the implementation track

Copy link
Copy Markdown
Collaborator

@aalu1418 aalu1418 Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does the freeze impact the timeline? afaik that's just for production environments, not development?

also this PR touches significant portions of the most sensitive part of a solana integration, test coverage as the changes are made imo is a must

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just clarifying a few things:

  • there is a production code freeze, but that shouldn't impact development here
  • i'd like to see an attempt at a proper integration tests, but it can be time-boxed to a day or two (if it still doesn't work, then a follow-up ticket is fine)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying, thought the freeze involved not merging new code in chainlink repo, even if it was development 😅. I'll start working on these integration tests

Comment thread pkg/solana/txm/txm.go Outdated
Comment thread pkg/solana/txm/txm_integration_test.go Outdated
Comment thread pkg/solana/txm/txm_integration_test.go Outdated
Comment thread pkg/solana/relay.go Outdated
Comment thread pkg/solana/relay.go Outdated
@cl-sonarqube-production
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants