[NONEVM-706][Solana] - Refactor TXM + Rebroadcast Expired Tx functionality by Farber98 · Pull Request #946 · smartcontractkit/chainlink-solana

Farber98 · 2024-11-26T21:11:31Z

Description

Refactor confirm and sendWithRetry funcs

Adding a new TxExpirationRebroadcast toggable routine:

Inside confirm loop, after checking confirmations, we will determine if we need to rebroadcast expired transactions that where not processed yet. Rebroadcasting a transaction involves:
- Removing previous tx and creating a new one with updated blockhash
  - Consideration: txid will be maintained, in case it was set by caller
- calling sendWithRetry directly without enqueuing
  - If error, we should use prebroadcast error functionality. Client will need to retry the tx if this happens
- Consideration: Implicitly it may be "bumping" despite config being off. We may get a new higher price for the rebroadcasted tx when we build and get a new compute unit price.

https://smartcontract-it.atlassian.net/browse/NONEVM-706

prev branch: #928

Merge together with:

[NONEVM-706][core] - TxExpirationRebroadcast config chainlink#15437

Soak Tests:

[NONEVM-706][SOAK] - Soak Testing TxExpirationRebroadcast feature #962

…ebroadcasted

…nment

silaslenihan · 2024-12-09T23:44:36Z

+	id                   string
+	createTs             time.Time
+	state                TxState
+	lastValidBlockHeight uint64 // to track expiration


Out of curiosity, what is an invalid block height? Like a reorged out chain?

+1. I assume this is related to using latest blockhash in tx, and tracking its expiration. But the variable name is confusing, if it tracks a particular block number, or just a depth.

In Solana, each transaction includes a "recent blockhash" that expires when that blockhash is no longer "recent enough"

During transaction processing, Solana validators will check if the "recent blockhash" of each transaction is recorded within the 150 most recent stored hashes. If the transaction's "recent blockhash" is older than this maximum processing age, the transaction won't be processed.

To track the expiration of each transaction, we call the RPC method GetLatestBlockhash and store the return value lastValidBlockHeight within our pendingTx object. This value represents the height of the last block at which the returned blockhash value will be valid.

Each slot (period of time a validator can produce a block) fluctuates between 400 ms and 600 ms. This means transactions can only use a given blockhash for 150 blocks, which is between 60 to 90 seconds since creation. After this, it will be considered expired by the network and you'll need a new blockhash to make it go through

Ah yeah this became clear to me later in the PR review, thanks for clarifying though that makes sense

prashantkumar1982 · 2024-12-10T01:14:10Z

+	id                   string
+	createTs             time.Time
+	state                TxState
+	lastValidBlockHeight uint64 // to track expiration


+1. I assume this is related to using latest blockhash in tx, and tracking its expiration. But the variable name is confusing, if it tracks a particular block number, or just a depth.

prashantkumar1982 · 2024-12-10T01:16:07Z

-	broadcastedTxs      map[string]pendingTx  // transactions that require retry and bumping i.e broadcasted, processed
-	confirmedTxs        map[string]pendingTx  // transactions that require monitoring for re-org
-	finalizedErroredTxs map[string]finishedTx // finalized and errored transactions held onto for status
+	broadcastedProcessedTxs map[string]pendingTx  // broadcasted and processed transactions that may require retry and bumping


In EVM, we just call these Unconfirmed.
Unconfirmed means tx was successfully broadcasted, but not yet mined into a block.
My preference would be to reuse that same name here too?

Names except Broadcasted were chosen to be consistent with the Solana Commitment Status

I feel it's good to reuse the same names across chains, but I also think Processed is clearer than Unconfirmed in this case. In my opinion, Unconfirmed does not make it clear whether the transaction was Processed or simply Broadcasted

Broadcasted: Our TXM sent the tx

Processed: tx was Processed by network and included in a block

@amit-momin @silaslenihan what do you think about Broadcasted vs Unconfirmed?

I agree with Juan. Unconfirmed might be misleading in the context of Solana since we have this Processed state in between. In EVM, we could consider a broadcasted transaction as Unconfirmed since the next step in the lifecycle is Confirmed

If Solana has these standard terms, then yes, better to align with that. I didn't know.

prashantkumar1982 · 2024-12-10T01:38:03Z

 		}
 		// validate id does not exist
-		if _, exists := c.broadcastedTxs[tx.id]; exists {
+		if _, exists := c.broadcastedProcessedTxs[tx.id]; exists {


What is this tx.id? Is it the one caller provided as an idempotency key?

If this is checking if this id exists, then we likely need to do better checks across many other status.

Yes! If set by caller we use that one. If not set, we generate a random UUID when enqueueing txes.

What checks do you think are missing?

The previously sent Tx by the caller with same id, may be still enqueued in the DB, not yet broadcasted, or may be confirmed, or finalizedErroredTx, or just finalized right?
We can't assume that a previously used id will only be in this broadcastedProcessedTxs list.

I think that's a fair point. We should check across all of the maps when adding a new transaction to our in-memory store

I'll add this

prashantkumar1982 · 2024-12-10T01:43:03Z

 	return maps.Keys(c.sigToID)
 }

+// ListAllExpiredBroadcastedTxs returns all the txes that are in broadcasted state and have expired for given block height compared against their lastValidBlockHeight.


nit: could we use blockNumber instead of blockHeight in all these places?
Block number is a more unambiguous term.

I think it's also easier in terms of block number than block height.

Guys, do you think it makes sense to rename lastValidBlockHeight to lastValidBlockNumber inside our TXM? The lastValidBlockHeight is the official one, but lastValidBlockNumber may be easier to deal with.

Another alternative would be to put comments when we use the term block height to be explicit that we are referring to block number

CC: @amit-momin @silaslenihan

I'm ok either way but I thought both of those terms referred to the same thing? What's the ambiguity with block height? Since I thought they were the same, I was on the side of using the official name from the RPC response

Yes if Solana defines a specific meaning to some terms, lets use that.
Didn't know that.
In this case, lets then just add a comment saying what it means.
For example: lastValidBlockHeight is the highest block number till which this Tx is valid. After this, this Tx gets treated as expired.

prashantkumar1982 · 2024-12-10T02:02:11Z

-	if initBuildErr != nil {
-		return solanaGo.Transaction{}, "", solanaGo.Signature{}, initBuildErr
+	// Build and sign initial transaction setting compute unit price and limit
+	initTx, err := txm.buildTx(ctx, msg, 0)


Do we know what happens to the previous tx(original one before the retry) which this is intending to replace?

Since this is updating the recentBlockHash, I would assume Solana treats this as a brand new Tx, not something which replaces the previous one. So maybe Solana confirms both of these Txs separately?
On EVM, we were sure only 1 would get confirmed, since they shared the same nonce. But here, they don't share anything.

This previous response makes sense here: #946 (comment)

We will proceed and Rebroadcast a tx when the first one has already expired and become invalid

prashantkumar1982 · 2024-12-10T02:03:37Z

 }

+// sendWithRetry attempts to send a transaction with exponential backoff retry logic.
+// It builds, signs and sends the initial tx with a new valid blockhash, and starts a retry routine with fee bumping if needed.


I commented below too, but if you use a new valid blockHash, are you sure Solana would still figure out and treat this as a replacement of the older Tx?
If not, then we may have both of these txs get included.

Also quoting #946 (comment) here

It won't be a "replacement" for the old TX, but the old TX became invalid in the eyes of validators and won't be picked up after 150 blocks have passed

prashantkumar1982 · 2024-12-10T19:05:46Z

+// Retries until context cancelled by timeout or called externally.
+// It uses handleRetry helper function to handle each retry attempt.
+func (txm *Txm) retryTx(ctx context.Context, msg pendingTx, currentTx solanaGo.Transaction, sigs *signatureList) {
+	deltaT := 1 // initial delay in ms


Isn't 1 ms too aggressive?
I assume just a round-trip time to Solana RPC call would be in order of 10s of ms atleast.
So after initial send, we may want to wait atleast 10 ms to give the initial tx some time to get confirmed.

Otherwise if say we have initial deltaT as 1ms, then next exponential backoffs are at 1, 2 ,4 8, 16, 31, 64, 128 ms.

But I see that the confirm logic, which checks Tx confirmation, has a polling that's 500 ms:
ConfirmPollPeriod: config.MustNewDuration(500 * time.Millisecond)

Seems like we will retry many times even before trying to check confirmation once.

This was logic that carried over from the existing code for Data Feeds. We chose to keep this the same so we don't introduce any regressions from DF. We could maybe make this configurable in the future especially if we need to reduce RPC calls but seems like this was out of necessity to make sure our transaction gets out in front of the current validator. The multinode integration probably helps a bit with that though so spamming this much maybe isn't needed anymore. I had the same thoughts as you initially but was reluctant to relax the existing code in case it introduced any unforeseen issues.

We decided here to create a ticket to make this retry configurable. We will keep the default values as they are for now. This will give us the flexibility to adjust them based on the use case. For example, we could relax these configurations for CCIP and make them more aggressive for DF

aalu1418 · 2024-12-13T14:45:38Z

+// rebroadcastExpiredTxs attempts to rebroadcast all transactions that are in broadcasted state and have expired.
+// An expired tx is one where it's blockhash lastValidBlockHeight (last valid block number) is smaller than the current block height (block number).
+// If any error occurs during rebroadcast attempt, they are discarded, and the function continues with the next transaction.
+func (txm *Txm) rebroadcastExpiredTxs(ctx context.Context, client client.ReaderWriter) {


this feels like there's a potential race condition here

tx is broadcast to the network and is included just before blockhash expiration

confirmer does not see it yet because it waits for confirmed commitment level

TXM marks a base transaction as expired because the block threshold has passed

deletes the saved tx id from the in-memory storage, and rebroadcasts a tx with a new blockhash

this could go on forever because it can never confirm that a successful tx is confirmed because the tx.id is always deleted before

Wouldn't this be handled gracefully by the existing code?

The processConfirmations method is called before rebroadcastExpiredTxs updating each signature status. Here we would detect if we have transitioned from Broadcasted to Processed.

Then inside rebroadcastExpiredTxs we call ListAllExpiredBroadcastedTxs, where we only consider the ones which have Broadcasted state

I think we would avoid that scenario since we only call rebroadcastExpiredTxs after processing the latest confirmation status changes. Since this method only processes Broadcasted transaction, the confirmation logic would handle any last minute state changes before we assess transaction for expiry. So in your scenario, we would notice that the transaction advanced to Processed, mark it accordingly in our in-memory storage, then move on to the expiry check. The expiry check would notice there aren't any transactions in Broadcasted state that are expired and just move on.

what happens if a transaction skips the processed phase?
processed is just the view of a single RPC node - so if a transaction can go from processed to non-existent, then it might also be possible to go from non-existent to confirmed

it just feels odd to delete all previous attempts at a transaction and then recreate - rather than just updating the transaction with a new blockhash

We would advance the state from non-existent (internally what we consider Broadcasted) to Confirmed and also avoid rebroadcasting with a new hash in that case. Our state transition methods don't require us to move one step up at a time in case of slow polling or other reasons. We can even jump straight to Finalized. We only delete all previous attempts/sigs if we can confirm that the tx has a non-existent state and its blockhash has expired. I was thinking it was better to clean up those old attempts/sigs than keep them around and waste cycles on status checks on them since they're expired anyways.

ok - the chances of this race happening might be low

is there logging around a transaction signature that doesn't match any tx id? something that fires if an assumption is broken?

If one of the state transition methods gets a sig that doesn't match to an ID in the internal map, it should return a ErrSigDoesNotExist error. Depending where the method was called, we have some logging to surface an assumption violation. But not in all cases like the confirmation expiration where we can maybe assume a different process cleaned up the signatures.

aalu1418 · 2024-12-13T14:46:25Z

 	return &txCopy
 }
+
+func TestTxm_ExpirationRebroadcast(t *testing.T) {


this tests only runs against a mocked endpoints under ideal behavior - please add additional tests that run against a live validator

Hey! We've found that testing this scenario without mocking is challenging because we don't have a way to force a transaction to expire in a live validator.

The only solution we've identified involves modifying some methods to simulate this behavior for the tests. However, this approach is not ideal, as it would require adding additional flows solely to make the test pass and to trigger a rebroadcast in the live validator tests.

Do you have any ideas on how to achieve this that we might be missing?

you could potentially introduce some sort of proxy that takes the transaction and returns a transaction signature, but doesn't allow any more transactions through until it knows the first transaction blockhash has expired

Would it be okay to address this in a separate ticket under the Testing track? So we are able to get this one in before the freeze and finish the re-org work that depends on this PR to complete the implementation track

why does the freeze impact the timeline? afaik that's just for production environments, not development?

also this PR touches significant portions of the most sensitive part of a solana integration, test coverage as the changes are made imo is a must

just clarifying a few things:

there is a production code freeze, but that shouldn't impact development here

i'd like to see an attempt at a proper integration tests, but it can be time-boxed to a day or two (if it still doesn't work, then a follow-up ticket is fine)

Thanks for clarifying, thought the freeze involved not merging new code in chainlink repo, even if it was development 😅. I'll start working on these integration tests

…gration tests

cl-sonarqube-production · 2024-12-19T15:27:18Z

Quality Gate passed

Issues
1 New issue
2 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
76.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

Farber98 added 30 commits November 26, 2024 17:30

refactor so txm owns blockhash assignment

2d1a82d

lastValidBlockHeight shouldn't be exported

50dfef0

better comment

4e545e2

refactor sendWithRetry to make it clearer

4ded53c

confirm loop refactor

9e1be6d

fix infinite loop

7dd2028

move accountID inside msg

6c675f2

lint fix

b0d9426

base58 does not contain lower l

1b38665

fix hash errors

6923ddf

fix generate random hash

462844b

remove blockhash as we only need block height

fd785d0

expired tx changes without tests

cf958a4

add maybe to mocks

c5e957b

expiration tests

a505993

send txes through queue

adc8b1c

revert pendingtx leakage of information. overwrite blockhash

7d77f99

fix order of confirm loop and not found signature check

92a280b

fix mocks

2598e19

prevent confirmation loop to mark tx as errored when it needs to be r…

42b3da1

…ebroadcasted

fix test

89af1f3

fix pointer

5e8a0da

add comments

75c1dcd

reduce rpc calls + refactors

4ff2d23

tests + check to save rpc calls

84e423e

address feedback + remove redundant impl

7d8319e

iface comment

68f3a3e

address feedback on compute unit limit and lastValidBlockHeight assig…

780179f

…nment

blockhash assignment inside txm.sendWithRetry

98f0246

address feedback

cbf55f6

Farber98 temporarily deployed to integration December 9, 2024 14:54 — with GitHub Actions Inactive

silaslenihan reviewed Dec 10, 2024

View reviewed changes

silaslenihan previously approved these changes Dec 10, 2024

View reviewed changes

prashantkumar1982 reviewed Dec 10, 2024

View reviewed changes

address feedback

50dd10f

Farber98 dismissed stale reviews from silaslenihan and amit-momin via 50dd10f December 10, 2024 13:36

Farber98 temporarily deployed to integration December 10, 2024 13:36 — with GitHub Actions Inactive

prashantkumar1982 reviewed Dec 10, 2024

View reviewed changes

Farber98 added 2 commits December 11, 2024 11:56

Merge branch 'develop' into backup-branch-fee-bumping

10b0453

validate that tx doesn't exist in any of maps when adding new tx

2ea0c50

amit-momin previously approved these changes Dec 11, 2024

View reviewed changes

prashantkumar1982 previously approved these changes Dec 11, 2024

View reviewed changes

Farber98 added 2 commits December 11, 2024 18:15

Merge branch 'develop' into backup-branch-fee-bumping

b5a6927

Merge branch 'develop' into backup-branch-fee-bumping

b4b4fd5

aalu1418 requested changes Dec 13, 2024

View reviewed changes

Farber98 added 4 commits December 16, 2024 13:04

Merge branch 'develop' into backup-branch-fee-bumping

e7d7680

callers set lastValidBlockheight + get blockhash on expiration + inte…

409fd1c

…gration tests

Merge branch 'develop' into backup-branch-fee-bumping

31a65a0

add enq iface comm to help callers

1be704a