Skip to content

netcom: Fix NPE in establishConnection when peer is null#479

Open
chrstnwhlrt wants to merge 1 commit intoLINBIT:masterfrom
chrstnwhlrt:master
Open

netcom: Fix NPE in establishConnection when peer is null#479
chrstnwhlrt wants to merge 1 commit intoLINBIT:masterfrom
chrstnwhlrt:master

Conversation

@chrstnwhlrt
Copy link
Copy Markdown

Problem

When establishConnection() is called with a SelectionKey that has no attached Peer object, the code attempts to extract diagnostic information from the channel's addresses. However, channel.getLocalAddress() and channel.getRemoteAddress() can return null, causing a NullPointerException at line 1003.

This can occur during a race condition between connection cleanup and reconnect attempts, where the peer attachment is removed while an OP_CONNECT event is still pending in the selector.

The unhandled NPE kills the entire SslConnector thread, preventing all further outbound SSL connections until the controller is restarted.

Fix

Replace the try-catch (ClassCastException) pattern with instanceof checks, which properly handle both null values and non-InetSocketAddress types.

Signed-off-by: Christian Wohlert <wohlert@appbase.hamburg>
@raltnoeder
Copy link
Copy Markdown
Member

I would actually also be interested in where the peer attachment gets removed, because as far as I remember, the NullPointerExceptions had been seen before, but a code section that removes the peer attachment was never found anywhere.
According to documentation, the only way to remove an attachment would be to call attach(null), and there seem to be only two attach calls on SelectionKey in all of LINSTOR's code (one in acceptConnection, which happens in the same thread as the one that selects, and the other one in connect, which synchronizes with the thread that selects and then performs actions based on what was selected), and both of them attach a Peer object that was just created in the code line directly above.

@chrstnwhlrt
Copy link
Copy Markdown
Author

@raltnoeder After deeper analysis, I believe the root cause is an unprotected selectNow():

Line 467 selectNow() is inside synchronized(syncObj) with a comment explaining why this is necessary.
Line 505 selectNow() is NOT synchronized - this appears to be an oversight.

The race condition window is small (between register() and attach()), but with fast connection failures (RST) and the selector thread calling selectNow() at high frequency, it can happen.

My PR fixes the secondary NPE in the error-reporting code, but the actual root cause seems to be Line 505 missing synchronized(syncObj).

@raltnoeder
Copy link
Copy Markdown
Member

That seems very plausible. I was even surprised to see the selectNow call in this place, because I wrote the original implementation many years ago, and it had exactly one select call and one selectNow call, for the reasons laid out in the comments. I remember looking for cases where the attachment would be removed, and did not find any, and the original two select/selectNow and associated synchronization seemed to make sense.
However, this unsynchronized selectNow call could definitely race with the register calls for outbound connections, where a new SelectionKey is registered and the peer attached afterwards within a synchronized block by another thread.
Thank you for your contribution!

@chrstnwhlrt
Copy link
Copy Markdown
Author

Thanks for confirming the analysis!

To summarize: this PR only addresses the secondary NPE in error reporting - it makes the logging more robust when the race condition occurs, but doesn't fix the root cause.

If you'd like, I can take a look at properly synchronizing the selectNow() call at line 505 to match the pattern at line 467. Just let me know if that would be helpful or if you prefer to handle it internally.

I found this race condition only in my home lab using a rpi 4 cluster which is really slow (compared to a proper deployment)..

@chrstnwhlrt
Copy link
Copy Markdown
Author

@raltnoeder Let me know if you need me to change anything to get this merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants