netcom: Fix NPE in establishConnection when peer is null#479
netcom: Fix NPE in establishConnection when peer is null#479chrstnwhlrt wants to merge 1 commit intoLINBIT:masterfrom
Conversation
Signed-off-by: Christian Wohlert <wohlert@appbase.hamburg>
|
I would actually also be interested in where the peer attachment gets removed, because as far as I remember, the NullPointerExceptions had been seen before, but a code section that removes the peer attachment was never found anywhere. |
|
@raltnoeder After deeper analysis, I believe the root cause is an unprotected Line 467 The race condition window is small (between My PR fixes the secondary NPE in the error-reporting code, but the actual root cause seems to be Line 505 missing |
|
That seems very plausible. I was even surprised to see the selectNow call in this place, because I wrote the original implementation many years ago, and it had exactly one select call and one selectNow call, for the reasons laid out in the comments. I remember looking for cases where the attachment would be removed, and did not find any, and the original two select/selectNow and associated synchronization seemed to make sense. |
|
Thanks for confirming the analysis! To summarize: this PR only addresses the secondary NPE in error reporting - it makes the logging more robust when the race condition occurs, but doesn't fix the root cause. If you'd like, I can take a look at properly synchronizing the selectNow() call at line 505 to match the pattern at line 467. Just let me know if that would be helpful or if you prefer to handle it internally. I found this race condition only in my home lab using a rpi 4 cluster which is really slow (compared to a proper deployment).. |
|
@raltnoeder Let me know if you need me to change anything to get this merged |
Problem
When
establishConnection()is called with aSelectionKeythat has no attachedPeerobject, the code attempts to extract diagnostic information from the channel's addresses. However,channel.getLocalAddress()andchannel.getRemoteAddress()can returnnull, causing aNullPointerExceptionat line 1003.This can occur during a race condition between connection cleanup and reconnect attempts, where the peer attachment is removed while an
OP_CONNECTevent is still pending in the selector.The unhandled NPE kills the entire
SslConnectorthread, preventing all further outbound SSL connections until the controller is restarted.Fix
Replace the
try-catch (ClassCastException)pattern withinstanceofchecks, which properly handle bothnullvalues and non-InetSocketAddresstypes.