Skip to content

hotfix - safeguard invalid scalar indices to prevent segmentation faults during LBC/IC generation#263

Open
guoqing-noaa wants to merge 1 commit into
ufs-community:noaa/developfrom
guoqing-noaa:segfault_hotfix4ufs
Open

hotfix - safeguard invalid scalar indices to prevent segmentation faults during LBC/IC generation#263
guoqing-noaa wants to merge 1 commit into
ufs-community:noaa/developfrom
guoqing-noaa:segfault_hotfix4ufs

Conversation

@guoqing-noaa

@guoqing-noaa guoqing-noaa commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

@keenan and I got the same lbc task crash on Hera recently. All lbc files were generated correctly but the task failed due to segmentation fault occurred:

+2s exrrfs_lbc.sh[101]: srun ./init_atmosphere_model.x
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
libpthread-2.28.s  000014DB92A91990  Unknown               Unknown  Unknown
libc-2.28.so       000014DB923C2811  cfree                 Unknown  Unknown
libpnetcdf.so.4.0  000014DB9D9D5198  for_deallocate_ha     Unknown  Unknown
init_atmosphere_m  00000000006491D8  Unknown               Unknown  Unknown
init_atmosphere_m  0000000000658966  Unknown               Unknown  Unknown
init_atmosphere_m  000000000065877A  Unknown               Unknown  Unknown
init_atmosphere_m  0000000000643855  Unknown               Unknown  Unknown
init_atmosphere_m  00000000005B38D5  Unknown               Unknown  Unknown
init_atmosphere_m  0000000000408FC0  Unknown               Unknown  Unknown
init_atmosphere_m  0000000000408BC6  Unknown               Unknown  Unknown
init_atmosphere_m  0000000000408B3D  Unknown               Unknown  Unknown
libc-2.28.so       000014DB92361865  __libc_start_main     Unknown  Unknown
init_atmosphere_m  0000000000408A5E  Unknown               Unknown  Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred

It looks similar to the behavior we met before when running the model where the executable just could not terminate correctly.

I tried a debug build and identified that the crash happened at line 8577 of mpas_init_atm_cases.F:

+2s exrrfs_lbc.sh[101]: srun ./init_atmosphere_model.x
forrtl: severe (408): fort: (3): Subscript #1 of the array SCALARS has value -1 which is less than the lower bound of 1

Image              PC                Routine            Line        Source
init_atmosphere_m  000000000081DDE8  init_atm_cases_mp        8577  mpas_init_atm_cases.F
init_atmosphere_m  0000000000556207  init_atm_cases_mp         404  mpas_init_atm_cases.F
init_atmosphere_m  000000000054C788  init_atm_core_mp_          92  mpas_init_atm_core.F
init_atmosphere_m  000000000040E70C  mpas_subdriver_mp         417  mpas_subdriver.F
init_atmosphere_m  00000000004085CD  MAIN__                     20  mpas.F
init_atmosphere_m  00000000004084FD  Unknown               Unknown  Unknown
libc-2.28.so       000014ED04092865  __libc_start_main     Unknown  Unknown
init_atmosphere_m  000000000040841E  Unknown               Unknown  Unknown
forrtl: severe (408): fort: (3): Subscript #1 of the array SCALARS has value -1 which is less than the lower bound of 1

and lines 8577-8581 read as follows:

        scalars(index_nc,:,iCell) = 0._RKIND
        scalars(index_ni,:,iCell) = 0._RKIND
        scalars(index_nr,:,iCell) = 0._RKIND
        scalars(index_ns,:,iCell) = 0._RKIND
        scalars(index_ng,:,iCell) = 0._RKIND

So it suggested a -1 of index_nc caused the crash.

In src/core_init_atmosphere/Registry.xml, we can find that when lbc_hydrometeors_gfs is true while lbc_hydrometeors_rrfs is false, we only have lbc_qv, lbc_qc, lbc_qr, lbc_qi, lbc_qs, lbc_qg, do not have lbc_nc/ni,..... Hence we will of course get -1 for their indexes (index_nc, ect).

                <var_array name="lbc_scalars" type="real" dimensions="nVertLevels nCells Time" packages="lbcs">
                        <var name="lbc_qv" array_group="moist" units="kg kg^{-1}"
                             description="Water vapor mixing ratio on lateral boundary cells"/>

                        <var name="lbc_qc" array_group="moist" units="kg kg^{-1}"
                             description="Cloud water mixing ratio on lateral boundary cells"
                             packages="lbc_hydrometeors_rrfs;lbc_hydrometeors_gfs"/>

                        <var name="lbc_qr" array_group="moist" units="kg kg^{-1}"
                             description="Rain water mixing ratio on lateral boundary cells"
                             packages="lbc_hydrometeors_rrfs;lbc_hydrometeors_gfs"/>

                        <var name="lbc_qi" array_group="moist" units="kg kg^{-1}"
                             description="Lateral boundary tendency of ice mixing ratio"
                             packages="lbc_hydrometeors_rrfs;lbc_hydrometeors_gfs"/>

                        <var name="lbc_qs" array_group="moist" units="kg kg^{-1}"
                             description="Lateral boundary tendency of snow mixing ratio"
                             packages="lbc_hydrometeors_rrfs;lbc_hydrometeors_gfs"/>

                        <var name="lbc_qg" array_group="moist" units="kg kg^{-1}"
                             description="Lateral boundary tendency of graupel mixing ratio"
                             packages="lbc_hydrometeors_rrfs;lbc_hydrometeors_gfs"/>

                        <var name="lbc_qh" array_group="moist"  units="kg kg^{-1} s^{-1}"
                             description="Lateral boundary tendency of hail mixing ratio"
                             packages="lbc_hydrometeors_rrfs"/>

                        <var name="lbc_nr" array_group="number" units="kg^{-1}"
                             description="Lateral boundary tendency of rain number concentration"
                             packages="lbc_hydrometeors_rrfs"/>

                        <var name="lbc_ni" array_group="number" units="kg^{-1}"
                             description="Lateral boundary tendency of ice number concentration"
                             packages="lbc_hydrometeors_rrfs"/>

                        <var name="lbc_nc" array_group="number" units="kg^{-1}"
                             description="Lateral boundary tendency of droplet number concentration"
                             packages="lbc_hydrometeors_rrfs"/>
......
......

This PR provides a hotfix by checking if (index_nc > 0) before setting scalars(index_nc,:,iCell) = 0._RKIND.
Similar safeguards are applied to IC generation where index_nwfa and index_nifa are only available for specific packages.

I don't have a full answer to why this only crashes on Hera but not on other machines.
Different machines have different memory/heap layout, the out-of-bounds address may still be within a valid memory page, so no SIGSEGV occurs and the program finishes "successfully".

Mandatory Questions

  • Does this PR include any additions or changes to external inputs (e.g., microphysics lookup tables, static data for gravity-wave drag -- things like that)?
    • no
  • Does this PR require updating one or more baselines for the CI tests? If so, what?
    • no

Reviews

  • Is this PR currently draft, still being worked on, or ready for review?
    • ready for review
  • For when this PR begins review, please list the developers/collaborators you'd like to prioritize for review; e.g.,:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Needs review

Development

Successfully merging this pull request may close these issues.

3 participants