Long experiments crashing
SUMMARY
Several crashes have been reported in GLOBO running for longer-than-s2s time window. This is a fluctuating problem, which shows up in multiple resolutions and with different compilers (ifort and gfortran). It seems to be related to a radiation issue, but it can be associated with NaN values emerging from the pole at the surface and propagating everywhere.
Investigation shows that surface pressure and its tendency gets to NaN suddenly.
We were able to identify the same crash in both the most recent updates (deve/coupling and devel/netcdf) and in the older versions (main), which suggests that this is not dependent from our recent code changes.
It also seems that changing compilers and mpi libraries affects the crash, but we speculate this is a mere change iin the moment in time were the code will crash. Reducing the level of optimizations in compilation phase seems to improve the length of the experiments, but still the feeling is that this is just by random changes in the numerical evolution of the model.
BACKTRACE
The crash is occurring with a segfault in ifort and invalid memory reference in gfortran, both pointing to some specific issue in the radiation
- GFORTRAN backtrace
Backtrace for this error:
#0 0x7f3a059188b0 in ???
#1 0x7f3a05917ae3 in ???
#2 0x7f3a0557283f in ???
#3 0x55a085722efd in rrtm_gasabs1a_140gp_
at /work/users/davini/globone/sources/rad-ecmwf-old/source/rrtm_gasabs1a_140gp.F90:134
#4 0x55a0856d7aa7 in rrtm_rrtm_140gp_
at /work/users/davini/globone/sources/rad-ecmwf-old/source/rrtm_rrtm_140gp.F90:237
#5 0x55a6f17204ec in radlsw_
at /work/users/davini/globone/sources/rad-ecmwf-old/source/radlsw.F90:1184
#6 0x55a0854c461c in radintec_
at /work/users/davini/globone/sources/globo/bolam.F90:11993
#7 0x55a085507de7 in bolam
at /work/users/davini/globone/sources/globo/bolam.F90:1294
#8 0x55a085489afe in main
at /work/users/davini/globone/sources/globo/bolam.F90:174
- IFORT backtrace
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
globo_cntr 0000000000776FDA Unknown Unknown Unknown
libpthread-2.28.s 00007FC5E7DAA730 Unknown Unknown Unknown
globo_cntr 00000000004B8D04 rrtm_taumol3_ 148 rrtm_taumol3.F90
globo_cntr 00000000006ABE6B rrtm_gasabs1a_140 81 rrtm_gasabs1a_140gp.F90
globo_cntr 000000000063172E rrtm_rrtm_140gp_ 234 rrtm_rrtm_140gp.F90
globo_cntr 000000000060CF73 radlsw_ 1177 radlsw.F90
globo_cntr 000000000046BD8E radintec_ 11994 bolam.F90
globo_cntr 00000000004136F4 MAIN__ 1294 bolam.F90
globo_cntr 0000000000404962 Unknown Unknown Unknown
EXPERIMENTS CRASH
Resolution | Vertical Levels | Timestep | Extra | Survived for... |
---|---|---|---|---|
KM078 (514x362) | L70 | 300s | default S2S | 61 days |
KM078 (514x362) | L70 | 300s | AMIP | 87 days |
KM078 (514x362) | L70 | 150s | 120 days | |
KM078 (514x362) | L50 | 300s | AMIP | 239 days |
KM156 (258x182) | L70 | 450s | 296 days | |
KM156 (258x182) | L50 | 450s | 270 days | |
KM156 (258x182) | L50 | 450s | anu2=anu2v=0.2 | 290 days |
KM156 (258x182) | L50 | 300s | At least 5 years | |
KM156 (258x182) | L50 | 300s | AMIP | 197 days |
KM156 (258x182) | L70 | 300s | At least 5 years | |
KM312 (130x92) | L70 | 600s | 30 days | |
KM312 (130x92) | L70 | 450s | 3 years | |
KM312 (130x92) | L50 | 450s | AMIP | 243 days |
Exp. performed by Piero Malguzzi on Ottovolante, with fixed SST (no seasonal cycle).
Resolution | Vertical Levels | Timestep | Extra | Survived for... |
---|---|---|---|---|
386x266 | L50 | 360s | Marginally stable run | ** At least 1 year ** |
386x266 | L70 | 480s | sigma thickness const./ ice thick. 0.7 | ** At least 1 year ** |