Now that my poster is finished, I am taking one last crack at getting CESM to run. Last time I wrote, I mentioned that the model execution was failing without giving any error messages (except the occasional “Segmentation fault”).
Michael Tobis thought that the problem had to do with mpiexec, so today I tried something new. I uninstalled mpich2 and replaced it with openmpi which I had built manually (as opposed to using apt-get). Now, when the model fails, the ccsm.log file actually says something:
mpiexec noticed that process rank 0 with PID 1846 on node computer name
exited on signal 11 (Segmentation fault).
15 total processes killed (some possibly by mpiexec during cleanup)
Perhaps the problem is still with MPI. It seems unlikely that the segfault is due to a problem with the code itself (eg an undeclared variable), seeing as this version has been tested and used by NCAR. Maybe gcc is the issue, and I should play around with some compiler flags? Any suggestions would be welcome.
I take it you never did get in with gdb?
I’m working on it – currently slugging through manpages. -Kate
Update: I got gdb to run ccsm.exe, and this is what it told me:
Program received signal SIGSEGV, Segmentation fault.
0x00000001 in ?? ()
When I did the backtrace, it told me the same thing, but with a
#0before the pointer. What exactly does this mean? I have very little experience with pointers. -Kate
It means that you’re dealing with memory corruption.
I suspect that what happened is that there is a function which overwrote the stack with zeros, and then returned. When it returned, the zeros on the stack got popped off, and placed in the registers which are used for tracking the location of the current stack frame, and the program counter, and the computer then tried to execute code at address 0. You can’t actually do that (since you don’t have access to that memory) so you got a sigsegv and crashed.
Here’s what you do next:
Set breakpoints before you start executing the program. Pick locations during its initialization and during its run. When you hit one, note it, and then enter c (for continue) at the gdb> prompt. When you crash, you’ll know that the crash happened between the last breakpoint you hit, and the next one in the execution flow. You can then use binary search to narrow down where the crash is.
If you really know the code reasonably well, you won’t have to iterate the binary search too many times, and you’ll be able to see where the crash is happening and get an understanding of why, and then address it.
I haven’t worked directly with the CESM, but in the past I’ve worked a lot with the CCSM3.0 and the CAM3.5, as well as other atmosphere dynamical cores and models, so let me just say, I can sympathize with your frustration here! From my experience, before you go tweaking compiler flags, you need to try to pinpoint where the model is failing. A good way to do this is by liberally sprinkling PRINT statements like:
print *,”nice I made it to line xyz in file abc.f90″
Start at the very top of the code, and run your model in serial, trying to pinpoint exactly where the model is failing. Sometimes, there’s just an odd difference between the compiler you’re using and what the developers at NCAR used. For instance, I’ve been working with the pairwise homogenization algorithm used at NCDC with the USHCNv2 dataset this summer, and switching from their machines to compiling with gfortran on my own Ubuntu install necessitated a few tweaks to the original code; just kind of unavoidable sometimes.
Nevertheless, Fortran debugging output *sucks*, and using PRINT statements to capture bad sections of code is sometimes all you can do.
I put a print statement right after the variable declarations in ccsm_driver.f90, which is the top-level program, but it was never printed. So it looks like none of the files even started running. I think it’s an MPI issue. Now, to find a test .exe file… -Kate
I have no idea if playing with the compiler flags will help with this problem, but it’s bound to amuse you for an afternoon and will provide you with a lifetime supply of reasons why program A won’t work with program B when you’re moaning about it over coffee. ABI, fpstrict, Wall, –ansi, -O3, omit-frame-pointer, pragma-pascal, etc, etc.
Does someone has a handy link on howto install this or other models on your desktop under linux/windows or mac?
I blogged about my experiences installing Model E (which worked) and CESM (still ongoing, but progress made) on Ubuntu here. My supervisor Steve managed to get CESM running on his MacBook. He took notes here – scroll down to the comments for more info. I don’t know of any GCMs that would run on Windows (unless you count EdGCM), as a large part of their infrastructure code is written in UNIX shell scripts. -Kate
Did you try just replacing the model executable with “Hello, world” to check you get 16*”Hello, world”? That would at least confirm that your mpich is fine.
That works okay. So the problem seems to be with CESM. -Kate
In that case, your next step could indeed be to stuff some PRINT statements very near the start. Complex climate models -at least HadCM3 was – are very sensitive to the exact compiler and machine and even flavour of linux run on. It isn’t hard to get them to fall over in a hideous heap with the wrong options.
If your mpich is OK, but your code/compilation is broken (and it isn’t an interaction between the two) then another useful trick is simply to run a 1-node version of the code direct by hand at the prompt, without mpich. Then if it goes wrong you have some chance of seeing where (but you’ll have the pain of hacking the start-up scripts. You usually want to take the real scripts, and just before the mpichrun, dump the whole env etc to a file so you can run interactively).