I have made slight headway regarding my installation of CESM. It still isn’t running, but now it’s not running for a different reason than previously! Progress!
It appears that, at some point while porting, I mangled the scripts/ccsm_utils/Machines/mkbatch.kate
file for my machine such that the actual call to launch the model wasn’t getting copied from mkbatch.kate
to test.kate.run
. A bit of trial and error fixed that problem.
I finally got Torque working. The only reason that jobs were getting stuck in the queue was that I didn’t start the pbs_sched
daemon! It turns out that qsub isn’t related to the problems I was having, and isn’t necessary to run the model, but it’s nice to have it working just in case I need it in the future.
So, with the relevant call in test.kate.run
as
mpiexec -n 16 ./ccsm.exe >&! ccsm.log.$LID
the command line output is
Wed July 6 11:02:33 EDT 2011 -- CSM EXECUTION BEGINS HERE
Wed July 6 11:02:34 EDT 2011 -- CSM EXECUTION HAS FINISHED
ls: No match.
Model did not complete - no cpl.log file present - exiting
The only log file created is ccsm.log
, and it is completely empty.
I have MPICH2 installed, the command mpiexec
seems to work fine, and I have mpd running. Regardless, I tried taking out mpiexec
and calling the executable directly in test.kate.run
:
./ccsm.exe >&! ccsm.log.$LID
The command line output becomes
Wed July 6 11:02:33 EDT 2011 -- CSM EXECUTION BEGINS HERE
Segmentation fault.
Wed July 6 11:02:34 EDT 2011 -- CSM EXECUTION HAS FINISHED
ls: No match.
Model did not complete - no cpl.log file present - exiting
Again, ccsm.log
is empty, and there seems to be no trace of why the model is failing to launch beyond Segmentation fault
. The CESM guide recommends setting the stack size to unlimited, which I did to no avail. Submitting test.kate.run
using qsub produces the same messages, but in the output and error files, rather than the terminal.
Thoughts?
A segmentation fault means that a program tried to access memory it isn’t supposed to, say because of a pointer arithmetic bug. Depending on your ulimit settings, it may produce a core file which, if the program was compiled with debug symbols, you could open up to try and understand exactly why.
ls: No match.
means that in csh, somebody did:
ls 〈regular expression〉
where the regular expression didn’t match any files.
I’m pretty sure that ls was looking for the cpl.log file, which was never created because the model didn’t get that far. -Kate
So lets concentrate on the segmentation fault then.
Assuming that you’re using gcc as your compiler, what you need to do is:
1. make sure everything is compiled with the -ggdb command line option to gcc
2. Run gdb on your resulting executable
3. specify command line arguments to the executable using ‘set args’ inside gdb
4. run the program by issuing a ‘r’ command in gdb
5. when it crashes, issue a ‘bt’ backtrace command in gdb to read off which fuction and lines are where the crash happens
6. look at the source code to see what might be happening
7. use ‘print’ in gdb to examine local variables to obtain a better understanding
If the problem is that the stack is really overflowing, you might want to check and see if your system is permitting non-root users to increase their stack size (via ulimit) and whether other memory constraints that ulimit imposes might be an issue.
The model is compiled automatically, using both gcc and gfortran, through the build script. I can include compiler flags in the build script, so I’ll try that. Thanks for the advice. -Kate
I think the missing log file is secondary.
My suspicion at this point is MPI, not CESM.
Did you build MPI using the came compiler stack?
Have you tried a test MPI job?
mt
I have tried using mpich2 installed via apt-get, as well as built from the tarball on their website. No luck either way. I am not sure how to test MPI. I think I should find a test .exe file to try and submit it to mpirun, but I’m not sure where to get one. -Kate