MPI Problem?

Now that my poster is finished, I am taking one last crack at getting CESM to run. Last time I wrote, I mentioned that the model execution was failing without giving any error messages (except the occasional “Segmentation fault”).

Michael Tobis thought that the problem had to do with mpiexec, so today I tried something new. I uninstalled mpich2 and replaced it with openmpi which I had built manually (as opposed to using apt-get). Now, when the model fails, the ccsm.log file actually says something:

mpiexec noticed that process rank 0 with PID 1846 on node computer name exited on signal 11 (Segmentation fault).
15 total processes killed (some possibly by mpiexec during cleanup)

Perhaps the problem is still with MPI. It seems unlikely that the segfault is due to a problem with the code itself (eg an undeclared variable), seeing as this version has been tested and used by NCAR. Maybe gcc is the issue, and I should play around with some compiler flags? Any suggestions would be welcome.

Progress?

I have made slight headway regarding my installation of CESM. It still isn’t running, but now it’s not running for a different reason than previously! Progress!

It appears that, at some point while porting, I mangled the scripts/ccsm_utils/Machines/mkbatch.kate file for my machine such that the actual call to launch the model wasn’t getting copied from mkbatch.kate to test.kate.run. A bit of trial and error fixed that problem.

I finally got Torque working. The only reason that jobs were getting stuck in the queue was that I didn’t start the pbs_sched daemon! It turns out that qsub isn’t related to the problems I was having, and isn’t necessary to run the model, but it’s nice to have it working just in case I need it in the future.

So, with the relevant call in test.kate.run as

mpiexec -n 16 ./ccsm.exe >&! ccsm.log.$LID

the command line output is

Wed July 6 11:02:33 EDT 2011 -- CSM EXECUTION BEGINS HERE
Wed July 6 11:02:34 EDT 2011 -- CSM EXECUTION HAS FINISHED
ls: No match.
Model did not complete - no cpl.log file present - exiting

The only log file created is ccsm.log, and it is completely empty.

I have MPICH2 installed, the command mpiexec seems to work fine, and I have mpd running. Regardless, I tried taking out mpiexec and calling the executable directly in test.kate.run:

./ccsm.exe >&! ccsm.log.$LID

The command line output becomes

Wed July 6 11:02:33 EDT 2011 -- CSM EXECUTION BEGINS HERE
Segmentation fault.
Wed July 6 11:02:34 EDT 2011 -- CSM EXECUTION HAS FINISHED
ls: No match.
Model did not complete - no cpl.log file present - exiting

Again, ccsm.log is empty, and there seems to be no trace of why the model is failing to launch beyond Segmentation fault. The CESM guide recommends setting the stack size to unlimited, which I did to no avail. Submitting test.kate.run using qsub produces the same messages, but in the output and error files, rather than the terminal.

Thoughts?