Progress?

I have made slight headway regarding my installation of CESM. It still isn’t running, but now it’s not running for a different reason than previously! Progress!

It appears that, at some point while porting, I mangled the scripts/ccsm_utils/Machines/mkbatch.kate file for my machine such that the actual call to launch the model wasn’t getting copied from mkbatch.kate to test.kate.run. A bit of trial and error fixed that problem.

I finally got Torque working. The only reason that jobs were getting stuck in the queue was that I didn’t start the pbs_sched daemon! It turns out that qsub isn’t related to the problems I was having, and isn’t necessary to run the model, but it’s nice to have it working just in case I need it in the future.

So, with the relevant call in test.kate.run as

mpiexec -n 16 ./ccsm.exe >&! ccsm.log.$LID

the command line output is

Wed July 6 11:02:33 EDT 2011 -- CSM EXECUTION BEGINS HERE
Wed July 6 11:02:34 EDT 2011 -- CSM EXECUTION HAS FINISHED
ls: No match.
Model did not complete - no cpl.log file present - exiting

The only log file created is ccsm.log, and it is completely empty.

I have MPICH2 installed, the command mpiexec seems to work fine, and I have mpd running. Regardless, I tried taking out mpiexec and calling the executable directly in test.kate.run:

./ccsm.exe >&! ccsm.log.$LID

The command line output becomes

Wed July 6 11:02:33 EDT 2011 -- CSM EXECUTION BEGINS HERE
Segmentation fault.
Wed July 6 11:02:34 EDT 2011 -- CSM EXECUTION HAS FINISHED
ls: No match.
Model did not complete - no cpl.log file present - exiting

Again, ccsm.log is empty, and there seems to be no trace of why the model is failing to launch beyond Segmentation fault. The CESM guide recommends setting the stack size to unlimited, which I did to no avail. Submitting test.kate.run using qsub produces the same messages, but in the output and error files, rather than the terminal.

Thoughts?

Advertisement

3 thoughts on “Progress?

  1. A segmentation fault means that a program tried to access memory it isn’t supposed to, say because of a pointer arithmetic bug. Depending on your ulimit settings, it may produce a core file which, if the program was compiled with debug symbols, you could open up to try and understand exactly why.

    ls: No match.
    means that in csh, somebody did:
    ls 〈regular expression〉
    where the regular expression didn’t match any files.

    I’m pretty sure that ls was looking for the cpl.log file, which was never created because the model didn’t get that far. -Kate

    • So lets concentrate on the segmentation fault then.

      Assuming that you’re using gcc as your compiler, what you need to do is:
      1. make sure everything is compiled with the -ggdb command line option to gcc
      2. Run gdb on your resulting executable
      3. specify command line arguments to the executable using ‘set args’ inside gdb
      4. run the program by issuing a ‘r’ command in gdb
      5. when it crashes, issue a ‘bt’ backtrace command in gdb to read off which fuction and lines are where the crash happens
      6. look at the source code to see what might be happening
      7. use ‘print’ in gdb to examine local variables to obtain a better understanding

      If the problem is that the stack is really overflowing, you might want to check and see if your system is permitting non-root users to increase their stack size (via ulimit) and whether other memory constraints that ulimit imposes might be an issue.

      The model is compiled automatically, using both gcc and gfortran, through the build script. I can include compiler flags in the build script, so I’ll try that. Thanks for the advice. -Kate

  2. I think the missing log file is secondary.

    My suspicion at this point is MPI, not CESM.

    Did you build MPI using the came compiler stack?

    Have you tried a test MPI job?

    mt

    I have tried using mpich2 installed via apt-get, as well as built from the tarball on their website. No luck either way. I am not sure how to test MPI. I think I should find a test .exe file to try and submit it to mpirun, but I’m not sure where to get one. -Kate

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.