Or, how to do your work without inconveniencing everyone

Often, people need to create Linux processes that will need to run for hours or days ("long-running processes"), and must continue to run on lab machines after the user logs out. On occasion, people, knowing no better, decide to stay logged in, locking their screens, for hours or days - precisely to ensure that their processes run to completion.

Unfortunately, this causes inconvenience to other users, and specifically breaks the DoC screen locking policy (which was requested by students, not imposed by CSG/DoC, by the way). This policy is:

    No-one should lock any lab computer for more than 30 minutes without discussing
    it with CSG first. The only exception to the 30 minute rule is if we have
    specifically allocated you a project machine

Our Advice

So, here's our best advice about how to complete your work in the least disruptive way possible:

  1. If your problem can be easily parallelised into one algorithm operating independently on many different sets of data, we strongly recommend the Condor batch processing framework, which will handle all the donkey work for you (starting your jobs on idle machines, killing processes off when an interactive user returns, and restarting that job somewhere else). See our local Condor documentation for more information.

  2. If Condor is not appropriate, you need to find a suitable machine (or machines) most appropriate to run your long-running processes on. Shell servers are NEVER appropriate for long running processes; the batch servers batch1.doc.ic.ac.uk and batch2.doc.ic.ac.uk are powerful machines and may well be suitable - but there are only two such machines and everyone uses them!  Note that each batch server also has a large local disk (called /data) that you can use to build software and store data in - "mkdir /data/YOUR_USERNAME" and work inside that directory.  However, often you decide that a lab machine or two would do just fine, preferably a Linux-only machine (currently, any non-project machine in Huxley 219) to avoid unexpected reboots half way through your computation.

  3. If your program only has a GUI, this is a serious obstacle to automation and running it while not logged on. Is there any way you can build a non-interactive, non-graphical version of the software, or use another piece of software instead? If at all possible, use a non-graphical piece of software. If it is not possible to eliminate the GUI, you may have no choice but to use a lab machine for long periods of time, in this case, please discuss this with us (in person or by email to help@doc.ic.ac.uk) before doing it. We'd like to know:

    • what you're trying to do?
    • how long you believe it will take?
    • why you really truly can't get rid of the GUI?
    • which machine(s) you're thinking of using?
  4. Given a non-interactive program, and a chosen machine (let's say ray15). it's useful to try to estimate how long the program should take to run (eg. by trying smaller test cases), and also ensure that your long-running process is not going to inconvenience other users of the same machine (eg. by using all the RAM, CPU power or network bandwidth of the machine). This is important both for lab machines and batch1/batch2, and is completely your responsibility.

  5. Now, work out how to run the program in an automatic fashion [this is also necessary for using condor]:

    • first work out a single command that will run the program in an automatic fashion with the correct data inputs, and capture the output. To achieve this, you may need to use a mixture of:
      • command line arguments,
      • input redirection (CMD < inputfile),
      • output redirection (CMD > outputfile and CMD 2> errorfile [sh syntax]).
    • wrap the above "run it with no manual intervention" command up in a short shell script, called RUNME for instance, roughly of the form:

      #/!/bin/sh
      CMD ARGS < inputfile > outputfile 2> errorfile

    • make it executable: chmod +x RUNME

  6. When you want to run your RUNME script on your chosen machine, and want to be able to logout and leave it running in the background:
    • run the following command (tcsh shell syntax):

      nice nohup ./RUNME </dev/null >&/dev/null &

      that means:

      • run the script RUNME
      • detached from keyboard (</dev/null)
      • and detached from screen [stdout and stderr] (>&/dev/null)
      • in the background (&)
      • make it immune to hangup signals (nohup)
      • reduce it's scheduling priority (nice) to favour other processes
    • try logging out, and then back into the same machine again. Check that your RUNME process is still running by: ps auxww|grep RUNME

    • monitor RUNME for a while with top to check it's resource utilization. many algorithms spend some time initializing themselves, and then have one or more main stages with maximal utilitization, and of course you will know when that occurs for your algorithm, yes?

    • you can check the outputfile periodically (but note that the most recently written - but not flushed - 4K of output may not appear in the output file until the 4K buffer is full or the program terminates)

    • please do check that the process has finished in a reasonable amount of time, perhaps you should kill it off if it fails to terminate in twice the amount of time you estimated it would take. Please do not just leave it running for days when it should have finished in 2 hours!

If you follow these guidelines, you should be able to get your long-running work done, without unduly inconveniencing other users. And everyone should be happy.

May I finally emphasize one last time that Condor (when appropriate) is the ideal way of giving your ridiculously large amount of data-processing a massive parallel boost - using loads of idle lab (and staff desktop) machines to run your code! We have seen single-machine estimates shrink by a linear factor of 400 when running under Condor. That's the simplest and most robust way of getting a massive speed-up we know of!

Don't Forgot to Optimize the Hotspots

Orthogonal to all this discussion, if you have a long-running program that you've built from source code, or better yet have written yourself, it is almost always worth attempting to optimize it's performance on a relatively small dataset long before you start running it hundreds of times with much bigger datasets!

The place to start here is to profile your program's runtime behaviour, to see where it's really spending the time. No programmer fully understands the runtime behaviour of their own code, so the results of profiling will always surprise you. Then try to optimize the small part of the program that is taking the most amount of time! Often very small changes to code can reduce runtime by factors of 2 through 10!

Some members of CSG have a lot of expertise in this area, it's well worth discussing such problems with us.