====== Job submission ====== Jobs are submitted to the cluster with **qsub**. The most basic usage of qsub has the job script as its only argument. To submit a job, > qsub script.sh Submitted jobs are identified by a job id (a unique id assigned by the cluster) and job name (defaults to the name of the script). The job id cannot be changed by the user. User's can set the job name to help distinguish multiple instances of the same script. > qsub -N run_script.1 script.sh > qsub -N run_script.2 script.sh Scripts can also accept arguments from the command line during submission. They can be accessed within the script as $1,$2 ... $n (n - number of arguments). To submit a job with arguments > qsub -N run_script_all script.sh run1 run2 run3 To parallelize processing of the above script, > qsub -N run_script.1 script.sh run1 > qsub -N run_script.2 script.sh run2 > qsub -N run_script.3 script.sh run3 ====== Job restrictions ==== == Each node has a finite amount of memory installed and due to the disk-less nature of the nodes there are restrictions set on the amount of ram used. Currently, the default is to assign 10G of ram per job that is submitted. If your job requires more than 10GB, then you may request a higher limit with the **"-l h_vmem"** directive ... otherwise you don't have to do anything. This is done to prevent memory over subscription and to better distribute the load across the available machines. > qsub -N run_script.1 -l h_vmem=12G,vf=12G script.sh run1 The above example will request/reserve 12G of available memory. **"vf"** will ensure your job will not go to a node unless it has the required amount available. Also, if you exceed the requested amount of **"h_vmem"** the grid engine will terminate the job and you will receive notice. In most cases you will not have to do anything, since 10G is a significant amount. The amount of ram used in your jobs is listed as **"Max vmem"** in the emails set from the cluster. The restriction is put in place to prevent memory being over allocated and jobs crashing an entire node, which would therefore kill other users' jobs. You can also get it from previous jobs if you have the job number with qacct ( there will be a resulting entry for "maxvmem" ) : > qacct -j JOBNUM maxvmem 4.315G Also, you can request the info from currently running jobs with qstat ( look for the "usage" information ) : > qstat -j JOBNUM usage 1: cpu=3:07:56:02, mem=36938.10261 GBs, io=15.57702, vmem=769.992M, maxvmem=1.451G The maximum available is ~750GB on any node, so if you request more than that, the job will just sit in the queue waiting indefinitely. //Please do not request additional resources unless you absolutely need them. If additional resources are requested, they are deducted from the amount available to everyone else. If unneeded resources are requested, this reduces the capacity on a given node for other potential usage.// There is a global limit on any single user of 60 slots and/or 1920G of ram. There is a 6GB cumulative quota on all HOME directories ====== Job status ====== The current statu s of a job can be checked with **qstat**. This will return the current list of jobs owned by the user. > qstat job-ID prior name user state submit/start at queue slots ja-task-ID --------------------------------------------------------------------------------------------------------------- 918 0.00000 script.sh deshmukh r 10/09/2007 08:20:24 users.q@node6.biac.duke.edu 1 919 0.00000 script.sh deshmukh qw 10/09/2007 08:20:26 1 Each job listing has the following relevant properties | job-id | Unique id assigned by the cluster. | | name | Name of the job. Default value is the name of the script submitted.| | user | Username of person who submitted the job. | | state | Current state of the job. This could be "r" -> running or "qw" -> waiting in queue. | | submit/start at | Submission time in "qw" state and start time in "r" state. | |queue | Queue and node on which the job is being run. This field is empty in "qw" state. | |slots | Number of processors the job will use. | When the job is completed, it will no longer appear in **qstat** listings. The status of all jobs owned by users can be checked with **qstatall** > qstatall Running jobs: job-ID # name owner start time running in ----------------------------------------------------------------------------- 1294 1 script1.sh deshmukh 10/09/2007 12:24:01 users.q 1295 1 script2.sh bizzell 10/09/2007 12:24:16 users.q ====== Job delete ====== A submitted job can be deleted with **qdel**. It takes the job-id (listed by **qstat**) as its argument. > qdel 9999 All jobs for a particular user can be deleted with the following command. > qdel -U username ====== Template Script ====== Jobs are usually written in bash. They are similar to local bash scripts in syntax and usage. In addition, they contain cluster related directives identified by lines starting with " #$ ". These are used to send job related setup information to the cluster. Scripts also contain requests for access to experiment data. The BIAC template script is a good starting point for testing job submission and as a base script for all jobs. Begin, by making a copy of the template script below. The template script requests access to an experiment folder and lists its contents. It needs a valid BIAC Experiment Name (case-sensitive) that is accessible by the user. Submit myscript.sh using qsub . qsub -v EXPERIMENT=Dummy.01 myscript.sh Run **qstat** to check job status. The job will initially be in "qw" state. Wait for a few seconds and run qstat again. The job should be in "r" state. If you don't see a listing, then the job has completed. The results of the job should appear in the experiment folder under Analysis (eg: \\Server\BIAC\Dummy.01\Analysis ) as myscript.sh.xxx.out (xxx is the job id). If you don't see the file, check the experiment name that was provided at submission. The script is divided into multiple sections. The user sections are [[biac:cluster:submit#user directive|USER DIRECTIVE]] and [[biac:cluster:submit#user script|USER SCRIPT]].The remaining sections are setup related and don't require modifications for most scripts. They are critical for access to your data. ==== USER DIRECTIVE ==== If you want mail notifications when your job is completed or fails you need to set the correct email address. Change the dummy email address (user@somewhere.edu) with the correct email address in the following line. #$ -M user@somewhere.edu ==== USER SCRIPT ==== * Add your script in this section. * Within this section you can access the requested experiment folder using $EXPERIMENT. All paths are relative to this variable eg: $EXPERIMENT/Data $EXPERIMENT/Analysis. The $EXPERIMENT variable is a temporary directory path (assigned for a specific job) that points to the requested experiment directory. Do not use this in place of the actual experiment name (eg: Dummy.01) if its required within your script. # Correct - lists the contents of the experiment folder ls -l $EXPERIMENT # Correct - lists the contents of the Analysis folder in your experiment directory ls -l $EXPERIMENT/Analysis # Incorrect - The output will be " My experiment name is /path/to/experiment " # instead of the desired " My experiment name is Dummy.01 " echo "My experiment name is $EXPERIMENT" * All terminal output is routed to the " Analysis " folder under the Experiment directory i.e. $EXPERIMENT/Analysis. To change this path, set the OUTDIR variable at the beginning of this section to another location under your experiment folder. OUTDIR=$EXPERIMENT/Analysis/ClusterLogs * On successful completion the job will return 0. If you need to set another return code, set the RETURNCODE variable in this section. To avoid conflict with system return codes, set a RETURNCODE higher than 100. RETURNCODE=110 * Arguments to the USER SCRIPT are accessible in the usual fashion eg: $1 $2 $3. #!/bin/sh # --- BEGIN GLOBAL DIRECTIVE -- #$ -S /bin/sh #$ -o $HOME/$JOB_NAME.$JOB_ID.out #$ -e $HOME/$JOB_NAME.$JOB_ID.out #$ -m ea # -- END GLOBAL DIRECTIVE -- # -- BEGIN PRE-USER -- #Name of experiment whose data you want to access EXPERIMENT=${EXPERIMENT:?"Experiment not provided"} EXPERIMENT=`findexp $EXPERIMENT` EXPERIMENT=${EXPERIMENT:?"Returned NULL Experiment"} if [ $EXPERIMENT = "ERROR" ] then exit 32 else #Timestamp echo "----JOB [$JOB_NAME.$JOB_ID] START [`date`] on HOST [$HOSTNAME]----" # -- END PRE-USER -- # ********************************************************** # -- BEGIN USER DIRECTIVE -- # Send notifications to the following address #$ -M user@school.edu # -- END USER DIRECTIVE -- # -- BEGIN USER SCRIPT -- # User script goes here # List all files in the requested Experiment directory ls -l $EXPERIMENT # -- END USER SCRIPT -- # # ********************************************************** # -- BEGIN POST-USER -- echo "----JOB [$JOB_NAME.$JOB_ID] STOP [`date`]----" OUTDIR=${OUTDIR:-$EXPERIMENT/Analysis} mv $HOME/$JOB_NAME.$JOB_ID.out $OUTDIR/$JOB_NAME.$JOB_ID.out RETURNCODE=${RETURNCODE:-0} exit $RETURNCODE fi # -- END POST USER-- ==== Notes ==== * if you ever edit your scripts on a non-unix machine, please run dos2unix on them before submitting * sometimes there are hidden window's characters that will prevent the script from running