Benchmarking
Quick-start to running benchmarks.
Benchmarks, i.e. executions of your program under test, can be generated
automatically using the
submit.pl
script. This script has many options, including two different back-ends:
slurm and GNU
Parallel. The submission script is
written in Perl to ease it’s contained text-parsing tasks and can be easily
extended or changed to fit your specific situation.
Getting Simsala
The first step is to get the simsala script collection itself. This can be done
by cloning the repository to some
path. Additionally, it is recommended for this tutorial to immediately also add
it to the $PATH
, so that we can use the provided scripts without paths later.
cd /tmp
git clone https://gitlab.sai.jku.at/simsala/simsala.git
export PATH=$PATH:/tmp/simsala
Resource Limiting and Relation to RunLim
By default, RunLim is used as resource limiter. The
path to the runlim executable should be supplied by --runlim
. Other tools may
be added later, or just be used as wrapper call around the executable in form of
a script being supplied to the -e
executable parameter. The --time
and
--space
commands are translated to parameters to runlim but are not required.
If no runlim was found, the limiting parameters are ignored and the tasks are
submitted directly.
Examples
Submit Jobs to a Slurm Cluster
submit.pl -e <path to executable> -n <jobname> --time <timeout in seconds> -p "`ls -d /some/folder/with/data/*.data`" --cpus_per_task 2 > task.txt
This script will immediately submit all matched files to slurm and places generated log-files into the current working directory.
Generate Run-Script to be Executed Locally
submit.pl -e <path to executable> -n <jobname> --time <timeout in seconds> -p "`ls -d /some/folder/with/data/*.data`" --cpus_per_task 2 --gnuparallel > task.sh
chmod +x task.sh
Execute this script on the desired host using ./task.sh
. Results will be
placed into the current working directory.
Dealing With Log-Files and Conflicting Filenames
The logs are created by combining the task name with the name of the file. If
multiple directories are searched through and one may have conflicting
filenames, the --fullpaths
option may help. It replaces the short filenames
with full absolute paths to the input files converted to valid filenames.
Directory Structure
It is recommended to have a simple directory structure to put the output of
benchmarks in, as this eases analyzing the results afterwards. One experiment is
a folder containing many different configurations, each configuration
representing one call to submit.pl
with many individual runs of the program
under test. The experiment then has many different folders of different
configurations, most commonly having a regular naming scheme encoding the
important parameters in the folder names.
Bulk-Benchmarks
When using folder names well, one can create benchmarks in bulk, greatly
speeding up generating results. Let’s say one has the programs program1
,
program2
, and program3
. They all have one configuration option, depth
,
which may be set arbitrarily. In this case, we just use 10
, 20
, and 30
.
All this should be run over multiple datasets, inputset1
, inputset2
, and
inputset3
. We assume the input sets to lie in a folder names inputs
next to
the experiment folder, the programs themselves are also next to the experiments
folder.
One creates an experiment folder for this, using regular mkdir
, like mkdir experiment
. Then, after cd experiment
, one creates all folders for the
individual benchmarks:
mkdir program{1,2,3}_depth{10,20,30}_inputset{1,2,3}
This generates the whole folder structure. Now, submit.pl
should be called for
every folder with the correct parameters. This can also be achieved by combining
bash
with awk
. First, we extract all information we need, using:
for d in *; do
e="../../"$(echo $d | awk -F _ '{ print $1 }');
depth=$(echo $d | awk -F _ '{ print substr($2,6) }');
inp="../../inputs/"$(echo $d | awk -F _ '{ print $3 }');
echo $e $depth $inp;
done
The command above gives the following structured output:
../../program1 10 ../../inputs/inputset1/
../../program1 10 ../../inputs/inputset2/
../../program1 10 ../../inputs/inputset3/
../../program1 20 ../../inputs/inputset1/
../../program1 20 ../../inputs/inputset2/
../../program1 20 ../../inputs/inputset3/
../../program1 30 ../../inputs/inputset1/
../../program1 30 ../../inputs/inputset2/
../../program1 30 ../../inputs/inputset3/
../../program2 10 ../../inputs/inputset1/
../../program2 10 ../../inputs/inputset2/
../../program2 10 ../../inputs/inputset3/
../../program2 20 ../../inputs/inputset1/
../../program2 20 ../../inputs/inputset2/
../../program2 20 ../../inputs/inputset3/
../../program2 30 ../../inputs/inputset1/
../../program2 30 ../../inputs/inputset2/
../../program2 30 ../../inputs/inputset3/
../../program3 10 ../../inputs/inputset1/
../../program3 10 ../../inputs/inputset2/
../../program3 10 ../../inputs/inputset3/
../../program3 20 ../../inputs/inputset1/
../../program3 20 ../../inputs/inputset2/
../../program3 20 ../../inputs/inputset3/
../../program3 30 ../../inputs/inputset1/
../../program3 30 ../../inputs/inputset2/
../../program3 30 ../../inputs/inputset3/
These variables may now be used to easily call submit.pl
! But first, we need
to step into the correct directory, to the working directory can serve as
target. This can be integrated into the command:
for d in *; do
e="../../"$(echo $d | awk -F _ '{ print $1 }');
depth=$(echo $d | awk -F _ '{ print substr($2,6) }');
inp="../../inputs/"$(echo $d | awk -F _ '{ print $3 }');
cd $d;
cd ..;
done
Now, only submit.pl
needs to be called. The script is assumed to be in the
$PATH
for easier explanation. We give every task a time limit of 1 hour (3600
seconds) and 8000MB of space, which are by default enforced by RunLim. By
default, slurm assigns two CPUs to each task, negating eventual hyperthreading
issues. As submit.pl
has no --depth
parameter, it passes it on to the
executable.
for d in *; do
e="../../"$(echo $d | awk -F _ '{ print $1 }');
depth=$(echo $d | awk -F _ '{ print substr($2,6) }');
inp="../../inputs/"$(echo $d | awk -F _ '{ print $3 }');
cd $d;
submit.pl -n depthsweep -e $e --time 3600 --space 8000 -p "`ls -d $inp/*`" --depth $depth > task.txt;
cd ..;
done
The above script now runs all tasks and logs are put into the corresponding folders. Now we can analyze this! If we don’t have access to slurm, this can also be done directly inline using GNU Parallel, like this:
for d in *; do
e="../../"$(echo $d | awk -F _ '{ print $1 }');
depth=$(echo $d | awk -F _ '{ print substr($2,6) }');
inp="../../inputs/"$(echo $d | awk -F _ '{ print $3 }');
cd $d;
submit.pl -n depthsweep -e $e --time 3600 --space 8000 -p "`ls -d $inp/*`" --gnuparallel --depth $depth > task.sh;
bash task.sh;
cd ..;
done
This produces a bunch of output and takes a while, depending on what the
programs program{1,2,3}
are doing internally. To see what this looks like when
done, this archive contains such a
finished run.