Instructions to Configure and Run Quantum ESPRESSO Benchmarks

This document is intended to give a quick and simple introduction to the 
Quantum ESPRESSO Benchmark suite.
The document is organized as follow: 1 - Brief Description of Quantum ESPRESSO,
2 - Download and Install Quantum ESPRESSO Benchmark Suite,3 - List and Purpose 
of Datasets, 4 - Run the Benchmarks, 5 - Collect and Report the Results, 
6 - Benchmark Rules.


1 - Brief Description of Quantum ESPRESSO


Quantum ESPRESSO (http://www.quantum-espresso.org) is an integrated suite of 
computer codes for electronic-structure calculations and materials modelling, 
based on density-functional theory, plane waves, and pseudopotentials 
(norm-conserving, ultrasoft, and projector-augmented wave). Quantum ESPRESSO 
stands for opEn Source Package for Research in Electronic Structure, 
Simulation, and Optimization. It is freely available to researchers around the 
world under the terms of the GNU General Public License. Quantum ESPRESSO 
builds upon newly restructured electronic-structure codes that have been 
developed and tested by some of the original authors of novel 
electronic-structure algorithms and applied in the last twenty years by some 
of the leading materials modelling groups worldwide. Innovation and efficiency 
are still its main focus, with special attention paid to massively parallel 
architectures, and a great effort being devoted to user friendliness. Quantum 
ESPRESSO is evolving towards a istribution of independent and inter-operable 
codes in the spirit of an open-source project, where researchers active in the 
field of electronic-structure calculations are encouraged to participate in 
the project by contributing their own codes or by implementing their own ideas 
into existing codes. Quantum ESPRESSO is written mostly in Fortran90, and 
parallelised using MPI and OpenMP. 


2 - Download and Install Quantum ESPRESSO Benchmark Suite


For this benchmark suite latest version of Quantum ESPRESSO 5.0.3 will be 
used. The code is publicly available from the Quantum ESPRESSO web site 
www.quantum-espresso.org, or from the download pages of the developers portal 
(qe-forge.org). 
No authentication/registration is required.
Quantum ESPRESSO distribution's tar balls tar balls of the source code is 
available at site: 
http://www.quantum-espresso.org/download/
Patches for GPU enabled version are available at: 
https://github.com/fspiga/QE-GPU

In what follow you can find the procedure to obtain the source tree for the 
benchmark. From a LINUX/UNIX terminal you issues the following command:
wget http://qe-forge.org/gf/download/frsrelease/116/403/espresso-5.0.2.tar.gz
wget http://qe-forge.org/gf/download/frsrelease/116/405/PHonon-5.0.2.tar.gz
wget http://qe-forge.org/gf/download/frsrelease/128/435/espresso-5.0.2-5.0.3.diff
wget http://qe-forge.org/gf/download/frsrelease/135/453/QE-GPU-r216.tar.gz
wget http://qe-forge.org/gf/download/frsrelease/142/452/QE-5.0.2_GPU-r216.patch
tar xvzf espresso-5.0.2.tar.gz
cd espresso-5.0.2
tar xvzf ../PHonon-5.0.2.tar.gz
tar xvzf ../QE-GPU-r216.tar.gz
patch -p1 < ../espresso-5.0.2-5.0.3.diff
patch -p1 < ../QE-5.0.2\_GPU-r216.patch

The command above should set the source tree ready for configure.
Three different configuration procedure are possible: CPU serial, CPU
parallel, CPU parallel + GPU

QE is self contained, nevertheless to obtain an optimal performance
usually it is better to link the code with external standard
libraries: BLAS, LAPACK, FFTW, BLACS/SCALAPACK (for parallel build).
Parallel build requires MPI 1.1 and optionally OpenMP.
QE support the possibility to use Nvidia GPGPU to accelerate linear
algebra subroutines and a limited number of other time consuming
subroutines.


2.1 -  Configure serial build

Serial build are usually useful to test scalar/vector optimizations.
Proceed as follow:

a) Set the environment variables
add the compiler executable path to the system PATH

b) Build the executable

- go inside the espresso-5.0.2 directory
> cd espresso-5.0.2
- issue the commands: 
> ./configure --disable-parallel
> make all
if everything goes fine you should find the executable "pw.x" in the directory 
espresso-5.0.2/bin

2.2 - Configure parallel build

Parallel build is the main stream and default build of QE.
To obtain a parallel build proceed as follow

a) Set the environment

Under the hypothesis you would like to use Intel compiler suite and Intel MPI,
set the following environment variables:
> export I_MPI_F77=ifort
> export I_MPI_CXX=icpc
> export I_MPI_ROOT=PATH_TO_INTEL_MPI
> export I_MPI_F90=ifort
> export I_MPI_CC=icc
> export INTELMPI_HOME=PATH_TO_INTEL_MPI
> export INTEL_HOME=PATH_TO_INTEL_SUITE
> export F90=ifort
> export F77=ifort
> export CXX=icpc
> export MKL_INC=PATH_TO_MKL_INCLUDE_SUBDIR
> export MKL_INCLUDE=PATH_TO_MKL_INCLUDE_SUBDIR
> export MKL_LIB=PATH_TO_MKL_LIB_SUBDIR
> export MKL_HOME=PATH_TO_MKL
> export MKLROOT=PATH_TO_MKL
> export LD_LIBRARY_PATH=PATH_TO_MKL_LIB_SUBDIR:PATH_TO_INTEL_MPI_LIB_DIR:PATH_TO_INTEL_SUITE_LIB_DIR
> export LIBPATH=PATH_TO_MKL_LIB_SUBDIR:PATH_TO_INTEL_MPI_LIB_DIR:PATH_TO_INTEL_SUITE_LIB_DIR
> export PATH=PATH_TO_INTEL_SITE_BIN_DIR

b) Build the executable

- go inside the espresso-5.0.2 directory
> cd espresso-5.0.2
- issue the command: ./configure --enable-openmp --with-scalapack
- edit the file "make.sys" and substitute the string:
	-lmkl_blacs_openmpi_lp64
- with string:
	-lmkl_blacs_intelmpi_lp64
- issue the command: make all

if everything goes fine you should find the executable "pw.x" in the directory 
espresso-5.0.2/bin

2.3 - Configure parallel + CUDA build
This kind of build is used to take advantage of GPGPU accelerators

a) set the environment
If you want to use the Intel compiler suite and CUDA, you have to set the 
following environment variable as appropriate:

> export I_MPI_F77=ifort
> export I_MPI_CXX=icpc
> export I_MPI_ROOT=PATH_TO_INTEL_MPI
> export I_MPI_F90=ifort
> export I_MPI_CC=icc
> export INTELMPI_HOME=PATH_TO_INTEL_MPI
> export INTEL_HOME=PATH_TO_INTEL_SUITE
> export F90=ifort
> export F77=ifort
> export CXX=icpc
> export MKL_INC=PATH_TO_MKL_INCLUDE_SUBDIR
> export MKL_INCLUDE=PATH_TO_MKL_INCLUDE_SUBDIR
> export MKL_LIB=PATH_TO_MKL_LIB_SUBDIR
> export MKL_HOME=PATH_TO_MKL
> export MKLROOT=PATH_TO_MKL
> export CUDA_SDK=PATH_TO_CUDA_SDK
> export CUDA_INCLUDE=PATH_TO_CUDA_INCLUDE_DIR
> export CUDA_HOME=PATH_TO_CUDA_HOME
> export CUDA_INC=PATH_TO_CUDA_INCLUDE_DIR
> export CUDA_LIB=PATH_TO_CUDA_LIB_DIR
> export CUDA_CFLAGS=--compiler-bindir=/usr/bin
> export NVCC_HOME=PATH_TO_NVCC_COMPILER_DIR
> export LD_LIBRARY_PATH=PATH_TO_CUDA_LIB_DIR:PATH_TO_MKL_LIB_SUBDIR:PATH_TO_INTEL_MPI_LIB_DIR:PATH_TO_INTEL_SUITE_LIB_DIR
> export LIBPATH=PATH_TO_CUDA_LIB_DIR:PATH_TO_MKL_LIB_SUBDIR:PATH_TO_INTEL_MPI_LIB_DIR:PATH_TO_INTEL_SUITE_LIB_DIR
> export PATH=PATH_TO_NVCC_COMPILER_DIR:PATH_TO_INTEL_SITE_BIN_DIR

b) Build the executable

- issue the commands:
> cd espresso-5.0.2
> cd GPU 
> ./configure --enable-parallel --enable-openmp --enable-cuda --with-gpu-arch=35 --with-cuda-dir=${CUDA_HOME} --disable-magma --enable-profiling --enable-phigemm --without-scalapack
- then edit file PW/Makefile to substitute line 44:
     all : tldeps pw-gpu.x manypw-gpu.x
- with line:
     all : tldeps pw-gpu.x
- finally issue the commands:
> cd ..
> make -f Makefile.gpu pw-gpu

if everything goes fine you should find the "pw-gpu.x" in the directory 
espresso-5.0.2/bin


3 - List and Purpose of Datasets


Three datasets are provided together with this benchmark: a small one
(SiO2.tar.gz), to be used to run benchmarks inside a single node (or device) 
and as a test bed for code change; a medium one (AuSurf.tar.gz) to run 
benchmarks
on more than one node; and a large one (AuSurf-large.tar.gz) to run benchmarks 
on many nodes.
- SiO2.tar.gz test-case requires only few Gigabytes of main memory to run, 
  and can scale easily up to 32 or 64 cores.
- AuSurf.tar.gz test-case requires less than 32 Gigabytes of main memory, 
  2 Gigabyte of disk space, and you would need 2 or 4 nodes to run it. 
  It can scale up to 256 or 512 cores.
- AuSurf-large.tar.gz test-case is 8 time larger than AuSurf.tar.gz,
  and it is meant to perform benchmark runs on 1024, 2048 or even more cores.
Usually, if the architecture is well balanced (i.e. the network performance 
are good enough to support the node performance) Quantum ESPRESSO display 
linear weak scalability with AuSurf.tar.gz and AuSurf-large.tar.gz test-cases.
Then, under this hypothesis, you can extrapolate the performance of
AuSurf-large.tar.gz using the performance results of AuSurf.tar.gz.


4 - Run the Benchmarks


Quantum Espresso reads many command line parameters to control the
internal distribution of data structure as well as standard input.
For parallel execution Quantum ESPRESSO require the use of a system launcher 
command (e.g. mpirun or mpiexec) to distribute the instance of Quantum 
ESPRESSO on different nodes. Relevant command line parameters for this 
benchmark are: 
-input MY_INPUT_FILE 
	tells Quantum ESPRESSO to read input from MY_INPUT_FILE)
-npool P 
	(tells Quantum ESPRESSO to use P pools to distribute data. P should
	be less or equal the number of k-points, and maximum scalability is
	usually reached with P exactly equal to the number of k-points. You can
	read the output to find out the number of k-points of your system)
-ntg T 
	(tells Quantum ESPRESSO to use T task groups to distribute FFT. Usually
	optimal performance can be reached with T ranging from 2 to 8)
-ndiag D 
	(tells Quantum ESPRESSO to use D processors to perform parallel linear
	algebra computation, ScalaPACK. D can range from 1 to the maximum
	number of MPI tasks, the optima value for D depend on the bandwidth and 
	latency of your network).

Below are reported some examples of possible command lines to execute QE:
SiO2 test-case (MPI only)
     a) mpirun -np 4 $QE_PATH/bin/pw.x < SiO2-50Ry.in > SiO2-50Ry.out
     b) mpirun -np 4 $QE_PATH/bin/pw.x -input SiO2-50Ry.in > SiO2-50Ry.out
     c) mpirun -np 16 $QE_PATH/bin/pw.x -ntg 2 -ndiag 16 < SiO2-50Ry.in > SiO2-50Ry.out
AuSurf test-case (MPI & OpenMP)
     a) export OMP_NUM_THREADS=4; mpirun -np 16 $QE_PATH/bin/pw.x \
        -ntg 2 -ndiag 16 < ausurf.in > ausurf.out
     b) export OMP_NUM_THREADS=4; mpirun -np 32 $QE_PATH/bin/pw.x \
        -ntg 4 -ndiag 16 < ausurf.in > ausurf.out
AuSurf-large test-case (MPI & OpenMP)
     a) export OMP_NUM_THREADS=4; mpirun -np 128 $QE_PATH/bin/pw.x \
        -ntg 2 -ndiag 64 -npool 2 < ausurf-large.in > ausurf-large.out
     b) export OMP_NUM_THREADS=4; mpirun -np 512 $QE_PATH/bin/pw.x \
        -ntg 4 -ndiag 64 -npool 8 < ausurf-large.in > ausurf-large.out


5 - Collect and Report the Results

5.1 Validate Results

To validate a benchmark result you have to check the value of the Total 
Energy at convergence (ETOT). Proceed as follow:
- inside the running directory issue the command: 
> grep "total energy              =" MY_OUTPUT_FILE | tail -1
- you should see a string like:
     total energy              =   -XXXX.YYYYYYYY Ry
- or
!    total energy              =   -XXXX.YYYYYYYY Ry
- Note that if this string is not present the result is not valid!
- The value XXXX.YYYYYYYY is the ETOT. It may vary depending on the
number of tasks/command line parameters, but its variation should be limited 
to the last 3 digits.

Below are reported reference values for the datasets
For the SiO2 test case the valid results should have ETOT: 
	-2622.42376YYY Ry +-0.00001
For the AuSurf test case the valid results should have ETOT: 
	-11427.0820YYYY Ry  +-0.0001
For the AuSurf-large test case the valid results should have ETOT: 
	-11408.2091YYYY Ry +-0.0001
where YYY can be any digits

5.2 Collect and Report the results

QE has already internal profiling and timing functions, then to
evaluate the performance of a given execution you need simply to
locate the execution wall time (PWSCF_WTIME) that can be found in the
PWSCF timing string (e.g.: PWSCF : 1m52.95s CPU 0m32.17s WALL) at the
end of the output. You can use the command: grep "PWSCF :" and
take the value labelled as WALL. Here "h", "m" and "s" stay for
hours, minutes and seconds.

The results should be recorded and reported using the following table,
where few sample records about different test-case is reported as an example:

Dataset      | Architecture | # Tasks | # Threads x Task | # GPU  | -ntg | -ndiag | -npool | ETOT               | PWSCF_WTIME
-----------------------------------------------------------------------------------------------------------------------------
SiO2         | EURORA       | 1       |  8               | 2      |   1  |   1    |   1    |  -2622.42376369 Ry | 0h15m WALL
-----------------------------------------------------------------------------------------------------------------------------
AuSurf-large | BGQ          | 1024    |  4               | 0      |   2  |   64   |   4    | -11408.20916560 Ry | 11m52.68s WALL
-----------------------------------------------------------------------------------------------------------------------------
AuSurf       | EURORA       | 4       |  8               | 4      |   1  |   1    |   1    | -11427.08209914 Ry | 0h24m WALL



6 - Benchmark Rules


The following Quantum ESPRESSO Benchmark Suite rules have to be adhered so 
that a result of an execution could be considered valid.

- Only minor changes of the source code, especially due to portability issues, 
  are allowed. In any case no more than 10% of the source line (not including 
  comments and empty lines) can be modified. If any other changes were 
  introduced, they must be reported with the results in order to be checked 
  and validated. 
- Replacement of the numerical libraries already supported and validated for 
  Quantum ESPRESSO with alternative libraries is not allowed. However, the 
  version of a library can be substituted for a more recent release. In such 
  a case, the version number of the library has to be clearly mentioned when 
  submitting the results. 
- There is no restriction on the usage of the compile-line options. 
  Nevertheless, for each code, the compile-line options used must be reported 
  with the final results. 
- Use of the C pre-processor is allowed only for supported configure and make 
  flags as defined in file "make.sys". 
- No change are allowed on the input files provided with the benchmark suite.
- At least three results for each data-set with different number of cores 
  should be provided to allow for an estimation of the scalability.
- Any valid combination of Threads and Task are allowed. Quantum ESPRESSO 
  will report invalid combination.
- Any valid combination of parallelization parameters (-ntg -ndiag -npool) 
  are allowed. Quantum ESPRESSO will report invalid combination.
- Extrapolation are allowed for the largest dataset AuSurf-large.tar.gz 
- Any information concerning non-standard execution (underutilised nodes, 
  user-defined MPI topologies, MPI task affinity, etc.) must be reported. 
- For each execution, the numerical results of a run must pass the validation 
  check in order for them to be considered valid.