Instructions to Configure and Run Quantum ESPRESSO Benchmarks This document is intended to give a quick and simple introduction to the Quantum ESPRESSO Benchmark suite. The document is organized as follow: 1 - Brief Description of Quantum ESPRESSO, 2 - Download and Install Quantum ESPRESSO Benchmark Suite,3 - List and Purpose of Datasets, 4 - Run the Benchmarks, 5 - Collect and Report the Results, 6 - Benchmark Rules. 1 - Brief Description of Quantum ESPRESSO Quantum ESPRESSO (http://www.quantum-espresso.org) is an integrated suite of computer codes for electronic-structure calculations and materials modelling, based on density-functional theory, plane waves, and pseudopotentials (norm-conserving, ultrasoft, and projector-augmented wave). Quantum ESPRESSO stands for opEn Source Package for Research in Electronic Structure, Simulation, and Optimization. It is freely available to researchers around the world under the terms of the GNU General Public License. Quantum ESPRESSO builds upon newly restructured electronic-structure codes that have been developed and tested by some of the original authors of novel electronic-structure algorithms and applied in the last twenty years by some of the leading materials modelling groups worldwide. Innovation and efficiency are still its main focus, with special attention paid to massively parallel architectures, and a great effort being devoted to user friendliness. Quantum ESPRESSO is evolving towards a istribution of independent and inter-operable codes in the spirit of an open-source project, where researchers active in the field of electronic-structure calculations are encouraged to participate in the project by contributing their own codes or by implementing their own ideas into existing codes. Quantum ESPRESSO is written mostly in Fortran90, and parallelised using MPI and OpenMP. 2 - Download and Install Quantum ESPRESSO Benchmark Suite For this benchmark suite latest version of Quantum ESPRESSO 5.0.3 will be used. The code is publicly available from the Quantum ESPRESSO web site www.quantum-espresso.org, or from the download pages of the developers portal (qe-forge.org). No authentication/registration is required. Quantum ESPRESSO distribution's tar balls tar balls of the source code is available at site: http://www.quantum-espresso.org/download/ Patches for GPU enabled version are available at: https://github.com/fspiga/QE-GPU In what follow you can find the procedure to obtain the source tree for the benchmark. From a LINUX/UNIX terminal you issues the following command: wget http://qe-forge.org/gf/download/frsrelease/116/403/espresso-5.0.2.tar.gz wget http://qe-forge.org/gf/download/frsrelease/116/405/PHonon-5.0.2.tar.gz wget http://qe-forge.org/gf/download/frsrelease/128/435/espresso-5.0.2-5.0.3.diff wget http://qe-forge.org/gf/download/frsrelease/135/453/QE-GPU-r216.tar.gz wget http://qe-forge.org/gf/download/frsrelease/142/452/QE-5.0.2_GPU-r216.patch tar xvzf espresso-5.0.2.tar.gz cd espresso-5.0.2 tar xvzf ../PHonon-5.0.2.tar.gz tar xvzf ../QE-GPU-r216.tar.gz patch -p1 < ../espresso-5.0.2-5.0.3.diff patch -p1 < ../QE-5.0.2\_GPU-r216.patch The command above should set the source tree ready for configure. Three different configuration procedure are possible: CPU serial, CPU parallel, CPU parallel + GPU QE is self contained, nevertheless to obtain an optimal performance usually it is better to link the code with external standard libraries: BLAS, LAPACK, FFTW, BLACS/SCALAPACK (for parallel build). Parallel build requires MPI 1.1 and optionally OpenMP. QE support the possibility to use Nvidia GPGPU to accelerate linear algebra subroutines and a limited number of other time consuming subroutines. 2.1 - Configure serial build Serial build are usually useful to test scalar/vector optimizations. Proceed as follow: a) Set the environment variables add the compiler executable path to the system PATH b) Build the executable - go inside the espresso-5.0.2 directory > cd espresso-5.0.2 - issue the commands: > ./configure --disable-parallel > make all if everything goes fine you should find the executable "pw.x" in the directory espresso-5.0.2/bin 2.2 - Configure parallel build Parallel build is the main stream and default build of QE. To obtain a parallel build proceed as follow a) Set the environment Under the hypothesis you would like to use Intel compiler suite and Intel MPI, set the following environment variables: > export I_MPI_F77=ifort > export I_MPI_CXX=icpc > export I_MPI_ROOT=PATH_TO_INTEL_MPI > export I_MPI_F90=ifort > export I_MPI_CC=icc > export INTELMPI_HOME=PATH_TO_INTEL_MPI > export INTEL_HOME=PATH_TO_INTEL_SUITE > export F90=ifort > export F77=ifort > export CXX=icpc > export MKL_INC=PATH_TO_MKL_INCLUDE_SUBDIR > export MKL_INCLUDE=PATH_TO_MKL_INCLUDE_SUBDIR > export MKL_LIB=PATH_TO_MKL_LIB_SUBDIR > export MKL_HOME=PATH_TO_MKL > export MKLROOT=PATH_TO_MKL > export LD_LIBRARY_PATH=PATH_TO_MKL_LIB_SUBDIR:PATH_TO_INTEL_MPI_LIB_DIR:PATH_TO_INTEL_SUITE_LIB_DIR > export LIBPATH=PATH_TO_MKL_LIB_SUBDIR:PATH_TO_INTEL_MPI_LIB_DIR:PATH_TO_INTEL_SUITE_LIB_DIR > export PATH=PATH_TO_INTEL_SITE_BIN_DIR b) Build the executable - go inside the espresso-5.0.2 directory > cd espresso-5.0.2 - issue the command: ./configure --enable-openmp --with-scalapack - edit the file "make.sys" and substitute the string: -lmkl_blacs_openmpi_lp64 - with string: -lmkl_blacs_intelmpi_lp64 - issue the command: make all if everything goes fine you should find the executable "pw.x" in the directory espresso-5.0.2/bin 2.3 - Configure parallel + CUDA build This kind of build is used to take advantage of GPGPU accelerators a) set the environment If you want to use the Intel compiler suite and CUDA, you have to set the following environment variable as appropriate: > export I_MPI_F77=ifort > export I_MPI_CXX=icpc > export I_MPI_ROOT=PATH_TO_INTEL_MPI > export I_MPI_F90=ifort > export I_MPI_CC=icc > export INTELMPI_HOME=PATH_TO_INTEL_MPI > export INTEL_HOME=PATH_TO_INTEL_SUITE > export F90=ifort > export F77=ifort > export CXX=icpc > export MKL_INC=PATH_TO_MKL_INCLUDE_SUBDIR > export MKL_INCLUDE=PATH_TO_MKL_INCLUDE_SUBDIR > export MKL_LIB=PATH_TO_MKL_LIB_SUBDIR > export MKL_HOME=PATH_TO_MKL > export MKLROOT=PATH_TO_MKL > export CUDA_SDK=PATH_TO_CUDA_SDK > export CUDA_INCLUDE=PATH_TO_CUDA_INCLUDE_DIR > export CUDA_HOME=PATH_TO_CUDA_HOME > export CUDA_INC=PATH_TO_CUDA_INCLUDE_DIR > export CUDA_LIB=PATH_TO_CUDA_LIB_DIR > export CUDA_CFLAGS=--compiler-bindir=/usr/bin > export NVCC_HOME=PATH_TO_NVCC_COMPILER_DIR > export LD_LIBRARY_PATH=PATH_TO_CUDA_LIB_DIR:PATH_TO_MKL_LIB_SUBDIR:PATH_TO_INTEL_MPI_LIB_DIR:PATH_TO_INTEL_SUITE_LIB_DIR > export LIBPATH=PATH_TO_CUDA_LIB_DIR:PATH_TO_MKL_LIB_SUBDIR:PATH_TO_INTEL_MPI_LIB_DIR:PATH_TO_INTEL_SUITE_LIB_DIR > export PATH=PATH_TO_NVCC_COMPILER_DIR:PATH_TO_INTEL_SITE_BIN_DIR b) Build the executable - issue the commands: > cd espresso-5.0.2 > cd GPU > ./configure --enable-parallel --enable-openmp --enable-cuda --with-gpu-arch=35 --with-cuda-dir=${CUDA_HOME} --disable-magma --enable-profiling --enable-phigemm --without-scalapack - then edit file PW/Makefile to substitute line 44: all : tldeps pw-gpu.x manypw-gpu.x - with line: all : tldeps pw-gpu.x - finally issue the commands: > cd .. > make -f Makefile.gpu pw-gpu if everything goes fine you should find the "pw-gpu.x" in the directory espresso-5.0.2/bin 3 - List and Purpose of Datasets Three datasets are provided together with this benchmark: a small one (SiO2.tar.gz), to be used to run benchmarks inside a single node (or device) and as a test bed for code change; a medium one (AuSurf.tar.gz) to run benchmarks on more than one node; and a large one (AuSurf-large.tar.gz) to run benchmarks on many nodes. - SiO2.tar.gz test-case requires only few Gigabytes of main memory to run, and can scale easily up to 32 or 64 cores. - AuSurf.tar.gz test-case requires less than 32 Gigabytes of main memory, 2 Gigabyte of disk space, and you would need 2 or 4 nodes to run it. It can scale up to 256 or 512 cores. - AuSurf-large.tar.gz test-case is 8 time larger than AuSurf.tar.gz, and it is meant to perform benchmark runs on 1024, 2048 or even more cores. Usually, if the architecture is well balanced (i.e. the network performance are good enough to support the node performance) Quantum ESPRESSO display linear weak scalability with AuSurf.tar.gz and AuSurf-large.tar.gz test-cases. Then, under this hypothesis, you can extrapolate the performance of AuSurf-large.tar.gz using the performance results of AuSurf.tar.gz. 4 - Run the Benchmarks Quantum Espresso reads many command line parameters to control the internal distribution of data structure as well as standard input. For parallel execution Quantum ESPRESSO require the use of a system launcher command (e.g. mpirun or mpiexec) to distribute the instance of Quantum ESPRESSO on different nodes. Relevant command line parameters for this benchmark are: -input MY_INPUT_FILE tells Quantum ESPRESSO to read input from MY_INPUT_FILE) -npool P (tells Quantum ESPRESSO to use P pools to distribute data. P should be less or equal the number of k-points, and maximum scalability is usually reached with P exactly equal to the number of k-points. You can read the output to find out the number of k-points of your system) -ntg T (tells Quantum ESPRESSO to use T task groups to distribute FFT. Usually optimal performance can be reached with T ranging from 2 to 8) -ndiag D (tells Quantum ESPRESSO to use D processors to perform parallel linear algebra computation, ScalaPACK. D can range from 1 to the maximum number of MPI tasks, the optima value for D depend on the bandwidth and latency of your network). Below are reported some examples of possible command lines to execute QE: SiO2 test-case (MPI only) a) mpirun -np 4 $QE_PATH/bin/pw.x < SiO2-50Ry.in > SiO2-50Ry.out b) mpirun -np 4 $QE_PATH/bin/pw.x -input SiO2-50Ry.in > SiO2-50Ry.out c) mpirun -np 16 $QE_PATH/bin/pw.x -ntg 2 -ndiag 16 < SiO2-50Ry.in > SiO2-50Ry.out AuSurf test-case (MPI & OpenMP) a) export OMP_NUM_THREADS=4; mpirun -np 16 $QE_PATH/bin/pw.x \ -ntg 2 -ndiag 16 < ausurf.in > ausurf.out b) export OMP_NUM_THREADS=4; mpirun -np 32 $QE_PATH/bin/pw.x \ -ntg 4 -ndiag 16 < ausurf.in > ausurf.out AuSurf-large test-case (MPI & OpenMP) a) export OMP_NUM_THREADS=4; mpirun -np 128 $QE_PATH/bin/pw.x \ -ntg 2 -ndiag 64 -npool 2 < ausurf-large.in > ausurf-large.out b) export OMP_NUM_THREADS=4; mpirun -np 512 $QE_PATH/bin/pw.x \ -ntg 4 -ndiag 64 -npool 8 < ausurf-large.in > ausurf-large.out 5 - Collect and Report the Results 5.1 Validate Results To validate a benchmark result you have to check the value of the Total Energy at convergence (ETOT). Proceed as follow: - inside the running directory issue the command: > grep "total energy =" MY_OUTPUT_FILE | tail -1 - you should see a string like: total energy = -XXXX.YYYYYYYY Ry - or ! total energy = -XXXX.YYYYYYYY Ry - Note that if this string is not present the result is not valid! - The value XXXX.YYYYYYYY is the ETOT. It may vary depending on the number of tasks/command line parameters, but its variation should be limited to the last 3 digits. Below are reported reference values for the datasets For the SiO2 test case the valid results should have ETOT: -2622.42376YYY Ry +-0.00001 For the AuSurf test case the valid results should have ETOT: -11427.0820YYYY Ry +-0.0001 For the AuSurf-large test case the valid results should have ETOT: -11408.2091YYYY Ry +-0.0001 where YYY can be any digits 5.2 Collect and Report the results QE has already internal profiling and timing functions, then to evaluate the performance of a given execution you need simply to locate the execution wall time (PWSCF_WTIME) that can be found in the PWSCF timing string (e.g.: PWSCF : 1m52.95s CPU 0m32.17s WALL) at the end of the output. You can use the command: grep "PWSCF :" and take the value labelled as WALL. Here "h", "m" and "s" stay for hours, minutes and seconds. The results should be recorded and reported using the following table, where few sample records about different test-case is reported as an example: Dataset | Architecture | # Tasks | # Threads x Task | # GPU | -ntg | -ndiag | -npool | ETOT | PWSCF_WTIME ----------------------------------------------------------------------------------------------------------------------------- SiO2 | EURORA | 1 | 8 | 2 | 1 | 1 | 1 | -2622.42376369 Ry | 0h15m WALL ----------------------------------------------------------------------------------------------------------------------------- AuSurf-large | BGQ | 1024 | 4 | 0 | 2 | 64 | 4 | -11408.20916560 Ry | 11m52.68s WALL ----------------------------------------------------------------------------------------------------------------------------- AuSurf | EURORA | 4 | 8 | 4 | 1 | 1 | 1 | -11427.08209914 Ry | 0h24m WALL 6 - Benchmark Rules The following Quantum ESPRESSO Benchmark Suite rules have to be adhered so that a result of an execution could be considered valid. - Only minor changes of the source code, especially due to portability issues, are allowed. In any case no more than 10% of the source line (not including comments and empty lines) can be modified. If any other changes were introduced, they must be reported with the results in order to be checked and validated. - Replacement of the numerical libraries already supported and validated for Quantum ESPRESSO with alternative libraries is not allowed. However, the version of a library can be substituted for a more recent release. In such a case, the version number of the library has to be clearly mentioned when submitting the results. - There is no restriction on the usage of the compile-line options. Nevertheless, for each code, the compile-line options used must be reported with the final results. - Use of the C pre-processor is allowed only for supported configure and make flags as defined in file "make.sys". - No change are allowed on the input files provided with the benchmark suite. - At least three results for each data-set with different number of cores should be provided to allow for an estimation of the scalability. - Any valid combination of Threads and Task are allowed. Quantum ESPRESSO will report invalid combination. - Any valid combination of parallelization parameters (-ntg -ndiag -npool) are allowed. Quantum ESPRESSO will report invalid combination. - Extrapolation are allowed for the largest dataset AuSurf-large.tar.gz - Any information concerning non-standard execution (underutilised nodes, user-defined MPI topologies, MPI task affinity, etc.) must be reported. - For each execution, the numerical results of a run must pass the validation check in order for them to be considered valid.