CONTENTS ======== |__OUTPUT_DATASET | |__32 | | |__time1.log | | |__time2.log | |__512 | |__time1.log | |__time2.log |__README |__SRC |__openQCD-1.4.tar.gz INTRODUCTION ============ openQCD is a public code widely used by researchers across Europe. The following benchmark is based on the velocity per core of the Dirac operator for a lattice of points. COMPILATION =========== Requirements for compiling the code is a compiler (e.g. gcc) and an MPI libraryp 1) First define the following variables export MPIR_HOME=/path/to/mpi/ export MPI_HOME=${MPIR_HOME} export MPI_INCLUDE=${MPI_HOME}/include 2) After having unpacked the source tar, edit the file openQCD-1.4/include/global.h and modify the following lines (first lines after the comments and before preprocessor directives) #define NPROC0 8 #define NPROC1 8 #define NPROC2 8 #define NPROC3 8 The number of core that will be used is NPROC0*NPROC1*NPROC2*NPROC3. The included file is prepared for a 8*8*8*8=4096 core run. If you want to decrease the number of cores, you just need to decrease the value of these define line. You do so by keeping the most symmetric configuration. For example, to run on 128 cores: #define NPROC0 4 #define NPROC1 4 #define NPROC2 4 #define NPROC3 2 or in short 4*4*4*2. We suggest you to follow this schema: 2*2*2*2 16 cores 4*2*2*2 32 cores 4*4*2*2 64 cores 4*4*4*2 128 cores 4*4*4*4 256 cores 8*4*4*4 512 cores 8*8*4*4 1024 cores The number of processes in each direction must be 1 or a multiple of 2. 3) Go to openQCD-1.4/devel/dirac and issue the command: make time1 time2 check2 check5 CC=mpicc CFLAGS='<CFLAGS_OPTIONS_HERE>' Please note that "-ansi" is included as a compiler option and can be disregarded. BENCHMARK RUN ============= 1) After having obtained the executables "time1" and "time2" by following the instructions of Section "COMPILATION", please run the benchmarks by issuing the commands: mpirun ./time1 sleep 2 (optional) mpirun ./time2 sleep 2 (optional) mpirun ./check2 sleep 2 (optional) mpirun ./check5 2) After completion of the run, two files 'time1.log' and 'time2.log' are created. The relevant numbers from the performance point of view are: time1.log: ----- Time per lattice point for Dw(): 0.318 micro sec (6047 Mflops) time2.log: ----- Time per lattice point for Dw_dble(): 0.616 micro sec (3117 Mflops) 3) Benchmark results must be reported as time = 0.8*time1 + 0.2*time2 where time1 and time2 are the times in micro seconds obtained by issuing the following commands: $ grep -A1 'Dw():' time1.log | tail -n 1 $ grep -A1 'Dw_dble():' time2.log | tail -n 1 Please provide us the result obtained for a run on 32 cores and at least other 3 values including the maximum available configuration. For each run, please provide us also the values of the "Maximal normalized deviation" reported on the end of the output files "check2.log" and "check5.log".