CONTENTS
========

|__OUTPUT_DATASET
|  |__32
|  |  |__time1.log
|  |  |__time2.log
|  |__512
|     |__time1.log
|     |__time2.log
|__README
|__SRC
   |__openQCD-1.4.tar.gz


INTRODUCTION
============

openQCD is a public code widely used by researchers across Europe. 
The following benchmark is based on the velocity per core of the
Dirac operator for a lattice of points.


COMPILATION
===========

Requirements for compiling the code is a compiler (e.g. gcc) and an 
MPI libraryp 


1) First define the following variables  

export MPIR_HOME=/path/to/mpi/
export MPI_HOME=${MPIR_HOME}
export MPI_INCLUDE=${MPI_HOME}/include


2) After having unpacked the source tar, edit the file

   openQCD-1.4/include/global.h

and modify the following lines (first lines after the comments and 
before preprocessor directives)

   #define NPROC0 8
   #define NPROC1 8
   #define NPROC2 8
   #define NPROC3 8

The number of core that will be used is NPROC0*NPROC1*NPROC2*NPROC3. 
The included file is prepared for a 8*8*8*8=4096 core run. If you 
want to decrease the number of cores, you just need to decrease 
the value of these define line. You do so by keeping the most 
symmetric configuration. For example, to run on 128 cores: 

   #define NPROC0 4
   #define NPROC1 4
   #define NPROC2 4
   #define NPROC3 2

or in short 4*4*4*2. We suggest you to follow this schema:

   2*2*2*2   16 cores
   4*2*2*2   32 cores
   4*4*2*2   64 cores
   4*4*4*2  128 cores
   4*4*4*4  256 cores
   8*4*4*4  512 cores
   8*8*4*4 1024 cores

The number of processes in each direction must be 1 or a multiple 
of 2.

3) Go to 

   openQCD-1.4/devel/dirac

and issue the command:

   make time1 time2 check2 check5 CC=mpicc CFLAGS='<CFLAGS_OPTIONS_HERE>' 
 
Please note that "-ansi" is included as a compiler option and can 
be disregarded. 


BENCHMARK RUN
=============

1) After having obtained the executables "time1" and "time2" by 
following the instructions of Section "COMPILATION", please run 
the benchmarks by issuing the commands:

   mpirun ./time1
   sleep 2 (optional)
   mpirun ./time2
   sleep 2 (optional)
   mpirun ./check2
   sleep 2 (optional)
   mpirun ./check5


2) After completion of the run, two files 'time1.log' and 'time2.log'
are created. The relevant numbers from the performance point of view
are:  

   time1.log:
   -----

   Time per lattice point for Dw():
   0.318 micro sec (6047 Mflops)

   time2.log:
   -----

   Time per lattice point for Dw_dble():
   0.616 micro sec (3117 Mflops)


3) Benchmark results must be reported as 
 	
   time = 0.8*time1 + 0.2*time2

where time1 and time2 are the times in micro seconds obtained by 
issuing the following commands:

   $ grep -A1 'Dw():' time1.log | tail -n 1
   $ grep -A1 'Dw_dble():' time2.log | tail -n 1 


Please provide us the result obtained for a run on 32 cores and
at least other 3 values including the maximum available configuration.
  
For each run, please provide us also the values of the "Maximal normalized 
deviation" reported on the end of the output files "check2.log" and 
"check5.log".