--------------------------------------------------------------------------- GYSELA_BENCH general README file --------------------------------------------------------------------------- --------------------------------------------------------------------------- Contents --------------------------------------------------------------------------- * Code Description ------------------ A. Licence information B. General description C. Coding and Parallelization * Building the Code ------------------- A. Preliminaries B. Compiling * Running the Code ------------------ A. A set of 2 SIMPLE test cases B. Strong scaling from 512 up to 8192 threads * Timing Issues --------------- * Expected Results ------------------ * Reporting the results ----------------------- --------------------------------------------------------------------------- * Code Description ------------------ A. License information The code is under CECILL-B licence: Copyright Status !************************************************************** ! Copyright Euratom-CEA ! Authors : ! Virginie Grandgirard (virginie.grandgirard [at] cea.fr) ! Chantal Passeron (chantal.passeron [at] cea.fr) ! Guillaume Latu (guillaume.latu [at] cea.fr) ! Xavier Garbet (xavier.garbet [at] cea.fr) ! Philippe Ghendrih (philippe.ghendrih [at] cea.fr) ! Yanick Sarazin (yanick.sarazin [at] cea.fr) ! ! This code GYSELA (for GYrokinetic SEmi-LAgrangian) ! is a 5D gyrokinetic global full-f code for simulating ! the plasma turbulence in a tokamak. ! ! This software is governed by the CeCILL-B license ! under French law and abiding by the rules of distribution ! of free software. You can use, modify and redistribute ! the software under the terms of the CeCILL-B license as ! circulated by CEA, CNRS and INRIA at the following URL ! "http://www.cecill.info". !************************************************************** B General description GYSELA_BENCH code is based on a Semi-Lagrangian scheme and solves 5D gyrokinetic ion turbulence in tokamak plasmas. C Coding and Parallelization This version of Gysela is implemented in Fortran 90, with some calls in C. It is an hybrid code with two levels of parallelism: MPI and OpenMP. The main output files are in HDF5 format, so the library HDF5 IS REQUIRED, and the corresponding fortran module must be available. This HDF5 library should provide fortran support (typically some Fortran modules should be available), but the MPI/parallel support is not required here (Gysela does not exploit this feature). * Building The code ------------------- A. Preliminaries At first, a 'ARCH' environment variable must be initialized with the machine name. For example in the '.bashrc' file, you should add something like 'export ARCH=occigen'. You need 'GNU make' tool to generate the executable. To adapt the Makefile for a new platform, modify the buildsystem/defaults/make.include (the 'Makefile' uses directly this 'make.include' file) - add a new section according to your ARCH variable - give your Fortran and C compiler, eventually preprocessor directives, - check for library locations: - MPI library - HDF5 library - check all the compiler flags for appropriate values on your system For example, here is the part of the buildsystem/defaults/make.include file for a given machine (occigen/CINES/France - 2015 setting): #***************************************** occigen ***** ifeq ($(ARCH), occigen) # Use this setting for env variable ARCH: # export ARCH=$(hostname -s | tr -d [0-9] ) # Use this module configuration in your .bashrc: # module load intel/15.0.0.090 hdf5/1.8.14 bullxmpi/1.2.8.3 BASEHDF5=/opt/software/libraries/hdf5/1.8.14 MAKE=make F90=mpif90 # Debug mode in compilation ifeq ($(MAKECMDGOALS),debug) F90FLAGS = -O2 -cpp -g -check bounds -fpe0 -DDEBUG -diag-disable 5462 else F90FLAGS = -O3 -cpp -xAVX -diag-disable 5462 endif # preprocessor directive TIMERFLAG = -DTIMER # OpenMP flag OMPFLAG = -openmp LIBS = LDFLAGS = ${OMPFLAG} # C directive CFLAGS = -O3 # HDF5 libraries HDF5INCLUDE = -I$(BASEHDF5)/include HDF5LIB = -L$(BASEHDF5)/lib -lhdf5_fortran -lhdf5 -lhdf5_hl -lhdf5hl_fortran -lz -lsz CC = mpicc else ... B. Compiling You have to go to src/ directory. For a quick compilation and to perform benchmarking issue: CE_DFLAGS=-DQUICK_COMPIL make -j timer the gysela_ti.exe is then created. The options available for make are the following: make : by default for generating the executable file gysela.exe make debug : for compiling with debugging options make clean : for removing all object and module files. make distclean : for deleting all : object, module and executable files. make timer : for compiling with all timers activated, generate gysela_ti.exe For a standard compilation (much less timer instrumentations) just type 'make clean;make', the executable gysela.exe is created. * Running the code ------------------ Data input files are available in WebDAV directory https://hpc-forge.cineca.it/files/gara_Tier0_2015/public/GYSELA/INPUT_DATASET/ A. Two simple test cases are in the directories prefixed with SIMPLE SIMPLE_8/DATA is an input file that can be run on 8 threads inside only one MPI process. SIMPLE_64/DATA is an input file for 8 MPI processes that each contains 8 threads. To run the first example (SIMPLE_8), you can use the following command: mpirun -np 1 ${SRCDIR}/gysela.exe 1>> gysela_res.out 2>> gysela_res.err You can also copy and adapt "*cmd" batch scripts ofthis directory to define a batch job. The gysela code will create automatically 8 threads in each MPI process. The number of threads is set depending on the value of variable "Nbthread" at line 6 of the DATA file. It is possible to change the number of threads inside each MPI process in modifying simultanously "bloc_phi" and "Nbthread" variables in the DATA file. Some acceptable values for the number of threads are: 8, 16, 32. There is *no need* to specify the number of threads via OMP_NUM_THREADS environment variable. A call to omp_set_num_threads(Nbthread) is done inside the Gysela code. B. You can benchmark the Gysela code with a Strong scaling experiment. This test requires 23.5 GB per MPI Process for the smallest run (STRONG_0512) and less memory per MPI process for larger runs. You can typically deploy one or two processes within one compute node, depending of the number of cores per node. STRONG_0512/DATA : input file for 32 MPI processes with 16 threads in each MPI Process STRONG_1024/DATA : input file for 64 MPI processes with 16 threads in each MPI Process STRONG_2048/DATA : input file for 128 MPI processes with 16 threads in each MPI Process STRONG_4096/DATA : input file for 256 MPI processes with 16 threads in each MPI Process STRONG_8192/DATA : input file for 512 MPI processes with 16 threads in each MPI Process It is possible to change the number of threads inside each MPI process in modifying simultanously "bloc_phi" and "Nbthread" variables in the DATA file. The acceptable values for the number of threads are: 8, 16, 32. There is no need to specify the number of threads via OMP_NUM_THREADS environment variable. A call to omp_set_num_threads(Nbthread) is done inside Gysela code. Remark: If you want to setup a larger Strong scaling experiment with a larger number of cores, it is possible. You have to multiply both "Nproc_theta" and "Ntheta" parameters in the input "DATA" file by a factor 2 (you can also take a factor 4 or 8). This will increase the number of MPI processes and also the mesh size along theta direction in the simulation. You will have also to multiply your number of MPI processes accordingly in your batch submission script. * Timing issues --------------- To retrieve the execution time of the Gysela run, you can use the following command: $ grep "Total time (without" STRONG_*/gysela_res.out STRONG_0512/gysela_res.out: Total time (without init & diag) = 1170.81192967296 STRONG_1024/gysela_res.out: Total time (without init & diag) = 598.070566773415 This procedure allows one to get the global execution time (excluding the initialization time and the time spent for saving multiple HDF5 files that are typically not relevant for this benchmark). Ideally, for the strong scaling experiment, the execution time should be divided by two, whenever the number of MPI processes is multiplied by two. * Verification of the numerical results --------------------------------------- You can check if you obtain correct results in comparing the gysela_CL.out output file (that contains some macroscopic physics variables) against reference files that are given in the following files: $ ls */gysela_CL.out SIMPLE_64/gysela_CL.out SIMPLE_8/gysela_CL.out STRONG_0512/gysela_CL.out Typically, if you run the same case, and look at the *last line* printed in gysela_CL.out, it has to be the same last line that the reference file (in each column). Example: $ tail -n 1 SIMPLE_8*/gysela_CL.out ==> SIMPLE_8/gysela_CL.out <== 4 6.000E+01 2.780996924424E+07 2.781040240991E+07 9.999844243293E-01 -1.694583348981E+01 6.404170071994E-01 8.248945523048E+07 1.809318846120E+06 For each floating point numbers in each column, only the 8 first figures are considered as significant. * Reporting the results ----------------------- Please, for the STRONG dataset, provide us at least four results including the reference configurations (512 cores, or the maximum configuration available).