Timing of Dw() and Dwhat() -------------------------- 64x32x32x32 lattice, 4x2x2x2 process grid, 16x16x16x16 local lattice There are 32 MPI processes The local size of the gauge field is 18 MB The local size of a quark field is 6 MB Using AVX instructions Assuming SSE prefetch instructions fetch 64 bytes Lattice parameters: beta = 5.5 c0 = 1.0, c1 = 0.0 csw = 1.978 Open boundary conditions cG = 0.55 cF = 0.9012 Time per lattice point for Dw(): 0.362 micro sec (5305 Mflops) Time per lattice point for Dwhat(): 0.329 micro sec (5794 Mflops)