Parallelization: Fork, Threads, MPI and OpenMP

Fork-safety

The Fimex library as of version 0.56 can be used with forked processes. It requires a fork system-call as provided by Unix/Linux environments. Fimex processes can be forked just before the data-fetching and achieves very good scaling for reading data. An example on how to use a getDataSlice with pre-forking can be seen under: share/doc/examples/parallelRead.cpp in Examples

Thread-safety

The Fimex library can be used in threaded environments. Fimex objects are generally not thread-safe, so every object should only be used from a single thread. But several threads can create their own Fimex objects.

In addition, all CDMReader::get*Data*() operations are thread-safe and the following code will work nicely:

size_t unlimSlices = unLimDim->getLength();
#pragma omp parallel for default(shared)
{
    for (size_t i = 0; i < unlimSlices; ++i) {
        try {
            doSomething(reader->getDataSlice(varName, i));
        } catch (...) {}
    }
}

OpenMP

Fimex can be build with parallelization support with OpenMP with the –enable-openmp flag of configure. The following code-parts are currently (0.35) parallelized:

NetCDF-writer/Null-writer: fetches each data-slice in a thread of it's own Next to perfect scaling until IO-system is saturated. The memory-consumption is linear with the number of threads.
interpolation: repositioning of values scales about factor 1.8 per processor for bilinear, better for bicubic, worse for nearestneighbor
interpolation: fill2d This scales well with the number of input layers (sigma, depth)
interpolation with coord_nearestneighbor This contains some parallelized part in the startup of the interpolation. But this is still much slower than the coord_kdtree.

Often, the performance is limited by the IO-system.

On the fimex-commandline, the number of threads can be set using:

fimex --num_threads=2 -c test.cfg

When using the library, one should use:

#include "fimex/ThreadPool.h"
...
if (MIFI_OK == mifi_setNumThreads(2)) {
   /* below starts the other fimex code */
}
...

MPI

To get MPI to work, the following prerequisites have to be met:

hdf5 library compiles with –enable-hl, –enable-parallel and –enable-shared
netcdf4 library compiled against above hdf5 library with ''CC=mpicc ./configure –enable-parallel-tests –enable-netcdf-4''
fimex compiled against above netcdf like ''CXX=mpic++ CC=mpicc CFLAGS=-O2 CXXFLAGS=-O2 ./configure –disable-openmp''

fimex can then be called with ''mpiexec -n 8 fimex'' and will use parallel MPI-IO to write the netcdf-files with the following CAVEATS:

compression of netcdf does not work due to hdf5 limitations in parallel mode: http://www.hdfgroup.org/hdf5-quest.html#p5comp
unlimited dimensions are disabled - bug in netcdf-4.3.*?
- please read the created files with a ncml-file re-setting the unlimited dimension, e.g. ''<dimension name="time" isunlimited="true">''
The implementation is currently only tested with OpenMPI 1.8.3
The implemnetation only works for creating netcdf4 files, the other file-formats supported by fimex don't allow parallel writing by MPI-IO

Performance reading a 11GB compressed netcdf4 file from a 16 core 32threads 2.6GHz machine connected to a lustre parallel filesystem:

nproc time [s]  factor
1     158.7
2      79.2     2
4      52.2     1.5
8      29.0     1.8
16     19.4     1.5
32     21.5     0.9

Reading 11GB compressed netcdf4 file and writing the same as uncompressed 37GB netcdf4 file.

nproc time [s]  factor
1     232.0
2     147.6     1.6
4     116.9     1.3
8      99.4     1.2
16    104.0     0.9
32    119.6     0.8

Using other compute-intensive data-manipulations will usually improve the scaling.