MI - Fimex

Parallelization: Fork, Threads, MPI and OpenMP

Fork-safety

The Fimex library as of version 0.56 can be used with forked processes. It requires a fork system-call as provided by Unix/Linux environments. Fimex processes can be forked just before the data-fetching and achieves very good scaling for reading data. An example on how to use a getDataSlice with pre-forking can be seen under: share/doc/examples/parallelRead.cpp in Examples

Thread-safety

The Fimex library can be used in threaded environments. Fimex objects are generally not thread-safe, so every object should only be used from a single thread. But several threads can create their own Fimex objects.

In addition, all CDMReader::get*Data*() operations are thread-safe and the following code will work nicely:

size_t unlimSlices = unLimDim->getLength();
#pragma omp parallel for default(shared)
{
    for (size_t i = 0; i < unlimSlices; ++i) {
        try {
            doSomething(reader->getDataSlice(varName, i));
        } catch (...) {}
    }
}

OpenMP

Fimex can be build with parallelization support with OpenMP with the –enable-openmp flag of configure. The following code-parts are currently (0.35) parallelized:

Often, the performance is limited by the IO-system.

On the fimex-commandline, the number of threads can be set using:

fimex --num_threads=2 -c test.cfg

When using the library, one should use:

...
/* below starts the other fimex code */
}
...

MPI

To get MPI to work, the following prerequisites have to be met:

fimex can then be called with ''mpiexec -n 8 fimex'' and will use parallel MPI-IO to write the netcdf-files with the following CAVEATS:

Performance reading a 11GB compressed netcdf4 file from a 16 core 32threads 2.6GHz machine connected to a lustre parallel filesystem:

nproc time [s]  factor
1     158.7
2      79.2     2
4      52.2     1.5
8      29.0     1.8
16     19.4     1.5
32     21.5     0.9

Reading 11GB compressed netcdf4 file and writing the same as uncompressed 37GB netcdf4 file.

nproc time [s]  factor
1     232.0
2     147.6     1.6
4     116.9     1.3
8      99.4     1.2
16    104.0     0.9
32    119.6     0.8

Using other compute-intensive data-manipulations will usually improve the scaling.