Data format
Access to rather small files as well as reading and writing small chunks (few kB) of data is, in general, largely inefficient and should be avoided to fully exploit the potential of modern shared file systems. This was exactly the case, when each single tomographic projection was stored as a separate (TIFF) file, as until recently typically done at most tomographic microscopy beamlines around the world, to directly take advantage of APIs for commercial detectors. For high efficiency, few large files (6–8 GB) are instead recommended, where data are read or written in large chunks (MB).
In this context, an optimized data format has been selected permitting fast I/O and compatibility with data from other synchrotron sources: we adopted the scientific data exchange format [13], based on the HDF5 technology [14]. This technology, a versatile data model for very complex data objects and metadata, is particularly suited to push I/O efficiency. There are no limitations on file size and on the number of objects stored in a file. It integrates features to maximize access time performance and storage space optimization.
In our current implementation, the raw data are written to an HDF5 file on disk in a sequential way using the direct chunk write function [15] and an n-bit filter. The HDF5 technology also supports parallel writing. We have so far not exploited this feature, to keep maximum flexibility with regard to possible compression approaches, currently under investigation for tomographic data. It could however be integrated in the current framework, if increased writing performance will be required.
The reconstruction pipeline reads instead the raw data from file in a parallel fashion. The theoretical limit of 5Â GB/s (related to our current gpfs file server) has been demonstrated while reading from a large HDF5 file using the Python h5py library [16]. The used chunking strategy is optimized for fast single frame access, the most natural and general approach for tomographic data. Other options, for specific applications (e.g., absorption tomography), could be advantages and are under evaluation.
Pipeline description
Main core
A typical full-field tomographic dataset acquired in few minutes at a third-generation synchrotron source consists of few thousand angular views (e.g., 1500–2000), each with more than 2000 × 2000 pixels, and a collection of dark- and white- (or flat-) field images used for normalization. Such raw dataset routinely exceeds 16 GB.
The post-processing pipeline consists of 2 main blocks: a pre-processing part generating the sinograms and the tomographic reconstruction function itself (Fig. 1).
Sinogram generator
In this first step, each angular view is corrected for the dark current of the detector and the background is normalized using the average of the acquired white-field images. In addition, the dataset is reorganized into sinograms, each containing the necessary information to reconstruct a selected tomographic slice. If this operation is performed in a naĂŻve way, all projection images need to be open and a small chunk of data needs to be read to generate a single sinogram resulting in poor scalability due to the high I/O load. Furthermore if the generation of the sinograms for a typical dataset (usually in the order of 2000) is completely parallelized, this step would result in 1500Â Ă—Â 2000 simultaneous random accesses to the shared file system where the angular projections are stored, definitely a non-optimized procedure quickly resulting in a bottleneck, in particular for the high data rates of cutting-edge detectors. To overcome this bottleneck, here MPI has been used. Larger chunks of raw data are read and sent to the dedicated computing nodes at once, significantly improving the performance. The read/compute core ratio is determined empirically. A ratio between 1:6 and 1:8 is advantageous for medium size clusters. For larger clusters, this ratio will be smaller (it is not optimal to have many reading cores, reading just little data), for smaller systems it will be larger, to avoid having just a single reading core. It is important in particular for memory reasons that the reader cores are spread evenly across the nodes within the cluster (equal number on each node).
Figure 2 shows the skeleton of the developed sinogram generation software. The main application is started on all requested cores and performs MPI environment and class instance initializations. Based on the MPI process rank of each core, it is decided if it is a reading or computing core and the corresponding class method is called. The assigned reading cores read then the raw data from disk. These data are sent to the computing cores, which generate the sinograms.
The computed sinograms can either be written to disk or piped directly into the tomographic reconstruction software. In this latter case, at least the correct center of rotation needs to be known to ensure high quality tomographic reconstructions. Therefore an additional routine, to be run prior to the sinogram generation, has also been developed. This routine runs just on one single node, using though all available cores. It computes, following [17], an estimation of the center of rotation and any dependency of this number on the sinogram within a dataset. If the center of rotation varies as a function of the sinogram number, implying an imperfect experimental alignment, the projections can be rotated according to the computed angle to compensate for the misalignment. For tomographic scans performed with the rotation axis positioned at the side of the available field of view, with the aim of doubling the size of the sample, which can be accommodated in an experiment without the need to resort to local tomography, the mentioned routine also provides the projection overlap. This is an important figure for the automatic stitching of projections acquired at angular positions spaced by 180°. All these estimated parameters are written together with relevant scan information (e.g., number of projections) to a log file, where they are accessible to the sinogram generator run in the next step in the pipeline.
Tomographic reconstruction algorithm
Although in the future we plan to expand the reconstruction capabilities including selected iterative algorithms, the post-processing pipeline as currently implemented at TOMCAT exclusively uses gridrec [18]. Despite being based on the Fourier Transform method, this fast analytic tomographic reconstruction algorithm has been validated as a valuable alternative to standard filtered back projection routines. The advantage of Fourier techniques lies in their intrinsic smaller number of required operations compared to other analytical methods. Gridrec is highly optimized for conventional CPU technology, not requiring more specialized architectures such as GPUs, to achieve a competitive reconstruction speed.
For integration in the pipeline, the original code has been adjusted to be compatible with multi-processing. For maximum flexibility two instances of the same function have been created. To permit the tomographic reconstruction of existing sinograms stored on the file system, the gridRecMPIWrapper launches as many instances of a gridrec standard executable as needed to process all sinogram files. To instead reduce the I/O load and for highest speed, the gridrec C code compiled as shared library is loaded from Python, so that the sinograms can be delivered to the reconstruction routine directly from memory.
The pipeline framework has been conceived in a modular way enabling the integration of additional pre- and post-processing steps at a later stage, as they might appear in the literature, in an easy manner. Currently available is a routine suppressing anomalously bright spots (zingers) typically observed on projection data when intense polychromatic radiation is used. They are the consequence of scattered X-ray photons hitting the detector chip directly and depositing significantly more energy than visible light photons. Zingers translate into tomographic reconstructed slices as lines. The removal routine, inspired by [19], works on sinograms, isolates the anomalous pixels by thresholding and substitutes them through an interpolation scheme. Two functions addressing ring artifacts are also included, more will be offered in the future. Concentric (half) rings (with a variety of different characteristics) in tomographic slices are infamously common. They can have different origins related to bad (non-linear, dead) detector elements, damaged or dirty scintillator screens, and fluctuating background beam profiles. These possible different causes all impair an accurate flat-field correction leading to sinograms contaminated by vertical lines, back-projecting to circles in tomographic reconstructions. Both implemented routines for the mitigation of these artifacts work in the sinogram domain. The first approach, based on [20], takes advantage of the unsharp mask filter idea. The second technique [21] decomposes the sinogram in the wavelet/FFT domain so as to clearly separate the artifacts from real features. In this way, the artifact contribution is collapsed along the abscissa in the Fourier space where it can be easily suppressed. For user comfort, the pipeline offers also the possibility to just reconstruct a region-of-interest, save the results in different image formats, and reconstruct a rotated version of the scanned object. The signal-to-noise ratio and sharpness in the tomographic volume can be simply controlled by selecting different reconstruction filters (Ram-Lak, Hanning, Parzen,…) and adjusting their cut-off frequency.
Phase contrast
Propagation-based phase contrast
Single distance propagation-based phase contrast, a technique exploiting the coherence of synchrotron radiation, is highly utilized by the user community at TOMCAT. Its experimental simplicity (no specific hardware required) coupled to computationally efficient phase retrieval algorithms and significant contrast-to-noise (and dose) ratio improvement in tomographic volumes [22], makes it a very appealing tool and about 50% of the TOMCAT users take advantage of it. Phase contrast imaging is particularly suited to investigate biological samples characterized by small cross sections for hard X-rays. It is also a very powerful method for increasing contrast in samples composed of materials with a similar X-ray linear attenuation coefficient and is increasingly exploited also for material science applications. It has also been shown that phase retrieval (requiring projections at one single distance) can largely compensate for sub-optimal experimental conditions, such as low photon counts typical for time-resolved experiment [22] and is a fundamental tool for the study of dynamic processes.
The modular design and implementation of the pipeline facilitates a posteriori integration of different phase retrieval algorithms as simple Python functions. Currently available are routines based on the Paganin [23] (with a deconvolution step partially restoring the deteriorated spatial resolution [24]), the MBA [25], and the Moosmann [26] approach.
Grating interferometry
In contrast to simple single distance phase retrieval techniques, grating interferometry provides quantitative information on the electron density distribution in a sample with a higher sensitivity [27], albeit requiring a dedicated rather complex setup and still calling for multiple projections at each angular position. These multiple projections encode information not only on the electron density distribution but also on the absorption and scattering properties of the investigated specimen. This complementary information can be separated by a pixelwise FFT analysis.
Such an X-ray grating interferometer is installed at the TOMCAT beamline [28] and the required data manipulations and calculations prior to tomographic reconstruction are integrated in the pipeline. For grating interferometry data, the post-processing pipeline includes an additional step before the sinogram generation, delivering 3 sets of tomographic projections based on 3 complementary contrast mechanisms: absorption, differential phase (DPC), and dark field. This stage is parallelized by distributing the computation for each angular position to individual cores. A wavelet-FFT filter [21] is used to remove residual horizontal stripes (related to beam vibrations) from the DPC projections to guarantee highest reconstruction quality. These 3 datasets are then independently reconstructed following the traditional steps described above, using dedicated filters (e.g., Hilbert filter for DPC reconstruction), if necessary. The entire process can be launched with one single command, where the contrast of interest can be specified.
Software technologies
Most of the pipeline code is written in Python, compatible with both the Enthought [29] and Anaconda [30] distribution. Python might not provide the ultimate computational speed and has some drawbacks (e.g., Global Interpreter Lock) in comparison for example to C. It is however very flexible, intuitive, and does not require compilation, which are the characteristics that will promote the further development of the code to integrate new routines necessary to address new problems and needs, even by non-expert programmers such as beamline staff, after the initial implementation phase. Python provides a large selection of fast, reliable, and easy-to-use scientific libraries. The pipeline implementation was for instance facilitated using the PyWavelets [31] and the more general NumPy libraries. The NumPy array broadcasting technology is extensively used for standard arithmetic operations guaranteeing C-like performance.
Raw data in TIFF or preferably for highest performance in HDF5 format are read using the tifffile [32] and h5py [16] libraries, respectively.
Parallelization at the different stages of the pipeline is achieved using the Python implementation of the message passing interface (MPI for Python (Mpi4Py) [33]). The pipeline software can be run on a multi-core single machine and also take advantage of high performance computing facilities. To have access to such facilities and also to optimally exploit the available computational resources on dedicated clusters, a batch-queuing system is mandatory. Our implementation works with both sun grid engine (SGE—being discontinued) and SLURM (simple linux utility for resource management [34]). These cluster management and job scheduling systems are responsible for accepting, scheduling, dispatching, and managing the distributed execution of a large number of different jobs, including job arrays. Job dependencies can be defined too. They also manage and schedule the allocation of distributed resources such as processors, memory, and disk space. Different priorities for different jobs can be defined: on a dedicated beamline cluster with simultaneous multiple users, it is possible to take advantage of the computational resources for offline calculations, without significantly affecting the performance of jobs related to an ongoing experiment.
Hardware
The TOMCAT beamline runs few dedicated small clusters with a total of more than 100 cores, with different queues and priorities. At the Paul Scherrer Institute, 2 additional larger scale computational facilities (more than 700 cores) can also be accessed via a queue system. The newer one will be opened (also remotely) to the user community. The post-processing pipeline can be deployed on all these different systems, in an almost transparent way for the standard user.
The nodes of each cluster are interconnected by InfiniBand. To optimally exploit its power, the size of the dispatched MPI packages should be at least of few MB. InfiniBand is also used for connecting the nodes to the gpfs storage, making the time spent for I/O operations negligible compared to the overall run time.
Graphical user interface (GUI)
The microtomography user community is very broad and the beamline users have a very diverse IT knowledge and experience, going from standard Windows users (most common) familiar with menus and buttons to computer experts (rare). To facilitate the independent reconstruction of the tomographic data by the users, without continuous support from the beamline staff, we have developed a simple graphical user interface (GUI) (Fig. 3). It enables easy tweaking of phase retrieval and reconstruction parameters and submission of the full reconstruction of a standard tomographic dataset to the computing cluster, without the need for any command line commands, usually prone to error. The users do not need to know and understand where and in which format the raw data are stored. They also do not have to be familiar with high performance computing: clicks on few buttons are enough for reconstruction optimization and submission. For more complex dynamic experiments, for instance those that produce single HDF5 files with multiple datasets, the current GUI is not adequate and reconstruction via command line is still necessary. Work is ongoing to standardize the scripts steering ultrafast experiments and the data acquisition in these more elaborated cases. This standardization should help the extension of the current GUI for the most common time-resolved experiments.
The GUI is written in Python/Jython and has been developed as a plugin for Fiji [35]. It has been necessary to implement only the aspects strictly related to the post-processing pipeline, while common tools for image analysis (histogram plot, line profile, filters, contrast enhancement,…) are readily available from the Fiji package.