Data analysis using AMI
AMI runs alongside the data acquisition, is user-configurable, and requires no user coding or preparation to produce an analysis. AMI actually refers to a collection of software implemented in C++ and QT consisting of (1) a shared memory server, a generic application that receives datagrams from the DAQ private network via UDP, builds them into events, and pushes them into shared memory, (2) a custom application that receives these events from shared memory, performs analyses, and exports viewable data such as plots, and (3) online_ami, the QT-based GUI that runs on the control room consoles and serves as a network client to the ami server, receiving users’ analysis configurations and displaying resulting plots.
At the start of a run, the monitoring automatically learns which detectors are available in the data and makes their raw data available to the user with the click of a button. AMI is the default tool for real-time online analysis and feedback.
Shared memory analysis takes advantage of the fact that the LCLS data acquisition system uses UDP multicasts to simultaneously send data to the data cache nodes, that save data to disk, and to the monitoring nodes where data from the last 16–32 events are stored in a Unix shared memory buffer. The UDP multicasts are made pseudo-reliable by enabling hardware-based Ethernet pause frames to create backpressure in the network if buffers become full. If the monitoring code is too slow to analyze the full event rate, the oldest events are discarded, ensuring that the results are from the most recent data. Processes running on multiple cores can connect to the same shared memory server, which distributes different events to the different processes on the node and serializes client requests with datagram handling. The analysis results are then collected by a custom collection application and displayed to the operator by the online_ami client. AMI runs on an instrument’s monitoring nodes which typically contain over 40 CPU cores. There is one shared memory input per monitoring node, but multiple clients can coexist so that users may monitor the data on different consoles and using different criteria. The processing load is distributed across the monitoring nodes, but because each node receives complete events, it is capable of fully analyzing any given event.
Users primarily interact with the online_ami GUI and use it to display and analyze information on-the-fly. The GUI has a set of simple operations that can be cascaded to achieve a variety of monitoring measures. It can be used to perform many standard tasks such as displaying detector images and waveforms, displaying data as histograms, strip charts, scatter plots, etc., and performing averaging, filtering, and other generic manipulations of the data including region of interest selection, masking, projections, integration, contrast calculation, and hit finding. AMI can be used to view raw or corrected detector images and perform tasks such as background subtraction, detector correlations, and event filtering. For example, the analysis may require that only events in which the beam energy is above a certain threshold and a laser is present should be plotted. The plot can be further manipulated, overlayed on other plots, displayed as a table, or saved to a text file or an image. All of the scalar data such as the beam energy, beamline diode values, encoder readouts, and EPICS [10] data associated with the event are also available and can be combined in user-defined algebraic expressions. AMI supports single-event waveform plots and image projections which can be averaged, subtracted, and filtered. AMI has an algorithm for simple edge finding using a constant fraction discriminator. Displays of waveforms and images can be manipulated by adding cursors and doing cursor math or waveform shape matching. Users may also integrate their own code to perform even more sophisticated or device-specific processing, either by building a C++ module plug-in for AMI, or writing Python code to run in the psana framework. AMI algorithms are available from our Subversion repository, https://confluence.slac.stanford.edu/display/PCDS/Software+Repository. Instructions for code development are documented here: https://confluence.slac.stanford.edu/display/PCDS/AMI+Online+Monitoring.
AMI can be used both on live data from shared memory and offline data read from disk without any coding. Figures 3 and 4 show examples of AMI waveform analysis and image displays. AMI is a useful tool for generic online analysis and feedback, but psana is a more comprehensive analysis tool available to support more experiment-specific analyses.
Data analysis using psana
The software framework psana handles importing the science data into memory (either staged from disk or streamed directly from the detectors), calibration, distributing events to multiple nodes/cores for parallel processing, and collecting the results and making them persistent. The psana framework is responsible for loading and initializing all user modules, loading one of the input modules to read data from XTC or HDF5 [11] files, calling appropriate methods of user modules based on the data being processed, providing access to data as a set of C++ classes and Python classes, and providing other services, such as histogramming, to the user modules.
The core portion of psana is written largely in C++, but psana supports both C++ and Python as user interfaces. Over time, it has become clear that Python is the preferred user interface for several reasons. First, it is possible to develop python analyses quickly, and short development times are a necessity given the frequent rate-of-change of LCLS experiments and the changing analysis requirements during an experiment. Second, C++ offers a steep learning curve for users. The observed trend at US light-source facilities and free-electron lasers around the world is to use Python and its associated tools.
In addition to providing data access, psana also provides simple python interfaces to complex algorithms. One commonly used example is the analysis code for the XTCAV detector [12] that is used to calculate lasing power as a function of time (on the femtosecond time scale) for each LCLS shot. Another example is the algorithm which computes the time separation between a pump laser and the LCLS shot [13]. Users are able to put together short python building blocks to quickly express the complexity of their experiment. Many of these building blocks are publicly available on the web, and so can be reused at any facility. We hope to include algorithms that are not LCLS-specific in globally available photon science-specific python packages which can be reused across labs. One such candidate is the publicly available scikit-beam project [14]. Psana and all its algorithms are open source and freely available from our Subversion repository. Instructions for code development and collaborative tools are documented here: https://confluence.slac.stanford.edu/display/PCDS/Software+Repository.
For performance, we support running psana in parallel using OpenMPI [15] through the python wrapper MPI4Py [16]. Several other photon science analysis packages [17] reuse the psana code when running at LCLS: OnDA [18], Hummingbird [19], cctbx.xfel, the Computation Crystallography Toolbox [20], the CrystFEL package [21], and Cheetah [22].
Interfaces
The data acquisition system is obligated to record all possible information to the data files, but the resulting complexity makes navigating the data difficult for the users. As a result, in addition to an interface that provides access to all data, we have found it useful to provide an additional simpler interface that exposes only information that most users typically access. We have also used this interface to capture commonality among detectors, e.g., all area detectors are transformed at a low level into NumPy arrays, either two-dimensional for a standard camera, or three-dimensional for multi-panel cameras. This is a powerful idea: metadata associated with a detector, such as pedestals, masks, per-pixel gains, can be given the same array shape as the real data, and then data corrections become efficient single-line NumPy operations like addition, multiplication, etc.
For performance, it is important that Python is able to call C++. For this, we have written Boost.Python (http://www.boost.org) converter methods for a few high-level classes that allow transfer of data between Python and C++ without copying large data. Memory management is done mostly in C++ using reference counts. We also use Boost.Python wrappers to call C++ class methods from Python. This allows for event analysis in a combination of C++ and Python, although the large majority of users only see the simpler Python interface.
Random access and parallelization with psana
MPI is a world standard for scientific parallelization across multiple nodes, with each node having many CPU cores. For most LCLS analyses, events can be analyzed in parallel, and I/O is a common bottleneck, which can be addressed using multiple cores/nodes. Most LCLS analyses parallelize trivially, with different cores processing different events. The psana MPI process running on a given core/node needs a way to jump to the events it will process—that is it needs random access to the large data rather than having to read through all the data. To achieve this, the data acquisition system writes additional small files called small-data XTC files where each piece of large data (e.g., a camera) is replaced with a file-offset into the full-data files. We maintain the same XTC format as the full data in these small-data files so that the same tools can be used to read it. When running with MPI, each core quickly reads these small-data files and then jumps to the appropriate big data for events that it should analyze by passing the big data file-offset to the fseek subroutine. Currently, the threshold for deciding which data is large or small defaults to 1 kB, but it can be overridden on the command line of the data acquisition software that records the data.
Further performance gains can be obtained from this small-data approach. For example, when processing an event, one can query beam quality (contained in the small-data files) and if the X-ray shot power was too low avoid spending the time to read the large data for that event. Psana has been structured so this conditional fetching can be done with a simple python “if” statement.
Psana also implements a user interface, based on random access, which accepts an event identifier and immediately returns the appropriate event. This identifier is the Unix seconds/nanoseconds timestamp plus a 17-bit 360 Hz “fiducial” counter as described previously.
Real-time analysis with psana
Prompt analysis of the data is critical for LCLS experiments, because such information is required for important decisions, e.g., beam tuning, moving detectors/samples, and evaluating whether or not sufficient statistics have been accumulated. It is possible to run psana data analysis in real time in two different modes, a shared memory interface, which receives DAQ network-multicast data, or a live-file mode where the data are read from the FFB storage layer:
-
1.
In the shared memory mode, psana reads events from a shared memory buffer on the monitoring node and uses MPI to launch processes on the different nodes for full 120 Hz analysis.
-
2.
In the FFB mode, the data acquisition small-data XTC files can be analyzed with MPI while the data are being written. If the software catches up to the end of the live file in this mode without seeing an end-run message, it will briefly sleep and try to read new data. If no new data appear within a timeout period, the software assumes no more events will appear and behaves as if the run had ended normally, albeit with a warning message.
The two online analysis approaches are complementary: FFB allows the user to analyze all events, at the risk of falling behind; shared memory has only a small buffer of events, meaning that the displayed data are always up to date, but there is no guarantee that all events will be seen by the analysis software, i.e., if the software is too slow, events will be dropped. Further, psana allows the user to run the same analysis code in online against the shared memory, quasi-real-time against the files on the FFB, and offline against data stored on disk.
Real-time visualization with psana
In addition to the standard matplotlib [38] methods for visualization in Python, we have used PyQtGraph to support real-time visualization because it has excellent interactive manipulation tools for plots together with fast graphics performance. The Python interface of the ZeroMQ (ZMQ) package [23] is used to transport data between the analysis code and the display, which may be on a remote machine. We use the publish/subscribe mechanism of ZMQ so that many real-time copies of plots may be displayed on different computers. To open a display, the subscriber uses a one-line command, which specifies the publisher’s hostname and port number, as well as a list of plot names.
Users can also create a multiplot which guarantees that all plots within the multiplot display coherent information, e.g., from the same LCLS events. In parallel jobs, typically one core is chosen to gather the results from the other cores via MPI and then publish the plots.
Build/release system
We use the SCons tool [24] to build all core Python/C++ packages of psana. The RHEL 5/6/7 operating systems are currently supported. All psana core and external packages are distributed using a modified form of APT [25] that supports relocatable RPM files. The repositories are made world-readable via http, so any user can download/run the APT code from the SLAC servers and quite easily install all psana binaries on a supported operating system. With the recent emphasis on Python-based analysis, we are considering a more Python-oriented release system, such as Anaconda [26], which would allow easier inclusion of Python external packages.
Detector calibration
LCLS supports calibrations of several area detectors, many of which have multiple panels. These calibrations include pedestal subtraction, bad-pixel determination, and common-mode noise removal, where noise varies coherently in several channels of a detector in one event. All corrections are stored in a run-dependent manner, e.g., pedestal values, common-mode noise parameters. The calibration data are stored in a hierarchical directory structure: with an experiment containing several detectors, each of which has several parameter types and run-associated data files. We considered storage in a database, but felt that a simple directory structure would allow for easier portability of analysis to remote institutions. Most of the constants are stored in text files, but we anticipate storing future constants in hierarchical HDF5 files. The same file-based constants are used by both offline and online analysis, including the AMI tool.
Command line and GUI tools are provided to compute pedestals, noise values, and bad-pixel lists. The graphical interface allows users to take appropriate multi-panel unassembled detector data, e.g., powder-pattern diffraction-ring data and graphically adjust the positions/rotations of the panels to create geometry constants. Optical measurements with a microscope and sophisticated crystallographic techniques [27] are used to more precisely determine geometry. The tools are used to deploy calibration constants that are valid for user-specified run ranges.
Geometry for multi-panel detectors is defined using a multi-level hierarchical approach as shown in Fig. 5; each component is positioned with parameters defining its rotation and translation in the parent frame. Multiple independent detectors can be placed in the correct position relative to each other using this approach. In many experiments, the origin is defined as the interaction point between the sample being studied and the laser shot.
Data type and data format
The data acquisition system produces many data types, implemented as C++ classes, and often these data types change with time as improvements are made. These changes are handled by introducing a new type for each modification using a custom-built data definition language (DDL) that allows us to represent the various data types in a language-independent manner. These descriptions are then compiled into language-specific Python or C++ classes. The DDL files are shared in common with the data acquisition system software, which uses C++, to guarantee a consistent description of LCLS data types between online/offline Python/C++ code.
The LCLS data acquisition system saves data in XTC format which consists of a hierarchical set of small headers that encapsulate larger data, where each container is mapped to a C++ class using an enumerated type. In the case of a dropped packet or missing data contribution, the header metadata associated with the event is annotated appropriately. It is an append-only data format, and only supports little-endian machines.
All code for writing/reading XTC data is contained in a library called pdsdata which has minimal dependencies. All data needed for analysis, including low-rate monitoring data like temperatures/voltages, exist in the XTC files. Because there are multiple files per run, easy user analysis requires a software framework like psana to manage the data reading. Psana presents the events from the multiple files to the user in time order, as well as doing offline event building when required. While the DAQ system performs real-time assembly of data from different detectors belonging to the same FEL shot into an event such that each XTC file is typically a sequence of complete events, there are also detectors that are shared across multiple data acquisition systems, although not simultaneously, and their data files are recorded separately and not included in the online event building process. To make these detectors easily available to users’ analysis code, psana additionally performs an offline event build that associates these data with the data acquisition data using the same timestamp, but at the time when the data are being read for analysis.
Because some users prefer HDF5 for offline analysis, the system provides a user-selectable translation service that can be configured from the LCLS web portal application to run automatically on the FFB queues and translate the raw XTC data to HDF5 as the data are being taken. The service produces raw or calibrated data organized into datasets based on each device rather than events. In addition, the data are self-describing with no software infrastructure required for analysis. The HDF5 data file has hierarchical organization consisting of the groups and dataset. Groups can contain other groups and datasets; datasets contain complex multi-dimensional data. This allows easy navigation from the “top” of the file to any object in that file, for example, /groupA/groupB/dataset1.
Users can take the data files off-site and analyze them in MATLAB, Python, or any other system that reads HDF5. Users can also customize the output of the translator by providing a configuration file to specify which data types should be translated or by including code that generates n-dimensional arrays which will automatically be included by the translator in the output.
While users do not need a software framework to work with LCLS HDF5, they all need to write the same code to correlate data from different datasets. That is, they need to match timestamps from the different datasets that the translator writes. This is essentially the event building process that psana must do with certain detectors. It is anticipated that as part of the LCLS-II upgrade the data acquisition system will write HDF5 files directly, given a couple of new critical features in the HDF5 1.10.x series, namely the ability read while writing and the ability to write to multiple files in parallel and aggregate them into one virtual dataset.
Analysis computing resources
LCLS has accumulated 11 PB of data since start-up in 2009, and 24% of these data are currently available on disk. Frequently, the data acquisition rate is more than 1 GB/s. For analysis, we provide 80 nodes each with 2 Xeon X5675 processors and 24 GB of memory. These nodes use a 40 Gb/s infiniband connection [28] to access data on Lustre file-systems [6] providing a total of 3.7 PB of offline storage. Additionally, running experiments have special priority access to 2 additional farms of 20 nodes, each with 2 Xeon E5-2640 processors and 128 GB of memory. These nodes are used for prompt data analysis against the FFB layer and are reserved for the running experiment using the standard SLAC batch system. These nodes can also be used for general lower-priority jobs, which are automatically suspended when the higher-priority jobs of the running experiment are submitted.
Case study: serial femtosecond crystallography
About one-third of beam time allocations at LCLS are currently awarded to serial femtosecond crystallography (SFX) experiments. With LCLS, it is possible to probe the sub-picosecond time domain, e.g., by triggering chemical changes with an optical pump/X-ray probe arrangement [29], or to observe sub-populations of conformational variation in the protein ensemble that are key to understanding enzyme mechanism and regulation [30].
The primary issue in XFEL crystallography processing pipelines is orchestrating movement of images through machine’s memory hierarchy as efficiently as possible while concurrently scheduling analysis tasks. This section describes the SFX pipeline based on cctbx.xfel [20], the computation crystallography toolbox, but other tools, like the CrystFEL package [21], are also available to the LCLS users.
Raw data from the X-ray sensors and from various diagnostic detectors are streamed at a sustained transfer rate near 10 Gb/s. With present data rates (120 Hz repetition rate and average image size of 4.5 MB), steady-state parallel analysis has been demonstrated, with the data being processed at the same rate they are acquired, by distributing the individual images to separate cores over multiple nodes [31]. Structural information is derived from the diffraction data collected from a stream of individual crystals. The Bragg spot intensities on each diffraction pattern are measured using the program cctbx.xfel. Four steps are executed in sequence: spotfinding (the identification of bright X-ray diffraction spots), indexing (the determination of the initial lattice model), refinement (parameter optimization for the lattice model), and integration (best-fit intensity modeling for individual Bragg spots). Simple parallelism is achieved by allocating each image to a different core. This level of parallelization is sufficient to keep up with current data rates with current analysis techniques, hence there is no present need for intra-image parallelism.
The top-level data reduction code from cctbx is called from within a psana script, which uses MPI to distribute the data. Concurrent processing is performed on approximately 1200 cores, corresponding to about 50 TFLOPs. This basic algorithm in the feature extraction pipeline for SFX image data from LCLS requires ~10 s/image single-threaded on a Xeon processor. Each of the four steps in the algorithm takes ~2.5 s to complete. The overall cycle time from data acquisition to reduced data is about 10 min.
An alternative SFX pipeline using psocake for spotfinding takes approximately 1.1 s/image to complete. Indexing and integration steps in CrystFEL take ~10 s/image; however, 95% of this time is spent reading an input hdf5 file containing the detector images and the spotfinding results suggesting huge gains can be achieved by bypassing the filesystem.
The current algorithms for SFX use the coarse approximation that each Bragg spot is located at a discrete mathematical point on an idealized lattice, with signal represented by summation of nearby pixel intensities. It has been shown that more accurate analysis is possible with protocols needing 100- to 1000-fold more CPU time [32].
Psocake
Since a typical LCLS experiment has millions of snapshots to choose from, it is critical to provide a means to quickly select images of interest and set regions of interest using masks. Included in psana is a graphical user interface called psocake [33] for viewing Area Detector images (CsPad, pnCCD, Opal, etc.) and that can be used to tune peak finding parameters and more closely examine the data. For example, one can mouse over a detector pixel display and identify its x and y pixel position and the ADU value. Regions of interest can be selected, masks can be drawn and applied, and events can be browsed using forward and back buttons. The user may save any event displayed as a NumPy array and can load and apply NumPy arrays to the image. For example, there is an option to launch an MPI job that saves a virtual powder pattern (mean, std, max) in a NumPy array. Users can click a button to optimize hit finding parameters, hit finding algorithms, and common-mode correction parameter for their experiment. Psocake and the algorithms are freely available from our Subversion repository: http://java.freehep.org/svn/repos/psdm/list/.
From within psocake, the user can tune hit finding parameters and launch peak finding jobs on multiple runs. The results of these jobs, the number of peaks found for each event, may be plotted (and refreshed) within psocake while the jobs are still running. By clicking on the plot, one can jump to the corresponding event and easily browse over the most interesting images based on the number of peaks. Psocake will also assist the user in doing crystal indexing using accurate detector geometry. Figure 6 shows an example of the psocake tool being used to inspect peaks found in an image.
Architectural choices
The main difference between our system and other comparable systems, especially those found in high-energy physics (HEP) experiments, is the lack of a veto or trigger system. While a veto mechanism is part of the design, it was never deployed because of the following reasons:
-
Many LCLS experiments have hit rates close to 100%, i.e., most pulses produce useful events. This is fundamentally different from most HEP experiments where the rate of a specific physics process is limited by the cross section of that process. This implies that the LCLS DAQ system had to be designed to handle the full machine rate.
-
Experiments change on weekly basis: these changes are often profound enough that adapting the veto/trigger parameters and algorithms to each experiment would represent a huge effort.
-
At the 120 Hz repetition rate of the source, and the average size and quantity of sensors, our current system can sustainably read out all data from all sensors at the full rate without the need for a mechanism to reduce the data on the fly.
-
Finally, obtaining the buy-in and the collaboration of the various experimental groups in determining the right parameters and algorithms for selecting data on the fly proved very difficult.
Because of the cost of building and maintaining a large storage system, we encourage the users, through the retention policy, to keep only the useful data on disk. Data may be reduced in offline processing and selectively saved to disk, although a full copy of the raw data is still preserved on tape.
Another characteristic of the LCLS data system is the presence of multiple storage layers (data cache, fast feedback, and offline, as shown in Fig. 1). As discussed above, it is critical for the users to be able to perform prompt analysis on the data. While the separation between quasi-real-time and offline processing resources can be handled relatively well via the enforcement of high- and low-priority processing queues, the storage aspect was best handled by the introduction of dedicated resources for the running experiment. The separation between data cache and fast feedback is dictated by the need to separate the DAQ writes from the user activities. We believe this separation will not be necessary in the future with the adoption of flash-based storage technologies that handle much better concurrent access from different sources.