CS has quickly become one of the most important discoveries in the digital age. The theory of CS, and numerous implementations, shows that a signal can be compressed at the time of measurement and accurately recovered at a later time in software. In imaging applications, the compression can be applied spatially to reduce the number of pixels that need to be measured. This can lead to an increase in sensing speed, a decrease in data size, and dose reduction in the case of electron microscopy [13]. In video applications, the time dimension can be compressed. By compressing the sensed data in time, the total frame rate of a camera system is multiplied by integrating a sequence of coded images into a single frame from the camera. In this section, the statistical models and microscope hardware for an approach to compressively sensing and recovering videos will be described.

The traditional approach in signal acquisition is to sample and then compress. This is motivated by the Nyquist-Shannon sampling theorem, which states that in order to accurately reconstruct a signal it must be sampled at a frequency at least twice the highest frequency present. Figure 1 shows a sum of three sine waves with different frequencies and amplitudes. By the sampling theorem, a rate of at least 128 would be required to reconstruct the signal. Yet, in the frequency domain, three samples are sufficient; the signal is said to be 3-sparse under the Fourier basis. One notion of the CS problem is to design a non-adaptive sensing scheme to measure signals in the basis that makes the signal as sparse as possible—effectively reducing the number of measurements below the Nyquist rate [18]. This approach has the benefit of eliminating the overhead of sensing the entire signal according to the sampling theorem. Usually the basis is chosen to be Fourier modes or wavelets, but it is also possible to discover the basis from the measurements [25].

### CS background

In imaging problems, the signal has two spatial dimensions, so the basis must also have two spatial dimensions. Often, small two-dimensional images (and higher) are referred to as patches. Figure 2 shows the two-dimensional Haar wavelet basis alongside the discrete cosine basis (DCT)—the real part of the Fourier transform. The basis patches along the top and left sides are the same as the one-dimensional basis elements, except they have been copied to fill the second dimension. The interior of the table is formed by combining the basis patches along the top and left edges into all of the possible two-dimensional variants.^{1}

There are conditions on the design of the sensing scheme^{2}, but in practical applications and in this paper the sensing scheme will simply omit pixels randomly. The measurements are linear so they can be represented as a matrix *Φ* and the true signal as a vector *x* (flattened from the two-dimensional image). Expressed mathematically,

$$ \boldsymbol{y} = \boldsymbol{\Phi}\boldsymbol{x}. $$

((1))

In order to omit pixels, there is a single 1 in each column; another way of stating this is that the rows are randomly selected from the identity matrix without replacement. The representation in Fig. 3 includes zero rows for illustrative purposes, but the sensing matrix does not have those zero rows. Because the sensing matrix is missing rows, it is short and wide, that is \(\boldsymbol {\Phi }_{i} \in \mathbb {R}^{Q\times P}, Q\ll P\), where *Q* is the dimension of compressed measurement, \(\boldsymbol {y}\in \mathbb {R}^{Q}\), and *P* is the dimension of the signal \(\boldsymbol {x}\in \mathbb {R}^{P}\). The inverse problem of recovering *x* from *y* is underdetermined, so further assumptions must be imposed to guarantee a solution.

Equation (1) is somewhat deceiving in that it appears that a single signal is recovered from a single measurement. In fact, there is a set of measurements, {*y*
_{1},…,*y*
_{
N
}}, a set of sensing matrices, {*Φ*
_{1},…,*Φ*
_{
N
}}, and a set of signals {*x*
_{1},…,*x*
_{
N
}},^{3} with the index *i* added to Eq. (1),

$$ \boldsymbol{y}_{i} = \boldsymbol{\Phi}_{i}\boldsymbol{x}_{i}. $$

((2))

In sensing problems where the signal is an image, the signals {*x*
_{1},…,*x*
_{
N
}} are patches from the full image. Usually the patches are overlapping so that each pixel has a corresponding patch, except for the right and bottom regions of the image. Figure 4 is an illustration of the patches and how they overlap. The sensing matrices, measurements, and signals are all obtained by extracting patches from the corresponding full-size images. In the case of the signal, the CS algorithm will recover the patches *x*
_{
i
} and then the patches are put back together and the overlapping pixels are averaged.

#### Dictionary learning and sparse-CS

Dictionaries are another choice for the basis, but dictionaries do not have an analytical form like the Fourier or wavelet bases. Dictionary learning is a method to discover a frame^{4} from the data, which is referred to as the dictionary. The learned dictionary allows every patch to be represented by a weighted sum of a few^{5} dictionary *elements* or vectors (assuming overcompleteness). Because the overcomplete dictionary model enforces the use of only a few basis patches, the data is sparse under the dictionary. This approach is advantageous because the learned dictionary can guarantee a sparse representation, whereas choosing a Fourier basis, for example, does not guarantee sparsity. Two learned dictionaries are depicted in Fig. 5.

The first algorithm for dictionary learning was based on human vision [26]. More recently, a much faster variant was proposed, the *K*-SVD algorithm [27], and Mairal et al. have further improved the *K*-SVD-based approach and given a thorough review of dictionary learning [16]. Another approach, a part of the approach in this paper, is beta-process factor analysis (BPFA) [25]. BPFA has been used in compressive sensing of STEM images [13]. The relationship between optimization/maximum likelihood (*K*-SVD) approaches and Bayesian/sampling (BPFA) approaches is discussed after the details of the BPFA model are introduced.

Another approach that has been applied in image restoration tasks, and specifically to STEM image restoration is the non-local means algorithm [16, 28]. Non-local means uses all of the image patches simultaneously to find a reweighting of the central pixel of each patch. Sparse representation, on the other hand, finds a subset of elements from a dictionary and the corresponding weights to reconstruct an entire patch (dictionary learning simultaneously finds a dictionary). Non-local means is a kernel density estimation method, and when employing the Gaussian kernel, it is closely related to the GMM, which will be explained in detail.

One of the approaches to guarantee that the solution of the underdetermined system of Eq. (2) is the desired solution is to assume there is a sparse representation under some basis/frame (e.g., Fourier, wavelets, or a learned dictionary). This means that

$$\begin{array}{*{20}l} \boldsymbol{x}_{i} &= \boldsymbol{D}\boldsymbol{w}_{i}, \end{array} $$

((3))

$$\begin{array}{*{20}l} \boldsymbol{y}_{i} &= \boldsymbol{\Phi}_{i} \boldsymbol{D}\boldsymbol{w}_{i}, \end{array} $$

((4))

where the columns of *D*=[*d*
_{1},…,*d*
_{
K
}] are the dictionary elements. The number of non-zero elements in *w*
_{
i
} is much less than the size of the basis *K* (number of columns in *D*), nnz(*w*
_{
i
})≪*K*. The choice of basis is important since it should induce sparsity in the *w*
_{
i
}. The issue of the CS inverse being underdetermined is alleviated by finding solutions *w*
_{
i
} that are also sparse. In practical applications, the noise *ε*
_{
i
} must also be considered

$$ \boldsymbol{y}_{i} = \boldsymbol{\Phi}_{i}(\boldsymbol{D}\boldsymbol{w}_{i} + \boldsymbol{\epsilon}_{i}). $$

((5))

In the Fourier example above, the signal is recoverable as long as the noise amplitude is not larger than the amplitude of the smallest signal component. The same idea holds for sparse CS.

There are a few applications of sparse-CS in electron microscopy. The first was using *ℓ*
_{1} and total variation (TV) regularization to simulate compressive sensing on STEM images and speculate about the application to STEM tomography [12]. It has also been shown that TV regularization is useful in electron tomography [29]. Tomography is closely related to CS, and even more so in electron tomography where it is common to have a missing *wedge* of data due to the inability to acquire all of the projections. More recently, BPFA has been applied to STEM compressive sensing [13], and an optimization approach is reported for compressed STEM imaging and tomography in [24].

#### Manifold-CS

A more recent approach in CS is to assume that the signal is a manifold embedded in a high-dimensional space [30]. Essentially, the intrinsic dimension of the data is smaller than the ambient dimension. Manifold-CS enjoys higher accuracy because the model is more flexible than sparse-CS [31] (sparse-CS is a special case of manifold-CS). A simple example of a manifold is a tube or a sheet through a three-dimensional space that is not self-intersecting. The concept of two-dimensional materials, such as graphene, is similar to the concept of a manifold in an *N*-dimensional space. Another example of a manifold is face images [32]. As the face image changes from happy to angry, as the lighting changes from light to dark, or as the face turns from right to left, the coordinates of the data move along constrained sections of the ambient space—the face manifold. This is not the same as moving along the principal dimensions defined by a principal components analysis (PCA). Manifold approaches learn local structures, whereas PCA-like methods learn global structures.

The concept of compactness from mathematical topology ([33], Chapter 3) states that a set, such as a manifold, can be covered by a finite number of open sets from the *N*-dimensional space.^{6} There is no specific structure required for the covering sets, so they can be assumed to be Gaussian, i.e., ellipsoids. Figure 6 shows the covering of a one-dimensional manifold (a curve) through a two-dimensional space. It can be seen that in order to use this approach the centers, orientations, and radii of the ellipsoids must be determined. Furthermore, any point on the manifold can be approximated arbitrarily well by this method simply by increasing the number of ellipsoids and also shrinking them to have a tighter fit. Statistically, having too many ellipsoids can cause undesirable overfitting effects, and mathematically, the number of ellipsoids (if it can be determined) is closely related to the manifold condition number.

The manifold-CS model described above is known in statistics as a mixture of factor analyzers (MFA). MFA combines the Gaussian mixture model (GMM) and factor analysis. In MFA, the GMM determines the number of ellipsoids and the factor analyzer determines the statistics of each ellipsoid (location, orientation, and radii). Connecting the pixel omission example in Fig. 3 to the MFA is the final piece in CS-MFA. Figure 7 illustrates the omission of dimensions of the measured data. The compressed data lies along the *x*- and *y*-axes. The CS inversion process—recovering the signal from compressed measurements—must take compressed measurements and map them back to the signal manifold. The model parameters learned by the MFA make this feasible by constraining the inversion procedure to the manifold.

One difficulty with the standard version of the GMM and factor analysis is that the number of clusters and dimension of the basis must be set *a priori*. Cross-validation can be employed to determine the parameter settings, but it requires splitting the data into several sections and learning the model on each section. Bayesian nonparametrics [34] offers a solution to this problem by including these parameters in the inference of the model. The rest of this section will describe the mathematical details of the GMM, factor analysis, their nonparametric extensions, the MFA, and a description of the hardware needed for a TEM to collect data that can be inverted by CS-MFA.

### Gaussian mixture model

The approach in this paper for manifold-CS is to model the manifold as an MFA. The mixture part of the MFA finds the number of ellipsoids needed to cover the manifold. The mixture part of MFA is based on the GMM, a model for clustering real-valued data. Figure 8 shows a set of two-dimensional data that was generated from a GMM. The primary goal in clustering is to determine which cluster each item belongs to and once this has been determined, cluster statistics such as mean and variance can be determined. Meeting this primary goal is easily accomplished by methods such as *K*-means. But the GMM goes beyond the primary goal by also finding the uncertainty parameters in the cluster assignments. In Fig. 8, several points lie in the overlap of two ellipses, with *K*-means they would simply be assigned to the nearest ellipse. In some applications, it may be important to know how strongly the algorithm believes a data point belongs to a cluster; this information can be inferred with the GMM.

The GMM is defined by the following hierarchical Bayesian model.^{7} In the GMM, the probability of a data point given the means *μ*
_{1},…,*μ*
_{
T
}, precisions (inverse variances), *τ*
_{1},…*τ*
_{
T
}, and cluster weights *λ*
_{1},…,*λ*
_{
T
}, is

$${} {\fontsize{9.2pt}{9.6pt}\selectfont{\begin{aligned} p(x_{i}| \lambda_{1},\ldots, \lambda_{T}, \mu_{1},\ldots,\mu_{T}, \tau_{1},\ldots\tau_{T}) = \sum_{t=1}^{T} \lambda_{t}\mathcal{N}\left(\mu_{t}, \tau_{t}^{-1}\right), \end{aligned}}} $$

((6))

where *T* is the number of clusters and *t* is a specific cluster number. This says that the data point could lie in any of the clusters, so the probability is the sum over the probability of *x*
_{
i
} being in each cluster. The rest of the hierarchy is defined as

$$\begin{array}{*{20}l} x_{i}|t(i) &\sim \mathcal{N}\left(\mu_{t(i)}, \tau^{-1}_{t(i)}\right) \end{array} $$

((7))

$$\begin{array}{*{20}l} \mu_{t} &\sim \mathcal{N}\left(a,b^{-1}\right) \end{array} $$

((8))

$$\begin{array}{*{20}l} \tau_{t} &\sim \mathcal{G}(c,d) \end{array} $$

((9))

$$\begin{array}{*{20}l} \lambda_{1}, \ldots, \lambda_{T} &\sim \text{Dirichlet}\,(\alpha/T, \ldots, \alpha/T) \end{array} $$

((10))

$$\begin{array}{*{20}l} t(i) &\sim \text{Multinomial}\,(1; \lambda_{1}, \ldots, \lambda_{T}) \end{array} $$

((11))

where *t*(*i*) is the cluster number of the *i*th data point and \(\mathcal {G}(\cdot,\cdot)\) is the gamma distribution, the conjugate prior for the precision of a normal distribution. The weight *λ*
_{
i
} determines the proportion of the data in cluster *i*. In Eq. (7), the cluster is known, so the probability is simply defined by the statistics of that cluster. The mean and precision of each cluster are given by Eqs. (8)–(9). The hyperparameters *a*,*b*,*c*,*d* are usually determined using the mean and precision of the entire data set. The cluster proportions are sampled jointly from a symmetric Dirichlet distribution in Eq. (10). The Dirichlet distribution is a multivariate extension of the beta distribution, where each *λ*
_{
t
}∈[0,1] and \(\sum _{t=1}^{T} \lambda _{t} = 1\). The parameter *α*>0 determines the decay rate of *λ*
_{1},…,*λ*
_{
T
} and will be discussed more below. Finally, the latent cluster assignments are drawn from a multinomial distribution based on the cluster proportions. The multinomial distribution is a generalization of the Bernoulli distribution; *n* trials (data points) are performed with a chance of success in exactly one of *k* different categories (clusters).

A common method of inference in Bayesian modeling is Gibbs sampling, a Markov chain Monte Carlo (MCMC) method. In order to use Gibbs sampling, the probability of each model parameter must be able to be sampled given all the other parameters. Each parameter is sampled iteratively until the model *mixes*; a model has mixed when the predicted distribution reaches a steady state. The samples taken before the model mixes are called *burn-in* and are thrown away. Samples taken after the burn-in phase can be used to compute statistical approximations, which will be used later. For the cluster assignments, the probability of *t*(*i*) can be analytically averaged over all possible *λ*
_{1},…,*λ*
_{
T
}. This is done by integrating the product of the distributions in Eqs. (10)–(11) with respect to *λ*
_{1},…,*λ*
_{
T
}. The result is that the probability of a data item being assigned to a particular cluster is proportional to the number of data items already assigned to that cluster:

$$ p(t(i)=j|\boldsymbol{t}(-i), \alpha) = \frac{n_{-ij} + \alpha/T}{n - 1 + \alpha}, $$

((12))

where *t*(−*i*) is the list of all cluster assignments except the *i*th and *n*
_{−i
j
} is the number of items in cluster *j*, excluding item *i*.

Returning to the number of clusters, it was previously mentioned that it is possible to infer the number of clusters using Bayesian nonparametrics. For the GMM, the nonparametric model is known as the infinite GMM and is produced by modifying the Dirichlet distribution to be a Dirichlet process (DP). There are a few analogies for the DP that have been well circulated in the statistics literature, the Chinese restaurant process (CRP) and the stick breaking process (SBP). In this paper, the CRP and SBP, which are equivalent to the DP, will be introduced; theoretical details of DP mixture models can be found in [17, 35, 36].

In the CRP, customers will choose a certain table with probability

$$\begin{array}{@{}rcl@{}} p(\text{occupied table } t) = \frac{n_{t}}{n-1+\alpha},\\ p(\text{new table}) = \frac{\alpha}{n-1+\alpha}, \end{array} $$

((13))

where *n* is the current number of customers, *n*
_{
t
} is the number of customers at table *t*, and *α* is the parameter related to the rate new tables are set up. To form a draw from a CRP, the infinity of customers are seated at their tables sequentially and after every customer has been seated the proportion of customers at each table determines \(\{\lambda _{t}\}_{t=1}^{\infty }\). The CRP representation clearly shows the influence of *α* on the thickness of the tail of the proportions; increasing *α* increases the tail thickness. This countably infinite set of proportions replaces the finite number of proportions in the GMM. Informally, if *T*→*∞* in Eq. (12), then limiting cases are given by Eq. (13). Once the proportions have decayed past a certain level, the remaining proportions are set to zero and the number of tables (clusters) can be determined. Figure 9 depicts the seating arrangement and assignment probabilities for a new customer after several customers have been seated.

As previously mentioned, the primary function of the CRP is to draw an infinite set of random proportions. Another way to think of this is the SBP. In the SBP, a random proportion is drawn from Beta(1,*α*) and broken off a stick of unit length. Proportions are drawn from Beta(1,*α*) and broken from the remaining stick until the stick is gone (infinitely small). This approach achieves the same result as the CRP, but the SBP samples the proportions directly. Mathematically, the SBP is defined as

$$\begin{array}{*{20}l} \lambda_{t} &= v_{t}\prod_{j=1}^{t-1} (1-v_{j}) \end{array} $$

((14))

$$\begin{array}{*{20}l} v_{t} &\sim \text{Beta}(1,\alpha) \end{array} $$

((15))

and replaces Eq. (10) in the infinite GMM. As with the CRP, the SBP can be terminated when the proportions are sufficiently small. Figure 10 illustrates the stick breaking process.

### Factor analysis

In the MFA approach to manifold-CS, a factor analyzer is used to determine the statistics of each ellipsoid covering the manifold. Factor analysis is a statistical method for discovering a basis/frame for a dataset. The probabilistic model PCA [37], one of the most common types of factor analysis, is given in the following equations:

$$ \begin{aligned} \boldsymbol{x}_{i} &= \boldsymbol{D}\boldsymbol{w}_{i} + \boldsymbol{\mu} + \boldsymbol{\epsilon}_{i}\\ \boldsymbol{d}_{k} &\sim \mathcal{N}(0, P^{-1}\boldsymbol{I}_{P}) \\ \boldsymbol{\epsilon}_{i} &\sim \mathcal{N}(0,\gamma_{\epsilon}^{-1}\boldsymbol{I}_{P}) \end{aligned} $$

((16))

where \(\boldsymbol {D}=[\boldsymbol {d}_{1}|\ldots |\boldsymbol {d}_{K}]\in \mathbb {R}^{N\times K}\), \(\boldsymbol {\mu }\in \mathbb {R}^{P}\) is the mean offset, \(\boldsymbol {w}_{i}\in \mathbb {R}^{K}\) are Gaussian distributed weights, *ε*
_{
i
} are Gaussian noise, and *I*
_{
N
} is the *N*×*N* identity matrix. In PCA, the data \(\{\boldsymbol {x}_{i}\}_{i=1}^{N}\) is used to discover the matrix *D* whose column vectors span the space of the data (up to noise) and *w*
_{
i
} are the transformed representations of *x*
_{
i
}. The algorithm has two parameters that need to be set *K*, the number of dictionary-elements/factors, and *γ*
_{
ε
}, the noise precision (inverse variance). The noise precision can also be modeled by a gamma random variable, so that it can also be inferred. Because the *d*
_{
k
} are Gaussian, the space discovered is ellipsoidal. This can be seen through the following reparameterization:

$$ \boldsymbol{x}_{i} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{DD}^{\top} + \gamma_{\epsilon}^{-1}\boldsymbol{I}_{N}). $$

((17))

Using the singular value decomposition (SVD), \(\boldsymbol {DD}^{\top } = \sum _{k=1}^{K} \sigma _{k} \boldsymbol {v}_{k} \boldsymbol {v}_{k}^{\top }\), where the singular vectors *v*
_{
k
} are orthonormal and the singular values *σ*
_{
k
}>0. The singular values are the radii of a *K*-dimensional ellipsoid and the singular vectors determine the orientation of each dimension (assuming \(\gamma _{\epsilon }^{-1} < \sigma _{K}\)). Figure 11 illustrates the singular values and the mean. Note that probabilistic PCA is different from PCA, which is simply a projection onto the top *K* principal components (either via SVD of the data or eigen-decomposition of the data covariance matrix) [37].

As with the GMM, it is desirable to infer the number of dictionary elements necessary for the data. The solution is again Bayesian nonparametrics. In factor analysis, the Beta-Bernoulli process (BeBP) is employed to infer the number of dictionary elements. The BeBP exhibits two additional features beyond the ability to infer the number of dictionary elements. First, the BeBP induces sparsity on the weights *w*
_{
i
}, and second, it allows information to be shared across the weights during inference. The finite Beta-Bernoulli hierarchy is defined as follows

$$ z_{ki} \sim \text{Bernoulli}(\pi_{k}), \quad \pi_{k} \sim \text{Beta}\left(\frac{a}{K}, b\frac{K-1}{K}\right), $$

((18))

where *K* is the number of dictionary elements and *a*,*b* are hyperparameters. For each \(\boldsymbol {x}_{i}\in \mathbb {R}^{P}\), the latent binary vector \(\boldsymbol {z}_{i}\in \mathbb {R}^{K}\) encodes which dictionary elements are used by *x*
_{
i
}. The proportion *π*
_{
k
} is the sharing mechanism and encodes the average use of basis vector *k* across all of the selection vectors *z*
_{
i
}.

The metaphor used to describe the BeBP is the Indian Buffet Process (IBP). In the IBP, customers (data points) enter the restaurant and choose dishes (dictionary elements) from the buffet. The first customer chooses Poisson(*a*) dishes. The *i*th customer samples each old dish with probability *#*(previous samples)/*i* and samples Poisson(*a*/*i*) new dishes. This is the single parameter IBP with *b*=1. Figure 12 illustrates the process. As the number of customers *i* tends to infinity, the number of new dishes tends to zero. In practice, the IBP is truncated to a number of dishes sufficiently large (i.e., large enough that some dishes are unused with high probability—this is data dependent) and any dishes that are unused can be removed from the representation. Details about the IBP and BeBP can be found in [17, 25].

Combining the BeBP with factor analysis results in the following beta process factor analysis [25]:

$$\begin{array}{*{20}l} \boldsymbol{x}_{i} &= \boldsymbol{D}\boldsymbol{w}_{i} + \boldsymbol{\epsilon}_{i} \end{array} $$

((19))

$$\begin{array}{*{20}l} \boldsymbol{d}_{k} &\sim \mathcal{N}(0, P^{-1}\boldsymbol{I}_{P}) \end{array} $$

((20))

$$\begin{array}{*{20}l} \boldsymbol{\epsilon}_{k} &\sim \mathcal{N}(0, \gamma_{\epsilon}^{-1}\boldsymbol{I}_{P}) \end{array} $$

((21))

$$\begin{array}{*{20}l} \gamma_{\epsilon} &\sim \mathcal{G}(c,d) \end{array} $$

((22))

$$\begin{array}{*{20}l} \boldsymbol{w}_{i} &= \boldsymbol{s}_{i} \circ \boldsymbol{z}_{i} \end{array} $$

((23))

$$\begin{array}{*{20}l} \boldsymbol{s}_{i} &\sim \mathcal{N}(0, \gamma_{s}^{-1}\boldsymbol{I}_{K}) \end{array} $$

((24))

$$\begin{array}{*{20}l} \gamma_{s} &\sim \mathcal{G}(e,f) \end{array} $$

((25))

$$\begin{array}{*{20}l} \boldsymbol{z}_{i} &\sim \prod_{k=1}^{K} \text{Bernoulli}(\pi_{k}) \end{array} $$

((26))

$$\begin{array}{*{20}l} \boldsymbol{\pi} &\sim \prod_{k=1}^{K} \text{Beta}\left(\frac{a}{K}, b\frac{K-1}{K}\right), \end{array} $$

((27))

where Eqs. (23)–(25) have replaced the expression for *w*
_{
i
} in the PCA model, ∘ is the element-wise Hadamard product, and the product notation in 26 and 27 denotes independent draws. The mean *μ* has been omitted in (19), since in the case of a single factor analyzer, the mean can simply be subtracted from the data as a pre-processing step. When implementing the algorithm, the hyper-parameters *a*,…,*f* are set to so-called non-informative values.

To make the connection to optimization approaches (e.g., *K*-SVD), the negative log likelihood is

$${} \begin{aligned} -\log p&(\boldsymbol{D}, \boldsymbol{S}, \boldsymbol{Z}, \boldsymbol{\pi} | \boldsymbol{X}, a,b,c,d,e,f)\\ &= \frac{\gamma_{\epsilon}}{2}\sum_{i=1}^{N} \|\boldsymbol{x}_{i} - \boldsymbol{D}(\boldsymbol{s}_{i} \circ \boldsymbol{z}_{i})\|_{2}^{2} + \frac{P}{2}\sum_{k=1}^{K} \|\boldsymbol{d}_{k}\|_{2}^{2} \\&\quad+ \frac{\gamma_{s}}{2}\sum_{i=1}^{N} \|\boldsymbol{s}_{i}\|_{2}^{2}\\ &-\log f_{\text{Beta-Bern}}(\boldsymbol{Z};a,b) -\log \text{Gamma}(\gamma_{\epsilon}| c,d) \\&-\log \text{Gamma}(\gamma_{s}| e,f) + \text{Const}, \end{aligned} $$

((28))

which is minimized to find the latent parameters. The first term is the least square error between the inferred parameters and the data while the second and third terms are commonly used as smoothing regularizers. The fourth term is the sparsifying regularizer, similar to the *ℓ*
_{1} norm. The BPFA model is commonly implemented using Gibbs sampling or variational Bayesian methods [25, 30]. It must be emphasized that Eq. (28) is not used by sampling algorithms and cannot be optimized with traditional approaches. For more details about beta process dictionary learning including the application to three-dimensional data, see [38].

### Mixture of factor analyzers

The MFA is realized by combining the GMM and the factor analyzer. The MFA is used to find an ellipsoidal covering of the signal manifold. Equations (19) and (21) can be combined to create an equivalent representation (with the mean no longer omitted)

$$\boldsymbol{x}_{i} \sim \mathcal{N}\left(\boldsymbol{D}\boldsymbol{w}_{i} + \boldsymbol{\mu}, \gamma_{\epsilon}^{-1}\boldsymbol{I}_{N}\right). $$

((29))

The new representation in Eq. (29) is the same format as the GMM. Now, the mixture of factor analyzers [30, 39, 40] can be introduced:

$$\begin{array}{*{20}l} \boldsymbol{x}_{i} &\sim \mathcal{N}\left(\boldsymbol{D}_{t(i)}\boldsymbol{w}_{i} + \boldsymbol{\mu}_{t(i)}, \gamma_{\epsilon,t(i)}^{-1}\boldsymbol{I}_{P}\right) \end{array} $$

((30))

$$\begin{array}{*{20}l}[0.5em] \boldsymbol{D}_{t(i)} &= \tilde{\boldsymbol{D}}_{t(i)}\boldsymbol{\Sigma}_{t(i)} \end{array} $$

((31))

$$\begin{array}{*{20}l} \tilde{\boldsymbol{d}}^{(t)}_{k} &\sim \mathcal{N}\left(0, P^{-1}\boldsymbol{I}_{P}\right) \end{array} $$

((32))

$$\begin{array}{*{20}l} \sigma^{(t)}_{kk} &\sim \mathcal{N}\left(0, \tau_{tk}^{-1}\right) \end{array} $$

((33))

$$\begin{array}{*{20}l}[0.5em] t(i) &\sim \text{SBP}(\alpha) \end{array} $$

((34))

$$\begin{array}{*{20}l}[0.5em] \boldsymbol{w}_{i} &= \boldsymbol{s}_{i} \circ \boldsymbol{z}_{t(i)} \end{array} $$

((35))

$$\begin{array}{*{20}l} \boldsymbol{s}_{i} &\sim \mathcal{N}_{t(i)}\left(0, \gamma_{s}^{-1}\boldsymbol{I}_{K}\right) \end{array} $$

((36))

$$\begin{array}{*{20}l} \boldsymbol{z}_{t} &\sim \text{IBP}(a,b) \end{array} $$

((37))

$$\begin{array}{*{20}l}[0.5em] \boldsymbol{\mu}_{t} &\sim \mathcal{N}\left(\boldsymbol{\mu}, \tau_{0}^{-1}\boldsymbol{I}_{P}\right) \end{array} $$

((38))

where *γ*
_{
ε,t
},*γ*
_{
s,t
},*τ*
_{
tk
},*τ*
_{0} all have gamma hyperpriors. Equation (30) says that data point *i* is in a cluster with statistics given by factor analyzer *t*(*i*). Equations (31)–(33) give a basis representation where *Σ*
_{
t(i)} is a diagonal matrix similar to a singular value matrix that weights the contributions of each basis vector. If some of the (diagonal) elements of *Σ*
_{
t
} are small relative to the noise variance, then that component *t*(*i*) will be low rank.

The MFA is also a block-sparse model, concatenating all of the means and bases together

$$ \boldsymbol{x} = \left[\boldsymbol{\mu}_{1},\boldsymbol{D}_{1}|\ldots| \boldsymbol{\mu}_{T},\boldsymbol{D}_{T}\right] \left[\begin{array}{c} \boldsymbol{w}_{1}\\ \vdots\\ \boldsymbol{w}_{T} \end{array}\right] $$

((39))

where only one of the vectors *w*
_{
t
} is non-zero. In this way, only a single block or group is active, which also makes the representation sparse. If there is only a single ellipsoid in the model, then the sparse-CS formulation is recovered as a special case.

In addition to having a block-sparse structure, the nonparametric MFA usually infers bases that are low-rank, *K*<*P*. Low-rank Gaussian bases correspond to localized tubular manifolds. In [30] the fact that the signal is 1-block sparse is used to prove the reconstruction guarantee. Theorems for the separability of the components and satisfaction of the restricted isometry property (RIP) can also be found in [30]. Essentially, the number of measurements should be greater than a constant times the largest rank among all of the *D*
_{
t
} plus the log of the number of components. The largest rank is the intrinsic manifold dimension, while the number of components *T* is related to the manifold condition number.

### CS-MFA

In order to use the MFA for CS inversion, the probability of the signal given the measurements needs to be determined, *p*(*x*|*y*), this requires the posterior predictive probability *p*(*x*) and the probability of the measurements given the signal *p*(*y*|*x*). The posterior predictive distribution is the expected value of a new (predicted) data point with the expectation taken over the posterior

$$ \begin{aligned} p(\boldsymbol{x}) &= \int_{\hat{\boldsymbol{w}}} p\left(\boldsymbol{x}|\hat{\boldsymbol{w}}\right) p\left(\hat{\boldsymbol{w}}|\{\boldsymbol{x}_{i}\}_{i=1}^{N}, \ldots\right) d\hat{\boldsymbol{w}}\\ &= \int_{\hat{\boldsymbol{w}}} \sum_{t=1}^{N} \mathcal{N}\left(\boldsymbol{x}; \tilde{\boldsymbol{D}}_{t}(\boldsymbol{\Sigma}_{t} \text{diag}(\boldsymbol{z}_{t}))\hat{\boldsymbol{w}} + \boldsymbol{\mu}_{t}, \gamma_{\epsilon,t}^{-1}\boldsymbol{I}_{P}\right)\\&\quad \mathcal{N}\left(\hat{\boldsymbol{w}};\boldsymbol{\xi}_{t}, \boldsymbol{\Lambda}_{t}\right) d\hat{\boldsymbol{w}}\\ &= \sum_{t=1}^{T} \lambda_{t} \mathcal{N}\left(\boldsymbol{x}; \boldsymbol{\chi}_{t}, \boldsymbol{\Omega}_{t}\right), \end{aligned} $$

((40))

where

$$\begin{array}{*{20}l} \boldsymbol{\chi}_{t} &=\tilde{\boldsymbol{D}}_{t}(\boldsymbol{\Sigma}_{t} \text{diag}(\boldsymbol{z}_{t}))\boldsymbol{\xi}_{t} + \boldsymbol{\mu}_{t} \end{array} $$

((41))

$$\begin{array}{*{20}l} \boldsymbol{\Omega}_{t} &= \tilde{\boldsymbol{D}}_{t}(\boldsymbol{\Sigma}_{t} \text{diag}(\boldsymbol{z}_{t})) \boldsymbol{\Lambda}_{t} (\text{diag}(\boldsymbol{z}_{t}) \boldsymbol{\Sigma}_{t})\tilde{\boldsymbol{D}}_{t}^{\top} + \gamma_{\epsilon,t}^{-1}\boldsymbol{I}_{P}. \end{array} $$

((42))

The prior predictive distribution is obtained when *ξ*
_{
t
}=0 and *Λ*
_{
t
}=*I*
_{
P
}, however this is usually inaccurate, so the posterior parameters are obtained by calculating the mean and covariance of the Gibbs samples. The bases \(\tilde {\boldsymbol {D}}_{t}\) are also taken as the mean of the Gibbs samples.

The probability of the measurements given the signal is also known

$$ p(\boldsymbol{y}|\boldsymbol{x}) = \mathcal{N}\left(\boldsymbol{y}; \boldsymbol{\Phi x}, \boldsymbol{R}^{-1}\right), $$

((43))

where *R* is the noise precision of the compressed noise *Φε*. By invoking Bayes’s rule, the order of the conditional probability can be switched and after another reparameterization, the desired probability is again a MFA.

$$\begin{array}{*{20}l} p(\boldsymbol{x}|\boldsymbol{y}) &= \frac{p(\boldsymbol{x})p(\boldsymbol{y}|\boldsymbol{x})}{\int p(\boldsymbol{x})p(\boldsymbol{y}|\boldsymbol{x})d\boldsymbol{x}}\\ &= \sum_{t=1}^{T} \tilde{\lambda}_{t} \mathcal{N}\left(\boldsymbol{x}; \tilde{\boldsymbol{\chi}}_{t}, \tilde{\boldsymbol{\Omega}}_{t}\right), \end{array} $$

((44))

where

$$\begin{array}{*{20}l} \tilde{\lambda}_{t} &= \frac{\lambda_{t}\mathcal{N}\left(\boldsymbol{y}; \boldsymbol{\Phi\chi}_{t}, \boldsymbol{R}^{-1} + \boldsymbol{\Phi\Omega}_{t}\boldsymbol{\Phi}^{\top}\right)}{\sum_{l=1}^{T} \lambda_{l}\mathcal{N}\left(\boldsymbol{y}; \boldsymbol{\Phi\chi}_{l}, \boldsymbol{R}^{-1} + \boldsymbol{\Phi\Omega}_{l}\boldsymbol{\Phi}^{\top}\right)} \end{array} $$

((45))

$$\begin{array}{*{20}l} \tilde{\boldsymbol{\chi}}_{t} &= \left(\boldsymbol{\Phi}^{\top} \boldsymbol{R\Phi} + \boldsymbol{\Omega}_{t}^{-1}\right)^{-1} \end{array} $$

((46))

$$\begin{array}{*{20}l} \tilde{\boldsymbol{\Omega}}_{t} &= \tilde{\boldsymbol{\chi}}_{t}\left(\boldsymbol{\Phi}^{\top} \boldsymbol{R y} + \boldsymbol{\Omega}_{t}^{-1}\boldsymbol{\chi}_{t}\right). \end{array} $$

((47))

The representation in Eq. (44) admits an analytic CS inversion procedure, that is, once the model parameters are learned (either offline or online [22, 41]), new signals are recovered by matrix–vector operations.

### Description of CS-TEM hardware

The coding scheme, called pixel-wise flutter-shutter, blocks pixels on the camera while it is integrating. A single pixel of the measured image has the following representation:

$$\begin{array}{*{20}l} \boldsymbol{Y}_{ij} &= \left[\boldsymbol{A}_{ij1},\boldsymbol{A}_{ij2},\ldots, \boldsymbol{A}_{ijL}\right] \left[\begin{array}{c} \boldsymbol{X}_{ij1}\\ \boldsymbol{X}_{ij2}\\ \vdots\\ \boldsymbol{X}_{ijL} \end{array} \right] \end{array} $$

((48))

The *A*
_{
i
j
ℓ
} are binary indicators of whether pixel *ij* is blocked in compressed frame *ℓ*, and *X* is the image. This representation can be consolidated as

$$\begin{array}{*{20}l} \boldsymbol{Y}_{ij} &=\boldsymbol{\Phi}_{ij}\boldsymbol{x}_{ij}, \end{array} $$

((49))

and the complete *Φ* is built by combining each pixel mask into a block diagonal matrix

$$\begin{array}{*{20}l} \boldsymbol{\Phi} &= \text{diag}\left(\boldsymbol{\Phi}_{1,1}, \boldsymbol{\Phi}_{1,2}, \ldots, \boldsymbol{\Phi}_{N_{x},N_{y}}\right), \end{array} $$

((50))

where the image size is *N*
_{
x
}×*N*
_{
y
} pixels. As previously mentioned, the images are broken down into patches so the data points *x*
_{
i
} in the MFA model are of size 4×4×*L*.

In order to obtain compressed measurements suitable for CS-MFA, the coded aperture compressive temporal imaging (CACTI) approach described in [23, 42] is used. CACTI was developed for optical video CS. In the CACTI camera system, the signal passes through a coded aperture that changes at a faster rate than the camera obtains images. This causes multiple coded images to be integrated into a single image. The aperture is set on a piezoelectric stage. The stage moves along either the *x*- or *y*-axis according to a triangle wave. During an up-stroke, a set of coded images are integrated and then another set are integrated during the down-stroke. A function generator is used to drive the piezo stage and trigger the image capture on the camera at the troughs and peaks of the triangle wave. The same setup is possible in TEM. The major difficulty in moving this approach to TEM is designing an aperture to block electrons rather than photons. Figure 13 shows an illustration of the TEM-CACTI system.

The benefit of placing the mask on a moving stage is that moving the mask creates a new encoding—essentially a new mask. If the position of the mask is known, then the encoding is known. This overcomes a difficulty in CS of using a new mask for every measurement. The compression ratio is determined by the range of motion of the mask. Effectively, moving *n* pixels (mask feature size) will give a factor of *n* compression, or *n* frames from 1.

Another difficulty—present in CS for TEM, but not in optical CS—is that the part of the mask blocking the signal must be supported by a material transparent to electrons. Example masks that allow approximately 50 % of electrons to pass are shown in Fig. 14. An issue that might be raised about this approach is that 50 % of the image is discarded. The intent of our approach, however, is to increase the acquisition rate. It has been shown that image data can be discarded and subsequently recovered, both generally [25] and in electron microscopy [13]. Moreover, it might be possible to place the aperture before the specimen, which would give a decrease in dose and an increase in acquisition rate.