### The multislice and Bloch wave methods

For previously published TEM simulation methods, we will briefly outline the required steps here. We refer readers to Kirkland for more information on these methods [20]. We will also only describe the scattering of the electron beam while passing through a sample; probe-forming optics and the microscope transfer function mathematics are described in many other works. All elastic scattering TEM simulations aim to describe how an electron wavefunction \(\psi (\vec {r})\) evolves over the 3D coordinates \(\vec {r} = (x,y,z)\). The evolution of the slow-moving portion of the wavefunction along the optical axis *z* can be described by the Schrödinger equation for fast electrons [20]

$$\begin{aligned} \frac{\partial \psi (\vec {r})}{\partial z} = \frac{{i} \,\lambda }{4 \pi } {\nabla _{xy}}^2 \psi (\vec {r}) + {i} \,\sigma V(\vec {r}) \psi (\vec {r}), \end{aligned}$$

(1)

where * λ* is the relativistic electron wavelength, \({\nabla _{xy}}^2\) is the 2D Laplacian operator,* σ* is the relativistic beam sample interaction constant, and \(V (\vec {r})\) is the electrostatic potential of the sample.

The Bloch wave method uses a basis set that satisfies Eq. 1 everywhere inside the sample boundary, which is assumed to be periodic in all directions. This basis set is calculated by calculating the eigendecomposition of a set of linear equations that approximate Eq.1 up to some maximum scattering vector \(|q_{\mathrm{max}}|\). Then, for each required initial condition such as different STEM probe positions on the sample surface, we compute the weighting coefficients for each element of the Bloch wave basis set. Finally, the exit wave after interaction of the sample is calculated by multiplying these coefficients by the basis set. This procedure can be written in terms of a scattering matrix **S** as [20]

$$\begin{aligned} \psi _f(\vec {r}) = \mathbf {S} \; \psi _0(\vec {r}), \end{aligned}$$

(2)

where \(\psi _0(\vec {r})\) and \(\psi _f(\vec {r})\) are the incident and exit wavefunctions, respectively. The Bloch wave method can be extremely efficient for very small simulations, where the field of view is on the scale of crystalline unit cells. High symmetry is also an asset for Bloch wave simulations, as we can limit the beam of plane waves (beams) included in the basis set to a small number. However, for a large STEM simulation consisting of thousands or even millions of atoms in the simulation, the **S**-matrix may contain billions or more entries, which requires an impractical amount of time to calculate the eigendecomposition. And, actually using Eq. 2 many times for various electron probes could take a very long time. Thus, Bloch wave methods are typically only used for very small size STEM simulations.

The most commonly employed method for large STEM simulations is the multislice algorithm. The multislice method alternates between solving the two terms on the right-hand side of Eq. 1, for thin slices of thickness *t* taken from the sample. The left term is interpreted as a Fresnel propagation operator, which can be efficiently applied in Fourier space as [20]

$$\begin{aligned} \Psi _{p+1}(\vec {q}) = \Psi _p(\vec {q}) \exp (- {i} \,\pi \lambda |\vec {q} \, |^2 t) \end{aligned}$$

(3)

where \(\Psi (q) = {\mathcal {F}}\{ \psi (r) \}\) is the Fourier transform of \(\psi (\vec {r})\), \(\vec {q}=(q_x,q_y)\) is the 2D coordinate vector for Fourier space, and the subscript *p* refers to the slice index. The second operator of Eq. 1 can be efficiently applied in real space as

$$\begin{aligned} \psi _{p+1}(\vec {r}) = \psi _p(\vec {r}) \exp \left[ {i} \,\sigma V_p^{\mathrm{2D}}(\vec {r}) \right] , \end{aligned}$$

(4)

where \(V_p^{\mathrm{2D}}(\vec {r})\) is the 2D electrostatic potential of all atoms inside slice *p*, integrated over the slice along the beam direction from the 3D potential. In practice, the atomic potentials are integrated into 2D potentials before the simulation, and then added directly to the slice potential, or applied using convolution [25]. These two steps describe how the electron wavefunction evolves slice-by-slice until it has interacted with the entire sample, applied sequentially as

$$\begin{aligned} \psi _{p+1}(\vec {r}) = {\mathcal {F}}^{-1} \left\{ {\mathcal {F}}\left\{ \psi _p(\vec {r}) e^{ {i} \,\sigma V_p^{\mathrm{2D}}(\vec {r}) } \right\} e^{- {i} \,\pi \lambda |\vec {q} \,|^2 t} \right\} , \end{aligned}$$

(5)

where \({\mathcal {F}}^{-1} \left\{ \right\}\) is the inverse Fourier transform. The multislice method is simple to implement and very accurate, but is not very efficient for large-scale STEM simulation. The reason is that although the atomic potentials can be reused for different probe positions, the remainder of the calculation (using Eq. 5 to propagate each probe though the sample) must be run independently. While this problem is amenable to parallelization, none of the calculations are shared between different probe positions, or different probe parameters such as defocus, convergence angle, or probe tilt. In the next section, we will show how a STEM simulation can be reformulated into an **S**-matrix approach, where the computational load of applying Eq. 5 can be shared between different probe configurations.

### The PRISM algorithm for STEM simulations

The first step of the method proposed here is to separate all atomic coordinates of the simulation cell (which is assumed to be orthorhombic here) into slices, as shown in Fig. 1a. These slices can have unequal thickness to better match the atomic coordinates, but should not have thicknesses larger than the average atomic spacing as this could cause errors [20]. The second step is to calculate the 2D projected potentials \(V(\vec {r})\) for all slices, as shown in Fig. 1b.

Next, we choose an interpolation factor *f*. In practice, a different factor can be used in *x* and *y*, but for simplicity we will describe the simulation method for a square [in the (*x*, *y*) plane] simulation cell of size *d*. This factor *f* should be chosen to be large enough so that a square area with a side length of the simulation cell size divided by *f* can encompass all possible STEM probes after they pass through the cell. This can be estimated by numerically simulating a few probes using the conventional multislice method or the method described here. We then also choose a maximum incident probe semi-angle \(\alpha _{\mathrm{max}}\). Note that the simulation will include larger scattering angles than this value, and that this value should be equal to the largest desired probe semi-angle plus *f* times the Fourier space pixel size \(\Delta q\). We then determine a set of plane wave initial conditions to simulate using the multislice method, as shown in Fig. 1c. This set of plane waves corresponds to the incident electron probe

$$\begin{aligned} \Psi _{m,n}(\vec {q}) = \delta ( q_x - m f \Delta q, \; q_y - n f \Delta q ), \end{aligned}$$

(6)

where \(\sqrt{m^2+n^2} f \lambda \Delta q \le \alpha _{\mathrm{max}}\), \(\delta (\vec {q})\) is the delta function, and (*m*, *n*) are integers representing the plane wave index. Thus, we compute only a subset of all possible periodic plane waves for the simulation cell size, reducing the number of waves calculated by a factor of *f*
^{2}. These plane waves are stored in realspace in a large array that we will refer to as the compact **S**-matrix, with the output plane waves defined as \(\mathbf {S}_{m,n}(\vec {r})\). These output wave dimensions can be reduced by a factor of 4, if the multislice simulation uses an antialiasing aperture positioned at half of the maximum scattering angle [20].

Next, we calculate each converged electron probe at position \(\vec {r}_0 = (x_0,y_0)\) by first computing the required coefficients \(\alpha _{m,n}(\vec {r}_0)\) for each plane wave \(\mathbf {S}_{m,n}(\vec {r})\), and then multiplying these coefficients by the associated plane wave basis and summing over a square subregion with side length *d* centered around the probe. This is shown schematically in Fig. 1d. The subregion is bounded by

$$\begin{aligned}&x_0 - \frac{d}{2 f} \le x< x_0 + \frac{d}{2 f} \nonumber \\&y_0 - \frac{d}{2 f} \le y < y_0 + \frac{d}{2 f}, \end{aligned}$$

(7)

giving a cutout region having an area of *d*
^{2}/*f*
^{2}, which should be periodically wrapped around the simulation cell boundaries. The wave coefficients are defined as

$$\begin{aligned} \alpha _{{m,n}} (\vec{r}_{0} ) = A(\vec{q})\exp [- i{\mkern 1mu} \chi (\vec{q})]\exp - 2i{\mkern 1mu} \pi \vec{q}\cdot [ {x_{0} - h\tan (\theta _{x}), y_{0} - h\tan (\theta _{y} )}], \end{aligned}$$

(8)

where \(A(\vec {q})\) is the probe aperture function defined as

$$\begin{aligned} \begin{array}{llll} A(\vec {q}) = &{} 1 &{} \mathrm {where} &{} |\vec {q}| \le q_{\mathrm{probe}}\\ &{} 0 &{} \mathrm {elsewhere}. &{} \end{array} \end{aligned}$$

The probe can also contain coherent wave aberrations such as defocus *C*
_{1} or 3rd order spherical aberration *C*
_{3} described by the phase shift function [20]

$$\begin{aligned} \chi (\vec {q}) = \pi \lambda |\vec {q} \, |^2 C_1 + \frac{\pi }{2} \lambda ^3 |\vec {q} \,|^4 C_3 + \cdots \end{aligned}$$

(9)

Finally, the terms \(h \tan (\theta _x)\) and \(h \tan (\theta _y)\) shift the probe back to the center of a cutout region for a given simulation cell of height *h* and probe tilt angles \(\theta _x\) and \(\theta _y\). As shown in Fig. 1e, once the probe coefficients \(\alpha _{m,n}(\vec {r}_0)\) have been computed, the complex probe in realspace \(\psi (\vec {r},\vec {r}_0)\) can be computed using the summation

$$\begin{aligned} \psi (\vec {r},\vec {r}_0) = \sum _{m,n} \mathbf {S}_{m,n}(\vec {r}) \; \alpha _{m,n}(\vec {r}_0), \end{aligned}$$

(10)

in the cut out region defined by Eq. 7. Note that this expression is simply an expanded form of Eq. 2. Equation 10 can be evaluated more quickly if we skip the addition of all terms where \(\alpha _{m,n}(\vec {r}_0)=0\). After the probe is computed, we can either output the full probe diffraction pattern, or more commonly integrate a subset of the probe intensity after taking its Fourier transform, as shown in Fig. 1f. Once the output signals of all probes have been tabulated, the simulation is complete. Our method is very similar to that proposed by Chen et al. [38]; but, where they include tilts of the various beams in the propagation operator, we have included it in the initial conditions of each beam, which negates the need for an offset term to relate the relative phases of the beams.

### Simulation and analysis implementation

All simulations and analysis in this study were performed using custom Matlab code. The multislice methods and the atomic potentials employed were taken from Kirkland [20]. Thermal scattering effects were implemented using the frozen phonon approximation, which involves repeating the calculation with different phonon configurations (approximated with random atomic displacements) and summing the results incoherently.

An implementation of the PRISM algorithm for a sample consisting of a nanoparticle contained within a carbon nanotube is shown in Fig. 2a–f. Each of the panels in this figure corresponds to the same step as those given in Fig. 1a–f. In Fig. 2c, e and f, the wave phase is shown as the color hue, while the wave amplitude is shown by the brightness of each pixel. All simulations were performed using an 80 kV accelerating voltage, a slice thickness of 0.2 nm, a pixel size of 0.01 nm, and we used no spherical aberration in the electron probes.

### Calculation time for PRISM simulations

We will now approximate the computation time of the PRISM algorithm, relative to traditional multislice simulations. We will neglect the computation time of the sample projected potential slices, as this calculation time is equal for both methods. We will also not consider thermal scattering, since it will require an increase in calculation time by an equal multiplier for both methods. For simplicity, we will assume a square simulation cell with side length *N* where *N* is a power of two. Each slice will require the transmission and propagation operations given in Eq. 5, which requires \(6 N \log _2(N)\) complex operations for the forward and inverse Fourier transforms and 2*N*
^{2} operations to multiply the sample potential and the Fresnel propagation functions. If the entire STEM simulation consists of *P* unique probe positions and *H* slices through the sample, the total calculation time \(T_{\mathrm{multi}}\) required is

$$\begin{aligned} T_{\mathrm{multi}}=\,& H P \left[ 6 N \log _2(N) + 2 N^2 \right] \nonumber \\\approx\,&2 H P N^2, \end{aligned}$$

(11)

if the simulation cell is large, i.e., \(N \gg 1\). The PRISM method requires two parts to compute the scattering of all STEM probes. The first half of the algorithm requires *B*/*F*
^{2} multislice simulations, where *B* is the number of beams included in the full-resolution simulation, which will be reduced by the interpolation factor squared. The second half is the multiplication of the compact scattering matrix *S* for all beams (multislice plane waves computed in the previous step), which is required for *P* total probes, as in Eq. 10. This multiplication step is only required for the reduced number of beams *B*/*F*
^{2}, and the cut out region defined by Eq. 7 will reduce the number of multiplication operations to *N*
^{2}/4*f*
^{2} (note the extra factor of 1/4 is due to storing only the part of *S* inside the anti-aliasing aperture). Therefore, the total calculation time \(T_{\mathrm{PRISM}}\) required for PRISM is

$$\begin{aligned} T_{\mathrm{PRISM}}= &\,\frac{H B}{f^2} \left[ 6 N \log _2(N) + 2 N^2 \right] + \frac{P B N^2}{4 f^4} \nonumber \\\approx &\,B N^2 \left[ \frac{2 H}{f^2}+\frac{P}{4 f^4}\right]. \end{aligned}$$

(12)

Note that for a STEM probe, the probe amplitude coefficients beyond the probe semi-angle are zero and so the number of beams *B* used in practice is often much lower than the number of possible beams. The speedup offered by the PRISM algorithm is therefore approximately equal to the ratio of Eqs. 11 and 12 given by

$$\begin{aligned} \frac{T_{\mathrm{Multi}}}{T_{\mathrm{PRISM}}} = \frac{8 H P f^4}{B (8 H f^2+P)}. \end{aligned}$$

(13)

If the rate-limiting computation step for the PRISM algorithm is multiplying out the compact *S*-matrix, the speedup ratio does not depend on the number of probe positions *P* and the speedup will vary with *f*
^{4}. In the multislice and PRISM simulations given in the first results section below, the values of the terms of Eq. 13 were *H* = 40, *B* = 10^{4}, and *P* = 10^{5}. Plugging these numbers into Eq. 13 gives a speedup factor \(T_{\mathrm{Multi}} / T_{\mathrm{PRISM}}\) of approximately 0.5, 8, 110, and 1100 for* f* = 2, 4, 8, and 16, respectively.