PyRSB: Portable Performance on Multithreaded Sparse BLAS Operations

This article introduces PyRSB, a Python interface to the LIBRSB library. LIBRSB is a portable performance library offering so called Sparse BLAS (Sparse Basic Linear Algebra Subprograms) operations for modern multicore CPUs. It is based on the Recursive Sparse Blocks (RSB) format, which is particularly well suited for matrices of large dimensions. PyRSB allows LIBRSB usage with an interface styled after that of SciPy’s sparse matrix classes, and offers the extra benefit of exploiting multicore parallelism. This article introduces the concepts behind the RSB format and LIBRSB, and illustrates usage of PyRSB. It concludes with a user-oriented overview of speedup advantage of rsb_matrix over scipy.sparse.csr_matrix running general sparse matrix-matrix multiplication on a modern shared-memory computer.


Introduction
Sparse linear systems solving is one of the most widespread problems in numerical scientific computing. The key to timely solution of sparse linear systems by means of iterative methods resides in fast multiplication of sparse matrices by dense matrices. More precisely, we mean the update: C ← C +αAB (at the element level, equivalent to C i,k ← C i,k + αA i, j B j,k ) where B and C are dense rectangular matrices, A is a sparse rectangular matrix, and alpha a scalar. If B and C are vectors (i.e. have one column only) we call this operation SpMV (short for Sparse Matrix-Vector product); otherwise SpMM (short for Sparse Matrix-Matrix product).
PyRSB [PYRSB] is a package suited for problems where: i) much of the time is spent in SpMV or SpMM, ii) one wants to exploit multicore hardware, and iii) sparse matrices are large (i.e. occupy a significant fraction of a computer's memory).
The PyRSB interface is styled after that of the sparse matrix classes in SciPy [Virtanen20]. Unlike certain similarly scoped projects ( [Abbasi18], [PyDataSparse]), PyRSB is restricted to 2dimensional matrices only.
Background: LIBRSB LIBRSB [LIBRSB] is a LGPLv3-licensed library written primarily to speed up solution of large sparse linear systems using iterative methods on shared-memory CPUs. It takes its name from the Recursive Sparse Blocks (RSB) data layout it uses. The RSB format is geared to execute multithreaded SpMV and SpMM as fast as possible. LIBRSB is not a solver library, but provides most of the functionality required to build one. It is usable via several languages: C, C++, Fortran, GNU Octave [SPARSERSB], and now Python, too. Bindings for the Julia language have been authored by D.C. Jones [RSB_JL].
It is available in pre-compiled form in popular GNU/Linux distributions like Ubuntu [UBUNTU], Debian [DEBIAN], Open-SUSE [OPENSUSE]; this is the best way have a LIBRSB installation to familiarize with PyRSB. However, pre-compiled packages will likely miss compile-time optimizations. For this reason, the best performance will be obtained by bulding on the target computer. This can be achieved using one of the several sourcebased code distributions offering LIBRSB, like Spack [SPACK], or EasyBuild [EASYBUILD], or GUIX [GUIX]. LIBRSB has minimal dependencies, so even bulding by hand is trivial.
PyRSB [PYRSB] is a thin wrapper around LIBRSB based on Cython [Behnel11]. It aims at bringing native LIBRSB performance and most of its functionality at minimal overhead.

Basic Sparse Matrix Formats
The explicit (dense) way to represent any numerical matrix is to list each of its numerical entries, whatever their value. This can be done in Python using e.g. scipy.matrix. This matrix has two rows and two columns; it contains three nonzero elements and one zero element in the second row. Many scientific problems give rise to systems of linear equations with many (e.g. millions) of unknowns, but relatively few coefficients which are different than zero (e.g. <1% ) in their matrix-form representation. It is usually the case that representing these zeroes in memory and processing them in linear algebraic operations does not impact the results, but takes compute time nevertheless. In these cases the matrix is usually referred as sparse, and appropriate sparse data structures and algorithms are sought.
The most straightforward sparse data structure for a numeric matrix is one listing each of the non-zero elements, along with its coordinate location, by means of three arrays. This is called COO.
It's one of the classes in scipy.sparse; see the following listing, whose output also illustrates conversion to dense: Even if yielding the same results, the algorithms beneath differ considerably. To carry out the C i,k ← C i,k + αA i, j B j,k updates the scipy.coo_matrix implementation will get the matrix coefficients from the V array, its coordinates from the I and J arrays, and use those (notice the indirect access) to address the operand's elements.
In contrast to that, a dense implementation like scipy.matrix does not use any index array: the location of each numerical value (including zeroes) is in direct relation with its row and column indices.
Beyond the V,I,J arrays, COO has no extra structure. COO serves well as an exchange format, and allows expressing many operations.
The second most straightforward format is CSR (Compressed Sparse Rows). In CSR, non-zero matrix elements and their column indices are laid consecutively row after row, in the respective arrays V and J. Differently than in COO, the row index information is compressed in a row pointers array P, dimensioned one plus rows count. For each row index i, P[i] is the count of non-zero elements (nonzeroes) on preceding rows. The count of nonzeroes at each row i is therefore P CSR's P array allows direct access of each sparse row. This helps in expressing row-oriented operations. In the case of the SpMV operation, CSR encourages accumulation of partial results on a per-row basis.
Notice that indices' occupation with COO is strictly proportional to the non-zeroes count of a matrix; in the case of CSR, only the J indices array. Consequently, a matrix with more nonzeroes than rows (as usual for most problems) will use less index space if represented by CSR. But in the case of a particularly sparse block of such a matrix, that may not be necessarily true. These considerations back the usage choice of COO and CSR within the RSB layout, described in the following section. 1/9 HCOO 2.0e+03 2/9 HCSR 1.5e+04 3/9 HCSR 1.3e+04 4/9 HCOO 2.0e+03 5/9 HCOO 9.8e+02 6/9 HCSR 1.0e+04 7/9 HCSR 8.5e+03 8/9 HCSR 7.0e+03 9/9 HCOO 5.0e+03 Fig. 1: Rendering of an RSB instance of classical matrix bayer02 (sized 14k × 14k with 64k nonzeroes, from the SuiteSparse Matrix Collection [SSMC]); each sparse block is labeled with its own format (the 'H' prefix indicating use of a shorter integer type); each block's effectively non-empty rectangle is shown, in colour; greener blocks have fewer nonzoeroes than average; rosier ones have more. Blocks' rows and columns ranges are highlighted (respectively magenta and green) on the blocks' sides. Note that larger blocks (like "9/9") may have fewer nonzeroes than smaller ones (like "4/9").

Recursive Sparse Blocks in a Nutshell
The Recursive Sparse Blocks (RSB) format in LIBRSB [Martone14] represents sparse matrices by exploiting a hierarchical data structure. The matrix is recursively subdivided in halves until the individual submatrices (also: sparse blocks or simply blocks) occupy approximately the amount of memory contained in the CPU caches. Each submatrix is then assigned the most appropriate format: COO if very sparse, CSR otherwise. Any operation on an RSB matrix is effectively a polyalgorithm, i.e. each block's contribution will use an algorithm specific to its format, and the intermediate results will be combined. For a more detailed description, please consult [Martone14] and further references from there.
The above details are useful to understand, but not necessary to use PyRSB. To create an rsb_matrix object one proceeds just as with e.g. coo_matrix: Direct conversion from scipy.sparse classes is also supported. Instancing an RSB structure is computationally more demanding than with COO or CSR (in both memory and time). Exploiting multiple cores and the savings from faster SpMM's shall make the extra construction time negligible. 3) upper left block is done (not active anymore); 4) upper right block becomes active; 5) upper right block is done; 6) lower left block is done; 7) lower right block is now active; 8) lower right block is done.

Multi-threaded Sparse Matrix-Vector Multiplication with RSB
The following sequence of pictures schematizes eight states of a two-threaded SpMV on an RSB matrix consisting of four (nonempty sparse) blocks. At any moment, up to two blocks are being object of concurrent SpMV (active). Here each active block has a gray background; its rows and column ranges are highlighted. Left of the matrix, a (out-of-horizontal-scale) result vector is depicted. For each of the active blocks, the corresponding active range (corresponding to the rows) is highlighted on the vector. Similarly, right of the matrix, the (out-of-horizontal-scale) operand vector is shown; its active ranges (corresponding to each blocks' column range) are highlighted. The idea behind the algorithm is that a thread won't write to a portion of the result array which is currently being updated by another thread. Beyond that, there is no further synchronization of threads.
This algorithm applies to square as well as non-square matrices. It supports transposed operation (in which case the ranges of each block are swapped). Symmetric operation is supported, too; in this case, an additional transposed contribution is considered for each block.
As depicted in the first RSB illustration (Fig. 1), the order of the sparse blocks in memory proceeds along a space-filling curve. That order of processing the individual blocks can help to deliver data from the memory to the cores faster. For this reason the individual cores attempt to follow that order whenever possible.
To have enough work for each thread, RSB arranges to have += * Fig. 3: A Matrix and its SpMM operands, in columns-major order. Matrix consisting of four sparse blocks, of which one highlighted. Left hand side and right hand side operands consist of two vectors each. These are stored one column after the other (memory follows blue line). Consequently, the two column portions operands pertaining a given sparse block are not contiguous.
more blocks than threads. For this and other trade-offs involved, as well for a formal description of the multiplication algorithm, see [Martone14] and further literature about RSB listed there. The SpMV algorithm sketched above is what happens under the hood in PyRSB. In practice, rsb_matrix is used in SpMV just as with scipy.sparse classes seen earlier:

Multi-threaded Sparse Matrix-Matrix Multiplication with RSB
With multiple column operands (in jargon, multiple right hand sides), the operation result is equivalent to that of performing correspondingly many SpMVs.
In these cases it comes naturally to lay the columns one after the other (consecutively) in memory, and have the resulting rectangular dense matrix as operand to the SpMM. Also here the same notation of the previous section is supported; see this example with 2 right hand sides: >>> from numpy import ones >>> B = ones([2,2], dtype=A.dtype) >>> C = A * B Let's look at how to deal with this when using the RSB layout. As anticipated, the individual right hand sides may lay after each other, as columns of a rectangular dense matrix. See Fig. 3, where a broken line follows the two operands' layout in memory, also by columns.
A straightforward SpMM implementation may run two individual SpMV over the entire matrix, one column at a time. That would have the entire matrix (with all its blocks) being read once per column.
A first RSB-specific optimization would be to run all the percolumn SpMVs at a block level. That is, given a block, repeat the SpMVs over all corresponding column portions. This would increase chance of reusing cached matrix elements as the operands are visited. This reuse mechanism is being exploited by LIBRSB-1.2. The by columns layout (or order) is the recommended one for SpMM there.
The most convenient thing though, would be to read the entire matrix only once. That is the case for LIBRSB-1.3 (scheduled for release in summer 2021): for small column counts, block-level += * Fig. 4: A Matrix and its SpMM operands, in rows-major order.
Matrix consisting of four sparse blocks, of which one highlighted. Left hand side and right hand side operands consist of two vectors each, interspersed (memory follows blue line). Consequently, the two column portions operands pertaining a given sparse blocks are contiguous.
SpMM goes through all the columns while reading a block exactly once.
The aforementioned SpMM algorithm is to be regarded as LIBRSB-specific internals, with not much user-level control over it.
But there is another factor instead, that plays a certain role in the efficiency of SpMM, where the PyRSB user has a choice: the layout of the SpMM operands.

SpMM with different Operands Layout
The by-columns layout described earlier and shown in Fig. 3 appears to be the most natural one if one thinks of the columns as laid in successive multiple arrays. However, one may instead opt to choose a by-rows layout instead, shown in figure 4.
A by-rows layout can be thought as interspersing all the columns, one index at a time. Here in the figure, the blue line follows their order in memory. At SpMM time, given one of the input columns, an element at a given index is multiplied by nonzeroes located at that column index. Similarly, given one of the output columns, an element at a given index receives a contribution from the nonzeroes located at that row coordinate. With a byrows layout of the operands, SpMM may proceed by reading a nonzero once, read all right hand sides at that row index (they are adjacent), and then update the corresponding left hand sides' elements (which are also adjacent). On current cache-and registerbased CPUs, the locality induced by this layout leads often to a slightly faster operation than with a by-columns layout.
The by-columns and by-rows layouts go by the respective names of Fortran ('F') and C ('C') order. A user can choose which dense layout to use when creating operands for SpMM. Their physical layouts differ, but NumPy makes their results are interoperable; see e.g.: While both layouts are supported, the 'C' layout is the recommended one for SpMM operands when using PyRSB with LIBRSB-1.3. Also notice that SpMV is a special case of SpMM with one left-hand side and one right-hand side, so the two layouts are equivalent here. In the following, we will often refer to righthand sides count as by NRHS.

Using PyRSB: Environment Setup and Autotuning
Usage of PyRSB requires no knowledge beyond its documentation. However, the underlying LIBRSB library can be configured in a variety of ways, and this affects PyRSB. To begin using PyRSB, a distribution-provided installation shall suffice. To expect best performance results, a native LIBRSB build is recommended. The next section comments some basic facts to control LIBRSB and make the most out of PyRSB.

Environment Variables
PyRSB does not use any environment variable directly; it is affected via underlying LIBRSB and Python. By default, LI-BRSB it is built with shared-memory parallelism enabled via OpenMP [OPENMP]. As a consequence, a few dozen OpenMP environment variables (all prefixed by OMP_) apply to LIBRSB as well. Of these, the most important is the one setting the active threads count: OMP_NUM_THREADS. Administrators of HPC (High Performance Computing) systems customarily set this variable to recommended values. Even if unset, chances are good the OpenMP runtime will guess the right value for this. Most other OpenMP variables will be of less use to PyRSB, except one: setting OMP_DISPLAY_ENV=TRUE will get current defaults printed at program start (very useful when debugging a configuration).
In addition to the above, there are environment variables affecting specifically LIBRSB. All of those are prefixed by RSB_, so to avoid any clash. One recommended to end users is RSB_USER_SET_MEM_HIERARCHY_INFO, and is used to override cache hierarchy information detected at runtime or hardcoded at build time. Essentially, one can use it to force a finer or coarser blocking. For its usage, and for verification of further LIBRSB defaults, please see its documentation (accessible from [LIBRSB]). Modifying the variables mentioned in this section will be mostly useful on very new or not fully configured systems, or for tuning a bit over the defaults.

RSB Autotuning Procedure for SpMM
Cores count, cache sizes, operands data layout, and matrix structure all play a role in RSB performance. The default blocks layout chosen when assembling an RSB instance may not be the most efficient for the particular SpMM to follow. In practice, given an RSB instance and an SpMM context (vector and scalar operands info, transposition parameter, run-time threads count), it may be the case that a better-performing layout can be found by exploring slightly coarser or finer blockings, An automated (autotuning) procedure for this exists and is accessible via autotune. The following example shows how to use it on matrix audikw_1 from [SSMC].
The reader impatient to see further speedup figures achievable by autotune can already peek at Fig. 10.

Experiments with SpMM and Autotuning
Purpose of this section is to present statistics of speedups one may encounter by using PyRSB instead of SciPy CSR in practical usage. In our choice of experiments, and in the exposition, we favour breadth over depth. So differently than in a paper with HPC in focus, we focus on the achievable speedup, and not on performance. We also take shortcuts which we would not take otherwise, like mixing statistics from single precision computations with double precision ones, or real-valued and complexvalued ones. Also the very focus of the article, namely comparing directly threaded RSB to serial CSR in SciPy would be illposed, were we interested to compare the parallelism grade of the two implementations. On the plots that will follow, samples  are grouped by matrix; for each one, a five-number summary (minimum and maximum, first quartile, second (median) and third quartiles) is drawn with a boxes and whiskers representation.

Experimental Setup
We use a AMD EPYC 7742 node with 64 cores. Scaling of memory bandwidth in STREAM-like loops here is around 10×. Considering we are dealing with memorybound operations, we chose OMP_NUM_THREADS=24, OMP_PROC_BIND=spread, and OMP_PLACES=cores. RSB_USER_SET_MEM_HIERARCHY_INFO was set to "L2:4/64/16000K,L1:8/64/32K". We use CSR from csr_matrix in SciPy e171a1 from Feb 20, 2021, PyRSB 8a6d603 from Jun 08, 2021, pre-release LIBRSB-1.3. For both, we use -Ofast -march=native -mtune=native flags and gcc version 10. 2.1 20210110 (Debian  10.2.1-6). We use matrices which were also used in [Martone14], available from https://sparse.tamu.edu/ ( [SSMC]); see the table below. Many of these are symmetric; differently than rsb_matrix, csr_matrix does not support symmetric SpMM ; therefore in both cases we expand their symmetry and perform only unsymmetric (general) SpMM. Before starting any measurement, we run autotune on a temporary matrix to warm-up the OpenMP environment, once. Then we do one non-timed warm-up SpMM before iterating for 0.2s and taking the fastest sample. We repeat this for each of the 28 matrices, right-hand-sides (NRHS) in 1,2,4,8, order among 'C' and 'F', BLAS numerical types in C,D,S,Z. When using rsb_matrix, we measure both non-autotuned, and autotuned with autotune(nrhs=...,order=...,tmax=0). So the above totals to 28 · 4 · 2 · 4 = 896 records with samples in SpMM and tuning timing. To avoid also timing repeated allocation of the SpMM result (C in C=A * B), we allocate it once, and then instead of the * operator, we use the functions underneath it, which take C as argument (this can be of interest to many performance-conscious users). SpMM Speedup: from csr_matrix to rsb_matrix Figure 8 summarizes the speed ratio of non-autotuned rsb_matrix over csr_matrix. Speedup without RSB autotuning ranges from 4× to 64×, with median 15×. Half of observed speedup cases falls between 11× and 20×. A streaming memory access benchmark we ran on this machine scaled up to circa 10×, which just less than the observed median speedup (remember rsb_matrix is running with multiple cores, but csr_matrix cannot exploit that).
For the reader who is not practical of SpMM performance: the memory access pattern of SpMM is typically very irregular, and largely dependent on the sparsity structure of the matrix. For this reason, for most layouts the multicore scaling of SpMM performance (in particular SpMV) tends to be worst than a streaming memory access scaling. But here we are comparing speed ratios of different algorithms, and these ratios differ as well. That reflects the better or worse aptness of a given format to a given matrix. For instance, matrix 17 has nonzeroes scattered quite regularly over the entire matrix, not much clustered: this favours RSB and the cache blocking induced by its structure rather than CSR (serial or not). Conversely, matrix 9 has most of its nonzeroes adjacent to some other, which is more CSR-friendly, and a contribution to the lesser improvement when switching to RSB here. See [Martone14] for more RSB-vs-CSR commentary. The speedups shown so far and those in Fig. 8 rely on default RSB layouts. As said earlier, the RSB format is suited best to scenarios with large matrices and repeated SpMM applications. These are also the scenarios where the usage of autotune, which refines the default layout according to the operands at hand, is most convenient. Figure 9 shows results with autotuned instances. Here autotune has been called for each combination of matrix, operands layout, NRHS, and numerical type. The median speedup over CSR here (circa 28.8×) is almost twice the one before autotuning.
With respect to non-autotuned RSB samples, the application of autotune brought a median improvement of 1.6×. This includes all samples, inclusive of the lower quartile, with speedup between 1× (no speedup) and 1.2×, which we nevertheless regard as ineffective (see next subsection's discussion). An overview of which matrix benefited more, and which less from autotuning is given by Fig. 10. There is no clear trend to see here. We observe that most of the cases (70%) benefited from autotuning. It's worth mentioning that the longer the time limit chosen to run SpMM before taking each performance sample, the less the fluctuation we would have encountered here, and times we chose were quite tight.
Speedups of tuned RSB vs CSR have median 29× with the 'C' layout, and 28.6× with 'F' layout; also within RSB the 'C' layout performs a few percentage points better than 'F'.
As seen in this section, autotuning can speedup RSB a further bit, but not always. The next section quantifies the cost of autotun-  For the purpose of this article, we chose to declare autotuning as effective if it brings a speedup of 20% or more. With this threshold set, while 94.5% of the cases get some speedup, it is 70% that qualify also as effective.
What one observes among effectively autotuned cases (see Fig.  11) is that in 75% of those cases, merely 2.5 CSR iterations are enough to amortize the autotuning time. This is thanks to the large speedup going from (serial) CSR to (parallel) RSB.
If as cost unit we consider going from non-autotuned to autotuned RSB instead, then the relative gain is less (because threaded non-autotuned RSB is already much faster than serial CSR), and consequently, it takes more to amortize it; see Fig. 12.
When autotuning was ineffective (30% of the cases with our 1.2× threshold, though only 5.5% exhibit no speedup at all), we regard its time as lost; in our test setup this was from a few dozen to a few hundred RSB iterations, with median 33; see Fig. 13. If expressed in terms of serial CSR iterations, these would be < 2.8 iterations in half of the cases, < 8 in 75% of the cases.
These results shall convince users that using autotune is a good option most of the times.

Conclusions and Future Work
Full utilization of the parallelism potential is important in achieving efficient operations on current CPUs. PyRSB does that by giving Python users transparent access to the shared-memory parallel performance library LIBRSB. Differently than classes in current scipy.sparse, but with a very similar usage interface, PyRSB's rsb_matrix readily exploits shared-memory parallelism. This article's results section gave a wide sample of speedup statistics with respect to SciPy's csr_matrix, on the SpMM operation. Observed median speedup with respect to csr_matrix exceeded the known memory bandwidth speedup on the machine; with autotuning, it doubled that, speaking for the good implementation in LIBRSB. Trade-off considerations in If one were to start autotuning from RSB (thus with less improvement potential than with CSR), the amortization times cost more iterations (here, median is 38.4×, 75% of the cases below 76×). Nevertheless, for many problems, where thousands of iterations are foreseen, this is perfectly acceptable. There is no guarantee autotuning improves SpMM performance. Actually, autotuning would be unnecessary, if we were able to guess blockings optimal under all circumstances. Indeed, without further analysis, one may even speculate that the default RSB blocking matrices where autotuning was ineffective, was also the best. In our experiment, ineffective autotuning searches cost 33× RSB (only 2.8× CSR) SpMM iterations in the median case. Note that for certain matrices (1,16,21) autotuning was always effective: this is why these have no associated box here.
using PyRSB effectively by means of autotuning have also been delineated.
SpMM and autotuning are the workhorses of PyRSB and we addressed their use here. Follow-up studies may address or reflect improvements on the LIBRSB side, special use cases, as well as mostly usability-related aspects on the PyRSB side, especially in striving for SciPy interoperability in the user interface. Comparing symmetric SpMM of PyRSB to that of specific symmetric formats in SciPy may also be of interest.