Search for Extraterrestrial Intelligence: GPU Accelerated TurboSETI

—A common technique adopted by the Search For Extraterrestrial Intelligence (SETI) community is monitoring electromagnetic radiation for signs of extraterrestrial technosignatures using ground-based radio observatories. The analysis is made using a Python-based software called TurboSETI to detect narrowband drifting signals inside the recordings that can mean a technosignature. The data stream generated by a telescope can easily reach the rate of terabits per second. Our goal was to improve the processing speeds by writing a GPU-accelerated backend in addition to the original CPU-based implementation of the de-doppler algorithm used to integrate the power of drifting signals. We discuss how we ported a CPU-only program to leverage the parallel capabilities of a GPU using CuPy, Numba, and custom CUDA kernels. The accelerated backend reached a speed-up of an order of magnitude over the CPU implementation.


Introduction
The Search for Extraterrestrial Intelligence (SETI) is a broad term utilized to describe the effort of locating any scientific proof of past or present technology that originated beyond the bounds of Earth.SETI can be performed in a plethora of ways: either actively by deploying orbiters and rovers around planets/moons within the solar system, or passively by either searching for biosignatures in exoplanet atmospheres or "listening" to technologically-capable extraterrestrial civilizations.One of the most common techniques adopted by the SETI community is monitoring electromagnetic radiation for narrowband signs of technosignatures using groundbased radio observatories.This search can be performed in multiple ways: equipment primarily built for this task, like the Allen Telescope Array (California, USA), renting observation time, or in the background while the primary user is conducting other observations.Other radio-observatories useful for this search include the MeerKAT Telescope (Northern Cape, South Africa), Green Bank Telescope (West Virginia, USA), and the Parkes Telescope (New South Wales, Australia).The operation of a radio-telescope is similar to an optical telescope.Instead of using optics to concentrate light into an optical sensor, a radio-telescope operates by concentrating electromagnetic waves into an antenna using a large reflective structure called a "dish" ([Reb82]).The interaction between the metallic antenna and the electromagnetic wave generates a faint electrical current.This effect is then quantized by an analog-to-digital converter as voltages and transmitted to a processing logic to extract useful information from it.The data stream generated by a radio telescope can easily reach the rate of terabits per second because of the ultra-wide bandwidth radio spectrum.The current workflow utilized by the Breakthrough Listen, the largest scientific research program aimed at finding evidence of extraterrestrial intelligence, consists in pre-processing and storing the incoming data as frequency-time binary files ([LCS + 19]) in persistent storage for later analysis.This postanalysis is made possible using a Python-based software called TurboSETI ([ESF + 17]) to detect narrowband signals that could be drifting in frequency owing to the relative radial velocity between the observer on earth, and the transmitter.The offline processing speed of TurboSETI is directly related to the scientific output of an observation.Each voltage file ingested by TurboSETI is often on the order of a few hundreds of gigabytes.To process data efficiently without Python overhead, the program uses Numpy for near machine-level performance.To measure a potential signal's drift rate, TurboSETI uses a de-doppler algorithm to align the frequency axis according to a pre-set drift rate.Another algorithm called "hitsearch" ([ESF + 17]) is then utilized to identify any signal present in the recorded spectrum.These two algorithms are the most resource-hungry elements of the pipeline consuming almost 90% of the running time.

Approach
Multiple methods were utilized in this effort to write a GPUaccelerated backend and optimize the CPU implementation of TurboSETI.In this section, we enumerate all three main methods.

CuPy
The original implementation of TurboSETI heavily depends on Numpy ([HMvdW + 20]) for data processing.To keep the number of modifications as low as possible, we implemented the GPUaccelerated backend using CuPy ([OUN + 17]).This open-source library offers GPU acceleration backed by NVIDIA CUDA and AMD ROCm while using a Numpy style API.This enabled us to reuse most of the code between the CPU and GPU-based implementations.

Numba
Some computationally heavy methods of the original CPU-based implementation of TurboSETI were written in Cython.This approach has disadvantages: the developer has to be familiar with Cython syntax to alter the code; the code requires additional logic   to be compiled at installation time.Consequently, it was decided to replace Cython with pure Python methods decorated with the Numba ([LPS15]) accelerator.By leveraging the power of the Just-In-Time (JIT) compiler from Low Level Virtual Machine (LLVM), Numba can compile Python code into assembly code as well as apply Single Instruction/Multiple Data (SIMD) acceleration instructions to achieve near machine-level speeds.

Single-Precision Floating-Point
The original implementation of the software handled the input data as double-precision floating-point numbers.This behavior would cause all the mathematical operations to take significantly longer to process because of the extended precision.The ultimate precision of the output product is inherently limited by the precision of the original input data which in most cases is represented by an 8-bit signed integer.Therefore, the addition of a singleprecision floating-point number decreased the processing time without compromising the useful precision of the output data.

Results
To test the speed improvements between implementations we used files from previous observations coming from different observatories.Table 1 indicates the processing times it took to process three different files in double-precision mode.We can notice that the CPU implementation based on Numba is measurably faster than the original CPU implementation based on Cython.At the same time, the GPU-accelerated backend processed the data from 6.8 to 9.3 times faster than the original CPU-based implementation.Table 2 indicates the same results as Table 1 but with singleprecision floating points.The original Cython implementation was left out because it doesn't support single-precision mode.Here, the same data was processed from 7.5 to 10.6 times faster than the Numba CPU-based implementation.
To illustrate the processing time improvement, a single observation containing 105 GB of data was processed in 12 hours by the original CPU-based TurboSETI implementation on an i7-7700K Intel CPU, and just 1 hour and 45 minutes by the GPU-accelerated backend on a GTX 1070 Ti NVIDIA GPU.

Conclusion
The original implementation of TurboSETI worked exclusively on the CPU to process data.We implemented a GPU-accelerated backend to leverage the massive parallelization capabilities of a graphical device.The benchmark performed shows that the new CPU and GPU implementation takes significantly less time to process observation data resulting in more science being produced.Based on the results, the recommended configuration to run the program is with single-precision calculations on a GPU device.

TABLE 1
Double precision processing time benchmark with Cython, Numba and CuPy implementation.

TABLE 2
Single precision processing time benchmark with Numba and CuPy implementation.