

# Computing DTWs on CPU, GPU and FPGA with **SYCL**

Cristian Campos, Rafael Asenjo, Javier Hormigo and Angeles Navarro

> EasyChair preprints are intended for rapid dissemination of research results and are integrated with the rest of EasyChair.

September 13, 2024

# Computing DTWs on CPU, GPU and FPGA with SYCL

Cristian Campos<sup>1[0009–0000–3273–6133]</sup>, Rafael Asenjo<sup>1[0000–0002–1570–3863]</sup>, Javier Hormigo1[0000−0002−5454−6821], and Angeles Navarro1[0000−0002−4140−2589]

> Dept. of Computer Architecture, Univ. of Málaga, Spain {cricamfe,asenjo,hormigo,angeles}@ac.uma.es

Abstract. One of the most time-consuming kernels of an epileptic seizure detection app is the computation of the Dynamic Time Warping (DTW) Distance Matrix. In this paper, we explore the design space of heterogeneous CPU, GPU, and FPGA implementations of this kernel. First, we optimize the CPU implementation of the DTW Distance Matrix computation leveraging the latest C++26 SIMD library and compare it with the SYCL implementation that also exploits the SIMD units. Next, run the SYCL code on an on-chip GPU, iGPU, as well as on a discrete NVIDIA GPU, dGPU. Finally we present the SYCL implementation on an Intel FPGA. Our evaluations demonstrate that SYCL seems well suited to exploit the SIMD capabilities of modern CPU cores, and shows promising results for the accelerating devices.

Keywords: Heterogeneous architecture · SIMD · GPU · FPGA · SYCL · DTW · energy efficiency.

# 1 Introduction

Nowadays, embracing heterogeneous programming is also a must if we require performance and reduced energy consumption from current computing platforms. In this context, the "No transistor left behind" war cry conveys the idea of all devices helping in accelerating different parts of an application. To help in this regard, new heterogeneous programming models, such as  $\text{SYCL}$  [\[6\]](#page-12-0),  $\text{DPC++}$ , and oneAPI [\[9\]](#page-12-1), are emerging in order to ease the development of heterogeneous applications without compromising performance.

In this paper, we target a CPU, an integrated GPU, iGPU, an NVIDIA discrete GPU, dGPU, and an FPGA that we exploit to accelerate the most expensive function of an epileptic seizure detection algorithm. In the process, we tailor the function for the CPU, the GPUs and the FPGA devices, striving to reduce the energy consumption and paying attention also to the programmability. For the CPU, one of the latest standard  $C_{++}$  SIMD library [\[10\]](#page-12-2) and SYCL are used. For the GPUs and FPGA, we also leverage the SYCL compiler (a High-Level Synthesis compiler in the case of the FPGA).

Epilepsy is one of the most common neurological diseases globally [\[17\]](#page-12-3), which makes the detection of epileptic seizures a socially impacting problem. Our goal

is to devise a wearable with just two electrodes that can take an electroencephalography (EEG) signal and warn the patient of epileptic seizures. To that end, we have developed a seizure detection algorithm based on Dynamic Time Warping (DTW) [\[14\]](#page-12-4) distance matrix computation. This is a quite compute and data-intensive algorithm that requires fast execution when trained with patientspecific EEG recordings.

With all this, this paper proposes the following novel contributions:

- 1. A CPU version of the DTW Distance Matrix computation that leverages the latest C++26 SIMD library features and SYCL SIMDimization capabilities.
- 2. Two accelerators of the DTW Distance Matrix computation, one for GPU and another for FPGA, which take advantage of the latest oneAPI SYCL compiler and its High-Level Synthesis capabilities for the FPGA.
- 3. A validation of the SYCL programming model in terms of performance portability and energy efficiency in four different heterogeneous architectures, discussing the strengths and limitations of our implementations for each device.

# 2 Background and related work

We rely on Figure [1](#page-2-0) to illustrate key concepts that are required to understand the EEG analysis we propose. A signal or channel,  $S<sup>c</sup>$ , is defined as discrete-time



<span id="page-2-0"></span>Fig. 1. The problem at hand and notation.

sequence of real-valued numbers  $s_i^c \in \mathbb{R}$ , i.e.,  $S^c = \{s_i^c, 0 \le i \le n\}$ , where n is the length of the sequence and  $c$  is a channel id or label. Each number  $s_i^c$ represents the electrical potential between two electrodes sampled at a specific time  $t_i$ . Channel labels in the 10-20 system  $[15]$  are used to identify each channel. For example, in Figure [1,](#page-2-0) each channel sample  $S^{F3-C3}$  represents the electrical potential measured between electrodes F3 and C3.

A seizure, denoted by  $Z_{k,z}^c$ , is a subsequence in channel  $S^c$  that starts at the sample k and has a length z, i.e.  $Z_{k,z}^c = S_{k,z}^c$  when this subsequence has been labeled as a seizure. In the dataset used, ictal (seizure) and interictal (nonseizure) episodes are clearly identified through metadata, md, that specifies the onset and offset timestamps of each seizure, from which we gather the  $k$  and  $z$ values. The presence of seizures in  $S<sup>c</sup>$  segments the signal into ictal,  $S<sup>c+</sup>$ , and interictal,  $S^{c-}$ , subsequences, as we see in Figure [1.](#page-2-0)

A query on channel  $S^c$ ,  $Q^c$ , is the concatenation of all  $N_z^t$  seizures in  $S^{c_t}$ one after the other in the order they appear. This can be expressed as:  $Q<sup>c</sup>$   $(Z_{k_1,z_1}^c, Z_{k_2,z_2}^c, ..., Z_{k_{Nt},z_{Nt}}^c)$ . The total length of the query is  $n_q = \sum_{i=1}^{N_z^t} z_i$ . For example, in Figure [1,](#page-2-0)  $Q^{F\ddot{3}-C3}$  is the concatenation of the two seizures in F3-C3.

In order to find seizure patterns in a channel  $S<sup>c</sup>$ , we only consider subsequences with a fixed length, e, and a fixed stride, s. We call epochs to these particular subsequences that, in other words, virtually segment the channels into smaller equidistant and fixed-size sliding windows. More precisely, an epoch  $E_i^c = S_{k,e}^c$  with  $k \in \{i \cdot s; 0 \le i < n_e\}$ , being the number of epochs  $n_e =$  $\lfloor (n-e)/s \rfloor + 1$ . Similarly to the epochs, the patterns are the subsequences or sliding windows of size  $e$  and stride  $s$  that fill in a query  $Q<sup>c</sup>$ . This is, a pattern  $P_i^c = Q_{k,e}^c$  with  $k \in \{i \cdot s; \ 0 \le i < n_p\}$ , being  $n_p = \lfloor (n_q - e)/s \rfloor + 1$  the number of patterns in the query of length  $n_q$ . In Figure [1,](#page-2-0) we depict just two generic epochs of the subsequence  $S_{a,b}^{F3-C3}$ , labeled as  $E_i^{F3-C3}$  and  $E_{i+1}^{F3-C3}$  and the patterns  $P_i^{F3-C3}$  and  $P_{i+1}^{F3-C3}$  of a query subsequence  $Q_{c,d}^{F3-C3}$ .

Let us remember that our goal is to automatically identify one or several patterns that can discriminate between seizures and non-seizures in an EEG channel. Such a suitable pattern should include a well-conserved shape that is present in seizure epochs,  $E^+$ , but not in non-seizure ones,  $E^-$ . This can be achieved by comparing patterns and epochs by computing the distance between them in order to measure their similarities. However, there are many patterns in the query, so we need to compute a quality metric for all of them in order to find the best pattern (the one with the highest discriminative quality). In any case, first we have to compute the distance matrix, DM, between all patterns,  $P_i$ , and all epochs,  $E_j$ , as shown in Figure [2.](#page-3-0) Formally, we use  $d_{i,j} = d(P_i, E_j)$ for the DTW distances between the patterns and the epochs.



<span id="page-3-0"></span>Fig. 2. Distance Matrix, DM.

We define the distance,  $d(P_i^c, E_j^c)$ , between a pattern,  $P_i^c$  and an epoch,  $E_j^c$ , as the Dynamic Time Warping (DTW) [\[14](#page-12-4)[,13\]](#page-12-6) distance between the two subsequences. DTW is an increasingly used algorithm for measuring similarity between two temporal sequences that may vary in phase or speed. It is now recognized as one of the most reliable similarity metrics [\[3\]](#page-12-7). Its higher computational cost in comparison with cheaper distances (as the Euclidean Distance, ED), has spurred the development of many simplified and optimized variants. One of the first simplifications was the cDTW (constrained DTW) [\[13\]](#page-12-6) that limits the warping path to a band, known as the Sakoe-Chiba band or warping window,  $w$ , around the main diagonal of the cost matrix. When the warping window is set to 0, the cDTW degenerates into the ED. The cDTW is a good compromise between the ED and the DTW, as it is faster than the DTW but more accurate than the ED. In this work we use cDTW with warping window,  $w = 16$ , as the distance metric between patterns and epochs. When the cDTW is used to find the nearest neighbor, NN, for example to find the most similar epoch to a pattern, further optimizations have been proposed, such as lower-bounding, early abandoning, and pruning techniques [\[7\]](#page-12-8).

In principle, the number of DTW distances that we have to compute is equal to the number of patterns times the number of epochs, which can be prohibitive. For example, for each of the 23 channels of the first patient of the CHB-MIT dataset [\[5\]](#page-12-9), with  $e = 1024$  (4 seconds) and  $s = 256$  (1 second), there are more than 145K epochs and 269 patterns, which results in computing almost 40 million DTW distances. Running a widely available Python DTW library<sup>[1](#page-4-0)</sup> [\[4\]](#page-12-10) on an offthe-shelf laptop, a single DTW distance of two subsequences of 1024 samples can take tens of milliseconds, which would translate in several months to compute all the distances of a single channel of a single patient.

Our work strives to optimize the computation of the DTW distance matrix, DM, and validate its performance and efficiency on four different architectures: CPU, iGPU, dGPU, and FPGA. We are not aware of any previous work that has tackled this problem in such a comprehensive way. However, we can find related work in the literature that has addressed the problem of computing another kind of distance matrix in different ways. For instance, in [\[18\]](#page-12-11), the authors propose a parallel algorithm, STAMP, to compute the Matrix Profile, based on a distance matrix that is used to find similar subsequences in time series. SCRIMP [\[19\]](#page-12-12) supersedes STAMP computing, in parallel, the diagonals of the distance matrix. SCAMP [\[20\]](#page-12-13) leverages the Pearson correlation to compare subsequences, instead of using the ED. We contributed with a CPU+GPU implementation of SCRIMP and a CPU+FPGA implementation of SCAMP in [\[11\]](#page-12-14) and [\[12\]](#page-12-15), respectively. In [\[1\]](#page-11-0) the DTW is used to compute the matrix profile instead of the ED, but the distance matrix is not explicitly computed. However, in our work, we need the whole DTW distance matrix because it is consulted many times in order to compute the quality metrics to identify the best epilepsy seizure pattern.

# <span id="page-4-1"></span>3 CPU and GPU implementations

In Figure [3](#page-5-0) we show an example of the cDTW distance computation. In the top-left corner we see the Euclidean Distance between a hypothetical Epoch, E, and Pattern, P, both of size  $e = 6$ . Each point of the Pattern is paired with the corresponding one in the Epoch so that the Euclidean Distance is  $d_E = \sum_{i=0}^{e-1} (P_i - E_i)^2$ . However, below we see that using a warping window of  $w = 2$ , the points of P,  $P_i$ , can be paired with points of E,  $E_j$ , with  $j \in \{i-2, i-1\}$  $1, i, i + 1, i + 2$ . In this case, the cDTW distance is  $d_{DTW} = \sum_{i=0}^{e-1} (P_i - E_j)^2$ , where  $j$  is the index of the point in  $E$ , within the warping window, that minimizes

<span id="page-4-0"></span><sup>1</sup> See <https://dynamictimewarping.github.io/>

the distance. In this example,  $d_{DTW} = 4 + 0 + 1 + 1 + 1 + 0 + 1 + 1 = 9$ , and the warping path, highlighted in gray, identifies the pairs  $(i, j)$  that result in this DTW distance.



<span id="page-5-0"></span>Fig. 3. Euclidean distance, cDTW distance example, and SIMD data structures.

The cDTW algorithm takes two vectors of data points,  $P$  and  $E$ , and computes the distance between them. The distance is computed by traversing a virtual (not stored) DTW matrix, D, of  $e \times e$  from the top left corner to the bottom right one. The distance between  $P_i$  and  $E_j$  is computed as the Euclidean distance between them,  $d(P_i, E_j) = (P_i - E_j)^2$ , plus the minimum of the three adjacent cells in the distance matrix,  $N$ ,  $NW$ , and  $W$  in Figure [3](#page-5-0) (from North, North West, and West). The final distance is the value of the bottom right corner of the matrix. The elements of the matrix out of the diagonal  $\pm w$  are initialized to  $\infty$ .

As we see on the left side of Figure [3,](#page-5-0) we can save memory space by only storing the required information of the matrix in two vectors of size  $2 \cdot w + 1$ , (5 elements in our example). These vectors, PRE (from previous) and CUR (from current) store just the elements that are necessary to compute the band of the matrix. At each i iteration, PRE and CUR are swapped and CUR is filled with the new values. The values NW, N, and W represent neighboring cells in the matrix used to compute the minimum distance at each step. For example, in our example of Figure [3,](#page-5-0) at iteration  $i = 4$ , PRE={6,9,18,13,13} (from iteration 3) and for  $j = 4$  we have  $NW = 18$ ,  $N = 13$ ,  $W = 7$ , and  $d(P_4, E_4) = 1$ , so  $x_{4,4} = min(18, 13, 7) + 1 = 8.$ 

The computation of the distance matrix, DM, offers several sources of parallelism that we can exploit. For instance, on the CPU, the simplest parallel version uses OpenMP to parallelize the epoch dimension (the largest one) of the DM. This is our baseline so we call it BASE. Besides, the pattern dimension is also parallelizable, and in order to better exploit the cache hierarchy we can "SIMDimize" the traversal in this dimension by comparing several patterns with each epoch.

To achieve substantial speedup in the computation, we explored two alterna-tives: i) using the C++[2](#page-5-1)6 SIMD library<sup>2</sup> [\[10\]](#page-12-2), which we refer to as SIMD; and ii) using the SIMD features of SYCL, which we call SYCLCPU. Depending on the chosen alternative (BASE, SIMD, or SYCLCPU), different data types and implementations are selected to optimize the computation. In our implementa-

<span id="page-5-1"></span> $2$  Or std::experimental::simd until C++26 is released

tions, a dtw t type is defined differently depending on the chosen alternative. For instance, dtw\_t is a float in BASE, a native\_simd in SIMD, and a float16 in SYCLCPU. Also we include a constant SIMDw that is initialized with the SIMD width (lanes), which, for  $dt w_t = fl$  oat, can be set to 4, 8, or 16 floats per SIMD register on our platforms.

To manage the computation of the DTW in the different versions (BASE, SIMD, and SYCLCPU), we encapsulated the functionality within the DTW CPU class. This class allows for the calculation of the DTW distance, leveraging conditional compilation and the parametrization of the  $dt$   $\< L$ . On the far right of Figure [3,](#page-5-0) we see the SIMD data structures, mainly registers PRE and CUR, and variables NW, N, and W.

The constructor of the DTW CPU class initializes the input arrays with an epoch, E, and several patterns, P. The entries of the patterns have been previously rearranged so that consecutive values correspond to different patterns, optimizing the data layout for SIMD and SYCLCPU implementations. We implement a calculate dtw method to traverse the rows of the DTW matrix and the band registers, iterating through the matrix in a nested loop. During each iteration, the the distance between the current elements of the epoch and pattern is computed.

In the SIMD and SYCLCPU versions, a vectorized operation is performed by loading a Psimd vector with multiple pattern samples and calculating several distances in parallel, which takes advantage of the SIMD units for performance. In contrast, the BASE version computes a single distance per iteration. The computation uses three neighboring values from the previous and current band registers, which are initialized according to the boundaries of the DTW matrix. The function then calculates the distance and updates the current band register, followed by a swap of the current and previous registers to prepare for the next iteration.

A performance issue was identified in the where SIMD function of the SIMD  $C++$  library. By replacing where by  $stat:min$ , a significant performance improvement was observed (up to 3.62x on our evaluation platforms).

In the BASE and SIMD versions, the function calculate dtw() is called from a double nested loop that traverses both the epochs and the patterns, using OpenMP parallel for in the outer loop that traverses the epochs. In SYCLCPU, a data-parallel kernel approach is used so that the same function is called inside a kernel submitted to the CPU SYCL queue (that also feeds the 8 CPU cores of our platforms).

For the GPU implementation, SYCLGPU, taking advantage of the portability of SYCL, we took the BASE code and compiled it for the iGPU and for the NVIDIA dGPU. For the latest, we leverage the interoperability features of the Intel one API  $DPC++/C++$  Compiler, by adding the required compiler flags (as -fsycl-targets and -Xsycl-target-backend). As in the SYCLCPU version, the SYCLGPU one invokes the function calculate dtw() from a kernel which is now submitted to the corresponding GPU queue (the iGPU or

the dGPU). In Section [5](#page-7-0) we describe the methodology employed to identify the optimal global work and work-group sizes for each device.

# <span id="page-7-2"></span>4 FPGA implementation

The FPGA implementation is also written in SYCL and compiled with the oneAPI SYCL compiler, but the code used for the CPU and GPU has to be substantially rewritten to get the most out of the FPGA. Still, the SYCL language allows us fast prototyping and design space exploration.

The core of the proposed architecture is based on the basic DTW accelerator circuit presented in [\[8\]](#page-12-16). This circuit combines a very simple iterative scheme, pipelining, and computation interleaving to produce a very efficient circuit in terms of throughput using a very reduced area. It has been adapted to the Intel FPGA architecture to create our IP kernel that we sketch in Figure [4.](#page-7-1) This kernel is replicated as much as possible (24 times) to maximize the FPGA utilization. Each kernel comprises several "DTW Computation" modules, along with one "Epoch Generation", one "Pattern Generation", and one "Result Write-back" module. The computation modules work in parallel with different epochs, but the same pattern. Hence, the different epochs are sent in parallel to the computation modules whereas the patterns are sent serially from one computation module to the next. There are as many computation modules in a basic kernel as strides fit in an epoch (four in our case, with  $e = 1024$  and  $s = 256$ ). The synchronization of the whole system is data-driven based. That means that each module controls how much data needs to be read or produced and the queues control signals stall the modules whenever is required.

<span id="page-7-1"></span>

Fig. 4. Basic kernel for DTW computation and kernel replication on the FPGA.

### <span id="page-7-0"></span>5 Experimental results

The major goal of SYCL is to improve the programmer productivity by allowing different heterogeneous devices to be used in a single application. However, although optimizations in the kernel code may differ across the devices in order to exploit their specific capabilities as we have seen in the previous sections, it is yet to be proven that this programming model guarantees performance portability across devices. This is the relevant point that we want to quantify here

for our case of study. For this, in this section, we conduct a performance evaluation of the various kernel implementations previously discussed. Our metrics of interest are the throughput (measured as the number of DTWs per millisecond -DTW/ms-) and the energy efficiency (measured as the throughput per Joule -DTW/ms per Joule-). We also explore whether, despite the optimizations, hardware bottlenecks appear in our executions.

#### 5.1 Test bench

The experimental evaluation has been performed on two test benches: AlderLake and SkyLake. All results (time and energy) are reported as the average value of 5 runs. We use 8 cores/threads in all CPU runs.

AlderLake features an Intel Core i9-12900K CPU, running at 3.20 GHz, with 8 performance, P, cores and 8 efficiency, E, cores, and 30 MB L3. For our study, we use the taskset command to confine the threads only on the Pcores. This platform also includes the on-chip/integrated GPU, iGPU, Intel UHD Graphics 770 and the discrete GPU, dGPU, NVIDIA RTX 4070 Ti. It has 128 GB of RAM and runs on Ubuntu 22.04.5 LTS. We compile with the Intel oneAPI DPC++/C++ icpx compiler version 2024.0.2.

SkyLake has an Intel Core i7-7820X 3.60 GHz processor with 8 cores and 11 MB L3, plus 128 GB of DDR4 RAM. In addition, it has an FPGA Intel Stratix 10 MX with 32 HBM memory banks, each with 512 MB, totaling 16 GB. The system runs CentOS 8.1.1911. This unit lacks a graphics card, thus enabling the execution of all developed versions, with the exception of the GPU version, including the FPGA implementation. On this platform, the baseline and SIMD versions based on C++26 utilize the GCC 12.2.0 compiler, whereas the SYCLCPU and FPGA versions are compiled with the Intel oneAPI  $DPC++/C++$  Compiler 2022.0.0 (this is the latest version supported by our FPGA).

To evaluate energy consumption on both platforms, we use Intel Performance Counter Monitor (Intel PCM)<sup>[3](#page-8-0)</sup> to accurately measure CPU and GPU power usage. In addition, to monitor FPGA power usage, we utilize StratixMonitor-Lib [\[16\]](#page-12-17). We rely on the NVIDIA Management Library (NVML) [\[2\]](#page-12-18), a specialized API created by NVIDIA to measure numerous metrics of its graphics cards, including the ability to monitor power usage.

As a benchmark, we use a channel,  $S<sup>c</sup>$ , with 162 hours of EEG signal sampled at 256 samples per second which translates into  $n = 150*10^6$  samples. The query,  $Q<sup>c</sup>$ , that contains the epileptic seizures has  $n_q = 107, 263$  samples. Using epochs and patterns of length  $e = 1024$  and stride  $s = 256$ , we end up with a number of epochs,  $n_e = 589,823$  and a number of patterns,  $n_p = 415$ , which results in a Distance Matrix with more than 244 million of DTW distances.

#### 5.2 Performance evaluation

Figure [5](#page-9-0) depicts the throughput (DTW/ms) that our different implementations achieve on AlderLake and SkyLake. BASE represents a parallel OpenMP CPU

<span id="page-8-0"></span><sup>3</sup> See <https://github.com/intel/pcm/>



Fig. 5. Performance metrics on AlderLake and SkyLake: Throughput -DTW/ms-. The higher, the better.

implementation without the SIMD optimization that we take as the baseline. SIMD represents a CPU implementation based on the C++26 SIMD library, whereas SYCLCPU shows the results for a CPU implementation based on SYCL (see section [3\)](#page-4-1). In both cases, SIMDw=x states the number of SIMD lanes that have been configured in each evaluation: 4, 8, and 16 floats per SIMD register.

Additionally, in AlderLake we see the results for the SYCLGPU implementation (see section [3\)](#page-4-1) running on the integrated GPU (iGPU) and on the discrete GPU (dGPU). To tune these versions, we use the Intel GPU Occupancy Calculator tool and the CUDA occupancy calculator that help us to explore different iGPU/dGPU job size configurations. We found that the optimal configuration for the iGPU was a global work size of 4096x32 and a work-group size of 64x8, achieving ideally 96.2% of execution unit (EU) utilization. Also, after exploring different combinations for the dGPU, we found that a global size of 12288x128 and a block size of 256x1, maximize ideally the multiprocessors (SM) utilization, achieving 97.66% of occupancy in this case. The results shown in the figure correspond to these configurations.

On the other hand, in SkyLake we see the results for the SYCL FPGA implementation (see section [4\)](#page-7-2) running on the Stratix 10 MX. Ideally, our implementation is able to compute 24 DTWs per cycle.

Clearly, from Figure [5](#page-9-0) we see that the SIMDimized CPU implementations always outperform the baseline and that increasing the number of lanes improves performance in both platforms, as expected. Moreover, SIMD code based on the C++26 library performs slightly better than the SYCL version. The observed performance degradation in SYCL is due to the kernel enqueueing and launching, which represent up to 5% and 3% of overhead in AlderLake and SkyLake, respectively. In any case, the optimal SIMDw=16 version outperforms the baseline by 11.8x and 11.4x in AlderLake and SkyLake. In fact, we performed a roofline analysis using the Intel Advisor tool and found that the function that represents the hotspot in the optimal SIMDw=16 code is compute-bound, and it features a headroom of just 1.4x and 2.4x to the ideal ALU peak in AlderLake and SkyLake

<span id="page-9-0"></span>

respectively. In other words, there are no bottlenecks, and the SIMDimization is fully exploiting the CPU capabilities in our CPU implementations (SIMD library-based and SYCL).

In AlderLake we see that the SYCLGPU version running on the integrated GPU (iGPU) performs 2.5x faster than the baseline. The Intel Advisor tool reports a 75% of EU occupancy, which hints that SYCL is reasonably exploiting the iGPU capabilities. Moreover, this SYCLGPU implementation running on the discrete GPU (dGPU) achieves 5.1x improvement over the fastest SYCLCPU. We also carried out a roofline analysis of this version using the NVIDIA NSight Compute tool and found that the function that represents the hotspot is memory bound, achieving 81% and 43% of memory and SM occupancy, respectively. Thus, although far from the expected ideal SM occupancy, this version fully exploits the attainable dGPU memory bandwidth, which represents the bottleneck in this device.

In SkyLake we notice that the SYCL version running on the FPGA is 4.3x faster than the baseline, but still 2.5x slower than the best SYCLCPU version. From the Intel oneAPI FPGA Reports tool we discovered that the loop that represents the hostspot has an initiation interval of 1, it is pipelined and works at a frequency of 300 MHz. This loop is in fact the DTW computation module shown in Fig. [4,](#page-7-1) and from the tool we learn that although it is fully optimized, the FIFO-queues used to send one computed pattern from one computation module to the next are effectively the bottleneck of the implementation because they introduce several stalls. Other interesting result is that the area estimates report tell us that our kernel uses 38% of ALUTs, 35% of on-chip block RAM and 7% of DSPs. Despite the apparent availability of resources, the compiler fitter module (quartus fit) was able to perform the placement and routing only under this scenario.

#### 5.3 Energy efficiency evaluation

The energy consumption metrics for AlderLake and SkyLake are shown in Figure [6.](#page-11-1) The solid bars show energy consumption in Joules (the higher the value, the worse), while the patterned bars show the energy efficiency -DTW/ms per Joule - in log scale (the higher the value, the better the efficiency).

From the energy consumption point of view, the SYCLGPU implementation running on the iGPU and dGPU reports the smallest values on Alderlake. However, from the energy efficiency perspective, the two more efficient implementations are SYCLGPU on dGPU and SIMD with SIMDw=16 on the CPU. On Skylake, the SYCLFPGA implementation exhibits the lowest energy consumption and its energy efficiency is near the more efficient SIMD and SYCLCPU with SIMDw=16 on the CPU.

# 6 Conclusions

In this paper, we propose a novel DTW distance matrix algorithm that we tailor to four different architectures: CPU, iGPU, dGPU, and FPGA. We use



Fig. 6. Energy consumption metrics in AlderLake and SkyLake: Energy -Joules- and Energy Efficiency -DTW/ms per Joule-. The latter metric is in log scale, and the higher, the better efficiency.

SYCL as the heterogeneous programming paradigm and evaluate its performance portability and energy efficiency across devices. Our results demonstrate that SYCL seems well suited to exploit the available SIMD capabilities of modern CPU cores, both in terms of performance and energy efficiency. It also shows promising results for accelerating devices, such as integrated and discrete GPUs and FPGAs, although in these two latter devices the off-chip and on-chip memory bandwidth are the bottlenecks, respectively.

These results make the case for using SYCL to systematically define the kernel of our application, then apply device-specific optimizations, as illustrated in this work, and finally dispatch each variant to the corresponding device. In fact, our results tell us that heterogeneous executions in which CPUs, GPUs, and FPGAs collaborate simultaneously to accelerate our application make sense, and we will explore this issue in future work.

# Acknowledgments

This work was supported by the Ministry of Science, Innovation and Universities of Spain (Grant Numbers TED2021-131527B-I00 and PID2022-136575OB-I00).

Disclosure of Interests. The authors have no competing interests to declare that are relevant to the content of this article.

# References

<span id="page-11-0"></span>1. Alaee, S., Kamgar, K., Keogh, E.: Matrix Profile XXII: Exact discovery of time series motifs under dtw. In: 2020 IEEE Intl. Conf. on Data Mining (ICDM) (2020)

<span id="page-11-1"></span>

- 12 C. Campos et al.
- <span id="page-12-18"></span>2. Developer, N.: Nvidia management library (nvml), [https://developer.](https://developer.nvidia.com/management-library-nvml) [nvidia.com/management-library-nvml](https://developer.nvidia.com/management-library-nvml)
- <span id="page-12-7"></span>3. Ding, H., et al.: Querying and mining of time series data: Experimental comparison of represent. and distance measures. Proc. VLDB Endow. 1(2), 1542–1552 (2008)
- <span id="page-12-10"></span>4. Giorgino, T.: Computing and visualizing dynamic time warping alignments in R: The dtw package. J. of Statistical Software (2009)
- <span id="page-12-9"></span>5. Goldberger AL, Amaral LAN, G.L.: CHB-MIT scalp EEG database (2010)
- <span id="page-12-0"></span>6. Group, T.K.S.W.: SYCL 2020 Specification (revision 3) (3 2021)
- <span id="page-12-8"></span>7. Herrmann, M., Webb, G.I.: Early abandoning and pruning for elastic distances including dynamic time warping. Data Mining and Knowledge Discovery 35(6), 2577–2601 (Nov 2021)
- <span id="page-12-16"></span>8. Hormigo-Jiménez, M., Hormigo, J.: High-throughput DTW accelerator with minimum area in AMD FPGA by HLS. In: 2023 38th Conf. on Design of Circuits and Integrated Sys. (DCIS) (2023)
- <span id="page-12-1"></span>9. Intel: Intel oneAPI Programming Guide (6 2020)
- <span id="page-12-2"></span>10. Kretz, M.: P0214R9: Data-parallel vector types & operations. Tech. rep., ISO C++ Committee, WG21, Parallelism TS 2 (2018)
- <span id="page-12-14"></span>11. Romero, J.C., Vilches, A., Rodríguez, A., Navarro, A., Asenjo, R.: ScrimpCo: scalable matrix profile on commodity heterogeneous processors. The Journal of Supercomputing (2020)
- <span id="page-12-15"></span>12. Romero, J.C., et al.: Efficient heterogeneous matrix profile on a CPU + high performance FPGA with integrated HBM. Future Generation Computer Systems 125, 10–23 (2021)
- <span id="page-12-6"></span>13. Sakoe, H., Chiba, S.: Dynamic program. algorithm optimization for spoken word recognition. IEEE T. on Acoustics, Speech, and Signal Processing 26(1) (1978)
- <span id="page-12-4"></span>14. Sakoe, H., Chiba, S.: A similarity evaluation of speech patterns by dynamic programming. In: Nat. Meeting of Institute of Electronic Communications Engineers of Japan. vol. 136 (1970)
- <span id="page-12-5"></span>15. Valentine, D.: Learning EEG, 10-20 system (2020), [https://www.](https://www.learningeeg.com/montages-and-technical-components) [learningeeg.com/montages-and-technical-components](https://www.learningeeg.com/montages-and-technical-components)
- <span id="page-12-17"></span>16. Vilches, A.: StratixMonitorLib (Jul 2020), [https://doi.org/10.5281/](https://doi.org/10.5281/zenodo.3948283) [zenodo.3948283](https://doi.org/10.5281/zenodo.3948283)
- <span id="page-12-3"></span>17. WHO: Optimizing brain health across the life course (August 2022)
- <span id="page-12-11"></span>18. Yeh, C.C.M., et al.: Matrix profile I: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets. In: Data Mining (ICDM), IEEE 16th Intl. Conf. on (2016)
- <span id="page-12-12"></span>19. Zhu, Y., et al.: Matrix Profile XI: SCRIMP++: Time series motif discovery at interactive speeds. 2018 IEEE International Conference on Data Mining (ICDM) pp. 837–846 (2018)
- <span id="page-12-13"></span>20. Zimmerman, Z., et al.: Matrix profile XIV: scaling time series motif discovery with GPUs to break a quintillion pairwise comparisons a day and beyond. In: ACM Symp. on Cloud Computing (2019)