I Introduction
Electrocardiogram is a reliable, effective and noninvasive diagnostic tool and is the best representation of electrophysiological pattern of depolarization and repolarization of the heart muscles during each heartbeat. Heart beat classification based on ECG provides conclusive information to the cardiologists about chronic cardiovascular diseases [1]. An intelligent system for diagnosing cardiovascular diseases is highly desirable because they are the leading source of death around the globe [2].
Arrhythmia is a heart rhythmic problem which occurs when electrical pulses that coordinate hearbeats cause heart to beat irregularly i.e either too slow or too fast. Arrhythmias can be caused by coronary artery disease, high blood pressure, changes in the heart muscle (cardiomyopathy), valve disorders etc.
Myocardial Infarction, also known as heart attack, is caused due to the blockage of blood supply to the coronary arteries and in general to the myocardium. This blockage stops the supply of oxygenrich blood to the heart muscle which can be lifethreatening for the patient [3].
ECG beatbybeat examination is vital for early diagnosis of cardiovascular conditions. However, differences of recording environment, variations of disease patterns among the subjects during testing, complex, nonstationary and noisy nature of ECG signal [4] make heartbeat classification a challenging and laborious exercise for cardiologists [5]. Thus, computer based novel practices are useful for automatic and autonomous detection of abnormalities in heartbeat ECG classification.
Conventional methods for heartbeat classification using ECG signal rely mostly on handcrafted or manually extracted features using signal processing techniques such as digital filterbased methods [6], mixture of experts methods [7], thresholdbased methods [8]
, Principal Component Analysis (PCA)
[9][10] and wavelet transform [11]. Some of the classifiers used with these extracted features are Support Vector Machines (SVM) [12], Hidden Markov Models (HMM)
[13] and Neural Networks [14]. The first disadvantage with these conventional methods is the separation of feature extraction part and pattern classification part. Furthermore, these methods need expert knowledge about the input data and selected features
[15]. Moreover, extracting features using subject experts is a time consuming process and features may not invariant to noise, scaling and translations and thus can fail to generalize well on unseen data.Exemplary performance of deep neural networks (DNNs) on ECG [16] and especially the performance of CNN using ID convolution [17] and 2D convolution [18]
has recently attracted attention of many researchers. Deep learning models are capable of automatically learning invariant and hierarchical features directly from the data and employ endtoend learning mechanism that takes data as input and class prediction as output. Recent deep learning models use 1D ECG signal or 2D representation of ECG by transforming ECG signal to images or some matrix form. For 1D ECG classification, commonly used deep learning models are deep belief networks, restricted Boltzmann machines, auto encoders, CNN
[19]and recurrent neural network (RNN)
[20]. For 2D ECG classification, CNNs are used and the input ECG data is transformed to images or some other 2D representation. It is experimentally proved in [21] that 2D representation of ECG provides more accurate heartbeat classification compared to 1D. In our previous work [22], univariate ECG signal is transformed to images by segmenting ECG signal between successive RR intervals and then stacking these RR intervals row wise to form images. Finally, multidomain multimodal fusion is performed to improve the stress assessment. Experimental results proved that multidomain multimodal fusion achieved highest performance as compared to single ECG modality.Existing deep learning methods deprived of providing robust fusion framework and rely mostly on concatenation [23] and decision level fusion [24].
In this manuscript, we deal with the shortcomings of existing deep learning models for ECG heartbeat classification by proposing two fusion frameworks that have the capacity of extracting and fusing complementary and discriminative features while reducing dimensionality as well.
The proposed work has following significant contributions:

Two multimodal fusion frameworks for ECG heartbeat classification called Multimodal Image Fusion (MIF) and Multimodal Feature Fusion (MFF), are proposed. At the input of these frameworks, we convert the heartbeats of raw ECG data into three types of twodimensional (2D) images using Gramian Angular Field (GAF), Recurrence Plot (RP) and Markov Transition Field (MTF). Proposed fusion frameworks are computationally efficient as they keep the size of the combined features similar to the size of individual input modality features.

We transform heartbeats of ECG signal to images using Gramian Angular Field (GAF), Recurrence Plot (RP) and Markov Transition Field (MTF) to conserve the spatial domain correlated information among the data samples. These transformations result in an improvement in classification performance in contrast to the existing approaches of transforming ECG to images using spectrograms or methods involving timefrequency analysis (Short time Fourier transform or wavelet transform).
Ii Related Work
Deep Learning models especially CNN has been used over the years for ECG heartbeat classification for the detection of cardiovascular diseases such as arrhythmia and MI. These models include both 1D and 2D CNNs.
Iia Onedimensional CNN Approaches
Various models based on 1D CNN has been proposed in the literature for ECG classification. In [25]
, an active learning model based on ID CNN is presented for arrhythmia detection using ECG signal. Model performance is improved by using breakingties (BT) and modified BT algorithms. Authors in
[26] proposed a model for adaptive real time implementation of a patientspecific ECG heartbeat classification based on 1D CNN using endtoend learning. In [27], a novel algorithm making use of an 11layer deep CNN is proposed for automatic detection of MI using ECG beats with and without noise. A transfer learning method based on CNN is proposed in
[28] where the information learned from arrhythmia classification task is employed as a reference for the training of classifiers. A computationally intelligent method for patient screening and arrhythmia detection using CNN is proposed in [29]. The proposed method is capable of diagnosing arrhythmia conditions without expert domain knowledge and feature selection mechanism. In
[30], wavelet transform based on FourierBessel series expansion is proposed for the localization of ECG. The FourierBessel spectrum of the ECG beats is separated into adjacent parts using the fixed order ranges and then multiscale CNN is employed for MI classification of different categories. MultiChannel Lightweight Convolutional Neural Network (MCLCNN) which uses squeeze convolution, the depthwise convolution, and the pointwise convolution is proposed in [31] for MI classification. Two endtoend deep learning models based on CNN are proposed in [32]. These models are called two stage hierarchical model. Furthermore, generative adversarial networks (GANs) is used for data augmentation and to reduce the class imbalance. In
[33], authors proposed a neural network model for precise classification of heartbeats by following the AAMI interpatient standards. This model works in two steps. In the first step the signals are preprocessed and then features are extracted from the signals. In the second step, the classification is performed by a twolayer classifier in which each layer consists of two independent fullyconnected neural networks. The experiments show that the proposed model precisely detects arrhythmia conditions. In [34], authors proposed a complex deep learning model consists of CNN and LSTM. This model classifies six types of ECG signals by processing ten seconds ECG slices of MITBIH arrhythmia dataset. Experimental results proved that the proposed model could be used by cardiologists to detect arrhythmia. In [35], authors presented CNN based model for proper diagnoses of congestive heart failure using ECG. The testing and training of the proposed model was carried out on publicly available ECG datasets. Performance of the proposed model shows the authenticity of model for congestive heart failure detection.IiB Twodimensional CNN Approaches
The knock out performance of CNN on 2D data such as images convinced the researchers to convert raw ECG data to images for improved results. In [21], shorttime Fourier transform is used to convert ECG signal into timefrequency spectrograms that were used as input to CNN for arrhythmia classification. Experimental results show that 2DCNN achieved higher classification accuracy than 1DCNN. In [36], ECG signal is converted into spectrotemporal images that were sent as an input to multiple dense convolutional neural network to capture both beattobeat and singlebeat information for analysis. Authors in [37] transformed heartbeat time intervals of ECG signals to images using wavelet transform. These images are used to train a six layer CNN for heartbeat classification. In [38], Generative neural network is used to convert the raw 1D ECG signal data into a 2D image. These images are input to DenseNet which produces highly accurate classification, with high sensitivity and specificity using 4 classes of heart beat detection. To distinguish abnormal ECG samples from normal, authors in [39] used pretrained CNNs such as AlexNet, VGG16 and ResNet18 on spectrograms obtained from ECG. Using a transfer learning approach, the highest accuracy of 83.82% is achieved by AlexNet. In [40], multilead ECG are treated as 2D matrices for input to a novel model called multileadCNN (MLCNN) which employs sub twodimensional (2D) convolutional layers and lead asymmetric pooling (LAP) layers. In [41], authors generated dual beat coupling matrix from the sections of heartbeats. This dual beat coupling matrix was then as 2D input to a CNN classifier. Graylevel cooccurrence matrix (GLCM), obtained from ECG data is employed for features vector description due to its exceptional statistical feature extraction ability in [42]. In [43], ECG signals were segmented into heartbeats and each of the heartbeats were transformed to 2D grayscale images which were input to CNN. In [44], two second segments of ECG signal are transformed to recurrence plot images to classify arrhythmia in two steps using deep learning model. In the first step the noise and ventricular fibrillation (VF) categories were recognized and in the second step, the atrial fibrillation (AF), normal, premature AF, and premature VF labels were classified. Experimental results show the promising performance of the proposed method.
IiC Fusion based approaches
Fusing different modalities mitigates the weaknesses of individual modalities both in 1D and 2D forms by integrating complementary information from the modalities to perform the analysis and classification tasks accurately. In [45], a Multiscale Fusion convolutional neural network (MSCNN) is proposed for heartbeat classification using ECG signal. The Multiscale Fusion convolutional neural network is a two stream network consisting of 13 layers. The features obtained from the last convolutional layer are concatenated before classification. Another Deep Multiscale Fusion CNN (DMSFNet) is proposed in [46] for arrhythmia detection. Proposed model consists of backbone network and two different scalespecific networks. Features obtained from two scale specific networks are fused using a spatial attention module. Patientspecific heartbeat classification network based on a customized CNN is proposed in [47]
. CNN contains an important module called multireceptive field spatial feature extraction (MRFSFE). The MRFSFE module is designed for extracting multispatial deep features of the heartbeats using five parallel convolution layers with different receptive fields. These features are concatenated before being sent to the third convolutional layer for further processing. Two stage serial fusion classifier system based on SVM’s rejection option is proposed in
[48]. SVM’s distance outputs are related with confidence measure and then ambiguous samples are rejected with first level SVM classifier. The rejected samples are then forwarded to a second stage Logistic Regression classifier and then late fusion is performed for arrhythmia classification. Authors in
[49] presented a unique feature fusion method called parallel graphical feature fusion where all the focus is given to geometric features of data. Original signal was first split into subspaces, then multidimensional features are extracted from these subspaces and then mapped to the points in highdimensional space. Multistage feature fusion framework based on CNN and attention module was proposed in [50] for multiclass arrhythmia detection. Classification is performed by extracting features from different layers of CNN. Combination of CNN and the attention module shows the improved discrimination power of the proposed model for ECG classification.The shortcoming in the existing fusion methods is that they depend mostly on concatenation fusion. Concatenation leads towards the problem computational complexity, curse of dimensionality and hence the degradation in classification accuracy
[51]. In this paper, we address the imperfections of the existing literature and propose two fusion frameworks called Multimodal Image Fusion (MIF) and Multimodal Feature Fusion (MFF) which extract and fuse the features while reducing dimensionality as well. The proposed fusion frameworks are described in section III.Iii Materials and Methods
This section explains the proposed fusion frameworks called Multimodal Image Fusion (MIF) and Multimodal Feature Fusion (MFF). The common element in both of the proposed fusion framework is ECG signal to image transformation as shown in Figures 1 and 2. Therefore in this section, first we will explain ECG signal to image transformation and then MIF, MFF and the two important elements of MFF, gated fusion network shown in Fig. 3 and architecture of CNN shown in Fig. 4, will be explained.
Iiia ECG Signal to Image Transformation
For each fusion framework, we transform the input heartbeats into three types of images called GAF, RP and MTF images.
IiiA1 Formation of Images by Gramian Angular Field (GAF)
Converting heartbeats of ECG into Gramian Angular Field (GAF) images maps the ECG in an angular coordinate system instead of typical rectangular coordinate system.
Consider that is an ECG signal of samples such that . We normalized between 0 and 1 to get . Now we map the normalized ECG in angular coordinate system by transforming the value into the angular cosine and the time stamps into the radius. Following equation is used to explain this encoding.
(1) 
In the above equation, is normalized sample of the ECG, is the time stamp for and is a constant to adjust the spread of the angular coordinate system. This encoding provides two benefits. It is bijective and it conserves the spatial domain affiliations through the [52]. Since the image location with respect to the ECG heart beat samples is consistent along the principal diagonal, therefore, the original heart beat samples of ECG can be restored from angular coordinates [53].
The angular viewpoint of the encoded image can be exploited by taking into account the sum/difference between each sample to indicate the correlation among various time stamps. The summation method, used in this article is explained by the following set of equations.
(2) 
(3) 
is the unit row vector in equation 3
GAF Images of five different categaories for MITBIH dataset are shown in Fig 5.
IiiA2 Formation of Images by Recurrence Plot (RP)
ECG is a nonstationary signal, therfore to visulaize the recurrent behavior and to observe the recurrence pattern of ECG signal [54], we encode ECG heartbeats into RP images. An RP image obtained from a heartbeat of ECG represents spacing between time points [55].
For ECG signal defined in section IIIA1, the recurrence plot is given by
(4) 
where is threshold and is the heaviside function.
RP Images of five different categaories for MITBIH dataset are shown in Fig 5.
IiiA3 ECG to Markov Transition Field (MTF) image conversion
For ECG heartbeats to MTF image encoding, we used the same approach explained in [56]. Let is the ECG signal defined in section IIIA1, then the foremost step is to define its
bins based on quantiles and assign every
to the related bins . Second step is the construction of weighted adjacency matrixby computing tranformations within quantile bins like a firstorder Markov chain on the time axis. Weighted adjacency matrix in the normalized form is called Markov transition matrix and is nonreative to the spatial domain characteristics, resulting in information loss. For handling the loss of information, Markov transition matrix is transformed to Markov transition field matrix (MTF) by stretching the transition likelihoods corresponding to the spatial domain locations. The MTF matrix is denoted by M and is shown below
(5) 
Where is the frequency of transition of a point between two quantiles. Since the formation of transformed matrix depends upon the chances of moving element, the MTF cannot be restored to original ECG signal.
Bins are the quantiles where the probability distribution is same. Any number of bins can be selected for ECG to MTF images. We decided to take 10 bins as the data is normalized between 0 and 1. These bins are defined during the formation of Weighted adjacency matrix which is the first step for creating MTF matrix shown in equation
5.MTF Images of five different categaories for MITBIH dataset are shown in Fig 5.
For ECG to image transformation using GAT, RP and MTF methods, we are using the full length of heartbeats to transform 1D information to 2D. Therefore, ECG signal of any length can be transformed to images and then can be resized using interpolation.
We can see from Fig. 5, that for each kind of image (GAF, RP and MTF), the gray scale images are more interpretable. These images show different patterns for each of the five categories of MITBIH dataset. The xy values of the 2D images are just pixel values of the GAF, RP, and MTF images.
IiiB Multimodal Image Fusion Framework
Multimodal Image Fusion (MIF) framework is shown in Fig. 1. At the input, we transform the heartbeats of raw ECG signal into three types of images as described in section IIIA and shown in Fig. 5. The motivation of choosing GAF, MTF and RP is that they are three different statistical methods of transforming ECG to images. During transformation they preserve the temporal information and hence they are lossless transformations. We combine these three gray scale images to form a triple channel image (GAFRPMTF). A triple channel image is a colored image in which GAF, RP and MTF images are considered as three orthogonal channels like three different colors in RGB image space. However, this threechannel image is not conventional way of converting a gray scale image to RGB, rather in this paper all three gray scale images are formed from raw ECG data with different statistical methods. Thus, a threechannel image in the presented work carries statistical dynamics of the ECG and therefore, is more informative. Furthermore, threechannel image can be easily utilized with with offtheshelf CNNs like AlexNet.
IiiC Multimodal Feature Fusion Framework
At the input of MFF, we transform ECG heartbeats into images as shown in Fig. 2. AlexNets are employed to learn features from input imaging modality. We extract these learned features from (fc7) of each AlexNet and are then fused by an efficient Gated Fusion Network (GFN), backbone of the proposed MFF, which fuses the features effectively by taking care of their dimensionalities as well. These fused features are input of the SVM classifier as shown in Fig. 2.
IiiC1 Gated Fusion Network
The architecture of our proposed gated fusion network (GFN) is shown in Fig. 3. We have adapted this network from our previous work in [58]. The input to the GFN are the features extracted from the second last fully connected layer (fc7) of each AlexNet as shown in Fig. 2.
Let , and be the features from each imaging modality respectively. These feature are then convolved with high boost kernel as shown in Fig. 3.
We used high boost filter for convolution with features since this filter precisely recognize important information of feature and accredits boosted value to every element of features according to its importance [59]. High boost filter is the difference between scaled version and lowpass version of the input image as shown below in equation 6.
(6) 
where and are respectively the scaled version and low pass version of image
In general, high boost filter is given by
(7) 
where is the amplification factor that assigns the weights to the feature during convolution.
The best filter performance is obtained for = 1. Other values of produces less amplification.
Thus, following high boost kernel is selected empirically that highlights the important characteristics.
(8) 
High boost filter highlights the high frequency components while conserving the low frequency components.
After convolution of features with the high boost filter, sigmoid function is used for generating proper gated weights , and respectively as shown in Fig. 3. Finally, we obtained pointwise product of the weights , and and the features , and respectively, to perform feature fusion and to generate fused features. The working of GFN can be understood by the following equations.
(9) 
(10) 
(11) 
(12) 
Where,
: Sigmoid Function.
: Convolution
: Point Wise Multiplication
: th feature of th modality
: Fused feature
IiiC2 CNN Architecture
Architecture of CNN used in proposed MFF is shown in Fig. 4
. It consists of three convolutional layers, two pooling layers, and a fully connected layer. The first convolutional layer has 16 kernels of size 5x5, followed by pooling layer of size 2x2 and stride 2. Second and third convolutional layers have 32 kernels of size 5x5 followed by 2x2 pooling layer with stride 2.
IiiD Classification Task and Classifier
The classification task of the proposed methods is ECG heart beat classification for arrythmia and MI detection.
The classification metrics used for classification are accuracy, precision and recall as shown in Tables
V, VI, VII and VIII. The accuracies, precisions and recalls are calculated using following equations.(13) 
(14) 
(15) 
where,
= True positive
= True negative
= False positive
= False negative
We used Softmax classifier in proposed MIF and Support Vector Machines (SVM) classifier in proposed MFF for classification task.
Softmax classifier is a multiclass classifier or regressor used in the fields of machine learning. Score function for softmax classifier computes the class specific probabilities whose sum is 1.
The mathematical representation of score function for softmax classifer is shown below.
(16) 
where is the input vector and the score function maps the exponent domain to the probabilities.
In simplest form, the score function for SVM is the mapping of the input vector to the scores and is a simple matrix operation as shown in Equation 17.
(17) 
Where is the input vector, is the weight determined by input vector and the number of classes and
is the bias vector.
Training Parameters  Values 
Momentum  0.9 
Initial Learn Rate  0.005 
Learn Rate Drop Factor  0.5 
Learn Rate Drop Period  10 
Regularization  0.004 
MiniBatchSize  128 
IiiE Training and Optimization
We resize images to 227 x 227 to perform experiments with AlexNet. We also perform experiments with smaller but computationally efficient CNN, whose architecture is shown in Fig. 4, to show that proposed frameworks can achieve comparable performance even with the smaller CNN. The comparison in terms of computational cost between both CNN models is provided in Table XI. We fine tune Alexnet by reducing the size of second last fully connected layer ’fc7’ from 4096 to 512 and the size of last fully connected layer ’fc8’ from 1000 to size equal to the number of classes in our datasets. The size of “fc7” layer of AlexNet is 4096 which is according to size of classification layer which is 1000. For our MITBIH dataset and PTB dataset, we need the size of classification layer equal to 5 and 2 respectively due to number of classes in these datasets. Thus to make ‘fc7’ compatible with classification layer, we reduce its size to 512. The training parameters for AlexNet and CCN are shown in Table I.
For optimization of the deep networks, we used Stochastic Gradient Descent with Momentum (SGDM) algorithm. SGDM is a method which helps accelerate gradients vectors in the right directions, thus leading to faster converging. It is one of the most popular optimization algorithms and many stateoftheart models are trained using it.
Iv Experimental Results
Iva ECG Databases
Experiments are performed with PhysioNet MITBIH Arrhythmia dataset [60] [61] for heartbeat classification and PTB Diagnostic ECG dataset [62] for MI classification using both proposed fusion frameworks. For experiments, ECG leadII resampled data at sampling frequency of 125Hz is used as the input.
We used the standardized form of both datasets provided in [63]. These datasets are already denoised and the training and testing parts are provided in the form of standard ECG heartbeats. Furthermore, five classes of arrythmia and MI localization has already been done and provided in terms of standard ECG heartbeats. Our study focused on ECG to image transformation and to the design of proposed multimodal fusion frameworks. The main focus is increasing the overall performance of classification of heartbeats. We did not attempt at modeling or solving for a specific type of noise.
We conduct our experiments on Matlab R2020a on a desktop computer with NVIDIA GTX1070 GPU.
The experimental results are discussed in detail in section V.
IvA1 PhysioNet MITBIH Arrhythmia Dataset
Forty seven subjects were involved during the collection of ECG signals for the dataset. The data was collected at the sampling rate of 360Hz and each beat is annotated by at least two experts. Using these annotations, five different beat categories are created in accordance with Association for the Advancement of Medical Instrumentation (AAMI) EC57 standard [64] as shown in Table II.
For training on CNN, we need large number of samples. We use the same testing and training segments provided in [63] to train on CNNs. Since there is a classimbalanced in the training part of the dataset as apparent from the numbers, we applied SMOTE [65] to upsample the minority classes (classes other than N) and finally settled on the numbers shown in the right column of Table III.
SMOTE is a data augmented technique which is used to reduce overfitting during training and is helpful to reduce the biasness of classifier.
We perform experiments using both proposed fusion frameworks on MITBIH dataset with the training and testing samples shown in Table IV and with the training parameters shown in Tables I. The experimental results are shown in Tables V and VI.
Modalities  Accuracies%  Precision%  Recall% 
GAF Images only  97.3  85  91 
RP Images only  97.2  82  93 
MTF Images only  91.5  86  89 
Concatenation Fusion  97  82  91 
Average Fusion  98.5  95  93.1 
Proposed MIF  98.6  93  92 
Proposed MFF  99.7  98  98 
Modalities  Accuracies%  Precision%  Recall% 
GAF Images only  98.4  98  96 
RP Images only  98  98  94 
MTF Images only  95.3  94  89 
Concatenation Fusion  97.4  95  95 
Average Fusion  98.5  97  98 
Proposed MIF  98.4  98  94 
Proposed MFF  99.2  98  98 
IvA2 PTB Diagnostic ECG dataset
Two hundred and ninety (290) subjects took part during collection of ECG records for PTB Diagnostics dataset. 148 of them are diagnosed as MI, 52 healthy control, and the rest are diagnosed with 7 different diseases. Frequency of 100Hz is used for each ECG record from 12 leads. However, for our experiments, we used lead II ECG recordings and worked with healthy control and MI categories.
We perform experiments using both proposed fusion frameworks on PTB dataset with training and testing samples shown in Table IV and with training parameters shown in Tables I. Training and testing parts of the dataset are provided in [63] to train CNN models. The experimental results are shown in Tables VII and VIII
V Discussion
We present the comparative results of the proposed frameworks with the stateofthe art methods in Tables IX and X. As we can see, our proposed frameworks considerably outperform the existing methods in terms of accuracy, precision, and recall.
To justify the importance of the proposed fusion frameworks, we assess the performance of different components of the proposed framework with both datasets by concatenation and average fusion methods. We performed average fusion by accrediting the unity value to all the weights i.e = 1, = 1 and = 1 in the gated fusion network. Since we have three modalities, therefore, by taking simple average, we get the equal value of 0.333 for each weight. We also experiment with 0.333 and get the same results. Since weights are equal in average fusion, therefore, to make things simpler, we assign a unity value to every weight. It is possible that better weight can be acquired through trainable weight coefficients. This is something we plan to investigate in future. Tables V, VI, VII and VIII reports the results of assessing different fusion methods along with proposed fusion frameworks.
Previous Methods  Accuracies%  Precision%  Recall% 
Izci et al. [43]  97.96     
Dang et al. [23]  95.48  96.53  87.74 
Li et al. [47]  99.5  97.3  98.1 
Zhao et al. [49]  98.25     
Oliveria et al. [37]  95.3     
Huang et al. [21]  99     
Shaker et al. [32]  98  90  97.7 
Kachuee et al. [28]  93.4     
Xu et al. [66]  95.9     
He et al. [67]  98.3     
Qiao et al. [68]  99.3     
Proposed MIF  98.6  93  92 
Proposed MFF  99.7  98  98 
Previous Methods  Accuracies%  Precision%  Recall% 
Dicker et al. [39]  83.82  82  95 
Acharya et al. [27]  95.22  95.49  94.19 
Kojuri et al. [69]  95.6  97.9  93.3 
Kachuee et al. [28]  95.9  95.2  95.1 
Liu et al. [40]  96  97.37  95.4 
Sharma et al. [12]  96  99  93 
Chen et al. [31]  96.18  97.32  93.67 
Cao et al. [70]  96.65     
Ahamed et al. [71]  97.66     
Proposed MIF  98.4  98  94 
Proposed MFF  99.2  98  98 
The performance of concatenation fusion is poor as compared to other methods as shown by experimental results. Concatenation fusion creates high dimensional feature vector that leads to the additional computational cost and deterioration of information during classification [72].
We also provide the comparison of both proposed fusion frameworks in terms of inference speed as shown in Table XII. Inference speed is the time consumed by classifier to recognize one test sample. It is expressed in microseconds (s). It is observed that MFF yields high accuracy, precision and recall for both datasets as compared to MIF, however, MIF is computationally efficient in terms of inference speed.
Since we experiment with two different CNNs, we provide comparison between both CNNs in terms of computational cost as shown in Table XI. Since there is a trade off between accuracy and computational cost, we observe from Tables V, VI and XI that CNN, shown in Fig. 4, is less accurate than AlexNet but is computationally efficient.
We prefer SVM classifier over softmax classifier since we have experimentally proved in our previous work [73] that SVM performs better than softmax, which is typically built into any CNN framework. Softmax classifier reduces the cross entropy function while SVM employs a margin based function. The more rigorous nature of classification is the reason of better performance of SVM over softmax.
The comparison provided in Tables IX and X is on the basis of datasets and the performance metrics. There are slight changes in the conditions for testing in few of the comparisons, However, it is appropriate to compare the results.
The limitation of the proposed Multimodal Image Fusion (MIF) Framework is that it requires exactly three different statistical gray scale images for creating a triple channel compound image. Since Multimodal Feature Fusion (MFF) Framework is using three separate AlexNet for training on GAF, RP and MTF images, it requires more time for training and inference.
Vi Conclusion
We proposed two computationally efficient multimodal fusion frameworks for ECG heart beat classification called Multimodal Image Fusion (MIF) and Multimodal Feature Fusion (MFF). At the input of these frameworks, we convert ECG signal into three types of images using Gramian Angular Field (GAF), Recurrence Plot (RP) and Markov Transition Field (MTF). In MIF, we first perform image fusion by combining three input images to create a three channel single image which used as input to the CNN. In MFF, highly informative cues are pulled out from penultimate layer of CNN and they are fused and used as input for the SVM classifier. We demonstrate the superiority of the proposed fusion frameworks by performing experiments on PhysionNet’s MITBIH for five different arrhythmias and on PTB diagnostics dataset for MI classification. Experimental results prove that we beat the previous stateoftheart in terms of classification accuracy, precision and recall. The important finding of this study is that the multimodal fusion of modalities increases the performance of the machine learning task as compare to use the modalities individually.
References
 [1] L. Sun, Y. Lu, K. Yang, and S. Li, “Ecg analysis using multiple instance learning for myocardial infarction detection,” IEEE transactions on biomedical engineering, vol. 59, no. 12, pp. 3348–3356, 2012.
 [2] Y. Xia, X. Liu, D. Wu, H. Xiong, L. Ren, L. Xu, W. Wu, and H. Zhang, “Influence of beattobeat blood pressure variability on vascular elasticity in hypertensive population,” Scientific reports, vol. 7, no. 1, pp. 1–8, 2017.
 [3] U. R. Acharya, N. Kannathal, L. M. Hua, and L. M. Yi, “Study of heart rate variability signals at sitting and lying postures,” Journal of bodywork and Movement Therapies, vol. 9, no. 2, pp. 134–141, 2005.
 [4] U. R. Acharya, Y. Hagiwara, J. E. W. Koh, S. L. Oh, J. H. Tan, M. Adam, and R. San Tan, “Entropies for automated detection of coronary artery disease using ecg signals: A review,” Biocybernetics and Biomedical Engineering, vol. 38, no. 2, pp. 373–384, 2018.
 [5] Z. Zhang, J. Dong, X. Luo, K.S. Choi, and X. Wu, “Heartbeat classification using diseasespecific feature selection,” Computers in biology and medicine, vol. 46, pp. 79–89, 2014.
 [6] E. Pasolli and F. Melgani, “Active learning methods for electrocardiographic signal classification,” IEEE Transactions on Information Technology in Biomedicine, vol. 14, no. 6, pp. 1405–1416, 2010.
 [7] Y. H. Hu, S. Palreddy, and W. J. Tompkins, “A patientadaptable ecg beat classifier using a mixture of experts approach,” IEEE transactions on biomedical engineering, vol. 44, no. 9, pp. 891–900, 1997.
 [8] V. Chouhan and S. Mehta, “Thresholdbased detection of p and twave in ecg using new feature signal,” International Journal of Computer Science and Network Security, vol. 8, no. 2, pp. 144–153, 2008.
 [9] N. A. Bhaskar, “Performance analysis of support vector machine and neural networks in detection of myocardial infarction,” Procedia Computer Science, vol. 46, no. 4, pp. 20–30, 2015.

[10]
K.i. Minami, H. Nakajima, and T. Toyoshima, “Realtime discrimination of ventricular tachyarrhythmia with fouriertransform neural network,”
IEEE transactions on Biomedical Engineering, vol. 46, no. 2, pp. 179–185, 1999.  [11] H. Khorrami and M. Moavenian, “A comparative study of dwt, cwt and dct transformations in ecg arrhythmias classification,” Expert systems with Applications, vol. 37, no. 8, pp. 5751–5757, 2010.

[12]
L. Sharma, R. Tripathy, and S. Dandapat, “Multiscale energy and eigenspace approach to detection and localization of myocardial infarction,”
IEEE transactions on biomedical engineering, vol. 62, no. 7, pp. 1827–1837, 2015. 
[13]
P.C. Chang, J.J. Lin, J.C. Hsieh, and J. Weng, “Myocardial infarction classification with multilead ecg using hidden markov models and gaussian mixture models,”
Applied Soft Computing, vol. 12, no. 10, pp. 3165–3175, 2012.  [14] H. Lu, K. Ong, and P. Chia, “An automated ecg classification system based on a neurofuzzy system,” in Computers in Cardiology 2000. Vol. 27 (Cat. 00CH37163). IEEE, 2000, pp. 387–390.
 [15] K. A. Sidek, I. Khalil, and H. F. Jelinek, “Ecg biometric with abnormal cardiac conditions in remote monitoring system,” IEEE Transactions on systems, man, and cybernetics: systems, vol. 44, no. 11, pp. 1498–1509, 2014.

[16]
V. Krasteva, S. Ménétré, J.P. Didon, and I. Jekova, “Fully convolutional deep neural networks with optimized hyperparameters for detection of shockable and nonshockable rhythms,”
Sensors, vol. 20, no. 10, p. 2875, 2020.  [17] I.C. Tanoh and P. Napoletano, “A novel 1d ccanet for ecg classification,” Applied Sciences, vol. 11, no. 6, p. 2758, 2021.
 [18] M. Wasimuddin, K. Elleithy, A. Abuzneid, M. Faezipour, and O. Abuzaghleh, “Multiclass ecg signal analysis using global averagebased 2d convolutional neural network modeling,” Electronics, vol. 10, no. 2, p. 170, 2021.
 [19] M. Längkvist, L. Karlsson, and A. Loutfi, “A review of unsupervised feature learning and deep learning for timeseries modeling,” Pattern Recognition Letters, vol. 42, pp. 11–24, 2014.
 [20] R. Salloum and C.C. J. Kuo, “Ecgbased biometrics using recurrent neural networks,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 2062–2066.
 [21] J. Huang, B. Chen, B. Yao, and W. He, “Ecg arrhythmia classification using stftbased spectrogram and convolutional neural network,” IEEE Access, vol. 7, pp. 92 871–92 880, 2019.
 [22] Z. Ahmad and N. Khan, “Multilevel stress assessment using multidomain fusion of ecg signal,” in 42nd Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE, 2020, pp. 4518–4521.
 [23] H. Dang, M. Sun, G. Zhang, X. Zhou, Q. Chang, and X. Xu, “A novel deep convolutional neural network for arrhythmia classification,” in 2019 International Conference on Advanced Mechatronic Systems (ICAMechS). IEEE, 2019, pp. 7–11.
 [24] P. De Chazal, M. O’Dwyer, and R. B. Reilly, “Automatic classification of heartbeats using ecg morphology and heartbeat interval features,” IEEE transactions on biomedical engineering, vol. 51, no. 7, pp. 1196–1206, 2004.
 [25] Y. Xia and Y. Xie, “A novel wearable electrocardiogram classification system using convolutional neural networks and active learning,” IEEE Access, vol. 7, pp. 7989–8001, 2019.
 [26] S. Kiranyaz, T. Ince, and M. Gabbouj, “Realtime patientspecific ecg classification by 1d convolutional neural networks,” IEEE Transactions on Biomedical Engineering, vol. 63, no. 3, pp. 664–675, 2015.
 [27] U. R. Acharya, H. Fujita, S. L. Oh, Y. Hagiwara, J. H. Tan, and M. Adam, “Application of deep convolutional neural network for automated detection of myocardial infarction using ecg signals,” Information Sciences, vol. 415, pp. 190–198, 2017.
 [28] M. Kachuee, S. Fazeli, and M. Sarrafzadeh, “Ecg heartbeat classification: A deep transferable representation,” in 2018 IEEE International Conference on Healthcare Informatics (ICHI). IEEE, 2018, pp. 443–444.
 [29] B. Pourbabaee, M. J. Roshtkhari, and K. Khorasani, “Deep convolutional neural networks and learning ecg features for screening paroxysmal atrial fibrillation patients,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 48, no. 12, pp. 2095–2104, 2018.
 [30] R. K. Tripathy, A. Bhattacharyya, and R. B. Pachori, “Localization of myocardial infarction from multilead ecg signals using multiscale analysis and convolutional neural network,” IEEE Sensors Journal, vol. 19, no. 23, pp. 11 437–11 448, 2019.
 [31] Y. Chen, H. Chen, Z. He, C. Yang, and Y. Cao, “Multichannel lightweight convolution neural network for anterior myocardial infarction detection,” in 2018 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI). IEEE, 2018, pp. 572–578.
 [32] A. M. Shaker, M. Tantawi, H. A. Shedeed, and M. F. Tolba, “Generalization of convolutional neural networks for ecg classification using generative adversarial networks,” IEEE Access, vol. 8, pp. 35 592–35 605, 2020.
 [33] H. Wang, H. Shi, K. Lin, C. Qin, L. Zhao, Y. Huang, and C. Liu, “A highprecision arrhythmia classification method based on dual fully connected neural network,” Biomedical Signal Processing and Control, vol. 58, p. 101874, 2020.
 [34] C. Chen, Z. Hua, R. Zhang, G. Liu, and W. Wen, “Automated arrhythmia classification based on a combination network of cnn and lstm,” Biomedical Signal Processing and Control, vol. 57, p. 101819, 2020.
 [35] M. Porumb, E. Iadanza, S. Massaro, and L. Pecchia, “A convolutional neural network approach to detect congestive heart failure,” Biomedical Signal Processing and Control, vol. 55, p. 101597, 2020.
 [36] C. Hao, S. Wibowo, M. Majmudar, and K. S. Rajput, “Spectrotemporal feature based multichannel convolutional neural network for ecg beat classification,” in 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, 2019, pp. 5642–5645.
 [37] A. T. Oliveira, E. G. Nobrega et al., “A novel arrhythmia classification method based on convolutional neural networks interpretation of electrocardiogram images,” in IEEE International conference on industrial technology. Piscataway, NJ, 2019.
 [38] M. M. Al Rahhal, Y. Bazi, H. Almubarak, N. Alajlan, and M. Al Zuair, “Dense convolutional networks with focal loss and image generation for electrocardiogram classification,” IEEE Access, vol. 7, pp. 182 225–182 237, 2019.
 [39] A. Diker, Z. Cömert, E. Avcı, M. Toğaçar, and B. Ergen, “A novel application based on spectrogram and convolutional neural network for ecg classification,” in 2019 1st International Informatics and Software Engineering Conference (UBMYK). IEEE, 2019, pp. 1–6.
 [40] W. Liu, M. Zhang, Y. Zhang, Y. Liao, Q. Huang, S. Chang, H. Wang, and J. He, “Realtime multilead convolutional neural network for myocardial infarction detection,” IEEE journal of biomedical and health informatics, vol. 22, no. 5, pp. 1434–1444, 2017.
 [41] X. Zhai and C. Tin, “Automated ecg classification using dual heartbeat coupling based on convolutional neural network,” IEEE Access, vol. 6, pp. 27 465–27 472, 2018.
 [42] W. Sun, N. Zeng, and Y. He, “Morphological arrhythmia automated diagnosis method using graylevel cooccurrence matrix enhanced convolutional neural network,” IEEE Access, vol. 7, pp. 67 123–67 129, 2019.
 [43] E. Izci, M. A. Ozdemir, M. Degirmenci, and A. Akan, “Cardiac arrhythmia detection from 2d ecg images by using deep learning technique,” in 2019 Medical Technologies Congress (TIPTEKNO). IEEE, 2019, pp. 1–4.
 [44] B. M. Mathunjwa, Y.T. Lin, C.H. Lin, M. F. Abbod, and J.S. Shieh, “Ecg arrhythmia classification by using a recurrence plot and convolutional neural network,” Biomedical Signal Processing and Control, vol. 64, p. 102262, 2021.
 [45] X. Fan, Q. Yao, Y. Cai, F. Miao, F. Sun, and Y. Li, “Multiscaled fusion of deep convolutional neural networks for screening atrial fibrillation from single lead short ecg recordings,” IEEE journal of biomedical and health informatics, vol. 22, no. 6, pp. 1744–1753, 2018.
 [46] R. Wang, J. Fan, and Y. Li, “Deep multiscale fusion neural network for multiclass arrhythmia detection,” IEEE Journal of Biomedical and Health Informatics, 2020.
 [47] F. Li, J. Wu, M. Jia, Z. Chen, and Y. Pu, “Automated heartbeat classification exploiting convolutional neural network with channelwise attention,” IEEE Access, vol. 7, pp. 122 955–122 963, 2019.
 [48] A. Uyar and F. Gurgen, “Arrhythmia classification using serial fusion of support vector machines and logistic regression,” in 2007 4th IEEE Workshop on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications. IEEE, 2007, pp. 560–565.
 [49] Y. Zhao, X. Yin, and Y. Xu, “Electrocardiograph (ecg) recognition based on graphical fusion with geometric algebra,” in 2017 4th International Conference on Information Science and Control Engineering (ICISCE). IEEE, 2017, pp. 1482–1486.
 [50] R. Wang, Q. Yao, X. Fan, and Y. Li, “Multiclass arrhythmia detection based on neural network with multistage features fusion,” in 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC). IEEE, 2019, pp. 4082–4087.
 [51] N. Manshor, A. A. Halin, M. Rajeswari, and D. Ramachandram, “Feature selection via dimensionality reduction for object class recognition,” in 2011 2nd International Conference on Instrumentation, Communications, Information Technology, and Biomedical Engineering. IEEE, 2011, pp. 223–227.

[52]
Z. Wang and T. Oates, “Imaging timeseries to improve classification and imputation,” in
TwentyFourth International Joint Conference on Artificial Intelligence
, 2015.  [53] C.L. Yang, Z.X. Chen, and C.Y. Yang, “Sensor classification using convolutional neural network by encoding multivariate time series as twodimensional colored images,” Sensors, vol. 20, no. 1, p. 168, 2020.
 [54] J. Eckmann, S. O. Kamphorst, D. Ruelle et al., “Recurrence plots of dynamical systems,” World Scientific Series on Nonlinear Science Series A, vol. 16, pp. 441–446, 1995.
 [55] Recuplots and cnns for timeseries classification. [Online]. Available: https://www.kaggle.com/tigurius/recuplotsandcnnsfortimeseriesclassification
 [56] Z. Wang and T. Oates, “Encoding time series as images for visual inspection and classification using tiled convolutional neural networks,” in Workshops at the TwentyNinth AAAI Conference on Artificial Intelligence, 2015.

[57]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Advances in neural information processing systems, 2012, pp. 1097–1105.  [58] Z. Ahmad and N. Khan, “Cnn based multistage gated average fusion (mgaf) for human action recognition using depth and inertial sensors,” IEEE Sensors Journal, 2020.
 [59] H. B. Mitchell, Image fusion: theories, techniques and applications. Springer Science & Business Media, 2010.
 [60] A. L. Goldberger, L. A. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.K. Peng, and H. E. Stanley, “Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals,” circulation, vol. 101, no. 23, pp. e215–e220, 2000.
 [61] G. B. Moody and R. G. Mark, “The impact of the mitbih arrhythmia database,” IEEE Engineering in Medicine and Biology Magazine, vol. 20, no. 3, pp. 45–50, 2001.
 [62] R. Bousseljot, D. Kreiseler, and A. Schnabel, “Nutzung der ekgsignaldatenbank cardiodat der ptb über das internet,” Biomedizinische Technik/Biomedical Engineering, vol. 40, no. s1, pp. 317–318, 1995.
 [63] Ecg heartbeat categorization dataset. [Online]. Available: https://www.kaggle.com/shayanfazeli/heartbeat
 [64] A. for the Advancement of Medical Instrumentation et al., “Testing and reporting performance results of cardiac rhythm and st segment measurement algorithms,” ANSI/AAMI EC38, vol. 1998, 1998.
 [65] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic minority oversampling technique,” Journal of artificial intelligence research, vol. 16, pp. 321–357, 2002.
 [66] X. Xu, S. Jeong, and J. Li, “Interpretation of electrocardiogram (ecg) rhythm by combined cnn and bilstm,” IEEE Access, vol. 8, pp. 125 380–125 388, 2020.

[67]
R. He, Y. Liu, K. Wang, N. Zhao, Y. Yuan, Q. Li, and H. Zhang, “Automatic detection of qrs complexes using dual channels based on unet and bidirectional long shortterm memory,”
IEEE Journal of Biomedical and Health Informatics, 2020.  [68] F. Qiao, B. Li, Y. Zhang, H. Guo, W. Li, and S. Zhou, “A fast and accurate recognition of ecg signals based on elmlrf and blstm algorithm,” IEEE Access, vol. 8, pp. 71 189–71 198, 2020.
 [69] J. Kojuri, R. Boostani, P. Dehghani, F. Nowroozipour, and N. Saki, “Prediction of acute myocardial infarction with artificial neural networks in patients with nondiagnostic electrocardiogram,” Journal of Cardiovascular Disease Research, vol. 6, no. 2, 2015.
 [70] Y. Cao, T. Wei, N. Lin, D. Zhang, and J. J. Rodrigues, “Multichannel lightweight convolutional neural network for remote myocardial infarction monitoring,” in 2020 IEEE Wireless Communications and Networking Conference Workshops (WCNCW). IEEE, 2020, pp. 1–6.
 [71] M. A. Ahamed, K. A. Hasan, K. F. Monowar, N. Mashnoor, and M. A. Hossain, “Ecg heartbeat classification using ensemble of efficient machine learning approaches on imbalanced datasets,” in 2020 2nd International Conference on Advanced Information and Communication Technology (ICAICT). IEEE, 2020, pp. 140–145.

[72]
E. Akbas and F. T. Y. Vural, “Automatic image annotation by ensemble of visual
descriptors,” in
2007 IEEE Conference on Computer Vision and Pattern Recognition
. IEEE, 2007, pp. 1–8.  [73] Z. Ahmad and N. Khan, “Towards improved human action recognition using convolutional neural networks and multimodal fusion of depth and inertial sensor data,” in 2018 IEEE International Symposium on Multimedia (ISM). IEEE, 2018, pp. 223–230.
Comments
There are no comments yet.