Browsing by Author "Thabtah, Fadi"
Now showing 1 - 20 of 24
Results Per Page
Sort Options
Item Associative classification approaches : review and comparison(World Scientific Publishing Co. Pte Ltd, 2014) Abdelhamid, Neda; Thabtah, FadiAssociative classification (AC) is a promising data mining approach that integrates classification and association rule discovery to build classification models (classifiers). In the last decade, several AC algorithms have been proposed such as Classification based Association (CBA), Classification based on Predicted Association Rule (CPAR), Multi-class Classification using Association Rule (MCAR), Live and Let Live (L3) and others. These algorithms use different procedures for rule learning, rule sorting, rule pruning, classifier building and class allocation for test cases. This paper sheds the light and critically compares common AC algorithms with reference to the abovementioned procedures. Moreover, data representation formats in AC mining are discussed along with potential new research directions. © 2014 World Scientific Publishing Co.Item Associative classification common research challenges(Institute of Electrical and Electronics Engineers Inc., 2016) Abdelhamid, Neda; Jabbar, Ahmad Abdul; Thabtah, FadiAssociation rule mining involves discovering concealed correlations among variables often from sales transactions to help managers in key business decision involving items shelving, sales and planning. In the last decade, association rule mining methods have been employed in deriving rules from classification dataset in different business domains. This has resulted in an emergence of new classification approach called Associative Classification (AC), which often produces higher predictive classifiers than classic approaches such as decision trees, greedy and rule induction. Nevertheless, AC suffers from noticeable challenges some of which have been inherited from association rules and others have been resulted from building the classifier phase. These challenges are not limited to the massive numbers of candidate ruleitems found, the very large classifiers derived, the inability to handle multi-label datasets, and the design of rule pruning, ranking and prediction procedures. This article highlights and critically analyzes common challenges faced by AC algorithms that are still sustained. Hence, it opens the door for interested researchers to further investigate these challenges hoping to enhance the overall performance of this approach and increase it applicability in research domains. © 2016 IEEE.Item Autism screening: an unsupervised machine learning approach(Springer, 2022-12) Thabtah, Fadi; Spencer, Robinson; Abdelhamid, Neda; Kamalov, Firuz; Wentzel, Carl; Ye, Yongsheng; Dayara, ThanuItem Autocorrelation for time series with linear trend(Institute of Electrical and Electronics Engineers Inc., 2021-09-29) Kamalov, Firuz; Thabtah, Fadi; Gurrib, IkhlaasThe autocorrelation function (ACF) is a fundamental concept in time series analysis including financial forecasting. In this note, we investigate the properties of the sample ACF for a time series with linear trend. In particular, we show that the sample ACF of the time series approaches 1 for all lags as the number of time steps increases. The theoretical results are supported by numerical experiments. Our result helps researchers better understand the ACF patterns and make correct ARMA selection. © 2021 IEEE.Item Autoregressive and neural network models: A comparative study with linearly lagged series(Institute of Electrical and Electronics Engineers Inc., 2021-09-29) Kamalov, Firuz; Gurrib, Ikhlaas; Thabtah, FadiTime series analysis such as stock price forecasting is an important part of financial research. In this regard, autoregressive (AR) and neural network (NN) models offer contrasting approaches to time series modeling. Although AR models remain widely used, NN models and their variant long short-term memory (LSTM) networks have grown in popularity. In this paper, we compare the performance of AR, NN, and LSTM models in forecasting linearly lagged time series. To test the models we carry out extensive numerical experiments based on simulated data. The results of the experiments reveal that despite the inherent advantage of AR models in modeling linearly lagged data, NN models perform just as well, if not better, than AR models. Furthermore, the NN models outperform LSTMs on the same data. We find that a simple multi-layer perceptron can achieve highly accurate out of sample forecasts. The study shows that NN models perform well even in the case of linearly lagged time series. © 2021 IEEE.Item A clustering approach for autistic trait classification(Taylor and Francis Ltd, 2020-07-02) Baadel, Said; Thabtah, Fadi; Lu, JoanMachine learning (ML) techniques can be utilized by physicians, clinicians, as well as other users, to discover Autism Spectrum Disorder (ASD) symptoms based on historical cases and controls to enhance autism screening efficiency and accuracy. The aim of this study is to improve the performance of detecting ASD traits by reducing data dimensionality and eliminating redundancy in the autism dataset. To achieve this, a new semi-supervised ML framework approach called Clustering-based Autistic Trait Classification (CATC) is proposed that uses a clustering technique and that validates classifiers using classification techniques. The proposed method identifies potential autism cases based on their similarity traits as opposed to a scoring function used by many ASD screening tools. Empirical results on different datasets involving children, adolescents, and adults were verified and compared to other common machine learning classification techniques. The results showed that CATC offers classifiers with higher predictive accuracy, sensitivity, and specificity rates than those of other intelligent classification approaches such as Artificial Neural Network (ANN), Random Forest, Random Trees, and Rule Induction. These classifiers are useful as they are exploited by diagnosticians and other stakeholders involved in ASD screening. © 2020 Taylor & Francis Group, LLC.Item Cybersecurity awareness: A critical analysis of education and law enforcement methods(Slovene Society Informatika, 2021) Baadel, Said; Thabtah, Fadi; Lu, JoanAccording to the international Anti-Phishing Work Group (APWG), phishing activities have abruptly risen over the last few years, and users are becoming more susceptible to online and mobile fraud. Machine Learning techniques have potential for building technical anti-phishing models, with a handful already implemented in the real time environment. However, majority of them have yet to be applied in a real time environment and require domain experts to interpret the results. This gives conventional techniques a vital role as supportive tools for a wider audience, especially novice users. This paper reviews in-depth, common, phishing countermeasures including legislation, law enforcement, hands-on training, and education among others. A complete prevention layer based on the aforementioned approaches is suggested to increase awareness and report phishing to different stakeholders, including organizations, novice users, researchers, and computer security experts. Therefore, these stakeholders can understand the upsides and downsides of the current conventional approaches and the ways forward for improving them. © 2021 Slovene Society Informatika. All rights reserved.Item Data imbalance in classification : experimental evaluation(Elsevier Inc., 2020-03) Thabtah, Fadi; Hammoud, Suhel; Kamalov, Firuz; Gonsalves, AmandaItem A dynamic rule-induction method for classification in data mining(Taylor and Francis Ltd., 2015) Qabajeh, Issa; Chiclana, Francisco; Thabtah, FadiRule induction (RI) produces classifiers containing simple yet effective ‘If–Then' rules for decision makers. RI algorithms normally based on PRISM suffer from a few drawbacks mainly related to rule pruning and rule-sharing items (attribute values) in the training data instances. In response to the above two issues, a new dynamic rule induction (DRI) method is proposed. Whenever a rule is produced and its related training data instances are discarded, DRI updates the frequency of attribute values that are used to make the next in-line rule to reflect the data deletion. Therefore, the attribute value frequencies are dynamically adjusted each time a rule is generated rather statically as in PRISM. This enables DRI to generate near perfect rules and realistic classifiers. Experimental results using different University of California Irvine data sets show competitive performance in regards to error rate and classifier size of DRI when compared to other RI algorithms. © 2015, © 2015 Antai College of Economics and Management, Shanghai Jiao Tong University.Item Feature Selection in Imbalanced Data(Springer Science and Business Media Deutschland GmbH, 2023-12) Kamalov, Firuz; Thabtah, Fadi; Leung, Ho HonThe traditional feature selection methods are not suitable for imbalanced data as they tend to be biased towards the majority class. This problem is particularly acute in the field of medical diagnostics and fraud detection where the class distribution is highly skewed. In this paper, we propose a novel filter approach using decision tree-based F1-score. The F1-score incorporates the accuracy with respect to the minority class data and hence is a good measure in the case of imbalanced data. In the proposed implementation, the F1-score is calculated based on a 1-dimensional decision tree classifier resulting in a fast and effective feature evaluation method. Numerical experiments confirm that the proposed method achieves robust dimensionality reduction and accuracy results. In addition, the low computational complexity of the algorithm makes it a practical choice for big data applications. © 2021, The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature.Item A Feature Selection Method Based on Ranked Vector Scores of Features for Classification(Springer Science and Business Media Deutschland GmbH, 2017-12-01) Kamalov, Firuz; Thabtah, FadiOne of the major aspects of any classification process is selecting the relevant set of features to be used in a classification algorithm. This initial step in data analysis is called the feature selection process. Disposing of the irrelevant features from the dataset will reduce the complexity of the classification task and will increase the robustness of the decision rules when applied on the test set. This paper proposes a new filtering method that combines and normalizes the scores of three major feature selection methods: information gain, chi-squared statistic and inter-correlation. Our method utilizes the strengths of each of the aforementioned methods to maximum advantage while avoiding their drawbacks—especially the disparity of the results produced by these methods. Our filtering method stabilizes each variable score and gives it the true rank among the input data’s available variables. Hence it maximizes the stability in the variables’ scores without losing the overall accuracy of the predictive model. A number of experiments on different datasets from various domains have shown that features chosen by the proposed method are highly predictive when compared with features selected by other existing filtering methods. The evaluation of the filtering phase was conducted via thorough experimentations using a number of predictive classification algorithms in addition to statistical analysis of the filtering methods’ scores. © 2017, Springer-Verlag GmbH Germany.Item Least Loss: A simplified filter method for feature selection(2020-09) Thabtah, Fadi; Kamalov, Firuz; Hammoud, Suhel; Shahamiri, Seyed RezaItem Machine learning applications for COVID-19: a state-of-the-art review(Elsevier, 2022-01-01) Kamalov, Firuz; Cherukuri, Aswani Kumar; Sulieman, Hana; Thabtah, FadiItem Machine learning applications to Covid-19: a state-of-the-art survey(Institute of Electrical and Electronics Engineers Inc., 2022) Kamalov, Firuz; Cherukuri, Aswani Kumar; Thabtah, FadiThere exists a large and rapidly growing body of literature related to applications of machine learning to Covid-19. Given the substantial volume of research, there is a need to organize and categorize the literature. In this paper, we provide the most up-to-date review as of the beginning of 2022. We propose an application-based taxonomy to group the existing literature and provide an analysis of the research in each category. We discuss the progress as well as the pitfalls of the existing research, and propose keys for improvement. © 2022 IEEE.Item MCOKE: Multi-Cluster Overlapping K-Means Extension Algorithm(World Academy of Science, Engineering and Technology, 2015) Baadel, Said; Thabtah, Fadi; Lu, JoanClustering involves the partitioning of n objects into k clusters. Many clustering algorithms use hard-partitioning techniques where each object is assigned to one cluster. In this paper we propose an overlapping algorithm MCOKE which allows objects to belong to one or more clusters. The algorithm is different from fuzzy clustering techniques because objects that overlap are assigned a membership value of 1 (one) as opposed to a fuzzy membership degree. The algorithm is also different from other overlapping algorithms that require a similarity threshold be defined a priori which can be difficult to determine by novice users.Item Modeling discrete-time analytical models based on random early detection : exponential and linear(World Scientific Publishing Co. Pte Ltd, 2015) Abdel-Jaber, Hussein; Thabtah, Fadi; Woodward, MikeCongestion control is among primary topics in computer network in which random early detection (RED) method is one of its common techniques. Nevertheless, RED suffers from drawbacks in particular when its "average queue length" is set below the buffer's "minimum threshold" position which makes the router buffer quickly overflow. To deal with this issue, this paper proposes two discrete-time queue analytical models that aim to utilize an instant queue length parameter as a congestion measure. This assigns mean queue length (mql) and average queueing delay smaller values than those for RED and eventually reduces buffers overflow. A comparison between RED and the proposed analytical models was conducted to identify the model that offers better performance. The proposed models outperform the classic RED in regards to mql and average queueing delay measures when congestion exists. This work also compares one of the proposed models (RED-Linear) with another analytical model named threshold-based linear reduction of arrival rate (TLRAR). The results of the mql, average queueing delay and the probability of packet loss for TLRAR are deteriorated when heavy congestion occurs, whereas, the results of our RED-Linear were not impacted and this shows superiority of our model. © 2015 World Scientific Publishing Company.Item Mr-arm : a map-reduce association rule mining framework(World Scientific Publishing Co. Pte Ltd, 2013) Thabtah, Fadi; Hammoud, SuhelAssociation rule is one of the primary tasks in data mining that discovers correlations among items in a transactional database. The majority of vertical and horizontal association rule mining algorithms have been developed to improve the frequent items discovery step which necessitates high demands on training time and memory usage particularly when the input database is very large. In this paper, we overcome the problem of mining very large data by proposing a new parallel Map-Reduce (MR) association rule mining technique called MR-ARM that uses a hybrid data transformation format to quickly finding frequent items and generating rules. The MR programming paradigm is becoming popular for large scale data intensive distributed applications due to its efficiency, simplicity and ease of use, and therefore the proposed algorithm develops a fast parallel distributed batch set intersection method for finding frequent items. Two implementations (Weka, Hadoop) of the proposed MR association rule algorithm have been developed and a number of experiments against small, medium and large data collections have been conducted. The ground bases of the comparisons are time required by the algorithm for: data initialisation, frequent items discovery, rule generation, etc. The results show that MR-ARM is very useful tool for mining association rules from large datasets in a distributed environment. © 2013 World Scientific Publishing Company.Item A new computational intelligence approach to detect autistic features for autism screening(Elsevier Ireland Ltd, 2018-09) Thabtah, Fadi; Kamalov, Firuz; Rajab, KhairanItem OMCOKE: A Machine Learning Outlier-based Overlapping Clustering Technique for Multi-Label Data Analysis(Slovene Society Informatika, 2022-11) Baadel, Said; Thabtah, Fadi; Lu, Joan; Harguem, SaidaClustering is one of the challenging machine learning techniques due to its unsupervised learning nature. While many clustering algorithms constrain objects to single clusters, K-means overlapping partitioning clustering methods assign objects to multiple clusters by relaxing the constraints and allowing objects to belong to more than one cluster to better fit hidden structures in the data. However, when datasets contain outliers, they can significantly influence the mean distance of the data objects to their respective clusters, which is a drawback. Therefore, most researchers address this problem by simply removing the outliers. This can be problematic especially in applications such as fraud detection or cybersecurity attacks risk analysis. In this study, an alternative solution to this problem is proposed that captures outliers and stores them on-the-fly within a new cluster, instead of discarding. The new algorithm is named Outlier-based Multi-Cluster Overlapping K-Means Extension (OMCOKE). Empirical results on real-life multi-label datasets were derived to compare OMCOKE’s performance with other common overlapping clustering techniques. The results show that OMCOKE produced a better precision rate compared to the considered clustering algorithms. This method can benefit various stakeholders as these outliers could have real-life applications in cybersecurity, fraud detection, and the anti-phishing of websites. © 2022 Slovene Society Informatika. All rights reserved.Item Parallel associative classification data mining frameworks based mapreduce(World Scientific Publishing Co. Pte Ltd, 2015-06) Thabtah, Fadi; Hammoud, Suhel; Abdel-Jaber, HusseinAssociative classification (AC) is a research topic that integrates association rules with classification in data mining to build classifiers. After dissemination of the Classification-based Association Rule algorithm (CBA), the majority of its successors have been developed to improve either CBA's prediction accuracy or the search for frequent ruleitems in the rule discovery step. Both of these steps require high demands in processing time and memory especially in cases of large training data sets or a low minimum support threshold value. In this paper, we overcome the problem of mining large training data sets by proposing a new learning method that repeatedly transforms data between line and item spaces to quickly discover frequent ruleitems, generate rules, subsequently rank and prune rules. This new learning method has been implemented in a parallel Map-Reduce (MR) algorithm called MRMCAR which can be considered the first parallel AC algorithm in the literature. The new learning method can be utilised in the different steps within any AC or association rule mining algorithms which scales well if contrasted with current horizontal or vertical methods. Two versions of the learning method (Weka, Hadoop) have been implemented and a number of experiments against different data sets have been conducted. The ground bases of the comparisons are classification accuracy and time required by the algorithm for data initialization, frequent ruleitems discovery, rule generation and rule pruning. The results reveal that MRMCAR is superior to both current AC mining algorithms and rule based classification algorithms in improving the classification performance with respect to accuracy. © 2015 World Scientific Publishing Company.