|Year : 2021 | Volume
| Issue : 3 | Page : 279-284
Prediction of treatment failure of tuberculosis using support vector machine with genetic algorithm
Keethansana Kanesamoorthy, Maheshi B Dissanayake
Department of Electrical and Electronic Engineering, Faculty of Engineering, University of Peradeniya, Sri Lanka
|Date of Submission||12-Jun-2021|
|Date of Decision||04-Jul-2021|
|Date of Acceptance||31-Jul-2021|
|Date of Web Publication||03-Sep-2021|
Maheshi B Dissanayake
Department of Electrical and Electronic Engineering, Faculty of Engineering, University of Peradeniya
Source of Support: None, Conflict of Interest: None
Background: Tuberculosis (TB) is a disease that mainly affects human lungs. It can be fatal if the treatment is delayed. This study investigates the prediction of treatment failure of TB patients focusing on the features which contributes mostly for drug resistance. Methods: Support vector machine (SVM) is a relatively novel classification model that has shown promising performance in regression applications. Genetic algorithm (GA) is a method for solving the optimization problems. We have considered lifestyle and treatment preference-related data collected from TB-positive patients in Yangon, Myanmar to obtain a clear picture of the TB drug resistance. In this article, TB drug resistance is analyzed and modelled using SVM classifier. GA is used to enhance the overall performance of SVM, by selecting the most suitable 20 features from the 35 full feature set. Further, the performance of four different kernels of SVM model is investigated to obtain the best performance. Results: Once the model is trained with SVM and GA, we have feed unseen data into the model to predict the treatment resistance of the patient. The results have shown that SVM with GA is capable of achieving 67% of accuracy in predicting the treatment resistance in unseen data with only 20 features. Conclusion: The findings would in turn, assist to develop an effective TB treatment plan in future based on patients' lifestyle choices and social settings. In addition, the model developed in this research can be generalized to predict the outcome of drug therapy for many diseases in future.
Keywords: Classification, drug resistance, genetic algorithm, support vector machine, tuberculosis
|How to cite this article:|
Kanesamoorthy K, Dissanayake MB. Prediction of treatment failure of tuberculosis using support vector machine with genetic algorithm. Int J Mycobacteriol 2021;10:279-84
|How to cite this URL:|
Kanesamoorthy K, Dissanayake MB. Prediction of treatment failure of tuberculosis using support vector machine with genetic algorithm. Int J Mycobacteriol [serial online] 2021 [cited 2021 Dec 8];10:279-84. Available from: https://www.ijmyco.org/text.asp?2021/10/3/279/325494
| Introduction|| |
Tuberculosis (TB) is a pandemic infection which mostly attacks the human lungs. Mycobacterium TB bacteria causes this disease and it could spread to other parts of the body as well. As of the World Health Organization Global TB report 2017, approximately 6.6 million new cases of TB are reported every year. However, treated correctly, TB is curable. Lifestyle choices and failure to adhere to a strict treatment plan are identified as key bottlenecks associated with treating TB, especially in developing countries. In this article, we investigate machine learning (ML)-based mechanism for the identification and prediction of features that are associated with treatment failure of TB disease. In our research, we have used a publicly available dataset with 356 TB patients from ten public township health centers in the largest city of Myanmar, Yangon.
Feature selection is the process of removing features from original dataset that have limited influenced to the defined task. This in turns reduces the computational complexity of the modeling and helps the decision-making model to focus on the most importance features of the dataset. In this article, Genetic algorithm (GA) is proposed to select appropriate features for the prediction., GA is a flexible and exploratory algorithm based on the evolutionary theory of natural genetics. The fitness function in GA is an objective function, which evaluates the fitness score between the estimated solution and the optimum solution of the desired problem. The solutions with higher fitness scores are given a higher chance to mate and yield more “fitter” output solutions over generations, till the system reach a stopping criterion. In the context of optimization algorithms, GA holds a higher recognition over other heuristic algorithms, due to its capability to act as a universal optimizer for many different problems.,
A support vector machine (SVM) model is a supervised ML classifier which uses a representation of the data as points in space., Once trained using labelled data associated with the given problem, it is capable of predicting the classification of unseen new datasets with higher accuracy. SVM, which can be used both in regression and classification problems, is a popular choice among data analysist due to high accuracy provided under low computational power.,,,, Moreover, SVM employs four kernels namely Linear, Radial Basis Function (RBF), Polynomial (poly), and Sigmoid. It is capable of performing nonlinear classifications as well with the correct kernel settings.
In our proposed model, after estimating the most influential features using GA, the performance of the four kernels of SVM, namely Linear, RBF, Polynomial and Sigmoid in prediction of treatment failure in TB is analyzed as well. In predicting treatment failure, SVM predicts whether the new data belong to TB symptoms get worse, no symptoms and the patient have drug resistance which is denoted as 2, 0, and 1, respectively.
| Methods|| |
Modern research on TB is mainly focusing on two areas, namely early diagnosis of TB and prognosis (natural course of the disease) of TB. With the integration of ML and deep learning into health-care sector, especially diagnosis, extensive research has primarily investigated the feasibility of TB diagnosis by means of radiography,,, or microscopic images., Further, a number of studies focuses on the prognosis of TB,, transmission of TB in the society,, and diagnosis of TB through clinical history.,
The first systematic review of prediction models for TB treatment outcome is presented in, where 37 prediction models in literature are evaluated. These models predict cured versus unfavorable outcomes, such as treatment failure and death. In another study, three ML models, namely SVM, Random Forest (RF) and Neural Network are used to predict the treatment completion outcome. For a dataset of 4213 records, the RF model achieved the highest accuracy 76.32%. Furthermore, deep learning model based on long short-term memory (LSTM) is employed to predict the unfavorable outcomes in a dataset of 16,975 patient records. The proposed LSTM Real-time Adherence Predictor model achieved an area under the ROC Curve of 0.743.
Sauer et al. investigated the possibility of predicting treatment outcome for TB, based upon clinical and demographic data of patients with TB. A range of ML models such as stepwise forward selection, stepwise backward elimination, backward elimination and forward selection, Least Absolute Shrinkage and Selection Operator regression, RF, SVM with linear kernel and polynomial kernel were tested in this study. Furthermore, they had utilized a multi-country dataset, gathered from Azerbaijan, Belarus, Moldova, Georgia, and Romania, maintained at the National Institute of Allergy and Infectious Diseases for their study. The most promising model out of all the test cases was forward stepwise selection with accuracy of 0.74, although most models tested performed at or above accuracy 0.7.
Khan et al. analyzed health care-seeking behavior and treatment-related practices of TB patients in a fragile health system. Here they randomly selected adult smear-positive pulmonary TB patients diagnosed between September 2014 and March 2015 at ten public township health centers in Yangon, Myanmar. The findings indicate that it is essential to include treatment adherence support to reduce the risk of drug-resistant TB.
In our research, we adopt the database compiled by Khan et al. in earlier mentioned study. We analysis collected the data and forecast the drug resistance among TB patients, using ML to improve the patient and disease management process, especially under low resource environment. Furthermore, GA was employed to select appropriate feature subset from large feature set.
In the research presented, we focus our attention on features which helps the researchers to predict the effectiveness of drugs in low resources environments. In general, clinical history records consist of many parameters, which we will refer as features hereafter, related to the lifestyle, drug history, and medical history of the patient. However, only few of these features will help the health-care sector to identify and predict the effectiveness of the drugs used to treat the disease. In the research presented, the attention was given to identify the features which cause drug resistance as well as to predict the drug resistance based on these features.
We propose a GA-based feature selection for SVM classifier to improve the prediction accuracy of the treatment failure of TB disease. GA is an optimization algorithm. In this context, it selects the optimum feature subset for the classification purpose. By adopting SVM with GA, the dimensionality of feature vector space is reduced through eliminating the redundant and irrelevant features. Hence, SVM with GA will help to improve the prediction accuracy of the model.
In the training phase of the proposed algorithm, the best feature subset was selected using GA with classification accuracy of the detection set as the fitness function. During the testing phase, the best features (20 features to be exact) were selected and analyzed to predict the outcome of the treatment, as worsening of symptoms, no symptoms and drug resistance, using SVM classifier. [Figure 1] shows the block diagram of our proposed method.
Four of the most used kernels in SVM are linear kernel, polynomial kernel, sigmoid kernel and RBF kernel. In this research, we have further analyzed the most suitable kernel for the classification task at hand, i.e., to predict the resistance of the TB drugs, given the clinically recorded history and lifestyle features of the patient. To evaluate the performance of our proposed algorithm, we have used the dataset in of 356 patients, which has been stripped of all personal identifiable information. Next, the input dataset is divided into two sets, as training set and testing set with the ratio of 75% and 25%, respectively.
Feature selection using genetic algorithm
GA is an optimization approach which follows the concept behind Charles Darwin's theory of natural evolution. In biology, organisms have to compete for limited resources to survive. In artificial evolution, GA operates on a population of individuals to produce the best fit individuals. [Figure 2] shows the steps of the feature selection using GA.
|Figure 2: The procedure of the feature selection using genetic algorithm|
Click here to view
The creation and initialization of the individuals in the population is the first step of GA. As GA is a stochastic optimization method, the individual's genes are initialized at random as shown in [Figure 3]. Each individuals represented by 6 binary genes (for instance 6 features) included in the model.
After the initialization, a fitness value is assigned to each individual in the population. This value indicates “how it” a given individual is to reach the potential solution. Hence, individuals with high fitness value will have a higher probability to select for recombination.
After a fitness value is fitted, the selection operator of GA chooses some of the individuals to be recombined for the next generation. The individuals with higher fitness level are mode likely to be chosen for the recombination, while the number of individuals selected will be half of the population size. It should be noted that the selector may pick the same individual more than once during the reproduction stage.
The crossover operator recombines and reproduces the selected individuals to generate new population. As shown in [Figure 4], the operator selects two individuals at random, to swap and combine their features to create 4 offspring. This process is repeated until the new population reach the same size as its parents.
Since the crossover operation tends to create a new generation with low diversity, the mutation operator applies random changes to some features of the offspring to maintain diversity in the new population. In general, mutation happens very slowly with extremely low probability. However, it helps the overall process to find a global optimum solution as selection and crossover operators on their own may cause the algorithm to converge quickly to a local optimum. [Figure 5] shows the mutation operation of one of the generated offspring.
In this article, GA was initialized using 35 individuals (35 features) and logistic regression accuracy was used as fitness function. The above steps, from 1 to 5, were repeated for ten generations, and finally, 20 features were selected from 35 input features.
Support vector machine classification
SVM is a popular supervised ML algorithm which can be used for classifications tasks. It is one of the most robust prediction methods found in ML literature that is built upon the statistical learning framework.
The basic steps of SVM algorithm for binary classification are as follows:
- Selecting the best two hyperplanes that separates the training data
- Maximizing the distance (gap) between these hyperplanes
- Mapping the new data (test data) to a class such that it sits on one side of the gap with maximum separation distance.
Initially, the dataset was split into training and testing sets, with a ratio of 75%:25% of the entire dataset, respectively. Then, the values of the features were normalized to make them comparable to SVM model which takes the average value as 0 and standard deviation as 1. Once the SVM model was created, it was trained and fitted using labelled data in the training set to learn the best decision boundary for the classification task. In addition to performing linear classification, SVMs can efficiently perform a nonlinear classification as well. The Linear, Polynomial, Sigmoid, and RBF kernels of SVM simply differ in making the hyperplane decision boundary between the classes.
In this article, SVM model was iterated through these four kernels, and the accuracy was compared to obtain the best decision boundary. Therefore, for each of these kernels, a new SVM was instantiated and fitted with data in the python environment during the simulations.
For a comprehensive performance evaluation of the proposed model, accuracy, precision, recall and F1 Score are employed. The accuracy of the SVM classifier is defined as,
Whereas precision, and recall can be defined as
The F1-Score is the weighted average of precision and recall by focusing on the worst component as in,
| Results|| |
[Figure 6] shows the variation of validation classification accuracy when we move from worst feature subset to the best feature subset. Following [Figure 6] and 20 feature subset, which output the highest classification accuracy, was selected. [Table 1] lists the optimum 20 features selected by the GA algorithm.
|Figure 6: The variation of validation set classification accuracy with continuum|
Click here to view
[Figure 7] illustrates the final classification accuracy of each kernel with and without the GA, for feature selection. The overall accuracy of SVM classifier was obtained by comparing predicted output against the ground truth label. According to the results, the linear and RBF kernels outperform the polynomial and sigmoid kernels by a clear margin when all 35 features were used for the classification task. Using GA, we were able to reduce the effective feature count to 20. According to [Figure 7], GA has improved the performance of the polynomial and linear kernel while sigmoid and RBF kernels have shown very small amount of improvement in the classification accuracy. Furthermore, we have tested the accuracy of the SVM kernel with 25 features and 15 features selected through GA as well. It is evident that by selecting 20 features, the performance of the classification can be improved further.
|Figure 7: The variation of classification accuracy of SVM kernels with and without Genetic algorithm|
Click here to view
Although classification accuracy is used heavily in ML domain for performance evaluation, it alone would not give the correct picture of the model performance. [Table 2] presents the classification report of the model designed, by means of the variation of the accuracy, precision, recall, and F1-score for all four kernels. It was found that the sigmoid kernel and polynomial kernels have the low values for precision and recall than other kernels. When comparing the classification accuracy of these four kernels, RBF kernel and linear kernel managed to find high decision boundary that separated almost 67% of data correctly.
|Table 2: The variation of accuracy, precision, recall, and F1 score with support vector machine kernels (output values were rounded up to nearest two decimal places)|
Click here to view
Hence, as an overall observation, it is concluded that both Rbf and linear kernels with GA are suitable for the drug resistance prediction task compared to the other two kernels of SVM.
| Discussion|| |
This study aims to identify the key clinical features which affect the drug resistance. [Table 1] shows the selected 20 optimum feature subsets which helps to understand the drug-resistance for TB. Accordingly premedical history such as HIV, diabetics, frequency of missing the drugs, and belief on traditional healers, impacts the outcome of the TB drugs. In the 15 influential feature selection simulation run, “drink, smoke, HIV, diabetes, alcohol” features related to the lifestyle were dropped after the GA stage. According to [Figure 7], with 15 features, the SVM shows poor accuracy with all tested algorithms. Hence, it is essential to monitor lifestyle choices as well for accurate prediction and analysis of the TB drug resistance.
Keane et al. carried out the first clinical and demographic data analysis to predict the TB treatment failure. Male sex and alcoholism plays a major role in treatment failure as of. In the authors established that smoking and age above 60 are the other causes for unresponsive outcome for the treatment, while in it was established that being male, family history of TB, and household size contributed to the cause. It should be noted that most studies reported different features as the key factors for treatment failure. These differences may be result of considerable heterogeneity present with the database in terms of the collected features at clinical history recording, demographical spread of the patients, divergent investigations, and the variation in ML tools adopted.
Hence to reach a generalized conclusion, the number of patients in the dataset, the type of features recorded, and the demographical spread of the dataset should be considerably expanded. Furthermore, the generated dataset should be analyzed in depth using wider range of ML tools and optimization algorithms before drawing final conclusions on factors causing drug resistance in TB. This would in turn, pave the path to design an effective TB treatment plan for especially heath systems running under restricted resources, such as that of developing countries. In addition, a similar model as proposed can be adopted in such restricted health systems, to treat other infection diseases, which has a high recovery and prevention possibilities through effective treatments.
| Conclusion|| |
This article proposes a TB drug resistance prediction model using SVM with GA for optimum feature selection from clinically collected dataset. The performance of four different kernels of SVM was analyzed for the TB treatment failure prediction using the clinical history of the patient. The results show that SVM classifier achieves 6%–12% (with polynomial and linear kernel) improved detection rate after selecting appropriate top 20 features using GA. Furthermore, accuracy, precision, recall, and F1-score values of linear, RBF, polynomial and sigmoid kernels were compared. SVM RBF kernel and linear kernels show better prediction rate. In future work, different objective function will be tested as fitness function for GA to select the optimum features for the SVM classification in detecting treatment failure. Furthermore, we plan to generalize the finding by employing a larger dataset spanning larger geographical area. Furthermore, the authors propose to add more features related to choices of drugs and lifestyle, to the dataset and investigate further the drug resistive factors for TB. This would in turn, assist to develop an effective TB treatment plan in future. In addition, the model developed in this research can be generalized to predict the outcome of drug therapy for many diseases with the relevant dataset and the SVM-GA model training proposed.
For the research presented we have utilized the freely available data produced by Khan et al. in M. S. Khan, C. Hutchison, R. J. Coker. Risk factors that may be driving the emergence of drug resistance in TB patients treated in Yangon, Myanmar, PLoS ONE 12(6): E0177999, 2017.
Ethical clearance was obtained from LSHTM Research Ethics Committee, the FHI 360 Protection of Human Subjects Committee and the Myanmar Ministry of Health. Written informed consent was sought from all participants.
Financial support and sponsorship
Conflicts of interest
There are no conflicts of interest.
| References|| |
Khan MS, Hutchison C, Coker RJ. Risk factors that may be driving the emergence of drug resistance in tuberculosis patients treated in Yangon, Myanmar. PLoS One 2017;12:e0177999.
Mohammad G, Rezaeipanah A. “Intrusion Detection System Using Genetic Algorithm and Data Mining Techniques Based on the Reduction Features.” International Journal of Computer Applications Technology and Research 2016;6:461-6.
Mardle S, Pascoe S, Tamiz M. An investigation of genetic algorithms for the optimization of multi-objective fisheries bioeconomic models. International Transactions in Operational Research 2000;7:33-49.
Mccall J. Genetic algorithms for modelling and optimisation. Journal of Computational and Applied Mathematics 2005;184:205-22.
Katoch S, Chauhan SS, Kumar V. A review on genetic algorithm: Past, present, and future. Multimed Tools Appl 2021;80:8091-126.
Man KF, Tang KS, Kwong S. Genetic algorithms: Concepts and applications, IEEE transactions on Industrial Electronics, 1996;43:519-534.
Tharwat A. Parameter investigation of support vector machine classifier with kernel functions. Knowl Inf Syst 2019;61:1269-302.
Zhang Y. Support Vector Machine Classification Algorithm and Its Application. In: Liu C, Wang L, Yang A. (eds) Information Computing and Applications. ICICA 2012. Communications in Computer and Information Science, Springer, Berlin, Heidelberg; 2012;308.
Ghosh S, Dasgupta A, Swetapadma A. A Study on Support Vector Machine based Linear and Non-Linear Pattern Classification, 2019 International Conference on Intelligent Sustainable Systems (ICISS), 2019. p. 24-28.
Srivastava DK, Bhanbhu L. Data classification using support vector machine, J Theor Appl Inf Technol 2010;12:1-7.
Evgeniou T, Pontil M. Support Vector Machines: Theory and Applications. In Machine Learning and Its Applications, Advanced Lectures. Springer-Verlag, Berlin, Heidelberg, 2001. p. 249-57.
Lakhani P, Sundaram B. Deep learning at chest radiography: Automated classification of pulmonary tuberculosis by using convolutional neural networks. Radiology 2017;284:574-82.
Rajaraman S, Candemir S, Xue Z, Alderson PO, Kohli M, Abuya J, et al
. A Novel Stacked Generalization of Models for Improved TB Detection in Chest Radiographs. In Proceedings of the 2018 40th
Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Honolulu, HI, USA, 18–21 July, 2018. p. 718-21.
Hooda R, Sofat S, Kaur S, Mittal A, Meriaudeau F. Deep-Learning: A Potential Method for Tuberculosis Detection Using Chest Radiography. In Proceedings of the 2017 IEEE International Conference on Signal and Image Processing Applications (ICSIPA), Kuching, Malaysia, 12–14 September, 2017. p. 497-502.
Dasanayaka C, Dissanayake MB. Deep learning methods for screening pulmonary tuberculosis using chest X-rays. Comput Methods Biomech Biomed Eng Imaging Vis 2021;9:39-49.
Sethi K, Parmar V, Suri M. Low-Power Hardware-Based Deep-Learning Diagnostics Support Case Study. In Proceedings of the 2018 IEEE Biomedical Circuits and Systems Conference (BioCAS), Cleveland, OH, USA, 17–19 October, 2018. p. 1-4.
Kant S, Srivastava MM. Towards Automated Tuberculosis Detection Using Deep Learning. In Proceedings of the 2018 IEEE Symposium Series on Computational Intelligence (SSCI), Bangalore, India, 18–21 November, 2018. p. 1250-3.
Peetluk LS, Ridolfi FM, Rebeiro PF, Liu D, Rolla VC, Sterling TR. Systematic review of prediction models for pulmonary tuberculosis treatment outcomes in adults. BMJ Open 2021;11:e044687.
Jamal S, Khubaib M, Gangwar R, Grover S, Grover A, Hasnain SE. Artificial Intelligence and Machine learning based prediction of resistant and susceptible mutations in Mycobacterium tuberculosis
. Sci Rep 2020;10:5487.
Wang S. Development of a Predictive Model of Tuberculosis Transmission among Household Contacts, Canadian Journal of Infectious Diseases and Medical Microbiology 2019;2019:7.
Zheng Y, Zhang L, Wang L, Rifhat R. Statistical methods for predicting tuberculosis incidence based on data from Guangxi, China. BMC Infect Dis 2020;20:300.
Goni I. Machine learning algorithm applied for predicting the presence of Mycobacterium tuberculosis. Int J Clin Dermatol 2020;3:4-7.
Tiwari A, Maji S. Machine Learning Techniques for Tuberculosis Prediction, International Conference on Advances in Engineering Science Management & Technology (ICAESMT); 2019.
Hussain OA, Junejo KN. Predicting treatment outcome of drug-susceptible tuberculosis patients using machine-learning models. Inform Health Soc Care 2019;44:135-51.
Killian JA, Wilder B, Sharma A, Choudhary V, Dilkina B, Tambe M. Learning to Prescribe Interventions for Tuberculosis Patients Using Digital Adherence Data. In Proceedings of the 25th
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Anchorage, AK, USA, 4–8 August, 2019. p. 2430-8.
Sauer CM, Sasson D, Paik KE, McCague N, Celi LA, Sánchez Fernandez I. Feature selection and prediction of treatment failure in tuberculosis. PLoS One 2018;13:e0207491.
Keane VP, de Klerk N, Krieng T, Hammond G, Musk AW. Risk factors for the development of nonresponse to first-line treatment for tuberculosis in southern Vietnam. Int J Epidemiol 1997;26:1115-20.
Santha T, Garg R, Frieden TR, Chandrasekaran V, Subramani R, Gopi PG, et al.
Risk factors associated with default, failure and death among tuberculosis patients treated in a DOTS programme in Tiruvallur District, South India, 2000. Int J Tuberc Lung Dis 2002;6:780-8.
Huang Q, Yin Y, Kuai S, Yan Y, Liu J, Zhang Y, et al.
The value of initial cavitation to predict re-treatment with pulmonary tuberculosis. Eur J Med Res 2016;21:20.
Mohammadzadeh KA, Ghayoomi A, Maghsoudloo D. Evaluation of factors associated with failure of tuberculosis treatment under DOTS in northern Islamic Republic of Iran. East Mediterr Health J 2016;22:87-94.
[Figure 1], [Figure 2], [Figure 3], [Figure 4], [Figure 5], [Figure 6], [Figure 7]
[Table 1], [Table 2]