Using Machine Learning Algorithms in Determining the Stage of Breast Cancer from Pathology Reports

Shirin Samadzad-Qushchi, Parinaz Eskandarian, Zahra Niazkhani, Ali Rashidi, Habibollah Pirnejad



Introduction: After a cancer diagnosis, the most important thing is to determine the stage and grade of the cancer. Pathology reports are the main source for cancer staging, but they do not contain all the information needed for the staging. However, the text of these reports is sometimes the only available information. We were interested in knowing whether text mining methods can be used to predict staging only from pathology reports.

Material and Methods: A total of 698 pathology reports of breast cancer cases and their TNM staging collected from multiple centers in West Azerbaijan Province, Iran were used for this study. After preparing the semi-structured reports, the texts of the reports were imported into a program written by Python V3. Three machine learning algorithms of Logistic Regression, SVM, and Naïve Bayes and a simple pipeline were used for the purpose of text mining. The performance of the algorithms was evaluated in terms of accuracy, precision, recall, and F1 score.

Results: The Naïve Bayes algorithm achieved excellent results and a value rate of higher than 91% in all evaluation criteria (accuracy, precision, recall and F1 score). This means that the Naïve Bayes algorithm could classify the reports with high efficiency and its predictions were more correct than the other two algorithms. Naïve Bayes also outperformed SVM and Logistic Regression in terms of accuracy, recall and F1 score. In addition, Naïve-Bayes showed faster inference due to its simplicity and lower computational and training time.

Conclusion: We suggest using the proposed design in this study for predicting breast cancer staging, where there is a need but not all necessary information except pathology reports. This method may not be a useful for clinical management of cancer patients, but it can be safely used for epidemiological estimations.


Breast Cancer; Pathology Reports; Text Mining; NLP; TNM Stage; Machine Learning;


Ferlay J, Colombet M, Soerjomataram I, Mathers C, Parkin DM, Piñeros M, et al. Estimating the global cancer incidence and mortality in 2018: GLOBOCAN sources and methods. Int J Cancer. 2019; 144(8): 1941-53. PMID: 30350310 DOI: 10.1002/ijc.31937

Khanjani N, Rastad H, Saber M, Khandani BK, Tavakkoli L. Causes of delay in seeking treatment in Iranian patients with breast cancer based on the health belief model (HBM). International Journal of Cancer Management. 2018; 11(6): e61383.

Norway CRo. Cancer in Norway 2018: Cancer incidence, mortality, survival and prevalence in Norway. The Cancer Registry of Norway Oslo; 2015.

Bray F, Grimsrud T, Haldorsen T, Johannesen T, Johansen A. Cancer in Norway 2008: Cancer incidence, mortality, survival and prevalence in Norway. The Cancer Registry of Norway; 2010.

Nguyen AN, Moore J, O’Dwyer J, Philpot S. Assessing the utility of automatic cancer registry notifications data extraction from free-text pathology reports. AMIA Annu Symp Proc. 2015; 2015: 953-62. PMID: 26958232 PMCID: PMC4765645

Gaikwad SV, Chaugule A, Patil P. Text mining methods and techniques. International Journal of Computer Applications. 2014; 85(17): 42-5.

Weglarz G. Two worlds data: Unstructured and structured. Dm Review. 2004; 14: 19-23.

Baeza-Yates R, Ribeiro-Neto B. Modern information retrieval. ACM Press; 1999.

Moo TA, Sanford R, Dang C, Morrow M. Overview of breast cancer therapy. PET Clin. 2018; 13(3): 339-54. PMID: 30100074 DOI: 10.1016/j.cpet.2018.02.006

Oskouei RJ, Kor NM, Maleki SA. Data mining and medical world: breast cancers’ diagnosis, treatment, prognosis and challenges. Am J Cancer Res. 2017; 7(3): 610-27. PMID: 28401016 PMCID: PMC5385648

Yan S, Qi Y. Apply text mining to advance cancer research. International Journal of Pharma Medicine and Biological Sciences. 2015; 4(2): 132-5.

Radha P, Preethi MBM. Text mining pathology and radiology records to habitually classify against disease: Computing The control of linking data sources. International Journal of Scientific Research in Computer Science, Engineering and Information Technology. 2018; 3(6): 76-84.

Sheikhpour R, Agha Sarram M, Zare Mirakabad MR, Sheikhpour R. Breast cancer detection using two-step reduction of features extracted from fine needle aspirate and data mining algorithms. Iranian Journal of Breast Diseases. 2015; 7(4): 43-51.

Yala A, Barzilay R, Salama L, Griffin M, Sollender G, Bardia A, et al. Using machine learning to parse breast pathology reports. Breast Cancer Res Treat. 2017; 161(2): 203-11. PMID: 27826755 DOI: 10.1007/s10549-016-4035-1

Deshmukh PR, Phalnikar R. TNM cancer stage detection from unstructured pathology reports of breast cancer patients. International Conference on Computational Science and Applications. Springer; 2020.

Sufyan M, Shokat Z, Ashfaq UA. Artificial intelligence in cancer diagnosis and therapy: Current status and future perspective. Comput Biol Med. 2023; 165: 107356. PMID: 37688994 DOI: 10.1016/j.compbiomed.2023.107356

Bhatia A, Victora CG, Beckfield J, Budukh A, Krieger N. “Registries are not only a tool for data collection, they are for action”: Cancer registration and gaps in data for health equity in six population‐based registries in India. Int J Cancer. 2021; 148(9): 2171-83. PMID: 33186475 DOI: 10.1002/ijc.33391

Buckley JM, Coopey SB, Sharko J, Polubriaginof F, Drohan B, Belli AK, et al. The feasibility of using natural language processing to extract clinical information from breast pathology reports. J Pathol Inform. 2012; 3: 23. PMID: 22934236 DOI: 10.4103/2153-3539.97788

Kou G, Yang P, Peng Y, Xiao F, Chen Y, Alsaadi FE. Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods. Applied Soft Computing. 2020; 86: 105836.

Asudani DS, Nagwani NK, Singh P. Impact of word embedding models on text analytics in deep learning environment: A review. Artif Intell Rev. 2023; 56: 10345–425. PMID: 36844886 DOI: 10.1007/s10462-023-10419-1

Li C, Weng Y, Zhang Y, Wang B. A systematic review of application progress on machine learning-based natural language processing in breast cancer over the past 5 years. Diagnostics (Basel). 2023; 13(3): 537. PMID: 36766641 DOI: 10.3390/diagnostics13030537

Zhang H, Li D. Naïve Bayes text classifier. International Conference on Granular Computing. IEEE; 2007.

Uddin S, Khan A, Hossain ME, Moni MA. Comparing different supervised machine learning algorithms for disease prediction. BMC Med Inform Decis Mak. 2019; 19(1): 281. PMID: 31864346 DOI: 10.1186/s12911-019-1004-8

Hearst M. What is text mining [Internet]. 2003 [cited: 15 Dec 2006]. Available from:

Radak M, Lafta HY, Fallahi H. Machine learning and deep learning techniques for breast cancer diagnosis and classification: A comprehensive review of medical imaging studies. J Cancer Res Clin Oncol. 2023; 149(12): 10473-91. PMID: 37278831 DOI: 10.1007/s00432-023-04956-z

Bae JH, Han HW, Yang SY, Song G, Sa S, Chung GE, et al. Natural language processing for assessing quality indicators in free-text colonoscopy and pathology reports: Development and usability study. JMIR Med Inform. 2022; 10(4): e35257. PMID: 35436226 DOI: 10.2196/35257

Kefeli J, Tatonetti N. Generalizable and automated classification of TNM stage from pathology reports with external validation. medRxiv. 2023; 2023: 2023.06.26.23291912. PMID: 37425701 DOI: 10.1101/2023.06.26.23291912

Abedian S, Sholle ET, Adekkanattu PM, Cusick MM, Weiner SE, Shoag JE, et al. Automated extraction of tumor staging and diagnosis information from surgical pathology reports. JCO Clin Cancer Inform. 2021; 5: 1054-61. PMID: 34694896 DOI: 10.1200/CCI.21.00065

Rathi M, Singh AK. Breast cancer prediction using Naïve Bayes classifier. International Journal of Information Technology & Systems. 2012; 1(2): 77-80.

Hazra A, Mandal SK, Gupta A. Study and analysis of breast cancer cell detection using Naïve Bayes, SVM and ensemble algorithms. International Journal of Computer Applications. 2016; 145(2): 39-45.

Harzevili NS, Alizadeh SH. Mixture of latent multinomial naive Bayes classifier. Applied Soft Computing. 2018; 69: 516-27.

Han-Joon K, Jiyun K, Jinseog K, Pureum L. Towards perfect text classification with Wikipedia-based semantic Naïve Bayes learning. Neurocomputing. 2018; 315: 128-34.

McCowan I, Moore D, Fry M. Classification of cancer stage from free-text histology reports. International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE; 2006.

Rajaguru H, Prabhakar SK. Expectation maximization based logistic regression for breast cancer classification. International Conference of Electronics, Communication and Aerospace Technology. IEEE; 2017.



  • There are currently no refbacks.