The large language model diagnoses tuberculous pleural effusion in pleural effusion patients through clinical feature landscapes

Wu, Chaoling; Liu, Wanyi; Mei, Pengfei; Liu, Yunyun; Cai, Jian; Liu, Lu; Wang, Juan; Ling, Xuefeng; Wang, Mingxue; Cheng, Yuanyuan; He, Manbi; He, Qin; He, Qi; Yuan, Xiaoliang; Tong, Jianlin

doi:10.1186/s12931-025-03130-y

Research
Open access
Published: 12 February 2025

The large language model diagnoses tuberculous pleural effusion in pleural effusion patients through clinical feature landscapes

Chaoling Wu¹^na1,
Wanyi Liu²^na1,
Pengfei Mei³^na1,
Yunyun Liu¹,
Jian Cai⁴,
Lu Liu¹,
Juan Wang³,
Xuefeng Ling¹,
Mingxue Wang¹,
Yuanyuan Cheng¹,
Manbi He¹,
Qin He¹,
Qi He¹,
Xiaoliang Yuan⁵ &
…
Jianlin Tong¹

Respiratory Research volume 26, Article number: 52 (2025) Cite this article

904 Accesses
16 Altmetric
Metrics details

Abstract

Background

Tuberculous pleural effusion (TPE) is a challenging extrapulmonary manifestation of tuberculosis, with traditional diagnostic methods often involving invasive surgery and being time-consuming. While various machine learning and statistical models have been proposed for TPE diagnosis, these methods are typically limited by complexities in data processing and difficulties in feature integration. Therefore, this study aims to develop a diagnostic model for TPE using ChatGPT-4, a large language model (LLM), and compare its performance with traditional logistic regression and machine learning models. By highlighting the advantages of LLMs in handling complex clinical data, identifying interrelationships between features, and improving diagnostic accuracy, this study seeks to provide a more efficient and precise solution for the early diagnosis of TPE.

Methods

We conducted a cross-sectional study, collecting clinical data from 109 TPE and 54 non-TPE patients for analysis, selecting 73 features from over 600 initial variables. The performance of the LLM was compared with logistic regression and machine learning models (k-Nearest Neighbors, Random Forest, Support Vector Machines) using metrics like area under the curve (AUC), F1 score, sensitivity, and specificity.

Results

The LLM showed comparable performance to machine learning models, outperforming logistic regression in sensitivity, specificity, and overall diagnostic accuracy. Key features such as adenosine deaminase (ADA) levels and monocyte percentage were effectively integrated into the model. We also developed a Python package (https://pypi.org/project/tpeai/) for rapid TPE diagnosis based on clinical data.

Conclusions

The LLM-based model offers a non-surgical, accurate, and cost-effective method for early TPE diagnosis. The Python package provides a user-friendly tool for clinicians, with potential for broader use. Further validation in larger datasets is needed to optimize the model for clinical application.

Introduction

Tuberculous pleural effusion (TPE) is a frequently encountered form of extrapulmonary tuberculosis, and its nonspecific clinical and imaging features present significant diagnostic challenges. Early and accurate diagnosis of TPE is critical for timely treatment, especially in regions with a high burden of tuberculosis. However, traditional diagnostic methods, such as pleural biopsy and pleural effusion (PE) analysis, often demonstrate limited sensitivity. This limitation underscores the need for more advanced diagnostic tools. While numerous studies have explored machine learning models for TPE diagnosis, the potential of large language models (LLMs) such as ChatGPT-4 has not yet been thoroughly investigated. This study aims to create a diagnostic model for TPE using ChatGPT-4 and compare its performance with traditional TPE diagnosis models based on logistic regression and machine learning methods. We also explore the performance differences between these approaches.

In many countries, TPE is a leading cause of PE and one of the most prevalent types of extrapulmonary tuberculosis, posing a prominent public health issue in developing countries, including China [1, 2]. TPE is caused by Mycobacterium tuberculosis infection of the pleura, characterized by a substantial accumulation of chronic effusion and inflammatory cells in the pleural cavity [3]. The combination of elevated lymphocyte count, exudative PE, and increased adenosine deaminase (ADA) levels is crucial for TPE diagnosis. However, in early cases, neutrophils may predominant [4], ADA levels may be relatively low [5], and the optimal pleural fluid ADA threshold for TPE diagnosis varies across studies [1, 6]. The gold standard for diagnosing TPE is detecting Mycobacterium tuberculosis in PE or pleural biopsy specimens [1]. However, pleural fluid microbiological cultures have low positivity rates and are time-consuming, often requiring up to eight weeks. Additionally, obtaining pleural specimens via thoracoscopy or percutaneous pleural biopsy involves a surgical procedure, which poses substantial trauma and risks of complications, such as iatrogenic pneumothorax [7]. Therefore, diagnosing TPE remains challenging. This highlights the critical need for a less invasive, more accurate, and cost-effective method for early TPE diagnosis.

Recently, the use of artificial intelligence (AI) in healthcare has been gradually expanding. Machine learning, a subset of AI, creates algorithms that utilize large and complex datasets. This enables computers to exhibit intelligent behavior [8]. Machine learning algorithms (MLAs), such as k-Nearest Neighbors (KNN), Random Forests (RF), and Support Vector Machines (SVM), can build efficient, objective, and accurate disease diagnosis models. Machine learning has shown broad potential for clinical diagnosis [9]. Zhou et al. proposed a new algorithm, CFDE, for feature selection in the clinical feature analysis of TPE. This algorithm demonstrated significant advantages in global optimization and feature selection. When combined with the SVM model, it effectively identified key clinical indicators associated with TPE, supporting early diagnosis and treatment of TPE [10]. Ren et al. explored diagnostic biomarkers for TPE and incorporated patient clinical features into MLAs, including logistic regression, SVM, RF, and KNN. The results showed that RF achieved an area under the curve (AUC) value of 0.97, significantly higher than the AUC of pleural effusion ADA (0.89) [11]. Li et al. developed a new model called bGACO-SVM to classify TPE from non-TPE. The results showed that this model differed from classical MLAs [12]. Additionally, Li et al. combined a new algorithm, FS-MFO-SVM, with feature selection for diagnosing TPE. This approach demonstrated an average accuracy of 95%, an AUC of 0.9564, sensitivity of 93.35%, and specificity of 97.57% [13]. Despite these advancements, machine learning-based methods still face challenges in effectively integrating and analyzing complex, multi-dimensional clinical data, especially when dealing with high variability data.

LLMs are AI systems based on deep learning [14, 15]. By learning from vast amounts of data, they can analyze complex clinical information and provide medical diagnostic suggestions [16,17,18,19,20]. Significant progress has been made in applying LLMs to disease diagnosis and treatment. Studies have shown that LLMs, such as ChatGPT, can assist clinicians quickly access and summarize large volumes of medical literature. This enables them to stay updated on recent studies about rare diseases and facilitates more precise diagnosis [21]. Tassallah et al. evaluated the performance of three LLMs—ChatGPT 3.5, ChatGPT-4, and Google Bard—in diagnosing conditions such as chylous tuberculosis and primary adrenal cortical insufficiency. The results showed that these models outperformed the average diagnostic accuracy of physicians [22]. Zheng et al. pointed out that ChatGPT excelled in assisting the diagnosis of diseases like primary pulmonary arterial hypertension and Parkinson’s disease with an early onset. It demonstrated the ability to quickly analyze medical literature and patient data while formulating personalized treatment plans [23]. Hu et al. evaluated ChatGPT-4’s ability to diagnose rare eye diseases in different scenarios. The results showed that ChatGPT-4 helped primary care ophthalmologists diagnose rare eye conditions more quickly and accurately [24]. Additionally, Carlo et al. assessed the performance of various AI LLMs (ChatGPT 3.5, ChatGPT-4, Bing Chat, Google Bard, and Claude) in answering medical questions about diseases such as thymoma and Good’s syndrome. The results showed that ChatGPT-4 and Bard outperformed others in terms of information accuracy, responsiveness, and clinical applicability [25]. These studies demonstrate that LLMs offer superior efficiency compared to traditional methods and may also provide advantages in diagnostic accuracy. However, while these LLM models have shown promise in various clinical scenarios, their application to TPE diagnosis remains unexplored.

This study aims to bridge this gap by developing a diagnostic model for TPE using the LLM. We compare its performance with traditional diagnostic approaches, including logistic regression and various MLAs, to evaluate its ability to diagnose TPE. The results show that LLMs, particularly ChatGPT-4, excel at integrating clinical data and identifying potential relationships between complex features, offering new insights and support for the early diagnosis of TPE. Furthermore, we developed and published a ChatGPT-4-based diagnostic LLM software package for distinguishing between TPE and non-TPE, making it accessible for clinical use. Future refinement of this tool could significantly enhance diagnostic accuracy and efficiency, ultimately facilitating earlier diagnosis and more personalized treatment of TPE.

Materials and methods

Patients and study design

This study included 38,885 hospitalized patients from January 2011 to June 2024 at the Affiliated Hospital of Jiujiang University. A cross-sectional study was conducted. Patients were eligible for enrollment if they met the following criteria: (1) a diagnosis of pleural effusion (PE) confirmed by ultrasound, chest computed tomography (CT), or X-ray; (2) a diagnosis of PE confirmed by pleural biopsy. The exclusion criteria were: (1) patients who had undergone anti-tuberculosis treatment prior to admission; (2) pregnant women; (3) patients with incomplete clinical data (more than 20% missing); (4) patients with an unknown cause of PE. All patients included in the study were newly diagnosed with PE and had not received any prior treatment. We collected relevant demographic, laboratory, and clinical information from the hospital’s clinical electronic records system. In total, 163 patients were included in the final analysis. Among them, 109 had TPE, and 54 had non-TPE. Initially, over 600 clinical features were screened for the 163 patients. Variables with more than 20% missing data were excluded, leaving 73 variables for analysis. Differences between variables were visualized using the ggplot2 package. The patient selection process and study flowchart are shown in Fig. 1.

Diagnostic criteria for TPE

The diagnosis of TPE is based on one of the following criteria: (a) the detection of Mycobacterium tuberculosis in pleural fluid or pleural tissue culture; (b) histological examination showing granulomatous inflammation in pleural biopsy, with Mycobacterium tuberculosis isolated from another site; or (c) histological examination showing granulomatous inflammation in pleural biopsy, with clinical response to anti-tuberculosis therapy [26].

Data collection and variable selection

We selected candidate variables based on key literature on TPE diagnosis models and our clinical experience. These variables were chosen for their clinical availability and non-surgical nature. The potential diagnostic variables included the following clinical characteristics: age, sex, routine PE parameters (color, turbidity, specific gravity, and Leifant test), biochemical parameters of PE (total protein, glucose, lactate dehydrogenase (LDH), adenosine deaminase (ADA), and albumin), serum biochemical parameters (C-reactive protein (CRP), erythrocyte sedimentation rate (ESR)), complete blood count [(white blood cells (WBC), lymphocytes, neutrophils], and tumor markers [carcinoembryonic antigen (CEA), non-small cell lung cancer-related antigen, neuron-specific enolase (NSE)], among others. Samples were sourced from peripheral blood, PE, or pleural tissue collected during hospitalization. We collected clinical data from eligible patients using a structured data sheet customized for this study. These clinical data were obtained from the patients’ discharge records. Two experienced pulmonologists reviewed, refined, and cross-checked the clinical data. All data were collected by research staff who were blinded to the final outcome measurements.

Data preprocessing and feature selection

The cohort data used in this study contained missing values. Deleting all incomplete data could reduce the sample size, compromise data quality, and affect diagnostic results. Therefore, we excluded data with more than 20% missing values. For data with missing values ≤ 20%, we applied different imputation methods depending on the data type. We used the “norm” method for continuous data, “logreg” for binary classification data, and “polyreg” for multiclass data. These imputation methods were implemented using the “mice” package in R. All continuous variables were converted into binary variables, with the optimal classification threshold determined by receiver operating characteristic (ROC) curve analysis. We used the coords function from the pROC package to select the optimal threshold, which is determined based on the trade-off between sensitivity and specificity. This method finds the balance point between sensitivity (maximizing the identification of positive samples) and specificity (minimizing false negatives), thereby optimizing the classification performance (detailed in Code Sect. 1 of Supplementary Materials).

We then clustered the data using partial least squares discriminant analysis (PLS-DA). Variables with variable importance projection (VIP) values greater than 0.5 were extracted, resulting in 73 selected variables. We removed variables with area under the curve (AUC) values less than 0.6, leaving 17 variables. Next, we removed highly correlated variables with pairwise correlations greater than 0.5. We assessed multicollinearity using the variance inflation factor (VIF). Variables with a VIF value less than 5 were retained, reducing the number of variables to 16. Finally, we used least absolute shrinkage and selection operator (LASSO) regression, conducted with the “glmnet” package in R, to select 12 variables for the final analysis. Using the “caret” package, we randomly split the patients into a training and test set in a 7:3 ratio.

Establishment and evaluation of the TPE machine learning diagnosis model

In this study, we used H2O AutoML to integrate a series of classical and advanced machine learning algorithms to effectively diagnose TPE. The algorithms employed included “XGBoost”, “GBM”, “GLM”, “XRT”, “DeepLearning”, and “StackedEnsemble” (detailed in Supplementary Materials). We comprehensively evaluated these algorithms to select the optimal model for disease diagnosis.

Hyperparameter tuning and cross-validation

To optimize model performance and prevent overfitting, we used fivefold cross-validation. This method splits the training dataset into five mutually exclusive subsets. Each subset serves as the validation set while the remaining four subsets are used for training. Additionally, the AutoML process automatically performs hyperparameter tuning to explore the best model configuration.

Model selection and evaluation

We set the number of automatically generated models to 1000, and successfully generated 453 models. These models underwent hyperparameter tuning and fivefold cross-validation. We implemented an early stopping mechanism, using AUC as the performance evaluation metric. The training process was automatically halted if the AUC improvement was less than 0.001 for three consecutive training cycles, preventing overfitting. We also filtered out models with an AUC of 1, as this could indicate overfitting. The final model was selected based on the largest average AUC value from both the training and validation sets. This ensured optimal diagnostic performance and generalizability.

Model performance evaluation and diagnostic interpretation

We evaluated model performance using ROC curves, F1 score, and SHAP (R package) analysis. First, we used the trained model to make diagnosis on the test set (testdata) and extracted the diagnostic probabilities for the positive class (class 1).We then used the pROC package to generate the ROC curve and calculate the model’s AUC with its 95% confidence interval. The F1 score on the test set was calculated using the confusionMatrix function. We also visualized the confusion matrix and saved it. Finally, we performed SHAP value analysis using the shapviz package to interpret the impact of each variable on the model’s diagnosis. This analysis helped explain the model’s diagnosis of TPE likelihood in patients.

Establishment of the traditional logistic regression model and comparison with previous models

Logistic regression model construction and variable selection

We constructed a logistic regression model to fit the data for effective TPE diagnosis. First, we built a full-variable logistic regression model using variables selected by LASSO in the training dataset. We applied three different variable selection strategies: forward selection, backward elimination, and stepwise regression. Forward selection adds variables progressively based on the minimum AIC value. Backward elimination removes non-significant variables step by step. Stepwise regression combines both strategies. During the variable selection process, we chose the logistic regression model from forward selection with the highest AUC for subsequent analysis.

Model evaluation and performance visualization

ROC curve and AUC calculation

We evaluated model performance using the ROC curve and calculated the AUC to quantify classification performance. The pROC package generated ROC curves for both the training and test sets. We recorded the AUC values and their 95% confidence intervals (CI). To assess the model’s reliability, we applied a bootstrap method with 1000 resamples. This produced multiple ROC curves to evaluate the model’s stability.

Decision curve analysis (DCA)

We performed DCA to assess the clinical applicability of the model at different thresholds. DCA evaluates the net benefit at various decision thresholds, helping us determine the model’s practical significance in specific clinical scenarios.

Variable importance and forest plot visualization

We used a forest plot to visually display each variable’s contribution to the model’s diagnosis. The forestplot package created the plot, displaying the importance of variables through their respective odds ratios (OR). The visual results also included the confidence intervals and significance levels for the variables.

Nomogram and individualized diagnosis

We used a nomogram to show how the model can be used for individualized diagnosis. The nomogram converts each variable into a scoring system to diagnose TPE in an individual. This approach enhances the interpretability and practical application of the model.

Comparison with published TPE diagnosis model performance

We collected the variables from eight previously published TPE diagnosis models and applied them to our training dataset for modeling and diagnosis on the test dataset. We compared the AUC values of the different models to evaluate and determine the classification performance of each model.

Establishment and evaluation of the large language model (LLM) (ChatGPT-4) diagnosis model for TPE

To further explore the application of LLM (ChatGPT-4) in diagnosing TPE, we employed the following methods:

Variable importance scoring

First, we used the variable set selected by LASSO regression and assigned an importance score to each variable using ChatGPT-4. The scores ranged from 1 to 10. This process was repeated 10 times, and we calculated the mean score for each variable across all 10 iterations. We ranked the variables in descending order based on their mean importance scores. Variables with an average importance score greater than 5 were selected. A total of 8 key variables were identified: a. Biochemical parameters of PE: ADA, total protein, albumin; b. Blood cell analysis parameters: Lymphocyte count, neutrophil percentage, monocyte percentage, neutrophil count; c. Patient age.

Model training and diagnosis

We input the 8 key variables into ChatGPT-4 to train the model, ensuring it accurately learned and understood the data features. After training, we used the model to diagnose the TPE.

Model evaluation

We evaluated the model’s performance using the ROC curve and the F1 score. By calculating the AUC and F1 score, we quantified the model’s classification performance. Finally, we compared the diagnosis results of the ChatGPT-4 with those of previous best-performing machine learning models (such as Support Vector Machines (SVM), Random Forests (RF)) and the traditional logistic regression model. This comparison helped us assess its superiority or limitations.

Development of the ChatGPT-4 diagnostic model python package for TPE

We developed a Python package named tepai (https://pypi.org/project/tpeai/) to quickly differentiate between TPE and non-TPE using the diagnostic power of ChatGPT-4. By inputting a set of key variables related to the patient’s biochemistry and blood cell analysis, the model provides an intelligent diagnosis. The required variables include: pleural fluid biochemistry (ADA, total protein, albumin), blood cell analysis (lymphocyte count, neutrophil percentage, monocyte percentage, neutrophil count), and patient age. The model uses these inputs to generate a diagnosis through ChatGPT-4, assisting clinicians quickly identify the type of PE and make informed diagnostic and treatment decisions.

Statistical methods

Statistical analyses and software development for this study were performed using R 4.2.3 and Python 3.10. We first tested continuous variables for normality. Data that followed a normal distribution are presented as mean ± standard deviation (SD). We used the independent samples t-test for pairwise comparisons and ANOVA for multiple group comparisons. For non-normally distributed data, the median and interquartile range [P25, P75] are presented. Group comparisons were made using the Mann–Whitney U test for two groups and the Kruskal–Wallis test for multiple groups. Categorical data are expressed as frequencies and percentages (%), with the Chi-square (χ²) test used to compare rates between groups. We set a significance level of α = 0.05 (two-tailed). A P-value of less than 0.05 was deemed statistically significant.

Results

Clinical characteristics of TPE

We analyzed 73 clinical variables from 163 patients (including 109 TPE patients and 54 non-TPE patients) who underwent thoracoscopic biopsy (Fig. 1). A baseline table of clinical characteristics for the TPE and non-TPE groups was generated using the tableone package (Table 1). We classified non-TPE and TPE samples using clinical characteristics data based on routine blood tests and biochemical markers, with PLS-DA (Fig. 2A). The results showed that the first two principal components (PC1 and PC2) explained 6.83% and 3.44% of the variance, respectively, and the two groups exhibited a clear separation on the score plot. This indicates that clinical features have strong discriminative power in distinguishing non-TPE group from TPE group. To further explore the expression patterns of these clinical features across different samples, we used a heatmap to display the expression levels of various biochemical and hematological variables (Fig. 2B). The results indicated significant differences in the expression levels of these variables between non-TPE and TPE groups. The cut-off values and area under the curves (AUCs) of the clinical characteristics are shown in Supplementary Table 1. We illustrated the distribution of key biomarkers in the non-TPE and TPE groups (Fig. 2C). The findings revealed significant intergroup differences (P < 0.05) in the levels of adenosine deaminase (ADA) in pleural effusion (PE), total protein distribution, and the percentage of monocytes in the blood count. These biomarkers may hold potential diagnostic value for TPE. Furthermore, we assessed the diagnostic performance of these biomarkers using ROC curve analysis (Fig. 2D). The AUC for ADA in PE was 0.8488 (95% CI: 0.7696–0.928), the AUC for the percentage of monocytes in the blood count was 0.7645 (95% CI: 0.6903–0.8387), and the AUC for total protein in PE was 0.7202 (95% CI: 0.6332–0.8072). These results indicate that ADA levels in PE, total protein distribution, and monocyte percentage in the blood count have high diagnostic accuracy and can effectively differentiate between TPE and non-TPE groups.

Table 1 Baseline clinical characteristics of TPE and non-TPE patients

Full size table

Machine learning modeling effectively diagnose TPE

We further refined the variable selection by excluding variables with AUC values less than 0.6 and those with pairwise correlations greater than 0.5. We also assessed multicollinearity using the variance inflation factor (VIF) and retained variables with VIF values less than 5 (Supplementary Table 2, Supplementary Fig. 1). This process resulted in 16 selected variables (detailed in the Methods section). Subsequently, we performed diagnostic analysis for TPE using a machine learning model. We visualized the coefficient paths of different biochemical and hematological variables using the LASSO regression model (Fig. 3A). As the regularization parameter λ increased, the coefficients of the variables gradually shrank toward zero, indicating that the influence of some variables on the model’s diagnosis was reduced during regularization. We then examined the model’s performance at different λ values using a validation curve (Fig. 3B). The Mean Squared Error initially decreased and then increased with λ, suggesting that an optimal λ value corresponding to the best model complexity that could maintain high accuracy while avoiding overfitting. We selected 12 variables for further modeling. Using the H2O automated machine learning platform, we created 453 models with six algorithms: “XGBoost”, “GBM”, “GLM”, “XRT”, “DeepLearning”, “StackedEnsemble”. We ranked the top 93 models based on their average AUC values from both the training and test sets (Fig. 3C, Supplementary Tables 3–6). The results indicate that most models performed excellently on these two metrics. Additionally, we used the Gain method to reflect the importance of each variable in the optimal XGBoost model (Fig. 3D). The variable importance ranking shows that biochemical markers in PE, such as albumin and ADA, possess the strongest diagnostic ability in distinguishing between the TPE and non-TPE groups. To further explain the diagnostic mechanism of the model, we used SHAP to analyze the contribution of each feature to the sample outcomes (Fig. 3E). The analysis showed that features like PFB ADA and PFB albumin significantly impacted the model’s diagnosis. In contrast, features such as age and lipid profile triglycerides contributed less. The results showed that “albumin = 0” in PE had the strongest negative impact on the model output, with a SHAP value of − 1.34, while “ADA = 1” in PE had a significant positive impact (SHAP value of + 1.29). Additionally, the SHAP force plot demonstrated the decomposition of multiple features’ contributions to the diagnostic result for a specific sample (Fig. 3F). In this sample, the positive contributions of total protein and ADA significantly increased the diagnostic value, while albumin and neutrophil count had the greatest negative impact. As a result, the model diagnosed the patient as non-TPE type (f(x) = 0.0963, < 0.5). The confusion matrix indicated that the model exhibited high sensitivity and specificity on both the training set (sensitivity = 0.909, accuracy = 0.938) and the test set (sensitivity = 0.977, accuracy = 0.957) (Fig. 3G, H). The F1 score and Kappa statistics further demonstrated the model’s excellent diagnostic consistency on both the training set (F1 score = 0.944, Kappa statistics = 0.908) and the test set (F1 score = 0.87, Kappa statistics = 0.829).

Superior diagnostic performance of traditional logistic model for TPE

We established a traditional multivariate logistic regression model to diagnose TPE. We selected variables sequentially using forward stepwise logistic regression, resulting in 12 variables. We then performed a comprehensive evaluation of its diagnostic performance. First, we presented the regression analysis results of each biochemical and hematological variable in the logistic regression model using a forest plot (Fig. 4A). The results revealed that ADA in pleural fluid biochemistry (PFB), albumin in PFB, and alkaline phosphatase had odds ratios (OR) of 24.63 (95% CI: 5.22–169.75, P < 0.001), 10.75 (95% CI: 1.47–121.88, P = 0.03), and 3.62 (95% CI: 0.73–21.61, P = 0.13), respectively. These variables played a significant role in diagnosing TPE. We displayed the logistic regression model’s scoring system through a nomogram (Fig. 4B). Each variable’s score was color-coded and mapped to the probability of TPE occurrence, with higher scores corresponding to a higher likelihood of TPE. This nomogram provides a convenient diagnosis tool for clinicians. ROC curves in the training and test sets demonstrated the diagnostic performance of the logistic regression model (Fig. 4C, D). The AUC in the training set was 0.96 (95% CI: 0.93–0.99), and in the test set, the AUC was 0.95 (95% CI: 0.89–1.00), indicating the model’s high diagnostic accuracy and robustness. Moreover, we assessed the clinical utility of the model at various treatment thresholds using DCA (Fig. 4E, F). The results showed that, across a wide range of treatment thresholds, the logistic regression model offered a greater net benefit than the “treat all” and “treat none” strategies. This supports its potential value in clinical practice. In the training set, after 1000 bootstrap resamplings, the ROC curve further validated the model’s robustness, with an AUC consistently around 0.96 (Fig. 4G). Additionally, comparisons between forward selection, backward selection, and forward–backward selection methods showed minimal variation in AUC. The AUC remained in the range of 0.95–0.96, confirming the model’s consistent diagnostic performance (Fig. 4H). Finally, we compared the performance of eight published models in the training and test sets (Fig. 4I, J). Our logistic regression model outperformed other models, such as the Wu model (training AUC = 0.87, test AUC = 0.78) and the Li model (training AUC = 0.74, test AUC = 0.81). Our model achieved significantly higher AUC values (training AUC = 0.96, test AUC = 0.95), suggesting superior performance in differentiating between TPE and non-TPE patients.

Effective diagnosis of TPE using large language model (LLM)

We innovatively employed LLMs such as ChatGPT-4 and ChatGPT-4o to assess their performance in diagnosing TPE. First, ChatGPT-4 rated the importance of different biochemical and hematological indicators (Fig. 5A, Supplementary Table 7), revealing that biochemical markers in PE, particularly ADA and total protein, received the highest scores. The percentage of neutrophils and lymphocyte counts in blood cell analysis also demonstrated high importance, which aligns with prior research findings [27]. Next, we compared the performance of four models: the best machine learning model (MLbest)-XGBoost, ChatGPT-4, ChatGPT-4o, and logistic regression. We evaluated these models using various metrics, including AUC, specificity, sensitivity, accuracy, F1 score, negative predictive value (NPV), and positive predictive value (PPV) (Fig. 5B). The results showed that ChatGPT-4o and MLbest (XGBoost) outperformed the others across all metrics. Both achieved AUCs approaching 1.00, with high sensitivity and specificity, outperforming the traditional logistic regression model. This indicates that they hold significant application potential for diagnosing TPE. Although ChatGPT-4 slightly underperformed compared to ChatGPT-4o, it still demonstrated strong diagnostic capabilities across all metrics. We analyzed the diagnostic results of the best ML (XGBoost) model and the LLM-ChatGPT model in this study. We found that the diagnoses of the two models were consistent for 43 cases in the test set. However, there were discrepancies in 5 cases, including 4 cases of TPE and 1 case of non-TPE. This suggests that the two models showed a higher discrepancy rate for cases of TPE than for non-TPE cases (Supplementary Fig. 2). Furthermore, we developed the Python package “tpeai” (version 0.2.0) (https://pypi.org/project/tpeai/) (Fig. 5C). This package integrates ChatGPT-4 for distinguishing between TPE and non-TPE groups. By combining biochemical and hematological data, this tool effectively supports clinical diagnosis. We showed the specific output results of the “tpeai” package in diagnosing tuberculous PE, along with a detailed display of ChatGPT-4’s logical reasoning and thought process during the analysis (Fig. 5D). In summary, the LLM-based ChatGPT-4 model demonstrates excellent performance in diagnosing TPE. By integrating multiple biochemical and hematological indicators, it can effectively diagnose TPE and provide valuable support for clinical decision-making.

Discussion

This study innovatively employed large language model (LLM) to diagnose TPE and compared its performance with traditional machine learning models and logistic regression models. The aim was to explore the potential application of LLM in the diagnosis of TPE. The results demonstrate that the LLM model effectively integrates various clinical variables and distinguishes between TPE and non-TPE. Compared to traditional logistic regression models and common machine learning algorithms, the LLM model performed similarly to standard machine learning methods. It outperformed the logistic regression model in terms of sensitivity, specificity, and other evaluation metrics. These findings validate the potential application value of LLM for early diagnosis of TPE.

Traditional diagnostic methods for TPE primarily rely on pleural effusion (PE) cultures and pleural biopsy. However, these methods have a high risk of missed diagnoses, and PE culture results often take up to 8 weeks to be available [28]. Therefore, early and accurate identification of TPE remains a pressing challenge. In this study, we successfully developed a diagnostic model based on an LLM artificial intelligence framework using clinical data from TPE patients. This model enables faster and more effective diagnosis in clinical settings. Compared to traditional diagnostic workflows, this model not only offers higher diagnostic efficiency but also demonstrates greater practical value for non-surgical diagnostic tools.

From a clinical perspective, the LLM-based TPE diagnostic model offers a novel approach to addressing the limitations of existing diagnostic methods for TPE. As a manifestation of tuberculosis on the pleura, the accuracy and efficiency of TPE diagnosis are crucial for timely treatment, particularly in regions with a high incidence of tuberculosis. Traditional TPE diagnosis primarily relies on pleural biopsy; however, this is surgical and time-consuming. Previous studies have validated the role of biomarkers such as adenosine deaminase (ADA) and lymphocyte ratio in TPE diagnosis [29], but these individual biochemical or cytological indicators are insufficient to capture the full complexity of TPE presentations. In this study, the LLM model integrates various indicators, including ADA, total protein, and monocyte percentage. This non-surgical, cost-effective diagnostic approach helps clinicians diagnose TPE more quickly and accurately under non-surgical conditions, supporting early diagnosis. Furthermore, variables such as ADA in PE, total protein, and monocyte percentage in the blood count were found to be strong diagnostic markers for distinguishing TPE from non-TPE. The key role of ADA levels in TPE diagnosis was further validated, consistent with previous research findings [30].

The LLM model developed in this study demonstrates performance comparable to the machine learning models established in our article. It outperforms previous machine learning-based models and traditional logistic regression models for diagnosing TPE, showing superior diagnostic ability. Previous studies using machine learning for TPE diagnosis have primarily relied on algorithms such as Random Forests (RF) and Support Vector Machines (SVM). For example, Li et al. [12] improved diagnosis accuracy with an SVM model (bGACO-SVM), and Zhou et al. [10] proposed a new algorithm, CFDE, with an SVM model for feature selection in TPE diagnosis. However, the performance of these models was weaker than the LLM model developed in this study. Logistic regression models are widely used in disease classification. For example, Li et al. [13] reported an area under the curve (AUC) of 0.92 for TPE diagnosis using logistic regression. However, logistic regression models are limited by linear relationships and struggle to accurately capture complex interactions between nonlinear features. In contrast, the LLM model is better suited for handling large datasets and nonlinear feature data. In this study, the LLM model outperformed the logistic regression model across multiple metrics, including AUC, F1 score, and sensitivity.

The LLM model’s excellent performance can be attributed to its ability to handle complex data and integrate multiple variables [16]. In contrast, while machine learning is advantageous in certain specific applications (e.g., small sample data scenarios and lightweight real-time applications), it generally suffers from poor interpretability and limited generalizability. Traditional logistic regression models, although advantageous in interpretability, are constrained by their assumption of linear relationships, limiting their applicability in complex clinical settings.

Despite its advantages, the LLM model has certain limitations. Small sample sizes may lead to overfitting, compromising the model’s generalizability. Its reliance on specific biomarkers further restricts its applicability across different regions. Additionally, the random feature selection and increased complexity can complicate result interpretation. While this study demonstrates the LLM model’s potential in diagnosing TPE, further validation is necessary for real-world settings. The limited sample size, lack of multi-center data, and absence of external validation may affect the model’s stability. Future research could explore integrating multimodal data, such as genomic and imaging information, with the current model to enhance diagnostic accuracy and address these limitations.

Conclusion

This study suggests that the LLM-based diagnostic model provides a novel approach for the early non-surgical diagnosis of TPE. The ChatGPT-4 Python package, named “tepai” (https://pypi.org/project/tpeai/), developed in this study, provides a simple and user-friendly tool for clinicians. It allows for the rapid generation of diagnostic recommendations based on basic biochemical and hematological data inputs. Further optimization of this tool will enhance its ability to support precise diagnosis and personalized treatment for TPE. As artificial intelligence technology advances, the application of LLM is expected to expand. When combined with multimodal data such as imaging and genomic data, it could significantly improve the diagnostic efficiency and accuracy of TPE. Future research should focus on validating the LLM model in larger, multi-center datasets to ensure its broad applicability and robustness.

Data Availability

No datasets were generated or analysed during the current study.

Abbreviations

TPE:: Tuberculous pleural effusion
LLM:: Large language model
AUC:: Area under the curve
TB:: Tuberculosis
PE:: Pleural effusion
ADA:: Adenosine deaminase
AI:: Artificial intelligence
MLA:: Machine learning algorithms
KNN:: K-nearest neighbors
RF:: Random forests
SVM:: Support vector machines
CT:: Computed tomography
LDH:: Lactate dehydrogenase
CRP:: C-reactive protein
ESR:: Erythrocyte sedimentation rate
WBC:: White blood cells
CEA:: Carcinoembryonic antigen
NSE:: Neuron-specific enolase
ROC:: Receiver operating characteristic
PLS-DA:: Partial least squares discriminant analysis
VIP:: Variable importance projection
VIF:: Variance inflation factor
LASSO:: Least absolute shrinkage and selection operator
CI:: Confidence intervals
DCA:: Decision curve analysis
OR:: Odds ratios
SD:: Standard deviation
PFB:: Pleural fluid biochemistry
MLbest:: The best machine learning model
NPV:: Negative predictive value
PPV:: Positive predictive value

References

Shaw JA, Diacon AH, Koegelenberg CFN. Tuberculous pleural effusion. Respirology. 2019;24(10):962–71.
Article PubMed Google Scholar
Xu H-Y, Li C-Y, Su S-S, et al. Diagnosis of tuberculous pleurisy with combination of adenosine deaminase and interferon-γ immunospot assay in a tuberculosis-endemic population: a prospective cohort study. Medicine (Baltimore). 2017;96(47): e8412.
Article PubMed CAS Google Scholar
Zhai K, Lu Y, Shi H-Z. Tuberculous pleural effusion. J Thorac Dis. 2016;8(7):E486–94.
Article PubMed PubMed Central Google Scholar
Choi H, Chon HR, Kim K, et al. Clinical and Laboratory differences between lymphocyte- and neutrophil-predominant pleural tuberculosis. PLoS ONE. 2016;11(10): e0165428.
Article PubMed PubMed Central Google Scholar
Lee SJ, Kim HS, Lee SH, et al. Factors influencing pleural adenosine deaminase level in patients with tuberculous pleurisy. Am J Med Sci. 2014;348(5):362–5.
Article PubMed Google Scholar
Li D, Shen Y, Fu X, Li M, Wang T, Wen F. Combined detections of interleukin-33 and adenosine deaminase for diagnosis of tuberculous pleural effusion. Int J Clin Exp Pathol. 2015;8(1):888–93.
PubMed PubMed Central Google Scholar
Kirsch CM, Kroe DM, Azzi RL, Jensen WA, Kagawa FT, Wehner JH. The optimal number of pleural biopsy specimens for a diagnosis of tuberculous pleurisy. Chest. 1997;112(3):702–6.
Article PubMed CAS Google Scholar
Gruson D, Helleputte T, Rousseau P, Gruson D. Data science, artificial intelligence, and machine learning: opportunities for laboratory medicine and the value of positive regulation. Clin Biochem. 2019;69:1–7.
Article PubMed Google Scholar
Saberi-Karimian M, Khorasanchi Z, Ghazizadeh H, et al. Potential value and impact of data mining and machine learning in clinical diagnostics. Crit Rev Clin Lab Sci. 2021;58(4):275–96.
Article PubMed Google Scholar
Zhou X, Chen Y, Gui W, et al. Enhanced differential evolution algorithm for feature selection in tuberculous pleural effusion clinical characteristics analysis. Artif Intell Med. 2024;153: 102886.
Article PubMed Google Scholar
Ren Z, Hu Y, Xu L. Identifying tuberculous pleural effusion using artificial intelligence machine learning algorithms. Respir Res. 2019;20(1):220.
Article PubMed PubMed Central Google Scholar
Li C, Hou L, Pan J, Chen H, Cai X, Liang G. Tuberculous pleural effusion prediction using ant colony optimizer with grade-based search assisted support vector machine. Front Neuroinform. 2022;16:1078685.
Article PubMed PubMed Central Google Scholar
Li C, Hou L, Sharma BY, et al. Developing a new intelligent system for the diagnosis of tuberculous pleural effusion. Comput Methods Programs Biomed. 2018;153:211–25.
Article PubMed Google Scholar
Kaczmarczyk R, Wilhelm TI, Martin R, Roos J. Evaluating multimodal AI in medical diagnostics. NPJ Digit Med. 2024;7(1):205.
Article PubMed PubMed Central Google Scholar
Shao J, Ma J, Yu Y, et al. A multimodal integration pipeline for accurate diagnosis, pathogen identification, and prognosis prediction of pulmonary infections. Innovation (Camb). 2024;5(4): 100648.
PubMed CAS Google Scholar
Hager P, Jungmann F, Holland R, et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat Med. 2024;2:642.
Google Scholar
Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29(8):1930–40.
Article PubMed CAS Google Scholar
Blank IA. What are large language models supposed to model? Trends Cogn Sci. 2023;27(11):987–9.
Article PubMed Google Scholar
Visibelli A, Roncaglia B, Spiga O, Santucci A. The impact of artificial intelligence in the odyssey of rare diseases. Biomedicines. 2023;11(3):887.
Article PubMed PubMed Central Google Scholar
Biswas S, Logan NS, Davies LN, Sheppard AL, Wolffsohn JS. Assessing the utility of ChatGPT as an artificial intelligence-based large language model for information to answer questions on myopia. Ophthalmic Physiol Opt. 2023;43(6):1562–70.
Article PubMed Google Scholar
Lapidus D. Strengths and limitations of new artificial intelligence tool for rare disease epidemiology. J Transl Med. 2023;21(1):292.
Article PubMed PubMed Central Google Scholar
Abdullahi T, Singh R, Eickhoff C. Learning to make rare and complex diagnoses with generative AI assistance: qualitative study of popular large language models. JMIR Med Educ. 2024;10: e51391.
Article PubMed PubMed Central Google Scholar
Zheng Y, Sun X, Feng B, et al. Rare and complex diseases in focus: ChatGPT’s role in improving diagnosis and treatment. Front Artif Intell. 2024;7:1338433.
Article PubMed PubMed Central Google Scholar
Hu X, Ran AR, Nguyen TX, et al. What can GPT-4 do for diagnosing rare eye diseases? A pilot study. Ophthalmol Ther. 2023;12(6):3395–402.
Article PubMed PubMed Central Google Scholar
Clerici CA, Chopard S, Levi G. Rare disease in the age of artificial intelligence. Recenti Prog Med. 2024;115(2):67–75.
PubMed Google Scholar
Ferreiro L, Toubes ME, San José ME, Suárez-Antelo J, Golpe A, Valdés L. Advances in pleural effusion diagnostics. Expert Rev Respir Med. 2020;14(1):51–66.
Article PubMed CAS Google Scholar
Jeon DS, Kim S-H, Lee JH, Choi C-M, Park HJ. Conditional diagnostic accuracy according to inflammation status and age for diagnosing tuberculous effusion. BMC Pulm Med. 2023;23(1):400.
Article PubMed PubMed Central CAS Google Scholar
Lo Cascio CM, Kaul V, Dhooria S, Agrawal A, Chaddha U. Diagnosis of tuberculous pleural effusions: a review. Respir Med. 2021;188: 106607.
Article PubMed Google Scholar
Garcia-Zamalloa A, Taboada-Gomez J. Diagnostic accuracy of adenosine deaminase and lymphocyte proportion in pleural fluid for tuberculous pleurisy in different prevalence scenarios. PLoS ONE. 2012;7(6): e38729.
Article PubMed PubMed Central CAS Google Scholar
Chan C, Chan KKP. Pleural fluid biomarkers: a narrative review. J Thorac Dis. 2024;16(7):4764–71.
Article PubMed PubMed Central Google Scholar

Download references

Funding

This study was supported by the Natural Science Foundation of Jiangxi Province (20202BABL206116), the Startup Fund for scientific research, Fujian Medical University (XJ2021018101), and Key Research and Development Program of Ganzhou City, Jiangxi Province (2023LNS37008).

Author information

Chaoling Wu, Wanyi Liu and Pengfei Mei contributed equally to this work.

Authors and Affiliations

Department of Respiratory Medicine, Affiliated Hospital of Jiujiang University, No. 57 East Xunyang Road, Xunyang District, Jiujiang, 332000, China
Chaoling Wu, Yunyun Liu, Lu Liu, Xuefeng Ling, Mingxue Wang, Yuanyuan Cheng, Manbi He, Qin He, Qi He & Jianlin Tong
Department of Hematology, The Second Affiliated Hospital of Fujian Medical University, Quanzhou, 362000, China
Wanyi Liu
Department of Gastroenterology, Affiliated Hospital of Jiujiang University, Jiujiang, 332000, China
Pengfei Mei & Juan Wang
Department of Cardiology, Affiliated Hospital of Jiujiang University, Jiujiang, 332000, China
Jian Cai
Department of Respiratory Medicine, First Affiliated Hospital of Gannan Medical University, No. 23, Qingnian Road, Zhanggong District, Ganzhou, 341000, China
Xiaoliang Yuan

Authors

Chaoling Wu
View author publications
You can also search for this author inPubMed Google Scholar
Wanyi Liu
View author publications
You can also search for this author inPubMed Google Scholar
Pengfei Mei
View author publications
You can also search for this author inPubMed Google Scholar
Yunyun Liu
View author publications
You can also search for this author inPubMed Google Scholar
Jian Cai
View author publications
You can also search for this author inPubMed Google Scholar
Lu Liu
View author publications
You can also search for this author inPubMed Google Scholar
Juan Wang
View author publications
You can also search for this author inPubMed Google Scholar
Xuefeng Ling
View author publications
You can also search for this author inPubMed Google Scholar
Mingxue Wang
View author publications
You can also search for this author inPubMed Google Scholar
Yuanyuan Cheng
View author publications
You can also search for this author inPubMed Google Scholar
Manbi He
View author publications
You can also search for this author inPubMed Google Scholar
Qin He
View author publications
You can also search for this author inPubMed Google Scholar
Qi He
View author publications
You can also search for this author inPubMed Google Scholar
Xiaoliang Yuan
View author publications
You can also search for this author inPubMed Google Scholar
Jianlin Tong
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

JLT and XLY conceived the project. Data analysis was performed by CLW, WYL, and PFM. The interpretation of the data involved contributions from CLW, WYL, PFM, YYL, JC, LL, JW, XFL, MXW, YYC, MBH, QH, QH, XLY and JLT. All authors contributed to writing and approving the final manuscript.

Corresponding authors

Correspondence to Xiaoliang Yuan or Jianlin Tong.

Ethics declarations

Ethics approval and consent to participate

This study was approved by the Ethics Committee of the Affiliated Hospital of Jiujiang University (Approval No.: jjumer-b-2024-0405) and was conducted in accordance with the Declaration of Helsinki. As retrospective data were used, written informed consent was waived. This article does not include any research involving human participants by the authors. All data were analyzed anonymously to ensure the privacy of the participants.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

Additional file 2.

Additional file 3.

Additional file 4.

Additional file 5.

Additional file 6.

Additional file 7.

Additional file 8.

Additional file 9.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Wu, C., Liu, W., Mei, P. et al. The large language model diagnoses tuberculous pleural effusion in pleural effusion patients through clinical feature landscapes. Respir Res 26, 52 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12931-025-03130-y

Download citation

Received: 20 November 2024
Accepted: 30 January 2025
Published: 12 February 2025
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12931-025-03130-y

The large language model diagnoses tuberculous pleural effusion in pleural effusion patients through clinical feature landscapes

Abstract

Background

Methods

Results

Conclusions

Introduction

Materials and methods

Patients and study design

Diagnostic criteria for TPE

Data collection and variable selection

Data preprocessing and feature selection

Establishment and evaluation of the TPE machine learning diagnosis model

Hyperparameter tuning and cross-validation

Model selection and evaluation

Model performance evaluation and diagnostic interpretation

Establishment of the traditional logistic regression model and comparison with previous models

Logistic regression model construction and variable selection

Model evaluation and performance visualization

ROC curve and AUC calculation

Decision curve analysis (DCA)

Variable importance and forest plot visualization

Nomogram and individualized diagnosis

Comparison with published TPE diagnosis model performance

Establishment and evaluation of the large language model (LLM) (ChatGPT-4) diagnosis model for TPE

Variable importance scoring

Model training and diagnosis

Model evaluation

Development of the ChatGPT-4 diagnostic model python package for TPE

Statistical methods

Results

Clinical characteristics of TPE

Machine learning modeling effectively diagnose TPE

Superior diagnostic performance of traditional logistic model for TPE

Effective diagnosis of TPE using large language model (LLM)

Discussion

Conclusion

Data Availability

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Respiratory Research

Contact us