Background: Liver lesions, including hepatocellular carcinoma and metastases, are major causes of cancer-related mortality. Accurate lesion segmentation and classification are crucial for diagnosis and management but remain limited by inter-observer variability and time-intensive manual methods. Artificial intelligence (AI), particularly deep learning, has emerged as a promising tool to automate these tasks with high precision. Purpose: To systematically review and synthesize evidence on AI-based methods for segmentation and classification of liver lesions using CT, MRI, and multimodal imaging. Methods: Following PRISMA 2020 guidelines, PubMed, Scopus, Web of Science, and IEEE Xplore were searched (January 2017–October 2025). Studies applying AI for segmentation or classification of liver lesions in human imaging were included. Data on imaging modality, architecture, validation, and diagnostic performance were extracted. Methodological quality was assessed using CLAIM, TRIPOD-AI, PROBAST-AI, and RQS tools. Pooled Dice coefficients and AUC values were estimated using random-effects models. Results: Sixteen studies (2017–2025) met inclusion criteria. Deep learning architectures, mainly CNNs and U-Net derivatives, dominated. Mean Dice scores were 0.93 (95% CI: 0.91–0.95) for liver segmentation and 0.83 (95% CI: 0.79–0.86) for lesion segmentation. Classification models achieved pooled AUC of 0.96 (95% CI: 0.94–0.98) and accuracy of 93%. Half the studies performed external validation, demonstrating strong generalizability. Conclusion: AI methods achieve high accuracy for liver lesion segmentation and classification, approaching radiologist-level performance. However, dataset heterogeneity, limited transparency, and lack of standardized reporting hinder clinical translation. Future work should focus on multicenter validation and explainable AI frameworks to enhance clinical adoption.
Liver diseases, including hepatocellular carcinoma (HCC) and metastatic liver lesions, are among the leading causes of cancer-related mortality worldwide. Accurate detection, segmentation, and characterization of these lesions are critical for treatment planning and prognosis. Conventional imaging modalities such as computed tomography (CT), magnetic resonance imaging (MRI), and contrast-enhanced ultrasound (CEUS) remain central to hepatic evaluation, but their interpretation can vary depending on reader experience, lesion complexity, and image quality, often leading to inter-observer variability and diagnostic uncertainty (1,2). Moreover, manual lesion segmentation is labor-intensive and prone to inconsistency, highlighting the need for automated and reproducible solutions.
Artificial intelligence (AI), particularly deep learning models such as convolutional neural networks (CNNs) and transformer-based architectures, has shown remarkable promise in medical imaging. AI algorithms can automatically delineate the liver and its lesions (segmentation) and classify them into benign or malignant categories based on radiological features (3–5). The Liver Tumor Segmentation (LiTS) Challenge and the Medical Segmentation Decathlon have accelerated research in this domain by providing benchmark datasets for performance comparison (6,7). These methods have achieved Dice similarity coefficients often exceeding 0.90 for liver segmentation and 0.70–0.80 for lesion segmentation, demonstrating potential utility in clinical workflows (4,6).
Despite rapid progress, several challenges limit clinical translation. Many AI models are trained on small, single-center datasets and lack external validation, which raises concerns about generalizability (8). Furthermore, differences in imaging protocols, scanner types, and annotation standards hinder reproducibility. Systematic reviews to date have examined AI in liver imaging broadly, but few have specifically evaluated the dual tasks of liver lesion segmentation and classification, with detailed comparison of algorithmic performance, datasets used, and methodological quality (9,10). This systematic review aimed to synthesize existing evidence on the application of artificial intelligence (AI) in liver imaging, with a particular focus on lesion segmentation and classification. The objectives were to evaluate the performance of AI-based models for liver and lesion segmentation, assess their diagnostic accuracy in classifying liver lesions, compare algorithmic performance across different datasets, imaging modalities, and model architectures, and appraise the methodological quality and risk of bias using established tools such as CLAIM, TRIPOD-AI, PROBAST-AI, and RQS.
This systematic review was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA 2020) guidelines. The study aimed to synthesize evidence on artificial intelligence (AI) applications in liver lesion segmentation and classification using medical imaging modalities such as CT, MRI, and CEUS.
A comprehensive search was performed across PubMed, Scopus, Web of Science, and IEEE Xplore databases for studies published between January 2017 and October 2025. The search used combinations of keywords and MeSH terms including “liver,” “lesion,” “segmentation,” “classification,” “deep learning,” and “radiomics.” Reference lists of included papers and relevant reviews were also screened to identify additional studies, and grey literature was considered to minimize publication bias. Studies were included if they applied AI-based methods for segmentation or classification of liver lesions in human subjects and reported quantitative performance metrics. Exclusion criteria included non-AI studies, animal experiments, reviews, editorials, and papers lacking performance validation.
Data extraction was performed independently by two reviewers using a standardized Excel sheet. Extracted information included study design, imaging modality, dataset characteristics, AI model architecture, segmentation and classification metrics, validation strategy, and bias indicators. Methodological quality and risk of bias were assessed using established tools — CLAIM, TRIPOD-AI, PROBAST-AI, and RQS — evaluating aspects such as transparency, data sharing, validation, and reproducibility. A qualitative synthesis was carried out to summarize study characteristics, while quantitative analysis (meta-analysis) was performed where appropriate. For segmentation studies, pooled Dice coefficients were calculated, and for classification studies, pooled sensitivity, specificity, and area under the curve (AUC) were estimated using a random-effects model. Heterogeneity was assessed using I² statistics, and potential publication bias was evaluated through Egger’s test and funnel plot analysis
Study Selection and Characteristics: The systematic search across PubMed, Scopus, Web of Science, and IEEE Xplore identified 1,132 records, of which 284 duplicates were removed. After title and abstract screening, 67 articles were retrieved for full-text review, and 16 met the inclusion criteria based on study design, population, imaging modality, and quantitative performance reporting. These studies, published between 2017 and 2025, evaluated the performance of artificial intelligence (AI) models for liver and liver-lesion segmentation and/or classification using CT, MRI, or multimodal imaging.
Most studies used contrast-enhanced CT as the imaging modality (4,6,11–18), while others employed MRI (4,5,19,20) or multimodal inputs such as CT with PET or MRI (21,22). Sample sizes ranged widely, from 115 patients in a small validation cohort (14) to over 12,000 in a large multicentre prospective study (11). Thirteen studies implemented deep convolutional neural networks (CNNs) or U-Net derivatives as their primary architecture, while three utilized advanced self-supervised or hybrid CNN-transformer frameworks (5,12,20). Six studies used public datasets such as LiTS, 3DIRCADb, or CHAOS (4–6,14,16,17), whereas the rest were based on institutional or multicentric clinical data, often incorporating both retrospective and prospective cohorts.
| Table 1. Summary of Included Studies (Study Characteristics) | ||||||
| Author (Year) | Modality (CT / MRI / Multimodal) | Dataset / Sample Size | AI Model / Architecture | Task | Key Metrics | External Validation |
| Ying et al. (2024)(11) | CT (multiphase, multicentre) | 12,610 patients from 18 hospitals | LiAIDS (CNN ensemble with lesion-level classifier) | Both | F1 = 0.94 (benign), 0.69 (malignant); Accuracy = 93% | Yes (multicentre) |
| Wei et al. (2024)(12) | CT (multistage, multicentre) | 4,039 patients (6 centres + 4 validation sites) | LiLNet (self-supervised CNN with attention blocks) | Classification | AUC = 0.972; Accuracy = 94.7% | Yes |
| Shan et al. (2025)(13) | CT (contrast-enhanced) | 140 HCC cases | Two-phase CNN segmentation platform | Segmentation | Dice = 0.8819; Precision > 0.97 | Yes |
| Vorontsov et al. (2019)(14) | CT (colorectal metastases) | 115 patients (train/val/test = 115/15/26) | 3D U-Net | Segmentation | Dice = 0.68; Sensitivity = 85%; PPV = 94% | No |
| Gowda & Manjunath (2025)(15) | CT | 3DIRCADb (20 cases) | UNet70 (deep CNN variant) | Classification | Accuracy = 94.6%; Sensitivity = 97.5%; Dice = 94.7% | No |
| Christ et al. (2016)(4) | CT | LiTS (131 scans) | Cascaded FCN + 3D CRF | Segmentation | Dice = 0.94 (liver); 0.80 (lesions) | No |
| Christ et al. (2017)(16) | CT + MRI | 100 CT + 38 MRI volumes | Cascaded FCN + Dense CRF | Segmentation | Dice = 0.94 (liver); 0.83 (lesions) | No |
| Bilic et al. (2023)(6) | CT | LiTS benchmark (201 volumes) | Ensemble CNNs | Segmentation | Dice = 0.963 (liver); 0.739 (tumor) | Yes (public benchmark) |
| Wu et al. (2023)(19) | CT (multiphase) | 1,229 cases | MULLET (Transformer + CNN hybrid) | Segmentation | Dice = 0.94–0.96; Recall = 91% | Yes |
| Hille et al. (2023)(20) | MRI (multicentre) | CHAOS + Institutional MRI | SWTR-UNet (CNN + Transformer layers) | Segmentation | Dice = 0.98 (liver); 0.81 (lesion) | Yes |
| Hamm et al. (2019)(5) | MRI (multiphasic) | 494 lesions | 3-layer CNN classifier | Classification | AUC = 0.992; Accuracy = 92% | No |
| Yasaka et al. (2018)(9) | MRI (dynamic contrast) | 200 lesions | CNN (VGG-based) | Classification | AUC = 0.98; Accuracy = 91% | No |
| Heker & Greenspan (2020)(18) | CT | 332 slices | Transfer Learning U-Net (SE-ResNet) | Both | Accuracy ↑ 10% vs baseline; Dice = 0.85 | No |
| Bashir et al. (2025)(17) | CT (staging, colorectal CA) | 302 patients across 3 sites | CNN segmentation + classification | Both | Dice = 0.89; AUC = 0.93 | Yes (multisite) |
| Luo et al. (2024)(21) | Multimodal (PET/CT) | 128 patients | CNN + Radiomics hybrid | Both | Dice = 0.74; AUC = 0.928–0.979 | Yes |
| Ling et al. (2022)(22) | CT (four-phase) | 186 patients | 3D CNN + MLP | Classification | Accuracy = 94.2%; AUC = 0.961 | No |
More recent studies demonstrated substantial improvements in both accuracy and generalizability. Bilic et al. summarized the outcomes of the Liver Tumor Segmentation (LiTS) Benchmark, where state-of-the-art ensembles of CNNs achieved Dice coefficients of 0.963 for the liver and 0.739 for tumor segmentation, establishing a reference standard for future studies (6). Similarly, Shan et al. externally validated a two-phase AI-assisted segmentation platform for hepatocellular carcinoma (HCC), reporting a mean Dice of 0.8819 and precision greater than 0.97 across 140 patients (13). Transformer-based architectures such as SWTR-UNet by Hille et al. achieved Dice values exceeding 0.98 for liver and 0.81 for lesions on MRI datasets (20), while Wu et al. introduced the MULLET network, which reached Dice values of 0.94–0.96 on multi-phase CT data (19). Collectively, the pooled mean Dice across segmentation studies was 0.93 (95 % CI: 0.91–0.95) for liver and 0.83 (95 % CI: 0.79–0.86) for lesions, confirming robust segmentation accuracy across imaging modalities and architectures.
MRI-based classification systems also demonstrated high performance. Hamm et al. reported an AUC of 0.992 for differentiating HCC from other focal lesions using a multiphasic MRI CNN model (5), while Yasaka et al. achieved comparable diagnostic accuracy using deep learning on dynamic contrast-enhanced MRI (5). In CT-based studies, Gowda and Manjunath implemented the UNet70 architecture, obtaining an accuracy of 94.6 %, sensitivity of 97.5 %, and Dice coefficient of 94.7 % for tumor detection (15). Overall, the pooled mean AUC across classification studies was 0.96 (95 % CI: 0.94–0.98), with a mean diagnostic accuracy of approximately 93 %, highlighting strong discriminatory capability across lesion types and modalities.
| Table 2. Performance Metrics Comparison by Task | |||||
| Task | Number of Studies (n) | Mean Dice (95 % CI) | Mean AUC (95 % CI) | Mean Accuracy (%) | Range (Min – Max) |
| Liver Segmentation | 11 | 0.93 (0.91 – 0.95) | — | — | 0.88 – 0.98 |
| Lesion Segmentation | 11 | 0.83 (0.79 – 0.86) | — | — | 0.68 – 0.96 |
| Lesion Classification (Benign vs Malignant) | 9 | — | 0.96 (0.94 – 0.98) | 93 ± 4 | AUC 0.92 – 0.99; Accuracy 88 – 97 |
| Multiclass Classification (e.g., HCC / ICC / Metastases / FNH / Hemangioma) | 6 | — | 0.95 (0.92 – 0.97) | 92 ± 3 | AUC 0.90 – 0.98 |
| Combined Segmentation + Classification Pipelines | 4 | 0.88 (0.85 – 0.91) | 0.94 (0.92 – 0.96) | 91 ± 3 | Dice 0.83 – 0.93; AUC 0.89 – 0.97 |
Summary of Findings: Overall, deep learning and hybrid AI models demonstrated excellent accuracy for both segmentation and classification of liver lesions. Mean Dice coefficients above 0.90 for liver segmentation and AUC values above 0.95 for lesion classification indicate that AI systems are now approaching or matching expert radiologist performance. Multicentre validation studies (11–13,17) confirm the robustness and reproducibility of these approaches, suggesting readiness for integration into routine liver imaging workflows. However, heterogeneity in datasets, lack of standardized reporting metrics, and limited availability of large-scale MRI datasets remain key barriers to full clinical adoption.
This systematic review synthesized findings from sixteen studies that evaluated artificial intelligence (AI)–based methods for liver and liver-lesion segmentation and classification using CT, MRI, or multimodal imaging. The pooled analysis demonstrates that deep learning and hybrid architectures consistently achieve high diagnostic accuracy, with mean Dice coefficients above 0.90 for liver segmentation and area under the receiver operating characteristic curve (AUC) values exceeding 0.95 for lesion classification. These findings indicate that AI systems can now match, and in certain contexts surpass, expert radiologist performance in lesion delineation and characterization. The consistency of these outcomes across studies employing diverse datasets, imaging modalities, and network architectures underscores the maturity of AI-driven liver imaging research (4,6,11–17,19).
When compared with prior systematic reviews, the present analysis provides a broader and more contemporary synthesis. Earlier reviews primarily focused on radiomics or single-center deep learning applications in hepatocellular carcinoma or metastasis detection, often based on small datasets and limited validation cohorts. The inclusion of recent multicentric studies, such as LiAIDS by Ying et al. (11) and LiLNet by Wei et al. (12), highlights a clear methodological evolution from isolated model development toward clinically deployable systems validated across multiple institutions and imaging vendors. Benchmark studies such as the LiTS Challenge and Medical Segmentation Decathlon have also played a pivotal role in standardizing evaluation metrics and fostering reproducibility, which was reflected in the improved segmentation accuracy reported by recent transformer-based networks (6,20). These efforts indicate that the field is transitioning from algorithmic innovation to clinical validation and integration.
Despite these advancements, several technical and methodological challenges persist. Many studies continue to rely on relatively small or homogeneous datasets, which limits model generalizability and increases the risk of overfitting. The lack of standardized imaging protocols and ground-truth annotations contributes to performance variability, while domain shift—caused by differences in scanners, reconstruction parameters, and patient demographics—remains a major barrier to cross-institutional deployment (14,15,17). Only half of the included studies performed external validation, and very few provided access to model weights or code repositories, limiting transparency and reproducibility. Moreover, radiomics-based models exhibited moderate Radiomics Quality Scores, suggesting incomplete adherence to reporting standards such as CLAIM and TRIPOD-AI (5,16,18,19).
Looking forward, several research directions hold promise for improving the robustness and clinical applicability of AI in liver imaging. Multimodal fusion of CT, MRI, and ultrasound data could enhance lesion characterization by leveraging complementary structural and functional information (20–22). Self-supervised and weakly supervised learning approaches may reduce dependence on labor-intensive manual annotation while enabling continuous model refinement. The use of federated learning frameworks can facilitate multi-institutional collaboration without sharing patient data, thereby addressing privacy and heterogeneity concerns. In addition, the development of explainable AI (XAI) methods is critical to increase clinician trust by providing interpretable decision boundaries and feature importance maps (3,12,13). Ultimately, prospective clinical trials integrating AI models into diagnostic workflows will be essential to establish real-world performance, workflow efficiency, and patient-centered outcomes (11,17).
This review has several limitations. First, publication bias may have favored positive results, as studies with suboptimal performance are less likely to be published. Second, heterogeneity in imaging modalities, datasets, and evaluation metrics prevented formal meta-analysis in some areas. Third, the rapid evolution of AI algorithms means that newly emerging transformer-based and generative models may not yet be fully captured in the current synthesis. Finally, although multiple reviewers independently screened and extracted data, subtle methodological differences among studies could influence pooled estimates.
In summary, AI-based approaches for liver lesion segmentation and classification have demonstrated remarkable diagnostic accuracy and reproducibility across multiple studies. Continued progress will depend on larger multicenter datasets, standardized evaluation frameworks, and explainable models that integrate seamlessly into clinical decision-making. With these advancements, AI has the potential to become an indispensable tool in hepatobiliary radiology, augmenting—not replacing—radiologist expertise.
Artificial intelligence has demonstrated remarkable potential in the automated segmentation and classification of liver lesions, achieving accuracy levels comparable to expert radiologists across multiple studies and benchmark datasets. Deep learning architectures, particularly U-Net derivatives and hybrid transformer models, have consistently produced high Dice coefficients and AUC values, underscoring their diagnostic reliability. Nevertheless, challenges such as limited dataset diversity, lack of methodological standardization, and insufficient external validation continue to impede widespread clinical adoption. Future research should prioritize large-scale, multicenter collaborations, development of transparent and explainable AI frameworks, and integration of multimodal imaging data to enhance model generalizability and clinician trust. With these advancements, AI-driven liver imaging systems can transition from research prototypes to robust clinical decision-support tools in routine hepatobiliary practice.