Edited by
Reviewed by
Figures
Tables
Background: Breast cancer remains a major global health alarm. By analyzing transcriptomic datasets GSE65194 & GSE42568, we identified quite a lot of consistently dysregulated genes that may impact cancer progression & serve as potential therapeutic targets.
Methods: With the GEO2R for a differential gene expression study, we identified many weighty genes (p < 0.05, absolute log2 fold change > 1), as well as both favorably upregulated & downregulated applicants. Fundamental visualizations contained within volcano plots, MA plots, UMAP for dimensionality reduction, & box plots. With the help of Venn diagram to validate overlapping differentially expressed genes transversely datasets, confirming sturdy shared expression patterns.
Results: Breast cancer gene expression datasets GSE65194 & GSE42568, which both equivalence cancerous breast tissue to normal controls, identified 5554 & 2957 differentially expressed genes, correspondingly. Definitely, in GSE65194, genes were upregulated (4968), & were downregulated (586), whereas in GSE42568, genes were upregulated (1512) & were downregulated (1445). In particular, among the genes in GSE65194, COL11A1 showed the greatest expression change, with a log2 fold change of 7.69. UMAP study of both datasets evidently separated cancerous & normal samples, prominence different gene expression profiles.
Conclusion: More than a few genes are dependably upregulated across breast cancer datasets, portentous shared disease-related pathways & latent value as biomarkers or therapeutic targets, though clinical authentication is still needed.
Breast cancer is the most prevalent and commonly diagnosed cancer in the world among women and it remains a significant weight in the health and health care systems of the population [1,2]. This highlights a continuous rise in its effects as the number of new cases and almost 790,000 deaths were reported worldwide in 2022 alone [3]. It is a complicated disease, and the number of molecular subtypes and options of clinical patterns makes it relevant to keep investigating the mechanisms of the disease [4,5]. The techniques of sequencing (High-throughput) have been able to improve cancer research by facilitating the production of extensive genomic and transcriptomic data. Such datasets have given researchers an improved insight into the biology of breast cancer [6]. The first source is the GEO (Gene Expression Omnibus), a large community database under NCBI (National Center for Biotechnology Information) in which many different datasets of gene expression are available to breast cancer research [7]. These datasets are utilized by scholars to discover genetic alterations and changes in gene expression associated with the initiation and development of breast cancer [8,9]. Full expression reporting of the genes has assisted researchers in defining the major molecular subtypes of breast cancer [10,11]. The use of categories (luminal A, luminal B, HER2-enriched, and basal-like) has been important in applying multiple administrative treatment decisions to patients and enhancing patient outcomes [12]. The TNBC (Triple-negative breast cancer), which is currently being viewed as a separate and aggressive type of the cancer, has also emerged as a significant field of study [13]. GEO datasets together is a convenient style for identifying genes that are consistently differentially expressed athwart patient groups [14,15]. This approach benefits overcome some of the limits of single studies & make known strong molecular monickers linked to breast cancer [16]. Meta-evaluates of these datasets have also identified most important biological pathways intricate in cancer progression, principally those connected to the cell cycle, DNA repair, & the immune response [17,18].
The new trends in the field of insilico biology have made a significant contribution to the knowledge on breast cancer. The present advancement in in silico research has critically corrected our knowledge about breast cancer. Machine learning has allowed the discovery of novel biomarkers (prognostic) & imaginable therapeutic targets [19,20], & network-based educations have irradiated the complicated relationships among genes & their regulation [21,22]. Transcriptomic based research has also guided that non-coding RNAs, including the microRNAs and long non-coding RNAs, have a crucial role in tumor development [23], and new applications in diagnosis and treatment [24]. This has been complemented with transcriptomic data with other forms of molecular data, such as methylation (DNA) and protein biology, which has further increased our sympathetic of breast cancer complexity [25]. Recent studies have been keen on the need to discover gene markers that anticipate the response of the patient to therapy and their prognosis, which can be used to establish individualized medicine and immunotherapy [26-28]. It is on the basis of this slant that our study consolidates various GEO datasets and spreads them on sound analytical tools to determine common patterns of gene expression and to determine predictable biomarkers and therapeutic targets in breast cancer.
Data Retrieval &Acquisition
In the study, we compared two well-characterized datasets (breast cancer) of gene expression data in the GEO (GSE65194 and GSE42568). These datasets were selected with great care due to their sustainable experimental design, comprehensive profiling, and comprehensive clinical information.
GSE65194, engendered employing the Affymetrix Human Genome U133 + 2.0 Array platform, consist of breast cancer tissue samples (130) & normal (11). To perform the current analysis, the differential gene expression was checked by performing a comparison between the tumor samples and normal breast tissue. This data is closely related to research by Maubant et al. and Maire et al. [29-33] and is largely relevant to TNBC, and a strong emphasis on characterization and potential therapeutic targets. Tissue samples were of high quality, and RIN (RNA integrity numbers) was greater than 7.0. The dataset also gets a comprehensive analysis of pathways & targets of interest, together with Wnt3a signaling, TTK/hMPS1, & Polo-like kinase 1. Similarly, GSE42568, which was reported by Clarke C et al. [34], contains a breast cancer sample (104) and normal (17) samples that were processed using the Affymetrix Human Genome U133 + 2.0 Array platform. This data gives comprehensive clinical information such as ER (estrogen receptor), PR (Progesterone receptor), and HER2, and tumor grade and survival outcomes. It implies that it has common subtypes of breast cancer and that it was treated using the same behavior to minimize variation and ensure consistency, which makes it very appropriate to do comparative analysis.
On their high quality, clinical annotation, and general representation of imperative breast cancer subtypes, we have chosen data set GSE65194 & GSE42568. GSE65194 suggestions a focused view of TNBC, supported thru high RNA quality & detailed molecular data. In contrast, GSE42568 provides a larger & more heterogeneous cohort with clearly defined ER, PR, & HER2 status, allowing subtype-oriented analysis. Together, these datasets provide consistent &well-annotated data for an integrative study of breast cancer.
To ensure data quality &comparability, both datasets were carefully preprocessed before analysis. The preprocessing steps included background correction, normalization, & batch-effect adjustment. Platform-appropriate methods were applied, including quantile normalization for the Illumina data &robust multi-array average (RMA) normalization for the Affymetrix data.
Data Preprocessing & Sample Grouping
Gene expression data from both datasets were carefully evaluated through multiple quality-control steps. In GSE65194, the samples were divided into breast cancer tissues (n = 130) &normal breast tissues (n = 11), whereas GSE42568 included 104 breast cancer samples &17 normal breast tissue samples. To ensure data reliability, we assessed RNA degradation, reviewed intensity distribution plots, &carried out sample correlation analysis to detect possible technical problems or outlier samples.
Preprocessing was performed using methods appropriate for each platform. For GSE65194, which used the Illumina HumanHT-12 V4.0 platform, quantile normalization was applied. For GSE42568, generated on the Affymetrix Human Genome U133 Plus 2.0 platform, normalization was carried out using the Robust Multi-array Average (RMA) method. In both datasets, background correction &batch-effect removal were also performed to minimize non-biological variation. Low-quality probes were excluded, &potential outliers were identified through signal intensity assessment &principal component analysis. Together, these steps helped ensure that the gene expression data used in the study were accurate, consistent, &suitable for downstream analysis. Probe sets were then matched to the most recent gene annotations using platform-specific annotation packages. After preprocessing, samples from GSE65194 &GSE42568 were classified into two main groups: breast cancer &healthy control samples. This grouping was based on the detailed metadata &sample information provided in the GEO database. In GSE42568, the samples were categorized into 104 breast cancer &17 normal breast tissue samples according to the available clinical annotations. GSE65194 as well exemplified a breast cancer expression dataset generated retaining the Array platform (Affymetrix Human Genome U133 Plus 2.0) [31-34]. This clear & reliable grouping on condition that a solid foundation for the ensuing differential gene expression analysis.
Gene Expression Analysis Using GEO2R
A differential gene expression homework was performed by the using GEO2R, a web-based tool as long as thru NCBI that is built on the limma package in R. For the GSE42568 dataset, gene expression was compared between breast cancer samples (104) & normal (17). For GSE65194, expression outlines were in the same way estimated across breast cancer & normal breast tissue samples on behalf of the respective study groups. To acquire reliable & statistically robust results, stern filtering criteria were applied, as well as an adjusted p-value of less than 0.05 using the Benjamini–Hochberg rectification, an absolute log2 fold change greater than 1, & variance approximation based on observed Bayes statistics. The FDR (false discovery rate) was also taken care of in order to achieve multiple-testing correction. GEO2R investigation was applied in the limma (Linear Models of Microarray Data) setting that is quite appropriate to manage data of the management in the form of microarray data produced by the Illumina and Affymetrix systems. In the case of GSE65194, the Illumina HumanHT-12 V4.0 platform, quantile normalization was done followed by a differential expression testing. In the case of GSE42568, RMA (Robust Multi-array Average), normalization of Affymetrix U133 + 2.0 was practical before statistical analysis. Such preprocessing steps were confident according to the known methods in transcriptomic analyses [29,32], and the purpose of optimizing the accuracy of the detection of differential gene expression, decreasing platform-dependent technical error, and minimizing false-positive results.
Visualization, Correlation Analysis, & Venn Diagram Assessment
GEO2R results of the differentiation of expression were also analyzed and plotted using the R programming with the version 4.1.0. Exported the output files in the form of.tsv format and contained the values of gene expression, estimates of fold-changes, and measures of statistical significance. Volcano plots were created with the ggplot2 package & log2 fold change placed on the x-axis and the negative log10 of the adjusted p-value placed on the y-axis. These plots allowed clear identification of genes that met the predefined importance thresholds. To explore sample clustering based on global gene expression patterns, UMAP was performed using the UMAP package in R. This dimensionality-reduction procedure produced two-dimensional plots that revealed distinct clustering of cancer & normal samples. In addition, box plots were used to display the expression levels of selected genes &to compare their distribution between breast cancer &normal tissue samples.
To determine genes that were consistently differentially expressed in both datasets, namely GSE65194 (130 breast cancer & 11 normal breast tissue samples) &GSE42568 (104 breast cancer & 17 normal breast tissue samples), Venny 2.1.0 was used to construct Venn diagrams. Overlapping genes were then studied for expression concordance, & their character across datasets was evaluated employing Pearson correlation analysis. Concurrently, these analyses provided a clearer view of the shared transcriptional alterations present in both breast cancer datasets.
Data Generation, Grouping, & Parameterization
Gene expression profiles were analyzed using two independent breast cancer datasets got from the Gene Expression Omnibus (GEO) database. The GSE65194 dataset contains 130 breast cancer & 11 normal samples. In the same way, GSE42568 contains 104 breast cancer &17 normal samples & was generated employing the U133+ 2.0 Array platform (Affymetrix Human Genome). The search of differential expression was carried out through GEO2R with challenging statistical requirements. The method (Benjamini-Hochberg) was used to amend multiple testing, and the precision weights of limma were feasible, and dealing with heteroscedasticity. Genes were careful pointedly differentially expressed if the adjusted p-value was below 0.05 & the absolute log2 fold change outdid 1. The datasets (2) identified imperative genes, & their expression forms were visualized through volcano plots & UMAP. To reduce technical variation & expand comparability, quantile normalization was used for the Illumina-based dataset (GSE65194), while Robust Multi-array Average (RMA) normalization was practical to the Affymetrix-based dataset (GSE42568). After standardization, clear transcriptional differences were detected between breast cancer & normal breast tissues.
Differential Gene Expression Analysis of the GSE65194 Breast Cancer Dataset
Analysis of the GSE65194 dataset revealed clear transcriptional differences among breast cancer samples across varying experimental conditions after normalization of the expression data. This was demonstrated by volcano plots &UMAP visualization.
Volcano Plot Analysis
A volcano plot study of the GSE65194 dataset identified 5,554 differentially expressed genes (DEGs) using thresholds of adjusted p-value < 0.05 &|log2 fold change| > 1. Of these, 4,968 genes were upregulated (logFC > 1; shown in red), whereas 586 genes were downregulated (logFC < −1; shown in blue). Among the most strongly upregulated genes, COL11A1 showed the highest fold change (log2FC = 7.69, adjusted p = 4.39 × 10⁻¹⁹), followed by COL10A1 (log2FC = 7.33, adjusted p = 2.75 × 10⁻²⁶). These collagen-related genes have been widely associated with breast cancer progression (Table 1 & Figure 1A).
Mean-Difference (MA) Plot Analysis
The MA plot shows strong separation of differentially expressed genes from background gene expression levels. The symmetric data distribution pattern confirmed the presence of true normalization & absence of intensity bias. The plot displayed two distinct gene clusters: upregulated genes, marked in red &positioned above the zero-fold-change line, &downregulated genes, marked in blue &positioned below, showing clear expression-level differences in the evaluated samples. (Figure 1B).
UMAP Analysis
The UMAP dimensionality reduction method generated a visualization showing two distinct sample groups with clear spatial separation. The analysis revealed non-overlapping transcriptional patterns, where cancer samples clustered separately from normal tissue samples. This clear segregation of sample clusters confirmed the presence of unique gene expression signatures between the two biological conditions (Figure 1C & D).
Box Plot Analysis
A box plot of the top 20 differentially expressed genes revealed how their expression varied across different sample types. Essential genes, including COL11A1, COL10A1, CXCL10, & RRM2, were more positively expressed in breast cancer samples than in normal breast tissue. The small overlap in expression between the cancer &normal groups suggests that these genes could be useful as disease biomarkers (Figure 1E).
Statistical Analysis
Statistical support at a long-lasting level was confirmed by the dataset (GSE65194), where 78% of the differentially expressed genes had adjusted p-values of less than 1 x 10 -1 and 45% differentiated genes with log2 fold changes of 2 or more. The different genes are associated with breast cancer progression, indicating the potential usefulness as diagnostic biomarkers and therapeutic targets.
Differential Gene Expression Analysis of GSE42568
Specific Analysis of dataset (GSE42568) revealed ambiguous differences between the gene expression between the breast cancer and the normal breast tissue under the several visualization techniques.
Volcano Plot Analysis
Volcano plot analysis identified 2957 significant DEGs. Among these, 1,512 genes were upregulated &1,445 were downregulated in breast cancer tissue approximated with normal breast tissue. These findings indicate substantial transcriptional alterations associated with breast cancer development (Table 2 &Figure 2A).
Mean-Difference Plot (MA) Analysis
The MA plot exhibited a symmetric trumpet-like distribution, confirming proper data normalization &no intensity bias in the analysis. The plot clearly separated significant differentially expressed genes from the background expression levels. The higher number of upregulated genes (red points) in the upper region compared to downregulated genes (blue points) in the lower region showed a trend towards increased gene expression in the dataset (Figure 2B).
UMAP Analysis
The UMAP analysis of gene expression data demonstrated clear clustering patterns between breast cancer & normal tissue samples. The dimensionality reduction tactic bare two well-defined & distinct groups in the dataset. Breast cancer samples formed one tight cluster that was definitely separated from the cluster of normal tissue samples, representative significant differences in their gene expression signatures. This clear separation in the UMAP plot confirms substantial transcriptional reprogramming that arises during breast cancer development (Figure 2C).
Box Plot Analysis
The box plot of the top 20 differentially expressed genes showed clear differences between breast cancer & normal samples. Several genes, such as KRT18, KRT19, EPPK1, COL11A1, EPCAM, DSP, & ESRP1, were expressed at much higher levels in the breast cancer group. The overlap in interquartile ranges suggests that these genes could serve as diagnostic biomarkers. Their unswerving differential expression advises a role in progression of breast cancer (Figure 2E).
Statistical Analysis
Among 2957 DEGs, 1512 were upregulated & 1445 were downregulated, representative broad transcriptional variations in the cancer (breast)
Upregulated Genes in GSE42568 & GSE65194: A Venn Analysis
Evaluation of GSE42568 & GSE65194 showed shared & dataset-specific upregulated genes in breast cancer, with 4 genes common among the top 18 DEGs (Figure 3).
Four genes were dependably upregulated in both datasets: COL11A1, TOP2A, RRM2, & ESRP1. These are genes that are linked to vital processes in cancer (breast), and extracellular matrix remodeling, cell division, DNA synthesis and splicing regulation. Their chronic deactivation across autonomous groups influences their biological reputation as well as dictates their possible worth as biomarkers and therapeutic targets. The differences between the datasets probably recreate difference in patient characteristics, tumor subtype, sample processing, and platform. General, integrating both datasets ideal parts shared and unique molecular characteristics of cancer (breast).
Breast cancer is an international health concern due to the complicated molecular pathways that facilitate the development of progression and resistance to therapy. The GSE42568 and GSE65194 data sets in the present researches identified four genes, which were consistently over-expressed in tumor samples relative to normal breast tissue, namely, COL11A1, TOP2A, RRM2, and ESRP1[35,36]. Their expression in distinct cohorts is reproducible and this is a key aspect in the biology of breast cancer. The most significantly upregulated of them was COL11A1 (log2 fold change = 7.69) and is highly associated with extracellular matrix remodeling, tumor invasion, and an inflammatory microenvironment, which is why it has the potential to become a biomarker and a treatment target [37,38].
TOP2A, RRM2, and ESRP1 showed high upregulations thus they contribute to the development of breast cancer. TOP2A is linked to cell growth and replication of DNA and can assist in predicting the response of patients to anthracycline therapy. RRM2 is involved in the DNA synthesis and monitoring the cell cycle, and the high concentrations may indicate chemotherapy resistance. ESRP1 cares different splicing & retention of epithelial characteristics, which types it an chief factor in tumor growth & therapeutic conclusions. Such findings suggest that all the three genes may be of great clinical use in breast cancer [39-42].
The study identified that tumors with increased TOP2A and RRM2 expression can be discarded to be more responsive to neoadjuvant chemotherapy based on anthracycline. On the other hand, COL11A1 may be a marker of chemotherapy resistance. Pathway analysis showed that these genes are implicated in DNA synthesis, DNA damage response, tumor microenvironment changes and splicing regulation, which can all potentially affect tumor response to treatment. To ensure that the influences were more accurate, data giving out, amendment of batch developments, normalization, multiple-testing correction and validation with independent datasets were taken into consideration. The level core gene signatures recommend that though they might work as diagnostic biomarkers and therapeutic targets, differences between datasets are the best parting of the diversity of breast cancer [43-48].
Despite the fact this study has quite a lot of strong point, it also has approximately limitations. It is based on transcriptomic data, and can be pretentious through platform-specific bias, and fails to provide a lot of data on post-transcriptional or protein-level changes. Multi-omics data, functional validation intake, and single-cell analysis should be utilized and applied in about to happen explore. Nevertheless, the gene signatures, which are generated here, may still be used in selecting treatment, intensives care reactions, and development of targeted therapies [49-54].
The functions of these genes, their interactions along the paths, and their roles in treatment resistance should be clarified in imminent research. It must also support the validation of biomarkers, targeted therapy, clinical trials, and corrected data integration through the innovative investigative processes [55-60].
This research is essential in determining extensive transcriptional differences and strong gene signatures in companionship with treatment response and survivability. By combining several datasets, this analytical method can eliminate certain weaknesses of single-cohort studies and provide a more in-depth picture of breast cancer biology [61,62].
To invention molecular signatures related to disease progression, we tested two independent datasets of breast cancer, GSE65194 and GSE42568. In GSE65194, our study of the differential expression identified 5554 DEGs and in GSE42568, 2957. Strict statistical norms were observed in all DEGs, and the adjusted p-value was less than 0.05 and absolute log2 fold change more than 1. The four genes COL11A1, TOP2A, RRM2 and ESRP1 were consistently upregulated in both datasets which augur well with the possibility that they are at the center stage of breast cancer. COL11A1 plays a complex role in extra cellular matrix remodeling. TOP2A is essential in replication of DNA. RRM2 the stage plays a role in nucleotide metabolism. ESRP1 controls other splicing, which interferes with the process of gene joining to mRNAs. The findings broaden our knowledge on the molecular pathogenesis of breast cancer and ideal part auspicious candidates of biomarkers and novel therapies. Further reading of these activated genes can prepare to even greater individualized and ongoing therapies of breast cancer.
I would like to thank King Faisal University &the Department of Biological Sciences for their funding support of this work.
Generative AI Statement
The author declares that Generative AI tools were used to enhance the language clarity of this work. The author takes full responsibility for the accuracy and integrity of the content.
Edited by
Reviewed by
Figures
Tables
Edited by
Reviewed by
Figures
Tables
This website and all of its content is licensed under CC BY NC. This license lets others remix, tweak, and build upon our work non-commercially, and although their new works must also acknowledge us and be non-commercial. More details of the license can be found here.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.