Case Study: Target Discovery in Triple-Negative Breast Cancer

Roman Frolov
Oct 9, 2025
5 min read

Updated: Oct 10, 2025

MDA-MB-231 is a highly aggressive, invasive, and poorly differentiated triple-negative breast cancer line (TNBC) that has fewer treatment options due to its lack of estrogen and progesterone receptors (ER and PR), as well as no HER2 amplification.

While targeting surface proteins remains the most straightforward strategy to develop immunotherapies, mounting evidence suggests the lack of truly tumor-specific surface proteins in TNBC. Hence, in this case study, we analyzed the HLA-present immunopeptidome of the MDA-MB-231 tumor cells to evaluate therapeutic potential by employing recent advances in sequencing, splicing research, and deep learning.

Procedure

The two key pieces of data used in this study were tumor RNA-seq and the LC-MS immunopeptidome validation set. The former was used as the main input for the Eternal target discovery pipeline, while the latter was used to validate the pipeline's predictions.

As input we used a tumor bulk RNA-seq MDA-MB-231 sample, downloaded from NIH SRA [1], providing us with raw FASTQ files with 100 million paired-end reads over 31 gigabases, sequenced using Illumina NovaSeq 6000 with stranded library preparation.

For the validation set we used the immunopeptidome from a study [2] where 9,494 total peptides were identified on the surface of the MDA-MB-231 cells using liquid chromatography mass spectrometry (LC-MS).

To reduce variability, we limited our test to 9mer peptides, and filtered out duplicates from the validation set. This resulted in 3,601 validation peptides.

To preprocess raw FASTQ reads, we aligned reads to human reference genome GRCh38.p14, using the STAR protocol, with two passes and other configurations increasing sensitivity toward alternative splicing, as a single gene can transcribe different mRNA isoforms translating into a diverse set of proteins [3]. This produced an aligned BAM file as a result.

The aligned BAM was then used as input into Eternal, where it was processed through a number of steps combining our proprietary splicing detection algorithm with molecular machine learning. The run yielded a table of 15 million peptide-MHC Class I complexes with their corresponding metadata.

Analysis

Here, we begin our analysis by comparing predictions to the validation set.

Validation set intersection with predictions (left), validation set intersection with a list of randomly generated 9mer peptides of the same length as the prediction set (right)

From the left pie chart above, you can see that Eternal predicted 3/4 of antigens presented on the surface of tumor cells according to the validation set. To ensure there were no "coincidence matches", we calculated the chance of a random peptide being in our validation set, which is simply the number of standard amino acids to the power of peptide length. Since we are interested in 9mer peptides, as they are the most likely to bind to the MHC Class I, the incidence rate is 1 per 20⁹=512,000,000,000. That's much higher than the number of predictions! Nonetheless, we generated a set of random 9mer peptides of the same length as the prediction set and evaluated the intersection with the validation set, which turned out to be zero, as expected.

Recalling 75% of the validation set using 15 million predictions isn't particularly useful in a clinical setting, so let's see how we can narrow down the search space to increase accuracy.

Processing distribution of missed predictions (gray) and matched predictions (black)

As you can see, the initial output contains a lot of noise. However, adding filters based on antigen processing and binding affinity can drastically alter the picture. It's worth mentioning that by processing, we refer to the likelihood of a peptide being cleaved from a protein by a proteasome and delivered to endoplasmic reticulum (ER) by transport associated with antigen processing (TAP) (where it hopefully binds to an MHC molecule and gets delivered to the cell surface in a secretory vesicle).

Going back to accuracy, we still haven't used another powerful parameter, which is the antigen expression in tumor cells. The higher expression isn't necessarily better for identifying therapeutic targets, as such abundance can lead to immunoediting, but let's take a look at how the graph looks in 3 dimensions with the expression represented on the z axis.

Processing distributions of matched predictions with an added expression axis

Identifying antigens present on the cell surface doesn't immediately translate into viable therapeutic targets. As our primary goal is to find the differentiator between normal and cancerous cells, we hereby introduce a normal control dataset, formed using expression data across 51 tissues from GTEx [4].

We then use this control dataset to compare expression of each antigen between normal and tumor cells on the transcriptomic level.

Normalized expression levels of antigens in normal (black) vs tumor (gray) cells

The list contains some promising candidates with significant expression differences from normal cells. It'd be useful to know where those antigens come from for further analysis.

Another distinctive feature of Eternal is traceability. Not only do you get the information on which antigens are present on the cell surface and their expression, but you can also trace each antigen back to the sourcing gene, specific isoform, and even genomic positions that produced the coding DNA.

Top 100 most expressed tumor antigens mapped to their genomic origins

This information allows for further analysis of each pathway to confirm the therapeutic potential of selected targets.

Summary

In this case study, we demonstrated some of the features of the Eternal platform, which are aimed at advancing the development of personalized cancer immunotherapies. Whether CAR T Cell or mRNA-based, the therapeutic use cases include:

Choosing the optimal targeted therapy for each patient, adjusting the protocol based on their unique genetic makeup. Aim is to target antigens with high coverage in tumor cells for individual patients.

Safety evaluation for new therapies. For example, a therapy was developed to eliminate cells with a particular antigen. However, this antigen can be present on both normal and tumor cells. An active component of a therapy that interacts with a number of different tissues can thus damage not only the cancer, but healthy tissues as well. By analyzing antigen's similarity to those present in normal tissues we can reduce the risk of dangerous off-tumor effects.

Development of personalized immunotherapies based on overexpressed tumor-associated antigens or de novo tumor-specific antigens.

With cancer being a highly heterogeneous disease that humanity has been fighting with since the dawn of time, we believe that the next generation of advances in cancer care will come from a personalized approach, which takes advantage of that very diversity of cancer genetics which makes it so hard to treat in the first place.

If you'd like to get early access to our platform, please fill out this form.

References: