Skip to content

Sequence Retrieval and Annotation Process

Sequences (fasta)

Sequence information on DNA and protein sequences of CAR T cell products (CAR constructs and vectors systems) has been collected from literature and patents (see Sources of Sequences). All DNA sequences were extracted from those sources using a Jupyter Notebook script (translate_sequences.ipynb). This process leverages a custom template matching approach implemented in the Python script template_matching.py. The method applies structural similarity (SSIM) to detect nucleotide templates (A, C, G, T) from .png images and accurately reconstruct them into a nucleotide sequence string.

All DNA sequences were translated into protein sequences using the Expasy Translate Tool and saved as *_protein_translated.fasta files. For some CAR products, protein sequences were directly provided in the corresponding patents—these original sequences were included whenever available.

Discrepancies Between Nucleotide and Protein Sequences

In several cases, we observed discrepancies between the nucleotide sequences and the corresponding protein sequences provided in the patents.
When translated with Expasy, the nucleotide sequences did not always align with the listed protein sequences.

For example, in WO2022116086A1, the original protein sequence contains an extra 'TS' that is not supported by the nucleotide sequence—suggesting inconsistencies in the source data.

Therefore, both translated and original protein sequences are provided to ensure transparency and traceability.

The annotation process of CAR constructs and vector systems is desribed below in the Annotations section.To ensure correct sequences, CAR constructs were manually re-check subsequently.


Sources of Sequences

CAR T Cell Product Original Source CAR Sequence (DNA) CAR Sequence (Protein) Vector Sequence
Ciltacel Patent WO2022116086A1 SEQ ID NO. 9–16 SEQ ID NO:17; translated from DNA Not available
Ciltacel Patent US20230270786A1 SEQ ID NO. 9–16 Seq ID NO 17; translated from DNA Not available
Ciltacel (Oezdemirli et al.) Supplementary Figure S1, Ozdemirli et al. Highlighted DNA sequence of CAR construct Translated from DNA Vector sequence from 5'UTR to 3'UTR
Ciltacel (Braun et al.) Braun et al. Reverse engineered DNA sequence provided at GitHub Translated from reverse engineered DNA Full reverse engineered vector sequence provided at GitHub
Idecel Patent WO2021091978A1 Sequence No. 10 Sequence No. 9; translated from DNA Sequence No. 36
Tisacel Patent US 9,499,629 B2 SEQ ID NO: 8 SEQ ID NO: 12; translated from DNA SEQ ID NO: 1
Axicel DrugBank: Roberts et al.Kochenderfer et al. → GenBank HM852952 GenBank ID HM852952 Translated from DNA Not available
Hu19-CD28Z Brudno et al. → NCBI MN698642.1 NCBI entry MN698642.1 Translated from DNA Not available

Annotations (gtf)

CAR constructs

For annotation, known nucleotide sequences of CAR construct parts were retrieved from NCBI, following this schematic.

Domains retrieved:
- CD28, CD3ζ, 41BB, CD8, CSF2RA
- Accession numbers: "NM_001378516.1", "NM_171827.4", "NM_001561.6", "NM_001410981.1", "NR_027760.3"

Fetched Sequences:
fetched_sequences.fasta

Each CAR construct was aligned against the retrieved domain sequences using BLAST. When applicable, annotations were compared with the original source data (annotation_from_nucleotide_seq.json) and compiled into a .gtf file.

All CAR sequences were additionally translated into protein sequence, and screened for protein domains with SMART. Based on protein position of predicted domains, nucleotide sequences were extracted find_nucleotide_from_protein.py and also added to the Annotation file annotation_from_nucleotide_seq.json.

Vector systems

For vector systems annotation was done using the Addgene online tool. Features within gtf file were defined by using ”Feature Type” according to Addgene Feature Options labels. Only the predicted ORF for the CAR construct was added. All annotation information are provided in GTF format.