Sequence Retrieval and Annotation Process¶

Sequences (fasta)¶

Sequence information on DNA and protein sequences of CAR T cell products (CAR constructs and vectors systems) has been collected from literature and patents (see Sources of Sequences). All DNA sequences were extracted from those sources using a Jupyter Notebook script (translate_sequences.ipynb). This process leverages a custom template matching approach implemented in the Python script template_matching.py. The method applies structural similarity (SSIM) to detect nucleotide templates (A, C, G, T) from .png images and accurately reconstruct them into a nucleotide sequence string.

All DNA sequences were translated into protein sequences using the Expasy Translate Tool and saved as *_protein_translated.fasta files. For some CAR products, protein sequences were directly provided in the corresponding patents—these original sequences were included whenever available.

Discrepancies Between Nucleotide and Protein Sequences

In several cases, we observed discrepancies between the nucleotide sequences and the corresponding protein sequences provided in the patents.
When translated with Expasy, the nucleotide sequences did not always align with the listed protein sequences.

For example, in WO2022116086A1, the original protein sequence contains an extra 'TS' that is not supported by the nucleotide sequence—suggesting inconsistencies in the source data.

Therefore, both translated and original protein sequences are provided to ensure transparency and traceability.

The annotation process of CAR constructs and vector systems is desribed below in the Annotations section.To ensure correct sequences, CAR constructs were manually re-check subsequently.

Sources of Sequences¶

CAR T Cell Product	Original Source	CAR Sequence (DNA)	CAR Sequence (Protein)	Vector Sequence
Ciltacel	Patent WO2022116086A1	SEQ ID NO. 9–16	SEQ ID NO:17; translated from DNA	Not available
Ciltacel	Patent US20230270786A1	SEQ ID NO. 9–16	Seq ID NO 17; translated from DNA	Not available
Ciltacel (Oezdemirli et al.)	Supplementary Figure S1, Ozdemirli et al.	Highlighted DNA sequence of CAR construct	Translated from DNA	Vector sequence from 5'UTR to 3'UTR
Ciltacel (Braun et al.)	Braun et al.	Reverse engineered DNA sequence provided at GitHub	Translated from reverse engineered DNA	Full reverse engineered vector sequence provided at GitHub
Idecel	Patent WO2021091978A1	Sequence No. 10	Sequence No. 9; translated from DNA	Sequence No. 36
Tisacel	Patent US 9,499,629 B2	SEQ ID NO: 8	SEQ ID NO: 12; translated from DNA	SEQ ID NO: 1
Axicel	DrugBank: Roberts et al. → Kochenderfer et al. → GenBank HM852952	GenBank ID HM852952	Translated from DNA	Not available
Hu19-CD28Z	Brudno et al. → NCBI MN698642.1	NCBI entry MN698642.1	Translated from DNA	Not available

Annotations (gtf)¶

CAR constructs¶

For annotation, known nucleotide sequences of CAR construct parts were retrieved from NCBI, following this schematic.

Domains retrieved:
- CD28, CD3ζ, 41BB, CD8, CSF2RA
- Accession numbers: "NM_001378516.1", "NM_171827.4", "NM_001561.6", "NM_001410981.1", "NR_027760.3"

Fetched Sequences:
fetched_sequences.fasta

Each CAR construct was aligned against the retrieved domain sequences using BLAST. When applicable, annotations were compared with the original source data (annotation_from_nucleotide_seq.json) and compiled into a .gtf file.

All CAR sequences were additionally translated into protein sequence, and screened for protein domains with SMART. Based on protein position of predicted domains, nucleotide sequences were extracted find_nucleotide_from_protein.py and also added to the Annotation file annotation_from_nucleotide_seq.json.

Full Annotation Script: get_annotations.py
Annotation File: annotation_from_nucleotide_seq.json

Vector systems¶

For vector systems annotation was done using the Addgene online tool. Features within gtf ﬁle were deﬁned by using ”Feature Type” according to Addgene Feature Options labels. Only the predicted ORF for the CAR construct was added. All annotation information are provided in GTF format.