Skip to content

Reference Building

When is a reference built?

Depending on the input parameters, the process checks whether a gene expression reference has already been built or needs to be built on the fly. If a prebuilt reference is available and the --gene_expression_reference parameter is set, the process uses it and skips reference generation. Otherwise, the pipeline proceeds to build a new reference based on the provided input.

How does the build process work?

The reference-building process is based on the pre-build scripts of the official 10x Genomics human references for 2024 and 2020. Both modified scripts can be found in the templates directory. They have been adapted for use in a Nextflow process but can still be reviewed manually.

Input

There are 7 parameters involved in building a custom reference:

  • gene_expression_source_fa: <path>
  • gene_expression_source_gtf: <path>
  • gene_expression_source_fa_url: <map>
  • gene_expression_source_gtf_url: <map>
  • gene_expression_reference_version: <'2020'/'2024'>
  • gene_expression_car_fa: <path>
  • gene_expression_car_gtf: <path>

To build a reference at least a Sequence file (FASTA / .fa) and an Annotation file (GTF / .gtf) are needed. The source files can be defined using the gene_expression_source_fa and gene_expression_source_gtf parameters. If they are not defined they have to be downloaded at runtime. This is done using the URLs defined with the gene_expression_source_fa_url and gene_expression_source_gtf_url parameters.

Attention

The provided source files (.fa and .gtf) should match those used in the 10x 2020/2024 reference builds, as the scripts include version-specific filtering steps. If users wish to use other references than those from 10x, they must build the reference themselves and provide it via the gene_expression_reference parameter.

If you wish to concatenate a CAR construct as well you also need the CAR Sequence file defined with the gene_expression_car_fa parameter as well as the CAR Annotation file defined with the gene_expression_car_gtf parameter.

The gene_expression_reference_version parameter is used to decide which URL and what build script version is actually used in the process. It can either be '2020' or '2024' and defaults to '2024'. This means that the template templates/build_reference_2024.sh.

Process

If either of the source files is not already provided, they are downloaded in the GET_GEX_SOURCE process. These files are then passed to BUILD_GEX_REFERENCE. If CAR files are defined, they are also included as inputs.

Within the build process, the appropriate script template is selected based on the gene_expression_reference_version parameter. The official 10x scripts are modified to additionally concatenation of CAR files — if provided — into the source files, just before the reference is built using cellranger mkref.