Skip to content

Extending the pipeline

If your use case needs additional processes that are not covered in the current implementation that you would like to be executed alongside the other processes, then you have the option to extend the existing workflows with your own. To do that, you need to clone the pipeline repository found on GitHub and locate the source files defining the process behavior.

Adding a custom process

Where to find your process Input

If you just need to add a process to the pipeline to include a tool you rely on, you need to first identify the inputs that your tool needs. Your inputs could be anything between simple integer values and files produced by other processes. Once you know what inputs you need, you can start to think about where you want to actually place the process. The pipeline behavior is defined in multiple Nextflow files, each defining a workflow with its own inputs and outputs. To avoid having to change any of these workflow inputs or outputs, it makes sense to place your process in a script with a workflow that already has all the variables you need. To help you find the ideal script to use, we compiled a list with the most important variables available in the workflows defined in each script. If your dependencies are not available in a single workflow, you are going to have to change the outputs emitted and inputs received by the workflow containing the variables you are interested in as well as applying those changes in the primary workflow defined in main.nf.

handle_references.nf

Source Variable name Description
Workflow input carFasta, carGtf CAR-Construct files
Process: BUILD_REFERENCE BUILD_REFERENCE.out Custom gene expression reference

quality_control.nf

Source Variable name Description
Workflow input samples Input samples provided by the samplesheet
Workflow input cellrangerReports Web-summary reports produced by the CELLRANGER_MULTI process
Process: FASTQC FASTQC.out FastQC reports
Process: FASTQ_SCREEN FASTQ_SCREEN.out FastQ_Screen reports
Process: MULTIQC MULTIQC.out MultiQC report

secondary_analysis.nf

Source Variable name Description
Workflow input samples Input samples provided by the samplesheet
Workflow input gexReference, vdjReference, featureReference References used by the CELLRANGER_MULTI process
Workflow input carFasta, carGtf CAR-Construct files
Process: CELLRANGER_MULTI CELLRANGER_MULTI.out.* Multiple relevant outputs produced by the CELLRANGER_MULTI process for each sample
Process: SEURAT_OBJECT SEURAT_OBJECT.out Seurat object built based on the CELLRANGER_MULTI output
Process: CAR_METRICS CAR_METRICS.out.* Multiple relevant outputs produced by the CAR_METRICS process

Adding a new process to a workflow

Once you find a workflow with all the data you need, you can integrate it into the workflow definition. To help with an example, we are going to add a new process that takes the websummary output produced by the CELLRANGER_MULTI process of each sample and modifies it in some way. For this example, it would be reasonable to place the process definition in the workflows/secondary_analysis.nf script. Here you can directly access the process output like this:

workflow SECONDARY_ANALYSIS {
    take:
    ...

    main:
    ...

    CELLRANGER_MULTI(...)
    CUSTOM_PROCESS( CELLRANGER_MULTI.out.webSummary )

    ...
}

The process definition itself is relatively straight forward:

process CUSTOM_PROCESS {
    // your directives

    input:
    path webSummary

    output:
    path 'modified_web_summary.html'

    script:
    """
    # add your code here
    """
}

Because the definition of a process is best explained by the Nextflow documentation, you are best advised to look here if you need help with the basic syntax.

Common directives

Important directives you might find used in other processes are the label, publishDir and fair directives. The label directive is usually used in order to move process configuration into the nextflow.config file. Most of the time they are used to define the containers used to run the processes in. If you want your process to be run in a containerized environment, you can define a custom label in the nextflow.config. Under the singularity section, you can add a new label by adding an entry to the process subsection like this:

singularity {
    singularity.enabled = true
    singularity.autoMounts = true

    process {
        withLabel: module_cellranger { ... }

        withLabel: module_customprocess {
            container = "/your/container.sif"
        }
    }
}

You can then use the label directive in your own process definition to use that container when the pipeline is running using the singularity profile. The publishDir directive can be used to move the output produced by your process to a predefined directory once the process is done. This is very useful when you want to use the process output without having to search through the working directories. Another directive commonly used in combination with the publishDir directive is tag the directive. If your process is supposed to run multiple times (for example, once for each sample), you can use the tag directive to identify each instance by a name. If you use this in combination with the publishDir directive, each instance will get its own subdirectory in the defined location to avoid collisions. You can find more information about directives and how to use them in the official documentation.