Extending the pipeline¶
If your use case needs additional processes that are not covered in the current implementation that you would like to be executed alongside the other processes, then you have the option to extend the existing workflows with your own. To do that, you need to clone the pipeline repository found on GitHub and locate the source files defining the process behavior.
Adding a custom process¶
Where to find your process Input¶
If you just need to add a process to the pipeline to include a tool you rely on, you need to first identify the inputs that your tool needs. Your inputs could be anything between simple integer values and files produced by other processes. Once you know what inputs you need, you can start to think about where you want to actually place the process. The pipeline behavior is defined in multiple Nextflow files, each defining a workflow with its own inputs and outputs. To avoid having to change any of these workflow inputs or outputs, it makes sense to place your process in a script with a workflow that already has all the variables you need. To help you find the ideal script to use, we compiled a list with the most important variables available in the workflows defined in each script. If your dependencies are not available in a single workflow, you are going to have to change the outputs emitted and inputs received by the workflow containing the variables you are interested in as well as applying those changes in the primary workflow defined in main.nf.
handle_references.nf¶
| Source | Variable name | Description |
|---|---|---|
| Workflow input | carFasta, carGtf |
CAR-Construct files |
Process: BUILD_REFERENCE |
BUILD_REFERENCE.out |
Custom gene expression reference |
quality_control.nf¶
| Source | Variable name | Description |
|---|---|---|
| Workflow input | samples |
Input samples provided by the samplesheet |
| Workflow input | cellrangerReports |
Web-summary reports produced by the CELLRANGER_MULTI process |
Process: FASTQC |
FASTQC.out |
FastQC reports |
Process: FASTQ_SCREEN |
FASTQ_SCREEN.out |
FastQ_Screen reports |
Process: MULTIQC |
MULTIQC.out |
MultiQC report |
secondary_analysis.nf¶
| Source | Variable name | Description |
|---|---|---|
| Workflow input | samples |
Input samples provided by the samplesheet |
| Workflow input | gexReference, vdjReference, featureReference |
References used by the CELLRANGER_MULTI process |
| Workflow input | carFasta, carGtf |
CAR-Construct files |
Process: CELLRANGER_MULTI |
CELLRANGER_MULTI.out.* |
Multiple relevant outputs produced by the CELLRANGER_MULTI process for each sample |
Process: SEURAT_OBJECT |
SEURAT_OBJECT.out |
Seurat object built based on the CELLRANGER_MULTI output |
Process: CAR_METRICS |
CAR_METRICS.out.* |
Multiple relevant outputs produced by the CAR_METRICS process |
Adding a new process to a workflow¶
Once you find a workflow with all the data you need, you can integrate it into the workflow definition. To help with an example, we are going to add a new process that takes the websummary output produced by the CELLRANGER_MULTI process of each sample and modifies it in some way. For this example, it would be reasonable to place the process definition in the workflows/secondary_analysis.nf script. Here you can directly access the process output like this:
workflow SECONDARY_ANALYSIS {
take:
...
main:
...
CELLRANGER_MULTI(...)
CUSTOM_PROCESS( CELLRANGER_MULTI.out.webSummary )
...
}
The process definition itself is relatively straight forward:
process CUSTOM_PROCESS {
// your directives
input:
path webSummary
output:
path 'modified_web_summary.html'
script:
"""
# add your code here
"""
}
Because the definition of a process is best explained by the Nextflow documentation, you are best advised to look here if you need help with the basic syntax.
Common directives¶
Important directives you might find used in other processes are the label, publishDir and fair directives. The label directive is usually used in order to move process configuration into the nextflow.config file. Most of the time they are used to define the containers used to run the processes in. If you want your process to be run in a containerized environment, you can define a custom label in the nextflow.config. Under the singularity section, you can add a new label by adding an entry to the process subsection like this:
singularity {
singularity.enabled = true
singularity.autoMounts = true
process {
withLabel: module_cellranger { ... }
withLabel: module_customprocess {
container = "/your/container.sif"
}
}
}
You can then use the label directive in your own process definition to use that container when the pipeline is running using the singularity profile. The publishDir directive can be used to move the output produced by your process to a predefined directory once the process is done. This is very useful when you want to use the process output without having to search through the working directories. Another directive commonly used in combination with the publishDir directive is tag the directive. If your process is supposed to run multiple times (for example, once for each sample), you can use the tag directive to identify each instance by a name. If you use this in combination with the publishDir directive, each instance will get its own subdirectory in the defined location to avoid collisions. You can find more information about directives and how to use them in the official documentation.