Overview

seqpipe is a collection of scripts which together form a pipeline for the processing of sequencing data, specifically regarding the processing primarily of ChIP-seq, ATAC-seq, with a couple additional RNA-seq specific scripts.

Where seqpipe excels is in separating processing into two distinct phases:

  • Per Sample Processing - each sample is treated independently, and processing can occur in parallel
  • Replicate-aware Processing - samples are combined into condition-specific folders such that interdependent samples can be processed together. This is particularly important for ChIP-seq where samples are paired with cognate inputs, and additionally for methods which are designed to consider replicates

Each script (see the Reference page) takes on a specific filetype or step of the process, and is labeled chronologically in alphabetical order, designed to run from A through Z. See below for a detailed tutorial on the design and execution of the pipeline.

For downloading all relevant shell scripts, download the latest pipeline collection by navigating to the Downloads page.


Per Sample Processing

Sample Acquisition

Samples are first assumed to be deposited in a core facility, organized such that samples from the same experiment are grouped together, i.e.:

  • Experiment_01
  • SampleA
  • SampleB
    • SampleB_read1.fastq.gz
    • SampleB_read2.fastq.gz
    • ...
    • SampleB_readN.fastq.gz
  • Experiment_2
  • ...

Creating a file with the paths to each Experiment_* folder from the core facility, these locations are passed onto the first script, A0_get_samples.sh, as well as a path to an output directory, and thereafter samples are then organized as:

  • SampleA
  • SampleB
  • fastq
    • SampleB_read1.fastq.gz
    • SampleB_read2.fastq.gz
    • ...
    • SampleB_readN.fastq.gz

While this may seem to lose the sample to sample relationship (e.g., SampleA and SampleB were from the same Experiment_01), this information is utilized at a later step, and is important if say SampleA is a ChIP-seq sample and SampleB is its cognate input.

Transforming and Cleaning the Data

Thereafter, depending on the sequencing type, processing proceeds through a series of transformations. Each script in the processing pipeline is designed to work on a sample by sample basis, e.g., the path to a specific sample is passed in.

To run the next series of steps, for example, to process SampleA, an ATAC-seq sample, we would write the following script (note the second argument is the number of threads to utilize, an important parameter for specifying the allocation of jobs on a cluster)


B0_PE_chip_atac_process.sh /path/to/SampleA 8
B1_atac_shift.sh /path/to/SampleA 8
B2_finalize_fix_bams.sh /path/to/SampleA 8
C0_make_bigwig.sh /path/to/SampleA 8

Please see the Reference pages for each page for more detail on each script, and note that the B1.atac_shift.sh is not run for ChIP-seq (or RNA-seq) samples. What we end up with is the following:

  • SampleA
  • fastqc - output from script B0 to collect QC results
  • bowtie2 - output from script B0, contains all alignment results
  • bams - output from B2, contains final cleaned up bam reads
    • SampleA.bam
  • bigwig - output from C0 for visualization of track files
    • SampleA.bw

With this sample by sample processing stage, we can see it is easy to add additional steps in processing each sample. The key outputs to track are the .bam and .bw files.


Replicate-aware Processing

Combining Samples into Experimental Folders

Recall that SampleA and SampleB came from Experiment_01, and were both processed with our example script, where they are experimental and input samples, respectively. Now, we can combine them at this next phase for steps which require knowing that their interdependency. We can do this by crafting a file, table.txt, which gives us their relationships as follows:

CONDITION_ID  SAMPLE_ID  INPUT_ID
Experiment_01 Sample_A   SampleB

Now we can combine samples by this condition, with the following:

D0.combine_by_condition.sh /path/to/samples_container/ table.txt /path/to/replicate_dir

Which will produce the following directory structure in /path/to/replicate_dir:

  • Experiment_01
  • samples
    • SampleA -> link to SampleA
  • inputs
    • SampleB -> link to SampleB

Thus, the replicate_dir is a lightweight container that organizes samples under condition IDs using a series of symbolic links.

Processing with Sample-Interdependencies

At this point, we can now run the next step in our processing pipeline, peak calling (with one or all of the following callers), using the following command:

E0_call_peaks_macs2.sh /path/to/Experiment_01 table.txt
E1_call_peaks_music.sh /path/to/Experiment_01 table.txt
E2_call_peaks_bcp.sh /path/to/Experiment_01 table.txt

We still provide the table, in case our experiment folder has multiple combinations of samples to inputs that we need to test. Note that for samples without inputs, an NA value will generate peak calls agnostic of input].


Creating a UCSC Compatible Trackhub

One of the most important steps is being able to see your data (because seeing is believing when it comes to peak calls). Thus, we can generate a UCSC Track Hub compatible repository, which can then be hosted from your own server and served up at the UCSC Genome Browser.

The best part is that we can combine samples from the same condition, and stack them using the UCSC bigwig container convention, thus allowing us to visualize replicates on the same track line. This works by taking your replicate-aware directories and writing out the correct track hub documentation.

To do this, we point to the container for our replicate-aware directories to write all of them at once, as follows:

F0_create_trackhub.sh /path/to/experiment_container/ /path/to/hub

This creates the entire hub and required directory structure, and all that needs to happen thereafter are small tweaks to the files to customize the hub name and contact info.

For more details, see detailed documentation on UCSC Track Hubs.


Collecting your Data

Finally, we may want to stage our data for transfer to a local computer. Sometimes however, the sample directories get clogged up with many intermediate processing steps/files, and so the following script collects the most important parts of each sample into a new transfer repository. This staging contains symbolic links to the important parts, at which point an scp command pointing to this directory can handle the rest.

Z0_collect_data.sh /path/to/super_directory_of_samples /path/to/staging_directory

Executing Parallel Jobs

Each script is tailored such that it works at the sample folder or experiment folder level. This allows for the processing of each sample as its own individual job when submitting to a cluster, and thus allows for easy parallelization of processing.

This is good because if a certain sample fails in processing, we can easily target it for reprocessing, or if we only want to (re)process a portion of samples, that becomes trivial as well. In other words, each individual sample is a task.

However, this comes with the caveat that we have to construct a wrapper if we want to process each individual sample. We can thus automate creating this tasklist (which works nicely with simpleQueue and other such task managers) by doing a tiny bit of bashing:


SCRIPT=/path/to/your/script.sh
THREADS=8

for I in $(ls -d /path/to/samples_container/*); do
  SAMPLE=$(readlink -f $I)
  echo "$SCRIPT $SAMPLE $THREADS" >> /path/to/tasklist
done

Briefly, here we created variables, SCRIPT and THREADS, and iteratively created a new variable for each SAMPLE in our samples_container super-directory, and then iteratively wrote a task for each sample. So we’d get an output such as:

/path/to/your/script.sh /path/to/SampleA 8
/path/to/your/script.sh /path/to/SampleB 8
...
/path/to/your/script.sh /path/to/SampleZ 8

Note that you may have to prepend each task with a call to setup your environment variables (e.g., source ~/.bashrc or load specific modules as needed, depending on your cluster environment.

We can save the above script as a wrapper for specific steps, or iteratively write each script in sequence - the process is designed to be extrememly flexible depending on your processing needs.


Caveats to Current Design

The biggest caveat within this repo is that throughout the scripts, some parameters are hardcoded for my own usage. Further development of this pipeline would have to replace these hardcoded areas with custom parameters, but for adapting for your own needs, these hardcoded regions can be replaced with your own variables. Keep this in mind if you adapt for your own purposes!


Download

To download all the seqpipe pipeline scripts, simply navigate to the Downloads page to sling it to your computer, and modify to your content! For running one-off scripts, the Reference page linking to each individual script contains in the code section all the necessary code to run the specific step.