seqpipeData Flow Tutorial
seqpipe is a collection of scripts which together form a pipeline for the processing of sequencing data, specifically regarding the processing primarily of ChIP-seq, ATAC-seq, with a couple additional RNA-seq specific scripts.
seqpipe excels is in separating processing into two distinct phases:
Each script (see the Reference page) takes on a specific filetype or step of the process, and is labeled chronologically in alphabetical order, designed to run from A through Z. See below for a detailed tutorial on the design and execution of the pipeline.
For downloading all relevant shell scripts, download the latest pipeline collection by navigating to the Downloads page.
Samples are first assumed to be deposited in a core facility, organized such that samples from the same experiment are grouped together, i.e.:
Creating a file with the paths to each
Experiment_* folder from the core facility, these locations are passed onto the first script,
A0_get_samples.sh, as well as a path to an output directory, and thereafter samples are then organized as:
While this may seem to lose the sample to sample relationship (e.g.,
SampleB were from the same
Experiment_01), this information is utilized at a later step, and is important if say
SampleA is a ChIP-seq sample and
SampleB is its cognate input.
Thereafter, depending on the sequencing type, processing proceeds through a series of transformations. Each script in the processing pipeline is designed to work on a sample by sample basis, e.g., the path to a specific sample is passed in.
To run the next series of steps, for example, to process
SampleA, an ATAC-seq sample, we would write the following script (note the second argument is the number of threads to utilize, an important parameter for specifying the allocation of jobs on a cluster)
B0_PE_chip_atac_process.sh /path/to/SampleA 8 B1_atac_shift.sh /path/to/SampleA 8 B2_finalize_fix_bams.sh /path/to/SampleA 8 C0_make_bigwig.sh /path/to/SampleA 8
Please see the Reference pages for each page for more detail on each script, and note that the
B1.atac_shift.sh is not run for ChIP-seq (or RNA-seq) samples. What we end up with is the following:
fastqc- output from script
B0to collect QC results
bowtie2- output from script
B0, contains all alignment results
bams- output from
B2, contains final cleaned up bam reads
bigwig- output from
C0for visualization of track files
With this sample by sample processing stage, we can see it is easy to add additional steps in processing each sample. The key outputs to track are the
SampleB came from
Experiment_01, and were both processed with our example script, where they are experimental and input samples, respectively. Now, we can combine them at this next phase for steps which require knowing that their interdependency. We can do this by crafting a file,
table.txt, which gives us their relationships as follows:
CONDITION_ID SAMPLE_ID INPUT_ID Experiment_01 Sample_A SampleB
Now we can combine samples by this condition, with the following:
D0.combine_by_condition.sh /path/to/samples_container/ table.txt /path/to/replicate_dir
Which will produce the following directory structure in
SampleA -> link to SampleA
SampleB -> link to SampleB
replicate_dir is a lightweight container that organizes samples under condition IDs using a series of symbolic links.
At this point, we can now run the next step in our processing pipeline, peak calling (with one or all of the following callers), using the following command:
E0_call_peaks_macs2.sh /path/to/Experiment_01 table.txt E1_call_peaks_music.sh /path/to/Experiment_01 table.txt E2_call_peaks_bcp.sh /path/to/Experiment_01 table.txt
We still provide the table, in case our experiment folder has multiple combinations of samples to inputs that we need to test. Note that for samples without inputs, an
NA value will generate peak calls agnostic of input].
One of the most important steps is being able to see your data (because seeing is believing when it comes to peak calls). Thus, we can generate a UCSC Track Hub compatible repository, which can then be hosted from your own server and served up at the UCSC Genome Browser.
The best part is that we can combine samples from the same condition, and stack them using the UCSC bigwig container convention, thus allowing us to visualize replicates on the same track line. This works by taking your replicate-aware directories and writing out the correct track hub documentation.
To do this, we point to the container for our replicate-aware directories to write all of them at once, as follows:
F0_create_trackhub.sh /path/to/experiment_container/ /path/to/hub
This creates the entire hub and required directory structure, and all that needs to happen thereafter are small tweaks to the files to customize the hub name and contact info.
For more details, see detailed documentation on UCSC Track Hubs.
Finally, we may want to stage our data for transfer to a local computer. Sometimes however, the sample directories get clogged up with many intermediate processing steps/files, and so the following script collects the most important parts of each sample into a new transfer repository. This staging contains symbolic links to the important parts, at which point an
scp command pointing to this directory can handle the rest.
Z0_collect_data.sh /path/to/super_directory_of_samples /path/to/staging_directory
Each script is tailored such that it works at the sample folder or experiment folder level. This allows for the processing of each sample as its own individual job when submitting to a cluster, and thus allows for easy parallelization of processing.
This is good because if a certain sample fails in processing, we can easily target it for reprocessing, or if we only want to (re)process a portion of samples, that becomes trivial as well. In other words, each individual sample is a task.
However, this comes with the caveat that we have to construct a wrapper if we want to process each individual sample. We can thus automate creating this tasklist (which works nicely with simpleQueue and other such task managers) by doing a tiny bit of
SCRIPT=/path/to/your/script.sh THREADS=8 for I in $(ls -d /path/to/samples_container/*); do SAMPLE=$(readlink -f $I) echo "$SCRIPT $SAMPLE $THREADS" >> /path/to/tasklist done
Briefly, here we created variables,
THREADS, and iteratively created a new variable for each
SAMPLE in our
samples_container super-directory, and then iteratively wrote a task for each sample. So we’d get an output such as:
/path/to/your/script.sh /path/to/SampleA 8 /path/to/your/script.sh /path/to/SampleB 8 ... /path/to/your/script.sh /path/to/SampleZ 8
Note that you may have to prepend each task with a call to setup your environment variables (e.g.,
source ~/.bashrc or load specific modules as needed, depending on your cluster environment.
We can save the above script as a wrapper for specific steps, or iteratively write each script in sequence - the process is designed to be extrememly flexible depending on your processing needs.
The biggest caveat within this repo is that throughout the scripts, some parameters are hardcoded for my own usage. Further development of this pipeline would have to replace these hardcoded areas with custom parameters, but for adapting for your own needs, these hardcoded regions can be replaced with your own variables. Keep this in mind if you adapt for your own purposes!
To download all the seqpipe pipeline scripts, simply navigate to the Downloads page to sling it to your computer, and modify to your content! For running one-off scripts, the Reference page linking to each individual script contains in the code section all the necessary code to run the specific step.