I recently finished my first semester of a Masters of Bioinformatics at UQ. BINF6000 exposed me to different types of bioinformatics workflows. In this post, I share one - an ChIP Seq pipeline - and discuss what each of the tools does.
I have a control dataset for Fruitfly (Drosophila). This means an ChIP Seq experiment was conducted without any enrichment or modification to the DNA.
data/control.fastq.gz
Let's peak inside! We can do that with zless data/control.fastq.gz
@K00242:156:HGLMKBBXX:3:1101:1692:1191 1:N:0:NAGTCA
NCCTCACTGTGGCGGTCACTTCAGCACTGAGAGAAAATGCGGATGAACCA
+
#AAFFJFFAJFJJFJJJJJJFJJJJJJF7JAJJJJFJJJJ-7FF<AJJJJ
This is what a fastq
file looks like. It is one of the most common file formats you find in bioinformatics.
The format is:
Breaking it down:
As a reminder, this is the header line:
@K00242:156:HGLMKBBXX:3:1101:1692:1191 1:N:0:NAGTCA
The format:
@<instrument>:<run ID>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos>
It's the sequence. A C G T and N for ambiguous.
Does nothing. It's basically just white space.
ASCII representation of the Phred score. This is a reflection of the quality of the data.
Throughout any pipeline, you want to constantly perform quality control to ensure the data set is good, and that everything is proceeding as expected. One popular program for this is FastQC.
Let's see the quality of the data with fastqc
:
fastqc data/control.fastq.gz
It generates a HTML report. Here's one of the failed metrics, at least according to FastQC out of the box:
The blue line is the theoretically "perfect" distribution. This is not realistic - some organisms are far more GC than AT, for example. This one actually looks pretty good. Looking at the spike in the center - we see 1.8M reads are right around the 50% GC mean content mark.
The actual organism, Drosophila, has around 42% GC. There is a spike in that are in our graph. The major spike at 50% is kind of surprising - maybe there is some contamination, or perhaps the presence of the adapters is impacting it.
Adapters are additional sequences added by the sequencing machine - we generally want to remove those before doing any sort of analysis, since they are not part of the actual biological dataset. Another report from FastQC about adapters:
We can remove it with a program called
cutadapt
:
cutadapt -a AGATCGGAAGAGCACACGTCAGAACTCCAGTCACGAGTCAATCTCGTATG -o output.fastq data/control.fastq.gz
Let's check the quality again:
fastqc output.fastq
Weird, right? The issue is after removing the adapters, we are left with a bunch of empty space. See below:
![[Screenshot 2024-06-29 at 10.58.34 AM.png]]
I found this surprising, but it is the expected result. It's even in the documentation:
By default, empty reads are kept and will appear in the output. If you do not want this, use the
--minimum-length
/-m
filtering option.
We can filter them using -m
for minimum. $1
is your the input file.
adapter="AGATCGGAAGAGCACACGTCAGAACTCCAGTCACGAGTCAATCTCGTATG"
cutadapt -a $adapter -m 40 -o output.fastq $1
The empty, overrepresented sequences are gone. You can also see no reads with a length less than 40 are present.
I did the same thing for my treatment_rep1
file. After trimming some adapters:
cutadapt -a AGATCGGAAGAGCACACGTCAGAACTCCAGTCACGGAGAAATCTCGTATG -a AACACAACACAACACAACACAACACAACACAACACAACACAACACAACAC -o cleaned/treatment1_rep1.fastq.gz -m 40 data/treatment_rep1.fastq.gz
We are now happy with our data set. Next is read mapping! Right now, we have a bunch of reads, mostly around 50 bp (base pairs), as the graph shows. We need to figure out where those map on the reference genome. So, we need
bowtie2
, this one is popular and widely used.Looking at the dataset:
└── data
├── GCA_000001215.4
│ └── GCA_000001215.4_Release_6_plus_ISO1_MT_genomic.fna
├── GCF_000001215.4
│ └── GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna
├── assembly_data_report.json
├── data_summary.tsv
└── dataset_catalog.json
The two main files are: - GCA_000001215.4 - GCF_000001215.4
Both are the same data, but in a different format.
The difference:
So if you want more data, GCA is for you. If you want to focus on quality, GCF.
Now we have a reference, we need to do the alignment. But what is an alignment?
An example alignment:
Read: GACTGGGCGATCTCGACTTCG
||||| |||||||||| |||
Reference: GACTG--CGATCTCGACATCG
The reference genome has some gaps aded (shown by -
). Genomes don't actually have gaps - what is more likely is some extra bases were inserted into the read, either via chance mutation, or perhaps the read is just wrong.
Here is how the reference looks:
>NC_004354.4 Drosophila melanogaster chromosome X
GAATTCGTCAGAAATGAgctaaacaaatttaaatcattaaatgcGAGCGGCGAATCCGGAAACAGCAACTTCAAACCAGT
It's really long:
wc -l genome/ncbi_dataset/data/GCF_000001215.4/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna
1799366
I am not sure what the difference between the lower and upper case characters are. I asked ChatGPT:
I suspect this is accurate - lower case means less confidence. I don't think it makes a difference for this experiment, anyway, but always good to scrutinize your data.
To use bowtie2, we also need to build an index. We can do it like this:
bowtie2-build <reference>.fa index/output
My exact command:
bowtie2-build genome/ncbi_dataset/data/GCF_000001215.4/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna index
This is what I got:
index.1.bt2 index.2.bt2 index.3.bt2 index.4.bt2 index.rev.1.bt2 index.rev.2.bt2
These are binary formats - not readable by humans. No problem, they are for bowtie, not for us!
Now, we do the alignment!
bowtie2 -x index/index. -U data/output.fastq > data/output.sam
Output:
9818 reads; of these:
9818 (100.00%) were unpaired; of these:
278 (2.83%) aligned 0 times
8350 (85.05%) aligned exactly 1 time
1190 (12.12%) aligned >1 times
97.17% overall alignment rate
Looking at the output, it is a SAM file. This stands for Sequence Alignment Map. First we have a bunch of lines that look like this:
@HD VN:1.5 SO:unsorted GO:query
@SQ SN:NC_004354.4 LN:23542271
@SQ SN:NW_001845990.1 LN:1625
The first line is the header. - VN is version. - SO is sorting order - ours is unsorted. - GO is the grouping order - ours is by query. I do not know what this means.
SQ is a Sequence Dictionary. - SN:... is a sequence name. This maps to a database, in our case, RefSeq. - LN is the length. In this case, 23542271 bp long. That is 23m pairs! This one points to the entire X chromosome.
After about 1500 SQ entries, we hit the data:
@SQ SN:NW_007931121.1 LN:76973
@SQ SN:NC_024511.2 LN:19524
@PG ID:bowtie2 PN:bowtie2 VN:2.5.3 CL:"/opt/homebrew/bin/../Cellar/bowtie2/2.5.3/bin/bowtie2-align-s --wrapper basic-0 -x index/index -U data/output.fastq"
K00242:156:HGLMKBBXX:3:1224:28787:26687 16 NT_033777.3 21051814 42 50M * 0 0 GTTTATGAACCGTTTTTGGTACGCTTTTCAATTAGAATTTAGTTATTGTG JJJJJJJ<-JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAA AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:50 YT:Z:UU
K00242:156:HGLMKBBXX:3:1221:22424:23048 0 NT_033778.4 20711542 42 50M * 0 0 GCCCTTCCGCGCTCAAAAAGGATTTATTCTCGTGCTTAACTTGGCTTAGC AAFFFJJJJJJJJJJJJJJJ7JJJJJJJJJJJJJJJJJJJJJJJJJJJJJ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:50 YT:Z:UU
K00242:156:HGLMKBBXX:3:1226:8775:48632 16 NT_033778.4 3199693 30 50M * 0 0 AGCGAACCAATTAGTGATTGAAAATGGGTTATATTTACCTTTGCACATTT JJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJFFAAA AS:i:0 XS:i:-6 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:50 YT:Z:UU
PG is the program entry. It has information about what we used to generate this SAM file.
The rest of the file is the aligment records. This is what we came for. Each column, given
K00242:156:HGLMKBBXX:3:1226:8775:48632 16 NT_033778.4 3199693 30 50M * 0 0 AGCGAACCAATTAGTGATTGAAAATGGGTTATATTTACCTTTGCACATTT JJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJFFAAA AS:i:0 XS:i:-6 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:50 YT:Z:UU
AGCGAACCAATTAGTGATTGAAAATGGGTTATATTTACCTTTGCACATTT
JJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJFFAAA
SAM is human readable. It is a big file - many gigabytes, even for a simple fruitfly. Instead, we can use BAM - binary alignment mapping. We can pipe the stdout of bowtie2
into samtools
and write it straight to a BAM file.
bowtie2 -x reference/bowtie2_index/genome -U cleaned/treatment1_rep1.fastq.gz | samtools view -bS - > cleaned/treatment_rep1_aligned.bam
We are doing as ChiP Seq experiment. The goal is to identify which proteins TF (transcription factors) bind to. To do this, once we have generated our data, we use peek calling. If the initial experiment followed correct ChiP Seq protocols, there should be a lot more reads mapped to regions to which TFs bind. Peek calling identifies those peaks (a genomic region).
Here is an example of a peek:
We want to identify the "peeks" - a genomic location with more reads than the rest. There are a bunch of programs, the most popular is MACS.
MACS has a lot of different functions. We want callpeak
. The macs3 callpeak --help
gives us a good command:
1. Peak calling for regular TF ChIP-seq:
$ macs3 callpeak -t ChIP.bam -c Control.bam -f BAM -g hs -n test -B -q 0.01
2. Broad peak calling on Histone Mark ChIP-seq:
$ macs3 callpeak -t ChIP.bam -c Control.bam --broad -g hs --broad-cutoff 0.1
3. Peak calling on ATAC-seq (paired-end mode):
$ macs3 callpeak -f BAMPE -t ATAC.bam -g hs -n test -B -q 0.01
4. Peak calling on ATAC-seq ( focusing on insertion sites, and using single-end mode):
$ macs3 callpeak -f BAM -t ATAC.bam -g hs -n test -B -q 0.01 --shift -50 --extension 100
We want the first one:
macs3 callpeak -c <control> -t <treatment> -f BAM -g dm -n <label> -B -q 0.01
My command is:
macs3 callpeak -c cleaned/control_aligned.bam -t cleaned/treatment_rep1_aligned.bam -f BAM -g dm -n macs_output/dm -B -q 0.01
I passed -n macs_output/dm
as the label, so that is where the output goes:
macs_output
├── dm_control_lambda.bdg
├── dm_model.r
├── dm_peaks.narrowPeak
├── dm_peaks.xls
├── dm_summits.bed
└── dm_treat_pileup.bdg
A bunch of files! dm_peaks.narrowPeak
is what we are interested in. It is basically a BED file.
chr2L 16685 17074 macs_output/dm_peak_1 62 . 1.64721 8.09327 6.28438 71
The format:
- chr2L - the chromosome which the peak maps to
- 16685 - start coordinate
- 17074 - end coordinate
- macs_output/dm_peak_1 - this is a label
- score - value depends on what the tool documentation specifies.
- -
. This is the "strand". I don't fully understand this.
- 1.64 is the fold enrichment. This describes how many more specific DNA sequences are bound by proteins in the ChIP sample than the control. A value of 1 implies no enrichment; great than 1 implies more DNA was bound in the treatment; biological significance.
- 8.09 - the p-value in -log(10). This means the peak is statistically significant. A value of 0 would represent not statistical significance.
- 6.28 - a q-value. Similar to a p-value. I am not sure about this or why it's useful.
We've got a bunch of peaks - genomic locations that proteins (TFs) bind to. The next thing is to see what genes are in those areas! We can do it programatically with some code, or using amino a genome viewer.
One popular viewer is the USCS viewer. I find it hard to use and confusing. I used the igv.org viewer, it's more simple, at least for me, as a beginner.
I loaded the dm6 genome and my narrowPeak file:
I searched for the gene that is in the center of the peak in the NCBI. Here it is.
This region is "lethal (2) giant larvae".
Enables several functions, including SNARE binding activity; myosin II binding activity; and protein kinase inhibitor activity.
This is outside my area of understanding. I looked up those terms:
This post demonstrates how the create a simple ChIP Seq pipeline. Tools and file formats include:
Some improvements would be: