In this article we analyze various Adenovirus F genomes using the ANI (Average Nucleotide Identity) and AF (Aligned Fraction). I wrote more about those here.
Now, I want to answer two questions:
Starting with question 1, we need some Adenovirus F genomes. We can get some from the NCBI Genome Assembly Database, using the datasets
CLI tool.
datasets summary genome taxon "human mastadenovirus F" --assembly-level complete
We get a bunch of datasets:
{
"accession" : "GCA_006437595.1",
"assembly_info" : {
"assembly_level" : "Complete Genome",
"assembly_method" : "Bowtie2 v. 2.1.0; Unipro UGENE v. 1.16.1",
"assembly_name" : "ASM643759v1",
"assembly_status" : "current",
"assembly_type" : "haploid",
"release_date" : "2016-03-21",
"sequencing_tech" : "Illumina",
"submitter" : "Institute of Molecular Life Sciences, University of Zurich"
},
"assembly_stats" : {
"contig_l50" : 1,
"contig_n50" : 34210,
"gc_count" : "17532",
"gc_percent" : 51,
"number_of_component_sequences" : 1,
"number_of_contigs" : 1,
"number_of_scaffolds" : 1,
"scaffold_l50" : 1,
"scaffold_n50" : 34210,
"total_number_of_chromosomes" : 1,
"total_sequence_length" : "34210",
"total_ungapped_length" : "34210"
},
"current_accession" : "GCA_006437595.1",
"organism" : {
"infraspecific_names" : {
"strain" : "HoviX"
},
"organism_name" : "Human adenovirus 40",
"tax_id" : 28284
},
"source_database" : "SOURCE_DATABASE_GENBANK"
}
Nice information, but we really need the nucleotide sequence if we are going to get the ANI and AF. In this case, we do not want datasets summary ...
but datasets download ...
. summary
is useful to see what we will get, and then once we are happy, download
will get the datasets.
Download the data:
#! /usr/bin/zsh
mkdir data
for g in {a,b,c,d,e,f,g}; do
datasets download genome taxon "human mastadenovirus $g" --assembly-level complete --filename "hadv_$g.zip"
mkdir "data/hadv_$g"
unzip "hadv_$g.zip" -d "data/hadv_$g"
done
rm *.zip
In addition, rename the fna
header to be a bit nicer:
#! /usr/bin/zsh
for dir in $(ls data); do
find "data/$dir" -type f -name "*.fna" | while read -r fna_file; do
echo "$fna_file -> $dir"
sed -i "1s/.*/>$dir/" "$fna_file"
done
done
Now we have some genomes, let's try it out. The default parameters are for larger genomes, so we need to tweak them:
Since v0.2.2, skani has the --small-genomes option equivalent to -c 30 -m 200 --faster-small.
-c
and -m
are as follows:
ALGORITHM PARAMETERS:
-c <c> Compression factor (k-mer subsampling rate). [default: 125]
--faster-small Filter genomes with < 20 marker k-mers more aggressively. Much faster
for many small genomes but may miss some comparisons.
-m <marker_c> Marker k-mer compression factor. Markers are used for filtering.
Consider decreasing to ~200-300 if working with small genomes (e.g.
plasmids or viruses). [default: 1000]
Imagine a genome like a book you are scanning. A higher -c
value would be like scanning for important words - good, fast, but you might miss details. A small -c
would be like reading every word - more accurate but also much more computationally expensive.
For large genomes (bacteria, eukaryotes) using a low -c
value is prohibitively expensive. For a virus, though, not so much!
Using the new -c
flag:
skani dist -c 20 data/hadv_a/ncbi_dataset/data/GCA_000846805.1/GCA_000846805.1_ViralProj14517_genomic.fna data/hadv_a/ncbi_dataset/data/GCA_006415435.1/GCA_006415435.1_ASM641543v1_genomic.fna | cut -f3-
We get:
Align_fraction_ref | Align_fraction_query | Ref_name | Query_name |
---|---|---|---|
98.73 | 99.71 | hadv_a | hadv_a |
This is what we'd expect from two organisms supposedly closely related (same species).
How about the same comparison, but for the other types of Adenovirus?
#! /usr/bin/zsh
function average_ani {
awk '{ sum1 += $3; sum2 += $4; count++ }
END {
if (count > 0)
print sum1 / count, sum2 / count
}'
}
for g in {a,b,c,d,e,f,g}; do
files=("data/hadv_$g"/**/*.fna)
# read ani1 ani2 < <(skani dist --small-genomes "${files[@]}" | average_ani)
read ani1 ani2 < <(skani dist -c 20 "${files[@]}" | average_ani)
echo "Adenovirus $g ANI / AF: $ani1, $ani2"
done
Mostly what you'd expect:
Adenovirus Type | ANI (%) | AF (%) |
---|---|---|
A | 88.79 | 92.06 |
B | 89.36 | 91.82 |
C | 96.89 | 94.78 |
D | 96.18 | 98.84 |
E | 94.11 | 95.38 |
F | 89.54 | 89.41 |
G | 90.39 | 90.03 |
Now we can do the same thing, but see how Adenovirus F compares to the rest, one by one, using -q
(query) and -r
(reference):
#!/usr/bin/zsh
function average_ani {
awk '{ sum1 += $3; sum2 += $4; count++ }
END {
if (count > 0)
print sum1 / count, sum2 / count
}'
}
target="f"
cousins=(a b c d e f g)
echo "Comparing Adenovirus $target to its cousins..."
for cousin in "${cousins[@]}"; do
files_target=("data/hadv_$target"/**/*.fna)
files_cousin=("data/hadv_$cousin"/**/*.fna)
# NOTE we use -c 20
read ani1 ani2 < <(skani dist -q "${files_target[@]}" -r "${files_cousin[@]}" -c 20 | average_ani)
echo "ANI between Adenovirus $target and $cousin: $ani1, $ani2"
done
We get:
Comparing Adenovirus f to its cousins...
Comparison | ANI (%) | Secondary Value |
---|---|---|
Adenovirus F vs A | 46.2533 | 10.14 |
Adenovirus F vs B | 68.8725 | 16.7455 |
Adenovirus F vs C | 0 | 0 |
Adenovirus F vs D | 76.6281 | 19.3789 |
Adenovirus F vs E | 74.2882 | 20.9326 |
Adenovirus F vs F | 98.3608 | 98.6093 |
Adenovirus F vs G | 76.4716 | 24.8242 |
There are some variants that are closer to Adenovirus F, such as D, E and G, in terms of ANI. The AF is very far apart. I'm interested in identifying in metagenomic data, though, so the ANI is probably the more important metric to focus on. Ideally, we want to find some unique loci that are only found in Adenovirus F, and no other Adenoviruses, and ideally, no other organisms.
Adenovirus F is quite distinct from the other adenovirus species. We should be able to find some way to identify it uniquely within some genomic data.