Fragment Length, Number of Mapped Reads Stats From Bam File
In this blog mail service, we will talk nearly what is the insert size in Illumina sequencing reads, and how to find it in SAM/BAM files, and how to filter BAM files with it.
Groundwork
In an Illumina sequencing run, either single-terminate sequencing (SE) or paired-stop sequencing (PE) can be used. In any case, DNA is chopped into small fragments, ligated with adapters at both ends, and sequenced either from one end (SE) or from both ends (PE). For PE reads, insert size refers to the fragment size (excluding the adapters). The proper noun insert, as pointed out past the great blog post Paired-terminate read confusion - library, fragment or insert size?, comes from the cloning era. Information technology refers to the piece of Deoxyribonucleic acid inserted between the adapters.
Therefore, the insert size includes frontward and reverse read as well as the unknown gap betwixt them. The unknown gap can be called the inner mate distance.
It can exist visually represented in the following style (discovered from biostars.org):
How to find out insert size from a BAM/SAM file or filter the file by it?
The 9th column of a SAM file, observed Template LENgth (TLEN
), can be used as an approximate of the fragment length. It is approximate because as documented in the SAM file specification, the verbal definition of mapping starts and ends are specific to implementations.
Below, we become through the procedure to collect insert sizes and to filter a BAM file with the insert size. We presume that there is a BAM file named file.bam
1.
Instance one: list insert sizes, ane line per read
Say we want to plot the distribution of insert sizes in a BAM file, considering but the first pair of the properly mapped pair (flag -f66
, see Decoding SAM flags).
samtools view -f66 file.bam | cut -f 9 > insert-sizes.txt
Why there are negative insert sizes?
Using the command above, negative insert sizes tin happen. The reason is that the first read is mapped to the reverse strand. In this example, the insert size of the second read is positive.
To study the absolute values, use the following command:
samtools view -f66 file.bam | cutting -f9 | awk '{impress sqrt($0^2)}' > insert-sizes.txt
The only additional trick is that the awk
command takes the absolute value.
Now, you lot can use your preferred tool, for instance R or Python or fifty-fifty Excel, to visualize the distribution of insert sizes, or to get summary statistics.
Example 2: filter by insert sizes
Say we want to only keep reads of insert sizes between 200 and 500 from a BAM file and write the reads that fulfil the status to another file.
samtools view -h file.bam | \ awk 'substr($0,ane,1)=="@" || ($ix>= 200 && $nine<=500) || ($9<=-200 && $9>=-500)' | \ samtools view -b > is-200-500.bam
Explanations:
-
-h
insamtools view
: export SAM file headers - In
awk
, thesubstr
office is used to proceed header lines, and the rest 2 status specify forward and reverse reads with the desired insert sizes, respectively - Last only not least,
samtools view -b
is chosen to write the filtered reads into new a BAM file.
Now, you can load the BAM file into genome browsers, or perform down-stream analysis, for instance feature counting, with it.
Limitation
The insert size estimation using this uncomplicated method has an limitation: if the RNA-seq reads were mapped against genomes (particularly eukaryotic genomes), instead of transcriptomes, the reported insert size can be huge, up to thousands or even more bases.
It happens because mapping reads to genomes requires that the read aligner splits the reads and map sub-reads to the genome, because in eukaryotic and peculiarly mammalian genomes, two exons can be separated by an intron of hundreds or even thousands nucleotides. This is why estimated insert sizes from genome-mapped SAM/BAM files can be misleading.
At that place are at least two solutions:
- Re-map the reads to the transcriptome and perform the analysis above. This volition avoid the problem of including introns in the insert sizes, because transcriptome sequences exercise not include introns. The reward is that it volition brand the estimates more accurate. The solution comes with a computational and time cost though to remap the data.
- Manually exclude very large inserts, say anything above 1000. The number needs to be determined by the protocol and the need.
An example of solution #2 would be:
samtools view -f66 file.bam | \ cutting -f9 | \ awk '{print sqrt($0^ii)}' | awk '$0<1000'
Conclusion
Insert size refers to the fragment length consisting of forrard and opposite reads and the un-sequenced gap between the paired reads. It is possible to utilise samtools
and command-line tools such as awk
and cut
to collect insert sizes or to filter BAM/SAM files.
colbournewenbestaide.blogspot.com
Source: http://accio.github.io/bioinformatics/2020/03/10/filter-bam-by-insert-size.html
0 Response to "Fragment Length, Number of Mapped Reads Stats From Bam File"
Post a Comment