In this blog mail service, we will talk nearly what is the insert size in Illumina sequencing reads, and how to find it in SAM/BAM files, and how to filter BAM files with it.

Groundwork

In an Illumina sequencing run, either single-terminate sequencing (SE) or paired-stop sequencing (PE) can be used. In any case, DNA is chopped into small fragments, ligated with adapters at both ends, and sequenced either from one end (SE) or from both ends (PE). For PE reads, insert size refers to the fragment size (excluding the adapters). The proper noun insert, as pointed out past the great blog post Paired-terminate read confusion - library, fragment or insert size?, comes from the cloning era. Information technology refers to the piece of Deoxyribonucleic acid inserted between the adapters.

Therefore, the insert size includes frontward and reverse read as well as the unknown gap betwixt them. The unknown gap can be called the inner mate distance.

It can exist visually represented in the following style (discovered from biostars.org):

fragment

How to find out insert size from a BAM/SAM file or filter the file by it?

The 9th column of a SAM file, observed Template LENgth (TLEN), can be used as an approximate of the fragment length. It is approximate because as documented in the SAM file specification, the verbal definition of mapping starts and ends are specific to implementations.

Below, we become through the procedure to collect insert sizes and to filter a BAM file with the insert size. We presume that there is a BAM file named file.bam 1.

Instance one: list insert sizes, ane line per read

Say we want to plot the distribution of insert sizes in a BAM file, considering but the first pair of the properly mapped pair (flag -f66, see Decoding SAM flags).

                samtools view                  -f66                  file.bam |                  cut                  -f                  9                  >                  insert-sizes.txt                              

Why there are negative insert sizes?

Using the command above, negative insert sizes tin happen. The reason is that the first read is mapped to the reverse strand. In this example, the insert size of the second read is positive.

To study the absolute values, use the following command:

                samtools view                  -f66                  file.bam  |                  cutting                  -f9                  |                  awk                  '{impress sqrt($0^2)}'                  >                  insert-sizes.txt                              

The only additional trick is that the awk command takes the absolute value.

Now, you lot can use your preferred tool, for instance R or Python or fifty-fifty Excel, to visualize the distribution of insert sizes, or to get summary statistics.

Example 2: filter by insert sizes

Say we want to only keep reads of insert sizes between 200 and 500 from a BAM file and write the reads that fulfil the status to another file.

                samtools view                  -h                  file.bam |                  \                  awk                  'substr($0,ane,1)=="@" || ($ix>= 200 && $nine<=500) || ($9<=-200 && $9>=-500)'                  |                  \                  samtools view                  -b                  >                  is-200-500.bam                              

Explanations:

  • -h in samtools view: export SAM file headers
  • In awk, the substr office is used to proceed header lines, and the rest 2 status specify forward and reverse reads with the desired insert sizes, respectively
  • Last only not least, samtools view -b is chosen to write the filtered reads into new a BAM file.

Now, you can load the BAM file into genome browsers, or perform down-stream analysis, for instance feature counting, with it.

Limitation

The insert size estimation using this uncomplicated method has an limitation: if the RNA-seq reads were mapped against genomes (particularly eukaryotic genomes), instead of transcriptomes, the reported insert size can be huge, up to thousands or even more bases.

It happens because mapping reads to genomes requires that the read aligner splits the reads and map sub-reads to the genome, because in eukaryotic and peculiarly mammalian genomes, two exons can be separated by an intron of hundreds or even thousands nucleotides. This is why estimated insert sizes from genome-mapped SAM/BAM files can be misleading.

At that place are at least two solutions:

  1. Re-map the reads to the transcriptome and perform the analysis above. This volition avoid the problem of including introns in the insert sizes, because transcriptome sequences exercise not include introns. The reward is that it volition brand the estimates more accurate. The solution comes with a computational and time cost though to remap the data.
  2. Manually exclude very large inserts, say anything above 1000. The number needs to be determined by the protocol and the need.

An example of solution #2 would be:

                samtools view                  -f66                  file.bam  |                  \                  cutting                  -f9                  |                  \                  awk                  '{print sqrt($0^ii)}'                  |                  awk                  '$0<1000'                              

Conclusion

Insert size refers to the fragment length consisting of forrard and opposite reads and the un-sequenced gap between the paired reads. It is possible to utilise samtools and command-line tools such as awk and cut to collect insert sizes or to filter BAM/SAM files.