PacBio BAM index file (`bam.pbi`) format

.. moduleauthor:: Derek Barnett, David Alexander, Marcus Kinsella, Yuan Li, James Drake

PacBio's previous alignment file format (cmp.h5) contained a data table called the alignment index that recorded auxiliary identifying information and precomputed summary statistics per aligned read. This table served several purposes:

it enabled fast random access to aligned reads satisfying fairly complex predicates, for example, reads from a specific list of ZMWs which had unambiguous mapping (MapQV==254), or a read with a given readname.

it allowed summary reports (readlength, mapped identity/accuracy, etc.) to be constructed by quick operations over the alignment index instead of loading all of the sequence reads for each analysis.

In order to provide backwards-compatibility with the APIs enabled for accessing the cmp.h5, we have devised a new BAM companion file, the PacBio BAM index, which supports the two use cases above.

Version

This is version 3.0.0 of the bam.pbi specification.

Changelog will go here in the future

Format

The format mimics the HDF5 column-based approach without requiring the additional library dependency. Most sections are laid out as a series of 1-dimensional arrays, a la HDF5 datasets. Calculating an average or maximum mapQV, for example, would simply involve a block read of the array and the relevant computation.

It enables the 2 major use cases listed above

Random-access queries, including:

by reference or genomic region

by read group

by query name

by ZMW

by barcode index

etc.

Obtain information without processing entire BAM file

Calculate summary statistics

Reverse-lookup - get information for a record, given its index

Like the associated BAM & BAI formats, PBI is compressed in the BGZF format. See SAM/BAM spec for details.

All multi-byte numbers in PBI are stored little-endian.

Layout - file sections follow each other immediately in the file and are described below.
- PBI Header
- Subread Data
- Mapped Data (optional)
- Coordinate-Sorted Data (optional)
- Barcode/Adapter Data (optional)

PBI Header

Field	Size	Definition	Value
magic	char[4]	PBI magic string	`PBI\1`
version	uint32_t	PBI format version (xx.yy.zz)	0x00xxyyzz
pbi_flags	uint16_t	bitflag describing file contents ¹
n_reads	uint32_t	number of reads in the BAM file
reserved	char[18]	reserved space for future expansion	fill(0x00)

¹ pbi_flags:

Flag Value Description

Basic 0x0000 PbiHeader & SubreadData only

Mapped 0x0001 MappedData section present

Coordinate Sorted 0x0002 CoordinateSortedData section present

Barcode/Adapter 0x0004 BarcodeAdapterData section present

(0x0008 - 0x8000) are available to mark future data modifiers, add'l sections, etc.

Subread Data

SubreadData

Field Size Definition

for 0..n_reads

rgId int32_t Integral value of @RG::ID ¹

for 0..n_reads

qStart

int32_t

Start of query in polymerase read

for 0..n_reads

qEnd

int32_t

End of query in polymerase read

for 0..n_reads

holeNumber

int32_t

The holenumber of the ZMW producing the read

for 0..n_reads

readQual

uint16_t

Expected accuracy ('rq' tag) [0-1000]

for 0..n_reads

fileOffset int64_t Virtual offset of record (bgzf_tell)

¹ Read group identifiers for PacBio data are calculated as follows:

RGID_STRING := md5(movieName + "//" + readType)) [:8]
RGID_INT    := int32.Parse(RGID_STRING)

RGID_STRING is used in the @RG header.
RGID_INT is used in the RG tag of BAM records and here in the PBI index.

Note that RGID_INT may be negative.

Mapped Data

MappedData

Field

Size

Definition

for 0..n_reads

tId

int32_t

BAM tid indication aligned reference

for 0..n_reads

tStart

uint32_t

(0-based) Start of alignment in reference

for 0..n_reads

tEnd

uint32_t

End of alignment in reference (endpos)

for 0..n_reads

aStart

uint32_t

Start of aligned query in polymerase read

for 0..n_reads

aEnd

uint32_t

End of aligned query in polymerase read

for 0..n_reads

revStrand

uint8_t

1 if reverse strand alignment, else 0

for 0..n_reads

nM

uint32_t

Number of base matches in alignment

for 0..n_reads

nMM

uint32_t

Number of base mismatches in alignment

for 0..n_reads

mapQV

uint8_t

The mapping quality [valid ranges 0-254]

Note

Note the absence of the nDel and nIns values in the index. These values are readily computed as:

nIns = aEnd - aStart - nM - nMM
nDel = tEnd - tStart - nM - nMM

Coordinate-Sorted Data

CoordinateSortedData

Field

Size

Definition

n_tids

uint32_t

Number of reference sequences

for 0..n_tids

tId	uint32_t	reference sequence ID ¹
beginRow	uint32_t	index of first record on tId ²
endRow	uint32_t	index of last record on tId ²

In a coordinate-sorted BAM file, the records mapped to each reference form a contiguous block of row numbers.

¹: This dataset should be sorted in ascending order of the uint32 cast of tId (thus a tId of -1 will follow all other tId values)
²: Data fields beginRow and endRow. If tId[i]==t, then [beginRow, endRow) represents range of reads (by 0-based ordinal position in the BAM file) mapped to the reference contig with tId of t. If no BAM records are aligned to t, then we should have beginRow, endRow = -1.

Barcode/Adapter Data

BarcodeAdapterData

Field

Size

Definition

for 0..n_reads

bc_left

uint16_t

B_L from 'bc' tag (index to barcode FASTA)

for 0..n_reads

bc_right

uint16_t

B_R from 'bc' tag (index to barcode FASTA)

for 0..n_reads

bc_qual

uint8_t

barcode call confidence ('bq' tag)

for 0..n_reads

ctxt_flag

uint8_t

Local context of subread ('cx' tag)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PacBioBamIndex.rst

PacBioBamIndex.rst

PacBio BAM index file (`bam.pbi`) format

Version

Format

PBI Header

Subread Data

Mapped Data

Coordinate-Sorted Data

Barcode/Adapter Data

Flag	Value	Description
Basic	0x0000	PbiHeader & SubreadData only
Mapped	0x0001	MappedData section present
Coordinate Sorted	0x0002	CoordinateSortedData section present
Barcode/Adapter	0x0004	BarcodeAdapterData section present

Files

PacBioBamIndex.rst

Latest commit

History

PacBioBamIndex.rst

File metadata and controls

PacBio BAM index file (bam.pbi) format

Version

Format

PBI Header

Subread Data

Mapped Data

Coordinate-Sorted Data

Barcode/Adapter Data

PacBio BAM index file (`bam.pbi`) format