Genomic Information Representation (MPEG-G) is a suite of specifications developed jointly with TC 276 Biotechnology that allows to reduce the amount of information required to losslessly store and transmit DNA reads from high speed sequencing machines.
An MPEG-G file can be created with the following sequence of operations:
- Put the reads in the input file (aligned or unaligned) in bins corresponding to segments of the reference genome
- Classify the reads in each bin in 6 classes: P (perfect match with the reference genome), M (reads with variants), etc.
- Convert the reads of each bin to a subset of 18 descriptors specific of the class: e.g., a class P descriptor is the start position of the read etc.
- Put the descriptors in the columns of a matrix
- Compress each descriptor column (MPEG-G uses the very efficient CABAC compressor already present in several video coding standards)
- Put compressed descriptors of a class of a bin in an Access Unit (AU) for a maximum of 6 AUs per bin
MPEG-G currently includes 6 parts
- Part 1 – Transport and Storage of Genomic Information specifies the file and streaming formats
- Part 2 – Genomic Information Representation specified the algorithm to compress DNA reads from high speed sequencing machines
- Part 3 – Genomic information metadata and application programming interfaces (APIs) specifies metadata and API to access an MPEG-G file
- Part 4 – Reference Software and Part 5 – Conformance are the usual components of a standard
- Part 6 – Genomic Annotation Representation will specify how to compress annotations.
Table of contents | ◄ | 13.18 MPEG-CICP | █ | 13.20 MPEG-IoMT | ► |