12.2 – Moving intelligence around

  • Post author:
  • Post category:Mpeg book


Artificial intelligence has reached the attention of mass media and technologies supporting it – Neural Networks (NN) – are being deployed in several contexts affecting a growing number of end users, e.g. in their smart phones.

If a NN is used locally, it is possible to use existing digital representation of NNs (e.g., NNEF, ONNX). However, these formats miss vital features for distributing intelligence, such as compression, scalability and incremental updates.

To appreciate the need for compression let’s consider the case of adjusting the automatic mode of a camera based on recognition of scene/object obtained by using a properly trained NN. As this area is being intensely investigated, very soon there will be new better trained versions of the NN or new NNs with additional features. However, as the process to create the necessary “intelligence” usually takes time and labour (skilled and unskilled), in most cases the new created intelligence must be moved from the centre to where the user handset is. With today’s NNs reaching a size of several hundred Mbytes and growing, a scenario where millions of users clog the network because they are all downloading the latest NN with great new features looks likely.

This article describes some elements of the MPEG work plan to develop one or more standards that enable compression of neural networks. Those wishing to know more please read Use cases and Requirements, and Call for Proposals. 

About Neural Networks

An (artificial) Neural Network is a system composed of connected nodes each of which can

  • Receive input signals from other nodes
  • Process them
  • Transmit an output signal to other nodes.

Nodes are typically aggregated into layers, each performing different functions. Typically, the “first layers” are rather specific of the signals (audio, video, various forms of text information etc.). MPEG is addressing the compression of NNs trained with multimedia data for classification or analysis purposes.

Nodes can send signals to subsequent layers but, depending on the type of network, also to the preceding layers.

Figure 44: An example of artificial neural network

Training is the process of “teaching” a network to do a particular job, e.g. recognising a particular object or a particular word. This is done by presenting to the NN data from which it can “learn”. Inference is the process of presenting to a trained network new data to get a response about what the new data is.

When is NN compression useful?

Compression is useful whenever there is a need to distribute NNs to remotely located devices. Depending on the specific use case, compression should be accompanied by other features. In the following two major use cases will be analysed.

Public surveillance

In 2009 MPEG developed the Surveillance Application Format. This is a standard that specifies the package (file format) containing audio, video and metadata to be transmitted to a surveillance centre. Today, however, it is possible to ask the surveillance network to do more intelligent things by distributing intelligence even down to the level of visual and audio sensors.

For this more advanced scenarios MPEG is developing a suite of specifications under the title of Internet of Media Things (IoMT) where Media Things (MThing) are the media “versions” of IoT’s Things. Parts 2 (IoMT Discovery and Communication API) and 3 (IoMT Media Data Formats and API) of the IoMT standard (ISO/IEC 23093) has reached FDIS level in March 2019. Part 1 (Archi­tecture) will reach FDIS level in October 2019.

The IoMT reference model is represented in Figure 45

Figure 45: IoT in MPEG stands for “media” – IoMT

IoMT standardises the following interfaces:

1 User commands (setup info.) between a system manager and an Mthing
1’ User commands forwarded by an MThing to another MThing, possibly in a modified form (e.g., subset of 1)
2 Sensed data (Raw or processed data in the form of just compressed data or resulting from a semantic extraction) and actuation information
2’ Wrapped interface 2 (e.g. for transmission)
3 MThing characteristics, discovery

IoMT is neutral as to the type of semantic extraction or, more generally, to nature of intelligence actually present in the cameras. However, as NNs networks are demonstrating better and better results for visual pattern recognition, such as object detection, object tracking and action recognition, cameras can be equipped with NNs capable to process the information captured to achieve a level of understanding and transmit that understanding through interface 2.

Therefore, one can imagine that re-trained or brand new NNs can be regularly uploaded to a server that distributes NNs to surveillance cameras. Distribution need not be uniform since different areas may need different NNs, depending on the tasks that given areas need to specifically carry out.

NN compression is a vitally important technology to make the described scenarios real because automatic surveillance system may use many cameras (e.g. thousands and even million units) and because, as the technology to create NNs matures, the time between NN updates will progressively become shorter.

Distribution of NN-based apps to devices

There are many cases where compression is useful to efficiently distribute heavy NN-based apps to a large number of devices, in particular mobile. Here 3 case are considered.

  • Visual apps. Updating a NN-based camera app in one’s mobile handset will soon become common place. Ditto for the many conceivable application where the smart phone understands some of the objects in the world around. Both will happen at an accelerated frequency.
  • Machine translation (speech-to-text, translation, text-to-speech). NN-based translation apps already exist and their number, efficiency, and language support can only increase.
  • Adaptive streaming. As AI-based methods can improve the QoE, the coded representation of NNs can initially be made available to clients prior to streaming while updates can be made during streaming to enable better adaptation decisions, i.e. better QoE.


MPEG has identified a number of requirements for compressing NNs. Even though not all applic­ations need the support of all requirements, the NN compression algorithm must eventually be able to support all the identified requirements.

  1. Compression shall have a losslessmode, i.e. the performance of the compressed NN is exactly the same as the uncompressed NN
  2. Compression shall have a lossymode, i.e. the performance of the decompressed NN can be different than the performance of the uncompressed NN of course in exchange for more compression
  3. Compression shall be scalable, i.e. even if only a subset of the compressed NN is used, there is still a level of performance
  4. Compression shall support incremental updates, i.e. as more data are received the performance of NN improves
  5. Decompression shall be possible with limited resources, i.e. with limited processing perfor­mance and memory
  6. Compression shall be error resilient, i.e. if an error occurs during transmission, the file is not lost
  7. Compression shall be robust to interference, i.e. it is possible to detect that the compressed NN has been tampered with
  8. Compression shall be possible even if there is no access to the original training data
  9. Inference shall be possible using compressed NN
  10. Compression shall support incremental updates from multiple providers to improve perfor­mance of a NN


The standard Compression of neural networks for multimedia content description and analysis (part 17 of MPEG-7) that MPEG is currently developing will initially produce a base layer standard that will help the industry move its first steps in this exciting field that will certainly shape the way intelligence is added to things near to all of us.


Table of contents 12.1 Meaningful data can be compressed 12.3 MPEG standards for genomics