Merge branch 'analytics_design_doc' into 'main'

RFC: doc: Add analytics support design documentation

See merge request gstreamer/gstreamer!6139
This commit is contained in:
Daniel Morin 2024-05-03 20:23:59 +00:00
commit 7671495e8c

View file

@ -0,0 +1,450 @@
# Analytics
Analytics refer to the process of extracting information from the content of the
media (or medias). The analysis can be spacial only, ex. Image analysis,
temporal only, sound detection, spacio-temporal tracking or action recognition,
multi-modal image+sound to detect a environment or behaviour. There's also
scenarios where the results of analysis is used as the input, with or without an
additional media. This design aim at supporting ML-based analytics and CV
analytics and offer a way to bridge both techniques.
## Vision
With this design we aim at allowing GStreamer application developers to develop
analytics pipeline easily while taking full advantage of the acceleration
available on the platform where they deploy. The effort of moving the analytic
pipeline to a different platform will be minimal.
## Refinement Using Analytics Pipeline
Similarly to content-agnostic media processing (ex. Scaling, colour-space change,
serialization, ...), this design promote re-usability and simplicity by allowing
to compose complex analytics pipeline from simple and dedicated analytics
elements that complement each others.
## Example
Simple hypothetical example of an analytic pipeline.
```
+---------+ +----------+ +---------------+ +----------------+
| v4l2src | | video | | onnxinference | | tensor-decoder |
| | | convert | | | | |
| src-sink scale src-sink1 src1-sink src---
| | |(pre-proc)| | (analysis) | | (post-proc) | /
+---------+ +----------+ +---------------+ +----------------+ /
/
----------------------------------------------------------------------
| +-------------+ +------+
| | Analytic- | | sink |
| | overlay | | |
-sink src-sink |
| (analysis | | |
| -results | +------+
| -consumer) |
+-------------+
```
## Supporting Neural Network Inference
There's multiple frameworks supporting neural network inference. Those can be
described more generally as computing-graph, as they are generally not limited
to NN inference application. Existing NN-inference or computing-graph framework,
like ONNX, is encapsulated into a GstElement/Filter. The inference element load
a model, description of the computing-graph, based on a property. The model
expect input(s) in a specific format and produce output(s) in specific
format. Depending on the model format, input/output formats can be extracted
from the model, like with ONNX, but it is not always the case.
### Inference Element
Inference element are an encapsulation of an NN-Inference framework. Therefore
they are specific to a framework, like ONNX Runtime.
Other inference elements can be added.
### Inference Input(s)
The inputs format is defined by the model. Using the model input format the
inference element can constrain its sinkpad(s) capabilities. Note because tensor
is very generic the term also encapsulate image/frame, and the term input tensor is
also used to describe inference input.
### Inference Output(s)
Output(s) of the inference are tensors and their format are also dictated by the
model. Analysis results is generally encoded in the output tensor in a way that
is specific to the model. Even models that target the same time of analysis
encode analysis results in different ways.
### Models Format Not Describing Inputs/Outputs Tensor Format
With some models format the inputs/outputs tensors formats are not described. In
this context it's the responsibility of the analytics pipeline to push input
tensors with the correct format into the inference process. In this context
inference element designer is left with two choices: supporting a model manifest
where inputs/outputs are described or leaving the constraining/fixing
inputs/outputs to analytics pipeline designer who can use caps filter to
constrain inputs/outputs of the model.
### Tensor-Decoders
In order to preserve the generality of the inference element, tensor decoding is
omitted from the inference element and left to specialized elements that have a
specific task of decoding tensor from a specific model. Additionally
tensor-decoding does not depend on a specific NN-framework or inference element,
this allow re-usability of tensor-decoders with a same model used with a
different inference element. For example, YOLOv3 tensor-decoder can used to
decode tensor from inference using YOLOv3 model with an element encapsulating
ONNX or TFLite. Note that tensor-decoder can handle multiple tensors the have
similar encoding.
### Tensor
N-dimensional vector.
#### Tensor Type Identifier
This is an identifier, string or quark, that uniquely identify a tensor type. A
tensor type, describe the specific format used to encode analysis result in
memory. This identifier is used by tensor-decoders to know if they can handle
the decoding of a tensor. For this reason, from an implementation perspective,
tensor-decoder is the ideal location to store tensor-type-identifier as the code
is ready model specific. Since the tensor-decoder is by design specific to a
model, no generality is lost by storing it the tensor-type-identifier.
#### Tensor Datatype
This is the primitive type used to store tensor-data. Like `int8`,
`uint8`, `float16`, `float32`, ...
#### Tensor Dimension Cardinality
Number dimensions in the tensor.
#### Tensor Dimension
Tensor shape.
- [a], 1-dimentional vector
- [a x b], 2-dimentional vector
- [a x b x c], 3-dimentional vector
- [a x b x ... x n], N-dimentional vector
### Tensor-Decoders Need to Recognize Tensor(s) They Can Handle
As mention before tensor-decoder need to be able to recognize tensor(s) they can
handle. It's important to keep in mind that multiple tensors can be attached to
a buffer, when tensors are transported as analytics-meta. It could be easy to
believe that tensor's (cardinality + dimension + datatype) is sufficient to
recognize a specific tensor format but we need to remember that analysis results
are encoded into the tensor and retrieve analysis results require a decoding
process specific to the model. In other words a tensor A:{cardinality:3,
dimension: 100 x 5, datatype:int8) and a tensor B:{cardinality:3, 100 x 5,
datatype:int8) can have completely different meaning.
A could be: (Object-detection where each candiate is encoded with (top-left)
coordinates, width, height and object location confidence level)
```
0 : [ x1, y1, w, h, location confidence]
1 : [ x1, y1, w, h, location confidence]
...
99: [ x1, y1, w, h, location confidence]
```
B could be: (Object-detection where each candiate is encoded with (top-left)
coordinates, (bottom-right) coordinate and object class confidence level)
```
0 : [ x1, y1, x2, y2, class confidence]
1 : [ x1, y1, x2, y2, class confidence]
...
99: [ x1, y1, x2, y2, class confidence]
```
We can see that even if A and B have same (cardinality, dimension, datatype) a
tensor-decoder expecting A and decoding B would wrong.
In general for high cardinality tensors the risk of having two tensors with same
(cardinality + dimension + datatype) is low, but if we think of low cardinality
tensors typical of classification (1 x C), we can see that the risk is much
higher. For this reason we believe it's not sufficient for tensor-decoder to
only rely on (cardinality + dimension + datatype) to identify tensor it can
handle.
#### Tensor-Decoder Second Job: Non-Maximum Suppression (NMS)
Tensor-Decoders main functionality is to extract analytics-results from tensor,
but in addition to decoding tensor, in general a second phase of post-processing
is handled by tensor-decoder. This post-processing phase is called non-maximum
suppression (NMS). A simplest example of NMS, is with classification. For every
input the classification model will produce a probability for potential class.
In general we're mostly interested in the most probable class or few most
probable classe but in general there's little value in transport all classes
probability. In addition to keeping only most propbable class (or classes) we
generally want the probability to be above a certain threshold other we're not
interested in the result. Because a significant portion of analytics results out
of the inference process don't have much value, we want to filter them out as
early as possible. Since analytics result are only available, after tensor
decoding the tensor-decoder is tasked with this type filtering (NMS). Same
concept exist for object-detection, where NMS generally involve calculating
intersection-of-union (IoU) in combination with location and class probability.
Because ML-based analytics are probabilistic analysis, generally need a form of
NMS post-processing.
#### Handling Multiple Tensors Simultaneously In A Tensor-Decoder
Some time it is needed or more efficient to have a tensor decoder handling
multiple tensors simultanously. In some case the tensors are complementary and a
tensor-decoder need to have both tensors to decode analytics result. In other
case it's just more efficient to do it simultanously because of the
tensor-decoders second job NMS. Let's consider YOLOv3 where 3 output tensors are
produced for each input. One tensor represent detection of small objects, second
tensor medium size objects and thirs tensor large size object. In this context
it's beneficial to have the tensor-decoder decode the 3 tensors simultanously to
perform the NMS on all the results otherwise analytics results with low value
would remain in the system for longer. This has implication for the negotiacion
of tensor-decoder, that will be expanded the section dedicated to tensor-decoder
negotiation.
### Why Interpreting (decoding) Tensors
As we descibed bellow tensor contain information and are used to store analytics
results. The analytics results are encoded in a model specific way into the
tensor and unless their consumer, process making use of analytics-results, are
also model specific they need to be decoded. Deciding if the analytics pipeline
will have elements producing and consuming tensor directly into their encoded
form or if a tensor-decoding process will done between tensor production and
consumption is a design decision that involve compromise between re-usability
and performance. As an example an object-detection overlay element would need to
be model specific to directly consume tensor, therefore would need to be
re-written for any object-detection model using a different encoding scheme, but
if the only goal of the analytics-pipeline is to do this overlay it would
probably be the most efficient implementation. Another aspect in favor of
interpreting tensor if that we can have multiple consumers of the analytics
results and tensor decoding is left to consumer themself in mean multiple
consumers decoding tensors. On the other end, we can think of two models
specifically designed to work together where outputs of one model become the
input of the downstream model. In this context the downstream model is not
re-usable without the upstream model but they bypass the need for
tensor-decoding and are very efficient. Another variation is that multiple
models are merged into one model removing the need the multi-level inference,
but again this is a design decision involving compromise on re-usability,
performance and effort. We aim at providing a support for all these use-cases
and alowing analytics-pipeline designer to make the best design decision based
on their specific context.
#### Analytics-Meta
Analytics-meta (GstAnalyticsRelationMeta) is the foundation of re-usability of
analytics results and its goal is to store analytics-results (GstAnalyticsMtd)
in an efficient way and allow to define relation between them. GstAnalyticsMtd
is very primitive and meant to be expanded. GstAnalyticsClassification (storage
for classification result), GstAnalyticsObjectDetection (storage for
object-detection result), GstAnalyticsObjectTracking (storage for
object-tracking) are specialization and can used as reference to create other
storage, based on GstAnalyticsMtd, to store other types of analytics result.
There is two major use-case for the ability to define relation between
analytics results. The first one is define a relation between analytics results
that were generated at different stage. A good example of this could be a first
analysis detected cars from an image and a second level analysis where only
section of image presenting a car is pushed to a second analysis to extract
brand/model of the car in a section of the image. This analytics result is then
appended to the original image with a relation defined with the object-detection
result that have localized this car in the image. The other use-case for
relation is to create composition by re-use existing GstAnalyticsMtd
specialization. The relation between different analytics-result is completely
decoupled from analytics result themselves. All relation definition are stored in
GstAnaltyicsRelationMeta that is a container of GstAnaltyicsMtd and also contain
an adjacency-matrix storing relation. One of the benefit is the ability of a
consumer of analytics-meta to explore the graph and follow relations between
analytics-results without having to understand every analytics-result the
relation relation path. Another important aspec is analytics-meta are not
specific to machine-learning techniques and can also be used to store analysis
results from computer-vision or other techniques. It base be used as a bridge
between different techniques.
##### Storing Tensors Into Analytics-Meta
To be able to describe more precisely analytics result a analytics pipeline
where first inference output tensor is directly pushed, without tensor decoding,
into a second inference, it would be useful to store those tensors using
analytics-meta because we could communicate the relation between tensor of first
inference and tensor of second inference. With the relation description a
tensor-decoder of second inference would be able to retrive associated tensor of
of first inference and extract potentially useful information that is not
availabe the tensor of the second inference.
### Semantically-Agnostic Tensor Processing
Not all tensor processing is model dependent. Sometime the processing can be
done uniformely on all tensor's values. For example normalization, range
adjustment, offset adjustment, quantization are examples of operations that do
not require knowledge of how the information is encoded in the tensor. To the
contrary of tensor-decoder, elements implementing these types of porcessing
don't need to know how information is encoded in the tensor but need to know
general information about the tensor like: cardinality, dimension and datatype.
Note GStreamer already does alot of semantically-agnostic tensor, rember
image/frame are also a form of tensor, processing like scaling, cropping,
colorspace conversion, ...
#### Semantically-Agnostic Tensor Processing With Graph-Computing Framework
Graph-Computing frameworks, like ONNX, can also be used this type of operation.
#### Tensor-Decoder Bin And Auto-Plugging
Since tensor-decoders are model specific, we expect that many will be created
and one way to simplify analytics pipeline creation an promote re-usability is to provide
tensor-decoder bin able to identify the correct tensor-decoder for the tensor to
be decoded. It's possible to have multiple tensor-decoder able to decoder the
exact same decoder but making use of specific acceleration. Rank can be used to
identify the ideal tensor-decoder for a specific platform.
### Tensor Transport Mode
Two transport mode are envisioned as Meta or as Media. Both mode have pros and
cons which justify supporting both mode.
#### Tensor Transport As Meta
In this mode tensor is attached to the buffer (the media) on which the analysis
was performed. The advantage of this mode if the original media is kept in a
direct association with analytics results. Further refinement analysis or
consumption (like overlay) of the analytics result ease easier when the media on
which the analysis was performed is available and easily identfiable. Another
advantage is the ability to keep a relation desciption between tensors in a
refinement context On the other hand this mode of transporting analytics result
make negotiation of tensor-decoder in particuler difficult.
#### Tensor Transport As Media
In this mode tensor is the media, potially refering to buffer (original media on
which the analysis was performed) using a Meta (idea behind OriginalBufferMeta MR).
The advantage of this mode is tensor-decoder negociation is simple, but
association of the analytics result with the original buffer on which the
analysis was performed is more difficult.
*Also note this the mode of transport used by NNStreamer.*
### Negotiation
Allowing to negociated required analysis pre/post processing and automatically
inject required elements able to performed them would be very valuable and
minimize effort of porting an analytics-pipeline between different platforms and
making use of acceleration available. Tensor-decoders bin, auto-plugging of
pre-processing (considering acceleration available), auto-plugging of inference
element (optimized of the platform), post-processing, tensor-decoder bin
selecting required tensor-decoders potionnally from multiple functionnally
equivalente but more adapted to the platfrom are all aspect to consider when
designing negociation involved in analytics-pipeline.
#### Negotiating Tensor-Decoder
As described above tensor-decoder need to know 4 attributes about a tensor to
know if it can handle it:
1. Tensor dimension cardinality ( not required explicitly in some cases)
2. Tensor dimension
3. Tensor datatype
4. Tensor type (identifier of analytics-result encoding semantic)
Note 1, 2, 3 could be encoded into 4, but this is not desirable because 1,2,3
are useful for selection semantically-agnosic tensor processor.
Tensor-decoder can handle multiple tensor types. This could be expressed in the
sinkpad(s) template by a list of arrays where each combination of tensor types
it can handle would be expressed. This would make the sinkpad(s) caps difficult
to read. To avoid this problem when a tensor-decoder handle multiple tensors the
tensor type is a category the encapsulate all tensor type it can handle.
Refering again to YOLOv3's 3 tensors: small, medium large, all 3 would have the
same tensor-type identifier, ex YOLOv3, and each tensors themselves would have
subtype field distinguishing them ('small', 'medium', 'large'). Same also
applies to FastSAM 2 tensors ('FastSAM-masks', 'FastSAM-logits') where both
would be represented by the same tensor type ('FastSAM') in pad capability level.
When tensor is stored as a meta, allocation query need to be used to negociate
tensor-decoder. TODO: expand how this would work.
##### Tensor-Decoder Sinkpad Caps Examples
Examples assuming object-detection on video frame
```
PadTemplates:
SINK template: 'vsink' // Tensor attached on to buffer
Avaiability: 'always'
Capabilities:
video/x-raw
format: {...}
width: {...}
height: {...}
framerate: {...}
SINK template: 'tsink' // Tensor
Avaiability: 'always'
Capabilities:
tensor/x-raw
shape:{<a, b, ...z>} // This represent a x b x ... x z
datatype: {(enum) "int8", "float32", ...}
type: { (string)"YOLOv3", (string)"YOLOv4", (string)"SSD", ...)}
```
##### Tensor-Decoder Srcpad(s)
Typically will be the same as sinkpad but could be different. In general
tensor-decoder only attach an analytics-meta to buffer. Analytics-meta consumer
is left to other downstream elements. It's also possible for tensor-decoder to
have very different caps on srcpad. This can be the case when model-free
analytics-result is difficult to represent like text-to-speech or
super-resolution. In these case the tensor-decoder could be producing a media
directly. audio for TTS or image for super-resolution.
### Inference Sinkpad(s) Capabilities
Sinkpad capability, before been constrained based on model, can be any
media type, including ```tensor``` . Note that multiple sinkpads can be present.
#### Batch Inference
To support batch inference ```tensor``` media type need to be used. Batching is
a method used to spread the fixed time cost of scheduling work on some
accelerator (GPU). Multiple samples (buffers) are aggregated into a batch that
is pushed to the inference. When batch is used output tensor will also contain
analytics results in the form of a batch. Un-batching require information on how
the batch was formed: buffer timestamp, buffer source, media type, buffer caps.
Batch can be formed by a single source, forming the batch with time multiplexing
of from multiple sources, time and sources mutiplexing. Once multiplexed the
batch can be pushed to the inference element as a tensor media.
TODO: Describe TensorBatchMeta
### Inference Srcpad(s) Capabilities
Srcpads capabilities, will be identical to sinkpads capabilities or a
```tensor```.
```
PadTemplates:
SRC template: 'vsrc_%u' // Tensor attached on to buffer
Avaiability: 'always'
Capabilities:
video/x-raw
format: {...}
width: {...}
height: {...}
framerate: {...}
SRC template: 'asrc_%u' // Tensor attached on to buffer
Avaiability: 'always'
Capabilities:
audio/x-raw
format: {...}
layout: {...}
rate: [...]
channels: [...]
SRC template: 'tsrc_%u' // Tensor attached on to buffer
Avaiability: 'always'
Capabilities:
text/x-raw
format: {...}
SRC template: 'src_%u' // Tensor
Avaiability: 'always'
Capabilities:
tensor/x-raw
shape:{<a, b, ...z>} // This represent a x b x ... x z
datatype: {(enum) "int8", "float32", ...}
type: { (string)"YOLOv3", (string)"YOLOv4", (string)"SSD", ...)}
```
### New Video Format
TODO
- We need to add floating point video formats
### Batch Aggregator Element
TODO
# Reference
- [Onnx-Refactor-MR](https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/4916)
- [Analytics-Meta MR](https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/4962)