mirror of
https://gitlab.freedesktop.org/gstreamer/gstreamer.git
synced 2024-05-18 00:12:46 +00:00
Merge branch 'analytics_design_doc' into 'main'
RFC: doc: Add analytics support design documentation See merge request gstreamer/gstreamer!6139
This commit is contained in:
commit
7671495e8c
450
subprojects/gst-docs/markdown/additional/design/analytics.md
Normal file
450
subprojects/gst-docs/markdown/additional/design/analytics.md
Normal file
|
@ -0,0 +1,450 @@
|
|||
# Analytics
|
||||
|
||||
Analytics refer to the process of extracting information from the content of the
|
||||
media (or medias). The analysis can be spacial only, ex. Image analysis,
|
||||
temporal only, sound detection, spacio-temporal tracking or action recognition,
|
||||
multi-modal image+sound to detect a environment or behaviour. There's also
|
||||
scenarios where the results of analysis is used as the input, with or without an
|
||||
additional media. This design aim at supporting ML-based analytics and CV
|
||||
analytics and offer a way to bridge both techniques.
|
||||
|
||||
## Vision
|
||||
|
||||
With this design we aim at allowing GStreamer application developers to develop
|
||||
analytics pipeline easily while taking full advantage of the acceleration
|
||||
available on the platform where they deploy. The effort of moving the analytic
|
||||
pipeline to a different platform will be minimal.
|
||||
|
||||
## Refinement Using Analytics Pipeline
|
||||
|
||||
Similarly to content-agnostic media processing (ex. Scaling, colour-space change,
|
||||
serialization, ...), this design promote re-usability and simplicity by allowing
|
||||
to compose complex analytics pipeline from simple and dedicated analytics
|
||||
elements that complement each others.
|
||||
|
||||
## Example
|
||||
Simple hypothetical example of an analytic pipeline.
|
||||
|
||||
```
|
||||
+---------+ +----------+ +---------------+ +----------------+
|
||||
| v4l2src | | video | | onnxinference | | tensor-decoder |
|
||||
| | | convert | | | | |
|
||||
| src-sink scale src-sink1 src1-sink src---
|
||||
| | |(pre-proc)| | (analysis) | | (post-proc) | /
|
||||
+---------+ +----------+ +---------------+ +----------------+ /
|
||||
/
|
||||
----------------------------------------------------------------------
|
||||
| +-------------+ +------+
|
||||
| | Analytic- | | sink |
|
||||
| | overlay | | |
|
||||
-sink src-sink |
|
||||
| (analysis | | |
|
||||
| -results | +------+
|
||||
| -consumer) |
|
||||
+-------------+
|
||||
|
||||
```
|
||||
|
||||
## Supporting Neural Network Inference
|
||||
|
||||
There's multiple frameworks supporting neural network inference. Those can be
|
||||
described more generally as computing-graph, as they are generally not limited
|
||||
to NN inference application. Existing NN-inference or computing-graph framework,
|
||||
like ONNX, is encapsulated into a GstElement/Filter. The inference element load
|
||||
a model, description of the computing-graph, based on a property. The model
|
||||
expect input(s) in a specific format and produce output(s) in specific
|
||||
format. Depending on the model format, input/output formats can be extracted
|
||||
from the model, like with ONNX, but it is not always the case.
|
||||
|
||||
### Inference Element
|
||||
Inference element are an encapsulation of an NN-Inference framework. Therefore
|
||||
they are specific to a framework, like ONNX Runtime.
|
||||
Other inference elements can be added.
|
||||
|
||||
### Inference Input(s)
|
||||
The inputs format is defined by the model. Using the model input format the
|
||||
inference element can constrain its sinkpad(s) capabilities. Note because tensor
|
||||
is very generic the term also encapsulate image/frame, and the term input tensor is
|
||||
also used to describe inference input.
|
||||
|
||||
### Inference Output(s)
|
||||
Output(s) of the inference are tensors and their format are also dictated by the
|
||||
model. Analysis results is generally encoded in the output tensor in a way that
|
||||
is specific to the model. Even models that target the same time of analysis
|
||||
encode analysis results in different ways.
|
||||
|
||||
### Models Format Not Describing Inputs/Outputs Tensor Format
|
||||
With some models format the inputs/outputs tensors formats are not described. In
|
||||
this context it's the responsibility of the analytics pipeline to push input
|
||||
tensors with the correct format into the inference process. In this context
|
||||
inference element designer is left with two choices: supporting a model manifest
|
||||
where inputs/outputs are described or leaving the constraining/fixing
|
||||
inputs/outputs to analytics pipeline designer who can use caps filter to
|
||||
constrain inputs/outputs of the model.
|
||||
|
||||
### Tensor-Decoders
|
||||
In order to preserve the generality of the inference element, tensor decoding is
|
||||
omitted from the inference element and left to specialized elements that have a
|
||||
specific task of decoding tensor from a specific model. Additionally
|
||||
tensor-decoding does not depend on a specific NN-framework or inference element,
|
||||
this allow re-usability of tensor-decoders with a same model used with a
|
||||
different inference element. For example, YOLOv3 tensor-decoder can used to
|
||||
decode tensor from inference using YOLOv3 model with an element encapsulating
|
||||
ONNX or TFLite. Note that tensor-decoder can handle multiple tensors the have
|
||||
similar encoding.
|
||||
|
||||
### Tensor
|
||||
N-dimensional vector.
|
||||
|
||||
#### Tensor Type Identifier
|
||||
This is an identifier, string or quark, that uniquely identify a tensor type. A
|
||||
tensor type, describe the specific format used to encode analysis result in
|
||||
memory. This identifier is used by tensor-decoders to know if they can handle
|
||||
the decoding of a tensor. For this reason, from an implementation perspective,
|
||||
tensor-decoder is the ideal location to store tensor-type-identifier as the code
|
||||
is ready model specific. Since the tensor-decoder is by design specific to a
|
||||
model, no generality is lost by storing it the tensor-type-identifier.
|
||||
|
||||
#### Tensor Datatype
|
||||
This is the primitive type used to store tensor-data. Like `int8`,
|
||||
`uint8`, `float16`, `float32`, ...
|
||||
|
||||
#### Tensor Dimension Cardinality
|
||||
|
||||
Number dimensions in the tensor.
|
||||
|
||||
#### Tensor Dimension
|
||||
|
||||
Tensor shape.
|
||||
|
||||
- [a], 1-dimentional vector
|
||||
- [a x b], 2-dimentional vector
|
||||
- [a x b x c], 3-dimentional vector
|
||||
- [a x b x ... x n], N-dimentional vector
|
||||
|
||||
### Tensor-Decoders Need to Recognize Tensor(s) They Can Handle
|
||||
|
||||
As mention before tensor-decoder need to be able to recognize tensor(s) they can
|
||||
handle. It's important to keep in mind that multiple tensors can be attached to
|
||||
a buffer, when tensors are transported as analytics-meta. It could be easy to
|
||||
believe that tensor's (cardinality + dimension + datatype) is sufficient to
|
||||
recognize a specific tensor format but we need to remember that analysis results
|
||||
are encoded into the tensor and retrieve analysis results require a decoding
|
||||
process specific to the model. In other words a tensor A:{cardinality:3,
|
||||
dimension: 100 x 5, datatype:int8) and a tensor B:{cardinality:3, 100 x 5,
|
||||
datatype:int8) can have completely different meaning.
|
||||
|
||||
A could be: (Object-detection where each candiate is encoded with (top-left)
|
||||
coordinates, width, height and object location confidence level)
|
||||
|
||||
```
|
||||
0 : [ x1, y1, w, h, location confidence]
|
||||
1 : [ x1, y1, w, h, location confidence]
|
||||
...
|
||||
99: [ x1, y1, w, h, location confidence]
|
||||
```
|
||||
|
||||
B could be: (Object-detection where each candiate is encoded with (top-left)
|
||||
coordinates, (bottom-right) coordinate and object class confidence level)
|
||||
```
|
||||
0 : [ x1, y1, x2, y2, class confidence]
|
||||
1 : [ x1, y1, x2, y2, class confidence]
|
||||
...
|
||||
99: [ x1, y1, x2, y2, class confidence]
|
||||
```
|
||||
We can see that even if A and B have same (cardinality, dimension, datatype) a
|
||||
tensor-decoder expecting A and decoding B would wrong.
|
||||
|
||||
In general for high cardinality tensors the risk of having two tensors with same
|
||||
(cardinality + dimension + datatype) is low, but if we think of low cardinality
|
||||
tensors typical of classification (1 x C), we can see that the risk is much
|
||||
higher. For this reason we believe it's not sufficient for tensor-decoder to
|
||||
only rely on (cardinality + dimension + datatype) to identify tensor it can
|
||||
handle.
|
||||
|
||||
#### Tensor-Decoder Second Job: Non-Maximum Suppression (NMS)
|
||||
|
||||
Tensor-Decoders main functionality is to extract analytics-results from tensor,
|
||||
but in addition to decoding tensor, in general a second phase of post-processing
|
||||
is handled by tensor-decoder. This post-processing phase is called non-maximum
|
||||
suppression (NMS). A simplest example of NMS, is with classification. For every
|
||||
input the classification model will produce a probability for potential class.
|
||||
In general we're mostly interested in the most probable class or few most
|
||||
probable classe but in general there's little value in transport all classes
|
||||
probability. In addition to keeping only most propbable class (or classes) we
|
||||
generally want the probability to be above a certain threshold other we're not
|
||||
interested in the result. Because a significant portion of analytics results out
|
||||
of the inference process don't have much value, we want to filter them out as
|
||||
early as possible. Since analytics result are only available, after tensor
|
||||
decoding the tensor-decoder is tasked with this type filtering (NMS). Same
|
||||
concept exist for object-detection, where NMS generally involve calculating
|
||||
intersection-of-union (IoU) in combination with location and class probability.
|
||||
Because ML-based analytics are probabilistic analysis, generally need a form of
|
||||
NMS post-processing.
|
||||
|
||||
#### Handling Multiple Tensors Simultaneously In A Tensor-Decoder
|
||||
Some time it is needed or more efficient to have a tensor decoder handling
|
||||
multiple tensors simultanously. In some case the tensors are complementary and a
|
||||
tensor-decoder need to have both tensors to decode analytics result. In other
|
||||
case it's just more efficient to do it simultanously because of the
|
||||
tensor-decoders second job NMS. Let's consider YOLOv3 where 3 output tensors are
|
||||
produced for each input. One tensor represent detection of small objects, second
|
||||
tensor medium size objects and thirs tensor large size object. In this context
|
||||
it's beneficial to have the tensor-decoder decode the 3 tensors simultanously to
|
||||
perform the NMS on all the results otherwise analytics results with low value
|
||||
would remain in the system for longer. This has implication for the negotiacion
|
||||
of tensor-decoder, that will be expanded the section dedicated to tensor-decoder
|
||||
negotiation.
|
||||
|
||||
### Why Interpreting (decoding) Tensors
|
||||
As we descibed bellow tensor contain information and are used to store analytics
|
||||
results. The analytics results are encoded in a model specific way into the
|
||||
tensor and unless their consumer, process making use of analytics-results, are
|
||||
also model specific they need to be decoded. Deciding if the analytics pipeline
|
||||
will have elements producing and consuming tensor directly into their encoded
|
||||
form or if a tensor-decoding process will done between tensor production and
|
||||
consumption is a design decision that involve compromise between re-usability
|
||||
and performance. As an example an object-detection overlay element would need to
|
||||
be model specific to directly consume tensor, therefore would need to be
|
||||
re-written for any object-detection model using a different encoding scheme, but
|
||||
if the only goal of the analytics-pipeline is to do this overlay it would
|
||||
probably be the most efficient implementation. Another aspect in favor of
|
||||
interpreting tensor if that we can have multiple consumers of the analytics
|
||||
results and tensor decoding is left to consumer themself in mean multiple
|
||||
consumers decoding tensors. On the other end, we can think of two models
|
||||
specifically designed to work together where outputs of one model become the
|
||||
input of the downstream model. In this context the downstream model is not
|
||||
re-usable without the upstream model but they bypass the need for
|
||||
tensor-decoding and are very efficient. Another variation is that multiple
|
||||
models are merged into one model removing the need the multi-level inference,
|
||||
but again this is a design decision involving compromise on re-usability,
|
||||
performance and effort. We aim at providing a support for all these use-cases
|
||||
and alowing analytics-pipeline designer to make the best design decision based
|
||||
on their specific context.
|
||||
|
||||
#### Analytics-Meta
|
||||
Analytics-meta (GstAnalyticsRelationMeta) is the foundation of re-usability of
|
||||
analytics results and its goal is to store analytics-results (GstAnalyticsMtd)
|
||||
in an efficient way and allow to define relation between them. GstAnalyticsMtd
|
||||
is very primitive and meant to be expanded. GstAnalyticsClassification (storage
|
||||
for classification result), GstAnalyticsObjectDetection (storage for
|
||||
object-detection result), GstAnalyticsObjectTracking (storage for
|
||||
object-tracking) are specialization and can used as reference to create other
|
||||
storage, based on GstAnalyticsMtd, to store other types of analytics result.
|
||||
There is two major use-case for the ability to define relation between
|
||||
analytics results. The first one is define a relation between analytics results
|
||||
that were generated at different stage. A good example of this could be a first
|
||||
analysis detected cars from an image and a second level analysis where only
|
||||
section of image presenting a car is pushed to a second analysis to extract
|
||||
brand/model of the car in a section of the image. This analytics result is then
|
||||
appended to the original image with a relation defined with the object-detection
|
||||
result that have localized this car in the image. The other use-case for
|
||||
relation is to create composition by re-use existing GstAnalyticsMtd
|
||||
specialization. The relation between different analytics-result is completely
|
||||
decoupled from analytics result themselves. All relation definition are stored in
|
||||
GstAnaltyicsRelationMeta that is a container of GstAnaltyicsMtd and also contain
|
||||
an adjacency-matrix storing relation. One of the benefit is the ability of a
|
||||
consumer of analytics-meta to explore the graph and follow relations between
|
||||
analytics-results without having to understand every analytics-result the
|
||||
relation relation path. Another important aspec is analytics-meta are not
|
||||
specific to machine-learning techniques and can also be used to store analysis
|
||||
results from computer-vision or other techniques. It base be used as a bridge
|
||||
between different techniques.
|
||||
|
||||
##### Storing Tensors Into Analytics-Meta
|
||||
To be able to describe more precisely analytics result a analytics pipeline
|
||||
where first inference output tensor is directly pushed, without tensor decoding,
|
||||
into a second inference, it would be useful to store those tensors using
|
||||
analytics-meta because we could communicate the relation between tensor of first
|
||||
inference and tensor of second inference. With the relation description a
|
||||
tensor-decoder of second inference would be able to retrive associated tensor of
|
||||
of first inference and extract potentially useful information that is not
|
||||
availabe the tensor of the second inference.
|
||||
|
||||
### Semantically-Agnostic Tensor Processing
|
||||
Not all tensor processing is model dependent. Sometime the processing can be
|
||||
done uniformely on all tensor's values. For example normalization, range
|
||||
adjustment, offset adjustment, quantization are examples of operations that do
|
||||
not require knowledge of how the information is encoded in the tensor. To the
|
||||
contrary of tensor-decoder, elements implementing these types of porcessing
|
||||
don't need to know how information is encoded in the tensor but need to know
|
||||
general information about the tensor like: cardinality, dimension and datatype.
|
||||
Note GStreamer already does alot of semantically-agnostic tensor, rember
|
||||
image/frame are also a form of tensor, processing like scaling, cropping,
|
||||
colorspace conversion, ...
|
||||
|
||||
#### Semantically-Agnostic Tensor Processing With Graph-Computing Framework
|
||||
Graph-Computing frameworks, like ONNX, can also be used this type of operation.
|
||||
|
||||
#### Tensor-Decoder Bin And Auto-Plugging
|
||||
Since tensor-decoders are model specific, we expect that many will be created
|
||||
and one way to simplify analytics pipeline creation an promote re-usability is to provide
|
||||
tensor-decoder bin able to identify the correct tensor-decoder for the tensor to
|
||||
be decoded. It's possible to have multiple tensor-decoder able to decoder the
|
||||
exact same decoder but making use of specific acceleration. Rank can be used to
|
||||
identify the ideal tensor-decoder for a specific platform.
|
||||
|
||||
### Tensor Transport Mode
|
||||
Two transport mode are envisioned as Meta or as Media. Both mode have pros and
|
||||
cons which justify supporting both mode.
|
||||
|
||||
#### Tensor Transport As Meta
|
||||
In this mode tensor is attached to the buffer (the media) on which the analysis
|
||||
was performed. The advantage of this mode if the original media is kept in a
|
||||
direct association with analytics results. Further refinement analysis or
|
||||
consumption (like overlay) of the analytics result ease easier when the media on
|
||||
which the analysis was performed is available and easily identfiable. Another
|
||||
advantage is the ability to keep a relation desciption between tensors in a
|
||||
refinement context On the other hand this mode of transporting analytics result
|
||||
make negotiation of tensor-decoder in particuler difficult.
|
||||
|
||||
#### Tensor Transport As Media
|
||||
In this mode tensor is the media, potially refering to buffer (original media on
|
||||
which the analysis was performed) using a Meta (idea behind OriginalBufferMeta MR).
|
||||
The advantage of this mode is tensor-decoder negociation is simple, but
|
||||
association of the analytics result with the original buffer on which the
|
||||
analysis was performed is more difficult.
|
||||
*Also note this the mode of transport used by NNStreamer.*
|
||||
|
||||
### Negotiation
|
||||
Allowing to negociated required analysis pre/post processing and automatically
|
||||
inject required elements able to performed them would be very valuable and
|
||||
minimize effort of porting an analytics-pipeline between different platforms and
|
||||
making use of acceleration available. Tensor-decoders bin, auto-plugging of
|
||||
pre-processing (considering acceleration available), auto-plugging of inference
|
||||
element (optimized of the platform), post-processing, tensor-decoder bin
|
||||
selecting required tensor-decoders potionnally from multiple functionnally
|
||||
equivalente but more adapted to the platfrom are all aspect to consider when
|
||||
designing negociation involved in analytics-pipeline.
|
||||
|
||||
#### Negotiating Tensor-Decoder
|
||||
As described above tensor-decoder need to know 4 attributes about a tensor to
|
||||
know if it can handle it:
|
||||
|
||||
1. Tensor dimension cardinality ( not required explicitly in some cases)
|
||||
2. Tensor dimension
|
||||
3. Tensor datatype
|
||||
4. Tensor type (identifier of analytics-result encoding semantic)
|
||||
|
||||
Note 1, 2, 3 could be encoded into 4, but this is not desirable because 1,2,3
|
||||
are useful for selection semantically-agnosic tensor processor.
|
||||
|
||||
Tensor-decoder can handle multiple tensor types. This could be expressed in the
|
||||
sinkpad(s) template by a list of arrays where each combination of tensor types
|
||||
it can handle would be expressed. This would make the sinkpad(s) caps difficult
|
||||
to read. To avoid this problem when a tensor-decoder handle multiple tensors the
|
||||
tensor type is a category the encapsulate all tensor type it can handle.
|
||||
Refering again to YOLOv3's 3 tensors: small, medium large, all 3 would have the
|
||||
same tensor-type identifier, ex YOLOv3, and each tensors themselves would have
|
||||
subtype field distinguishing them ('small', 'medium', 'large'). Same also
|
||||
applies to FastSAM 2 tensors ('FastSAM-masks', 'FastSAM-logits') where both
|
||||
would be represented by the same tensor type ('FastSAM') in pad capability level.
|
||||
|
||||
When tensor is stored as a meta, allocation query need to be used to negociate
|
||||
tensor-decoder. TODO: expand how this would work.
|
||||
|
||||
|
||||
##### Tensor-Decoder Sinkpad Caps Examples
|
||||
|
||||
Examples assuming object-detection on video frame
|
||||
```
|
||||
PadTemplates:
|
||||
SINK template: 'vsink' // Tensor attached on to buffer
|
||||
Avaiability: 'always'
|
||||
Capabilities:
|
||||
video/x-raw
|
||||
format: {...}
|
||||
width: {...}
|
||||
height: {...}
|
||||
framerate: {...}
|
||||
|
||||
SINK template: 'tsink' // Tensor
|
||||
Avaiability: 'always'
|
||||
Capabilities:
|
||||
tensor/x-raw
|
||||
shape:{<a, b, ...z>} // This represent a x b x ... x z
|
||||
datatype: {(enum) "int8", "float32", ...}
|
||||
type: { (string)"YOLOv3", (string)"YOLOv4", (string)"SSD", ...)}
|
||||
|
||||
```
|
||||
|
||||
##### Tensor-Decoder Srcpad(s)
|
||||
Typically will be the same as sinkpad but could be different. In general
|
||||
tensor-decoder only attach an analytics-meta to buffer. Analytics-meta consumer
|
||||
is left to other downstream elements. It's also possible for tensor-decoder to
|
||||
have very different caps on srcpad. This can be the case when model-free
|
||||
analytics-result is difficult to represent like text-to-speech or
|
||||
super-resolution. In these case the tensor-decoder could be producing a media
|
||||
directly. audio for TTS or image for super-resolution.
|
||||
|
||||
### Inference Sinkpad(s) Capabilities
|
||||
Sinkpad capability, before been constrained based on model, can be any
|
||||
media type, including ```tensor``` . Note that multiple sinkpads can be present.
|
||||
|
||||
#### Batch Inference
|
||||
To support batch inference ```tensor``` media type need to be used. Batching is
|
||||
a method used to spread the fixed time cost of scheduling work on some
|
||||
accelerator (GPU). Multiple samples (buffers) are aggregated into a batch that
|
||||
is pushed to the inference. When batch is used output tensor will also contain
|
||||
analytics results in the form of a batch. Un-batching require information on how
|
||||
the batch was formed: buffer timestamp, buffer source, media type, buffer caps.
|
||||
Batch can be formed by a single source, forming the batch with time multiplexing
|
||||
of from multiple sources, time and sources mutiplexing. Once multiplexed the
|
||||
batch can be pushed to the inference element as a tensor media.
|
||||
|
||||
TODO: Describe TensorBatchMeta
|
||||
|
||||
### Inference Srcpad(s) Capabilities
|
||||
|
||||
Srcpads capabilities, will be identical to sinkpads capabilities or a
|
||||
```tensor```.
|
||||
|
||||
```
|
||||
PadTemplates:
|
||||
SRC template: 'vsrc_%u' // Tensor attached on to buffer
|
||||
Avaiability: 'always'
|
||||
Capabilities:
|
||||
video/x-raw
|
||||
format: {...}
|
||||
width: {...}
|
||||
height: {...}
|
||||
framerate: {...}
|
||||
SRC template: 'asrc_%u' // Tensor attached on to buffer
|
||||
Avaiability: 'always'
|
||||
Capabilities:
|
||||
audio/x-raw
|
||||
format: {...}
|
||||
layout: {...}
|
||||
rate: [...]
|
||||
channels: [...]
|
||||
|
||||
SRC template: 'tsrc_%u' // Tensor attached on to buffer
|
||||
Avaiability: 'always'
|
||||
Capabilities:
|
||||
text/x-raw
|
||||
format: {...}
|
||||
|
||||
SRC template: 'src_%u' // Tensor
|
||||
Avaiability: 'always'
|
||||
Capabilities:
|
||||
tensor/x-raw
|
||||
shape:{<a, b, ...z>} // This represent a x b x ... x z
|
||||
datatype: {(enum) "int8", "float32", ...}
|
||||
type: { (string)"YOLOv3", (string)"YOLOv4", (string)"SSD", ...)}
|
||||
|
||||
```
|
||||
|
||||
### New Video Format
|
||||
TODO
|
||||
|
||||
- We need to add floating point video formats
|
||||
|
||||
### Batch Aggregator Element
|
||||
TODO
|
||||
|
||||
|
||||
# Reference
|
||||
- [Onnx-Refactor-MR](https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/4916)
|
||||
- [Analytics-Meta MR](https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/4962)
|
||||
|
||||
|
Loading…
Reference in a new issue