Merge branch 'analytics_design_doc' into 'main'

RFC: doc: Add analytics support design documentation See merge request gstreamer/gstreamer!6139
2024-05-18 00:12:46 +00:00 · 2024-05-03 20:23:59 +00:00 · 2024-05-03 20:23:59 +00:00 · 7671495e8c
parent 73c64e8182 04050c2a2e
commit 7671495e8c
1 changed files with 450 additions and 0 deletions
--- a/subprojects/gst-docs/markdown/additional/design/analytics.md
+++ b/subprojects/gst-docs/markdown/additional/design/analytics.md
@ -0,0 +1,450 @@
+# Analytics
+
+Analytics refer to the process of extracting information from the content of the
+media (or medias). The analysis can be spacial only, ex. Image analysis,
+temporal only, sound detection, spacio-temporal tracking or action recognition,
+multi-modal image+sound to detect a environment or behaviour. There's also
+scenarios where the results of analysis is used as the input, with or without an
+additional media. This design aim at supporting ML-based analytics and CV
+analytics and offer a way to bridge both techniques.
+
+## Vision
+
+With this design we aim at allowing GStreamer application developers to develop
+analytics pipeline easily while taking full advantage of the acceleration
+available on the platform where they deploy. The effort of moving the analytic
+pipeline to a different platform will be minimal.
+
+## Refinement Using Analytics Pipeline
+
+Similarly to content-agnostic media processing (ex. Scaling, colour-space change,
+serialization, ...), this design promote re-usability and simplicity by allowing
+to compose complex analytics pipeline from simple and dedicated analytics
+elements that complement each others.
+
+## Example
+Simple hypothetical example of an analytic pipeline.
+
+```
+---------+    +----------+    +---------------+    +----------------+
+| v4l2src |    | video    |    | onnxinference |    | tensor-decoder |
+|         |    |  convert |    |               |    |                |
+|        src-sink  scale src-sink1           src1-sink              src---
+|         |    |(pre-proc)|    | (analysis)    |    | (post-proc)    |   /
+---------+    +----------+    +---------------+    +----------------+  /
+                                                                       /
+----------------------------------------------------------------------
+|  +-------------+    +------+
+|  | Analytic-   |    | sink |
+|  |  overlay    |    |      |
+-sink           src-sink     |
+   | (analysis   |    |      |
+   |  -results   |    +------+
+   |  -consumer) |
+   +-------------+
+
+```
+
+## Supporting Neural Network Inference
+
+There's multiple frameworks supporting neural network inference. Those can be
+described more generally as computing-graph, as they are generally not limited
+to NN inference application. Existing NN-inference or computing-graph framework,
+like ONNX, is encapsulated into a GstElement/Filter. The inference element load
+a model, description of the computing-graph, based on a property. The model
+expect input(s) in a specific format and produce output(s) in specific
+format. Depending on the model format, input/output formats can be extracted
+from the model, like with ONNX, but it is not always the case.
+
+### Inference Element
+Inference element are an encapsulation of an NN-Inference framework. Therefore
+they are specific to a framework, like ONNX Runtime.
+Other inference elements can be added.
+
+### Inference Input(s)
+The inputs format is defined by the model. Using the model input format the
+inference element can constrain its sinkpad(s) capabilities. Note because tensor
+is very generic the term also encapsulate image/frame, and the term input tensor is
+also used to describe  inference input.
+
+### Inference Output(s)
+Output(s) of the inference are tensors and their format are also dictated by the
+model. Analysis results is generally encoded in the output tensor in a way that
+is specific to the model. Even models that target the same time of analysis
+encode analysis results in different ways.
+
+### Models Format Not Describing Inputs/Outputs Tensor Format
+With some models format the inputs/outputs tensors formats are not described. In
+this context it's the responsibility of the analytics pipeline to push input
+tensors with the correct format into the inference process. In this context
+inference element designer is left with two choices: supporting a model manifest
+where inputs/outputs are described or leaving the constraining/fixing
+inputs/outputs to analytics pipeline designer who can use caps filter to
+constrain inputs/outputs of the model.
+
+### Tensor-Decoders
+In order to preserve the generality of the inference element, tensor decoding is
+omitted from the inference element and left to specialized elements that have a
+specific task of decoding tensor from a specific model. Additionally
+tensor-decoding does not depend on a specific NN-framework or inference element,
+this allow re-usability of tensor-decoders with a same model used with a
+different inference element. For example, YOLOv3 tensor-decoder can used to
+decode tensor from inference using YOLOv3 model with an element encapsulating
+ONNX or TFLite. Note that tensor-decoder can handle multiple tensors the have
+similar encoding.
+
+### Tensor
+N-dimensional vector.
+
+#### Tensor Type Identifier
+This is an identifier, string or quark, that uniquely identify a tensor type. A
+tensor type, describe the specific format used to encode analysis result in
+memory. This identifier is used by tensor-decoders to know if they can handle
+the decoding of a tensor. For this reason, from an implementation perspective,
+tensor-decoder is the ideal location to store tensor-type-identifier as the code
+is ready model specific. Since the tensor-decoder is by design specific to a
+model, no generality is lost by storing it the tensor-type-identifier.
+
+#### Tensor Datatype
+This is the primitive type used to store tensor-data. Like `int8`,
+`uint8`, `float16`, `float32`, ...
+
+#### Tensor Dimension Cardinality
+
+Number dimensions in the tensor.
+
+#### Tensor Dimension
+
+Tensor shape.
+
+- [a], 1-dimentional vector
+- [a x b], 2-dimentional vector
+- [a x b x c], 3-dimentional vector
+- [a x b x ... x n], N-dimentional vector
+
+### Tensor-Decoders Need to Recognize Tensor(s) They Can Handle
+
+As mention before tensor-decoder need to be able to recognize tensor(s) they can
+handle. It's important to keep in mind that multiple tensors can be attached to
+a buffer, when tensors are transported as analytics-meta. It could be easy to
+believe that tensor's (cardinality + dimension + datatype) is sufficient to
+recognize a specific tensor format but we need to remember that analysis results
+are encoded into the tensor and retrieve analysis results require a decoding
+process specific to the model. In other words a tensor A:{cardinality:3,
+dimension: 100 x 5, datatype:int8) and a tensor B:{cardinality:3, 100 x 5,
+datatype:int8) can have completely different meaning.
+
+A could be: (Object-detection where each candiate is encoded with (top-left)
+coordinates, width, height and object location confidence level)
+
+```
+0 : [ x1, y1, w, h, location confidence]
+1 : [ x1, y1, w, h, location confidence]
+...
+99: [ x1, y1, w, h, location confidence]
+```
+
+B could be: (Object-detection where each candiate is encoded with (top-left)
+coordinates, (bottom-right) coordinate and object class confidence level)
+```
+0 : [ x1, y1, x2, y2, class confidence]
+1 : [ x1, y1, x2, y2, class confidence]
+...
+99: [ x1, y1, x2, y2, class confidence]
+```
+We can see that even if A and B have same (cardinality, dimension, datatype) a
+tensor-decoder expecting A and decoding B would wrong.
+
+In general for high cardinality tensors the risk of having two tensors with same
+(cardinality + dimension + datatype) is low, but if we think of low cardinality
+tensors typical of classification (1 x C), we can see that the risk is much
+higher. For this reason we believe it's not sufficient for tensor-decoder to
+only rely on (cardinality + dimension + datatype) to identify tensor it can
+handle.
+
+#### Tensor-Decoder Second Job: Non-Maximum Suppression (NMS)
+
+Tensor-Decoders main functionality is to extract analytics-results from tensor,
+but in addition to decoding tensor, in general a second phase of post-processing
+is handled by tensor-decoder. This post-processing phase is called non-maximum
+suppression (NMS). A simplest example of NMS, is with classification. For every
+input the classification model will produce a probability for potential class.
+In general we're mostly interested in the most probable class or few most
+probable classe but in general there's little value in transport all classes
+probability. In addition to keeping only most propbable class (or classes) we
+generally want the probability to be above a certain threshold other we're not
+interested in the result. Because a significant portion of analytics results out
+of the inference process don't have much value, we want to filter them out as
+early as possible. Since analytics result are only available, after tensor
+decoding the tensor-decoder is tasked with this type filtering (NMS). Same
+concept exist for object-detection, where NMS generally involve calculating
+intersection-of-union (IoU) in combination with location and class probability.
+Because ML-based analytics are probabilistic analysis, generally need a form of
+NMS post-processing.
+
+#### Handling Multiple Tensors Simultaneously In A Tensor-Decoder
+Some time it is needed or more efficient to have a tensor decoder handling
+multiple tensors simultanously. In some case the tensors are complementary and a
+tensor-decoder need to have both tensors to decode analytics result. In other
+case it's just more efficient to do it simultanously because of the
+tensor-decoders second job NMS. Let's consider YOLOv3 where 3 output tensors are
+produced for each input. One tensor represent detection of small objects, second
+tensor medium size objects and thirs tensor large size object. In this context
+it's beneficial to have the tensor-decoder decode the 3 tensors simultanously to
+perform the NMS on all the results otherwise analytics results with low value
+would remain in the system for longer. This has implication for the negotiacion
+of tensor-decoder, that will be expanded the section dedicated to tensor-decoder
+negotiation.
+
+### Why Interpreting (decoding) Tensors
+As we descibed bellow tensor contain information and are used to store analytics
+results. The analytics results are encoded in a model specific way into the
+tensor and unless their consumer, process making use of analytics-results, are
+also model specific they need to be decoded. Deciding if the analytics pipeline
+will have elements producing and consuming tensor directly into their encoded
+form or if a tensor-decoding process will done between tensor production and
+consumption is a design decision that involve compromise between re-usability
+and performance. As an example an object-detection overlay element would need to
+be model specific to directly consume tensor, therefore would need to be
+re-written for any object-detection model using a different encoding scheme, but
+if the only goal of the analytics-pipeline is to do this overlay it would
+probably be the most efficient implementation. Another aspect in favor of
+interpreting tensor if that we can have multiple consumers of the analytics
+results and tensor decoding is left to consumer themself in mean multiple
+consumers decoding tensors. On the other end, we can think of two models
+specifically designed to work together where outputs of one model become the
+input of the downstream model. In this context the downstream model is not
+re-usable without the upstream model but they bypass the need for
+tensor-decoding and are very efficient. Another variation is that multiple
+models are merged into one model removing the need the multi-level inference,
+but again this is a design decision involving compromise on re-usability,
+performance and effort. We aim at providing a support for all these use-cases
+and alowing analytics-pipeline designer to make the best design decision based
+on their specific context.
+
+#### Analytics-Meta
+Analytics-meta (GstAnalyticsRelationMeta) is the foundation of re-usability of
+analytics results and its goal is to store analytics-results (GstAnalyticsMtd)
+in an efficient way and allow to define relation between them. GstAnalyticsMtd
+is very primitive and meant to be expanded. GstAnalyticsClassification (storage
+for classification result), GstAnalyticsObjectDetection (storage for
+object-detection result), GstAnalyticsObjectTracking (storage for
+object-tracking) are specialization and can used as reference to create other
+storage, based on GstAnalyticsMtd, to store other types of analytics result.
+There is two major use-case for the ability to define relation between
+analytics results. The first one is define a relation between analytics results
+that were generated at different stage. A good example of this could be a first
+analysis detected cars from an image and a second level analysis where only
+section of image presenting a car is pushed to a second analysis to extract
+brand/model of the car in a section of the image. This analytics result is then
+appended to the original image with a relation defined with the object-detection
+result that have localized this car in the image. The other use-case for
+relation is to create composition by re-use existing GstAnalyticsMtd
+specialization. The relation between different analytics-result is completely
+decoupled from analytics result themselves. All relation definition are stored in
+GstAnaltyicsRelationMeta that is a container of GstAnaltyicsMtd and also contain
+an adjacency-matrix storing relation. One of the benefit is the ability of a
+consumer of analytics-meta to explore the graph and follow relations between
+analytics-results without having to understand every analytics-result the
+relation relation path. Another important aspec is analytics-meta are not
+specific to machine-learning techniques and can also be used to store analysis
+results from computer-vision or other techniques. It base be used as a bridge
+between different techniques.
+
+##### Storing Tensors Into Analytics-Meta
+To be able to describe more precisely analytics result a analytics pipeline
+where first inference output tensor is directly pushed, without tensor decoding,
+into a second inference, it would be useful to store those tensors using
+analytics-meta because we could communicate the relation between tensor of first
+inference and tensor of second inference. With the relation description a
+tensor-decoder of second inference would be able to retrive associated tensor of
+of first inference and extract potentially useful information that is not
+availabe the tensor of the second inference.
+
+### Semantically-Agnostic Tensor Processing
+Not all tensor processing is model dependent. Sometime the processing can be
+done uniformely on all tensor's values. For example normalization, range
+adjustment, offset adjustment, quantization are examples of operations that do
+not require knowledge of how the information is encoded in the tensor. To the
+contrary of tensor-decoder, elements implementing these types of porcessing
+don't need to know how information is encoded in the tensor but need to know
+general information about the tensor like: cardinality, dimension and datatype.
+Note GStreamer already does alot of semantically-agnostic tensor, rember
+image/frame are also a form of tensor, processing like scaling, cropping,
+colorspace conversion, ...
+
+#### Semantically-Agnostic Tensor Processing With Graph-Computing Framework
+Graph-Computing frameworks, like ONNX, can also be used this type of operation.
+
+#### Tensor-Decoder Bin And Auto-Plugging
+Since tensor-decoders are model specific, we expect that many will be created
+and one way to simplify analytics pipeline creation an promote re-usability is to provide
+tensor-decoder bin able to identify the correct tensor-decoder for the tensor to
+be decoded. It's possible to have multiple tensor-decoder able to decoder the
+exact same decoder but making use of specific acceleration. Rank can be used to
+identify the ideal tensor-decoder for a specific platform.
+
+### Tensor Transport Mode
+Two transport mode are envisioned as Meta or as Media. Both mode have pros and
+cons which justify supporting both mode.
+
+#### Tensor Transport As Meta
+In this mode tensor is attached to the buffer (the media) on which the analysis
+was performed. The advantage of this mode if the original media is kept in a
+direct association with analytics results. Further refinement analysis or
+consumption (like overlay) of the analytics result ease easier when the media on
+which the analysis was performed is available and easily identfiable. Another
+advantage is the ability to keep a relation desciption between tensors in a
+refinement context On the other hand this mode of transporting analytics result
+make negotiation of tensor-decoder in particuler difficult.
+
+#### Tensor Transport As Media
+In this mode tensor is the media, potially refering to buffer (original media on
+which the analysis was performed) using a Meta (idea behind OriginalBufferMeta MR).
+The advantage of this mode is tensor-decoder negociation is simple, but
+association of the analytics result with the original buffer on which the
+analysis was performed is more difficult.
+*Also note this the mode of transport used by NNStreamer.*
+
+### Negotiation
+Allowing to negociated required analysis pre/post processing and automatically
+inject required elements able to performed them would be very valuable and
+minimize effort of porting an analytics-pipeline between different platforms and
+making use of acceleration available. Tensor-decoders bin, auto-plugging of
+pre-processing (considering acceleration available), auto-plugging of inference
+element (optimized of the platform), post-processing, tensor-decoder bin
+selecting required tensor-decoders potionnally from multiple functionnally
+equivalente but more adapted to the platfrom are all aspect to consider when
+designing negociation involved in analytics-pipeline.
+
+#### Negotiating Tensor-Decoder
+As described above tensor-decoder need to know 4 attributes about a tensor to
+know if it can handle it:
+
+1. Tensor dimension cardinality ( not required explicitly in some cases)
+2. Tensor dimension
+3. Tensor datatype
+4. Tensor type (identifier of analytics-result encoding semantic)
+
+Note 1, 2, 3 could be encoded into 4, but this is not desirable because 1,2,3
+are useful for selection semantically-agnosic tensor processor.
+
+Tensor-decoder can handle multiple tensor types. This could be expressed in the
+sinkpad(s) template by a list of arrays where each combination of tensor types
+it can handle would be expressed. This would make the sinkpad(s) caps difficult
+to read. To avoid this problem when a tensor-decoder handle multiple tensors the
+tensor type is a category the encapsulate all tensor type it can handle.
+Refering again to YOLOv3's 3 tensors: small, medium large, all 3 would have the
+same tensor-type identifier, ex YOLOv3, and each tensors themselves would have
+subtype field distinguishing them ('small', 'medium', 'large'). Same also
+applies to FastSAM 2 tensors ('FastSAM-masks', 'FastSAM-logits') where both
+would be represented by the same tensor type ('FastSAM') in pad capability level.
+
+When tensor is stored as a meta, allocation query need to be used to negociate
+tensor-decoder. TODO: expand how this would work.
+
+
+##### Tensor-Decoder Sinkpad Caps Examples
+
+Examples assuming object-detection on video frame
+```
+PadTemplates:
+  SINK template: 'vsink' // Tensor attached on to buffer
+    Avaiability: 'always'
+    Capabilities:
+      video/x-raw
+        format: {...}
+        width: {...}
+        height: {...}
+        framerate: {...}
+
+  SINK template: 'tsink' // Tensor
+    Avaiability: 'always'
+    Capabilities:
+      tensor/x-raw
+        shape:{<a, b, ...z>} // This represent a x b x ... x z
+        datatype: {(enum) "int8", "float32", ...}
+        type: { (string)"YOLOv3", (string)"YOLOv4", (string)"SSD", ...)}
+
+```
+
+##### Tensor-Decoder Srcpad(s)
+Typically will be the same as sinkpad but could be different. In general
+tensor-decoder only attach an analytics-meta to buffer. Analytics-meta consumer
+is left to other downstream elements. It's also possible for tensor-decoder to
+have very different caps on srcpad. This can be the case when model-free
+analytics-result is difficult to represent like text-to-speech or
+super-resolution. In these case the tensor-decoder could be producing a media
+ directly. audio for TTS or image for super-resolution.
+
+### Inference Sinkpad(s) Capabilities
+Sinkpad capability, before been constrained based on model, can be any
+media type, including ```tensor``` . Note that multiple sinkpads can be present.
+
+#### Batch Inference
+To support batch inference ```tensor``` media type need to be used. Batching is
+a method used to spread the fixed time cost of scheduling work on some
+accelerator (GPU). Multiple samples (buffers) are aggregated into a batch that
+is pushed to the inference. When batch is used output tensor will also contain
+analytics results in the form of a batch. Un-batching require information on how
+the batch was formed: buffer timestamp, buffer source, media type, buffer caps.
+Batch can be formed by a single source, forming the batch with time multiplexing
+of from multiple sources, time and sources mutiplexing. Once multiplexed the
+batch can be pushed to the inference element as a tensor media.
+
+TODO: Describe TensorBatchMeta
+
+### Inference Srcpad(s) Capabilities
+
+Srcpads capabilities, will be identical to sinkpads capabilities or a
+```tensor```.
+
+```
+PadTemplates:
+  SRC template: 'vsrc_%u' // Tensor attached on to buffer
+    Avaiability: 'always'
+    Capabilities:
+      video/x-raw
+        format: {...}
+        width: {...}
+        height: {...}
+        framerate: {...}
+  SRC template: 'asrc_%u' // Tensor attached on to buffer
+    Avaiability: 'always'
+    Capabilities:
+      audio/x-raw
+        format: {...}
+        layout: {...}
+        rate: [...]
+        channels: [...]
+
+  SRC template: 'tsrc_%u' // Tensor attached on to buffer
+    Avaiability: 'always'
+    Capabilities:
+      text/x-raw
+        format: {...}
+
+  SRC template: 'src_%u' // Tensor
+    Avaiability: 'always'
+    Capabilities:
+      tensor/x-raw
+        shape:{<a, b, ...z>} // This represent a x b x ... x z
+        datatype: {(enum) "int8", "float32", ...}
+        type: { (string)"YOLOv3", (string)"YOLOv4", (string)"SSD", ...)}
+
+```
+
+### New Video Format
+TODO
+
+- We need to add floating point video formats
+
+### Batch Aggregator Element
+TODO
+
+
+# Reference
+- [Onnx-Refactor-MR](https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/4916)
+- [Analytics-Meta MR](https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/4962)
+
+