Summary of Articles on situation recognition
Conceptual representations between video signals and natural language descriptions
Arens 2008
Overview
A redesign of the experimental implementation of the artificial cognitive vision system for Incremental recognition of traffic situations from image sequeces (M. Haag, H.-H nagel, 2000) is presented. Modifications are made to the knowledge representation formalisms utilized in the systems, and an extension for generation of natural language texts from videos is demonstrated.
Background and Related Work
In a traffic situations, at least five levels of knowledge representation for a cognitive vision system can be identified:
-
Representation of geometry of space-time developments in the road scene: in the image plane(2D) and in the depicted scene(3D)
-
Representation of driving maneuvers tied specific traffic situations,
-
Conceptual representation of all visible objects of interest: their attributes and elementary movements.
-
Generic conceptual representations of the objects’ configuration in space and time as wells as their expected configurations at later times.
-
some form of instantaneous natural language representation of developments.
The contributions of the work can be summarized into:
-
Facilitation of the extension of the Situational Graph Trees approach to recognition and representation from the exploratory phase presented by the previous work into practical application domains.
-
Development of software tools to ease the initial formulation of SGTs and their evolution.
-
A knowledge base for vehicle behavior in traffic situations represented as an SGT based on tracking results obtained from signal and geometry measurements by the overall system. Instantiating the SGT provides the input required to generate natural language descriptions of the developments in the video of interest.
There are various approaches to representation and use of conceptual knowledge in machine vision systems. Uncertainty in vision systems can be attributed to the inherent “noisy” nature of video sequences as all data streams do, and the vagueness of the concepts designed to communicate with the human user. To overcome noise, it is typical to represent numerical data in a symbolic or conceptual form. One approach representing this uncertainty is by explicit formalization based on the fuzzy extension of predicate logic. Then, by linking the quantitative data to qualitative concepts based on background knowledge, a conceptual description of the video sequence can be formulated.
Results and Evaluation
Conclusion
Representation of occurrences for road vehicle traffic
Gerber 2008
Overview
This work details a 3D-model-based Computer Vision procedure for the conversion of vehicle trajectories in vehicle image sequences of road traffic scenes(elementary vehicle actions) into conceptual representations (based on FMTHL). The resulting verb phrases can serve as a basis for higher layers of a Situational Recognition system.
Background and Related Work
Finding an accurate algorithmic transformation of video signals into natural language text is an important step towards machine vision. However, research on derivation of conceptual representations and textual descriptions of agent behaviors from visual input is plagued by uncertainties related to geometric results estimated from video sequences, and the semantic gap between computer vision results and action descriptions. This contribution is concerned with the latter challenge, and deals with isolatable agent activities, specifically rigid vehicles in videos recorded from road traffic.
Methodology
Video based Algorithmic text generation is concerned with at least two disciplines: computer vision and computational linguistics. This greatly complicates the presentation and analysis.
In this work, three groups of processes can be identified for the transformation of video signals into their corresponding textual representation.
-
The subsystem for video recording and process steps including the 3D extraction of the geometric description of the scene. In the layered computer vision system as presented in Arens 2008, this constitutes the Sensor-Actuator Level, the Image signal -Level, the Picture-Domain-Level, and the Scene-Domain-Level. In this case, it constitutes the 3-D vehicle status together with the 3-D model of the vehicles that have been identified, initialized and tracked.
-
This 3-D spatio-temporal information is then converted into an elementary conceptual representation represented in the Conceptual-Primitives-Layer. These conceptual primitives are then aggregated to obtain information about, for example, the behavior of agents in the depicted scene at the Behavior-Representation-Level.
-
The Natural-Language-Level which comprises the natural language text-processing component, and possibly in the future, a natural language question-answering component.
From now onwards the focus will be on the Conceptual-Primitives – the major contribution of this work.
Principal steps for Text Generation
The following steps can be identified for importation of the geometric results of the core Computer Vision subsystem and the knowledge about the static part of the depicted scene into the conceptual representation subsystem.
-
Conversion of the input from a quantitative, numerical representation into a qualitative conceptual one. First , the input discrete values are converted into discrete values compatible with predefine attribute schemes, then the attributes are combined to assert occurrences. Time scales relating to such conceptual representations of vehicle motion could extend from a fraction of a second to several seconds.
-
Situational Analysis: Here the most appropriate description of vehicular behavior is determined from the combination of primitive conceptual representations with knowledge about the conditions in the scene that may influence the switch between particular occurrences. The time scale here could range from several seconds to a minute or even more.
-
A Perspectivated Conceptual Scene Description(PCSD) is then generated from this Conceptual Scene Description. The PCSD is obtained by exploiting the human reader’s perspective on the temporal development of the recorded scene.
-
The PCDS is then converted into a Natural Language Description by first converting the image (sub) sequence into a Discourse Representation Structure, and then into the output text.
Results and a-priori knowledge imported from the computer vision subsystem
The computer Vision subsystem provides tracking results and geometric lane data to the conceptual representation subsystem. The geometric lane data is then converted into a conceptual representation of the lane structure in the recorded scene for use by the FMTHL inference engine.
-
Importation of tracking results: Tracking results constitute geometric values of the positions of vehicles, their orientation, their velocity, and their steering angle for each agent and frame time point. These are converted into FMHTL facts which are imported into the conceptual representation subsystem.
-
Lane Geometry: The geometric lane model, which consists of points, lines and lane segments, is used in the conceptual representation subsystem to associate agent positions with lanes and to derive conceptual descriptions of the vehicle behavior. The lane data is converted into its conceptual representation in form of facts and FHTML-rules making the additional elementary properties about the lane geometry available in conceptual rather than quantitative geometric terms.
Generation of primitive conceptual representations for time-dependent agent properties
The facts obtained from the data provided by the Computer Vision subsystem constitute the link between the computer vision and the Conceptual Representation subsystem. However, since some of the facts obtained as part of the object status are quantitative numerical values(e.g. velocity and position of agents), it is necessary to compute the Degree of validity of the geometric values of the object’s status values for their mapping to discrete values that are used as inputs for the logical predicates used in the conceptual representation stage.
Occurrences
Although short term observations of a road vehicle can provide fairly good prediction regarding its subsequent motion in the next few seconds, a more reliable prediction could be obtained by taking into account the behavior of other traffic participants. An occurrence can generally be described as a recognizable movement primitive, and can in the case of road vehicle motion be categorized into:
-
Perpetuative if the tend to retain the dominant aspect of a movement without major change,
-
Mutative if the characterize the systematic change of some aspect
-
terminative if they relate to the beginning or ending of a dominant movement characteristic.
Each occurrence can be characterized uniquely by a conjunction of predicates: a precondition which has to be satisfied before the occurrence is considered valid; a monotonicity condition indicating the type of admissible monotonous change which may take place while the occurrence represents a valid description; and a a post condition which becomes a true once the occurrence in question will no longer constitute adequate description of the temporal development in which the agent is involved. In this work the occurrences detailed are Occurrences referring to:
-
only the agent;
-
the agent and a location;
-
the agent and an additional object;
-
agent and a lane;
Additionally, to take temporal dependencies into account, transductors(finite state acceptance automatons) for occurrence recognition are applied. A transductor is designed for each occurrence type(perpetuative, mutative, terminative) to determine whether or not the required conditions are satisfied in the prescribed temporal order.
Results
Conclusion
Rule Based High-Level Situation Recognition from Incomplete tracking Data
Münch 2012
Overview
Presents two techniques for improving the robustness of the FMTL/SGT high-level-situation recognition system, and also provides a knowledge-base for recognizing vehicle centered situations.
Background and Related Work
High-level situation recognition can be summarized into a two step process:
-
The video data is processed to extract descriptions of and tracks for objects of interest: people, vehicles etc
-
The tracks are processed to detect occurrence of interesting situations.
Errors in high-level situation recognition arise from two unavoidable phenomena: model imperfections(some aspects of the domain remain unobserved), and noisy observations. Particularly, noisy observations manifest as data gaps which arise from incomplete data due to object occlusion , object motion in areas with no sensor coverage, and machine perception technical problems. It is therefore necessary to adopt or formulate techniques that enable high-level recognition to effectively handle these, and possible any other forms of uncertainty.
Situation recognition can be done directly on on the acquired videos, or through a layered approach with each layer performing a specialized recognition task. The Situational Graph Tree approach, presented here, is one such hierarchical approach; it is based on Fuzzy Metric Temporal Logic that have been shown to allow for multi-hypothesis, real-time inference in a number of domains.
Methods
FMTL allows for handling of both uncertainty and vagueness; examples have been demonstrated in modelling aspects of the traffic domain and human behavior. The SGT-Editor is a powerful framework where both the internal representation of SGTs and the inference algorithm is programmed in formal FMTL. This in turn allows for fast, accurate inference about complex information.
Handling Incomplete Data
-
Interpolation of Input data
Here interpolation is used to fill in sensor(input) data missing for a particular interval of time by using the data neighboring that time interval. Weight adjustments are made to reflect the level of influence of the data at the beginning of the interval versus that at the end of the interval. However, special considerations has to be made if the data is of a radial nature.
-
Hallucinating High-Level Evidence.
Here noisy data is dealt with by allowing for the instantiation of situation schemes in which the majority of the preconditions are satisfied except for a few. The situation recognition system is extended to hallucinate the missing evidence and allow for that situation scheme to be instantiated. The situation graph traversal can therefore proceed with that path of inference.
-
Knowledge Base *
The start situation scheme PatientCar instantiates the car as the patient for the current agent. The situation schemesCarFar and CarNear can then be reached through temporal edges. Similar descriptions can be drawn for the rest of the situation schemes in the SGT.
Results
The publicly available VIRAT video dataset is used to test the proposed methods. The dataset provides files in which people and objects of interest are annotated, as well as a file where semantically interesting situations and all participating agents in the environment of a car pack are annotated. Some examples are: getting out of a vehicle, closing a trunk etc. Some results from the analysis of six videos are:
-
The proposed method never missed any interesting situation in the test set. However, some false positive classifications lead to bad results.
-
The proposed method is capable of handling incomplete data even if half of the data is missing.
-
The false positive rate increases with larger amounts of missing data, however at a rate that is lower than that without the data interpolation and hallucination.
Conclusion
-
A cognitive system that can deal with incomplete data for situation recognition is presented.
-
Incomplete data in a rule-based expert system can be dealt with by either interpolation of input data and its uncertainity as well as by hallucinating high-level inference rules.
-
The work extended the SGT-Editor and the situation recognition inference algorithm to handle incomplete data.
-
Finally, a knowledge base for recognizing vehicle-centered situations and the first ever evaluation of the VIRAT video dataset on a high semantic level is demonstrated.
892 thoughts on “Summary of articles on situation recognition”