Research Terms
The University of Central Florida invention is a real-time online system and method that can detect multiple activities occurring in long, untrimmed security videos. The invention uses a deep learning approach to process videos in an online fashion at a clip level—drastically reducing the computation time in detecting activities. The ability of the method to process one video clip at a time in an online fashion makes it robust against varying length activities. The methodology was tested on the VIRAT and MEVA (Multiview Extended Video with Activities) datasets with more than 250 hours of videos and demonstrated effective performance in terms of processing speed as well as activity detection. The invention can process high-resolution security videos at 100 frames per second.
The University of Central Florida invention provides a low-cost system that can significantly reduce the annotation cost for videos. Video activity detection requires annotations at every frame, which drastically increases the labeling cost. As a solution, the UCF invention offers a way to greatly reduce annotation costs while using only a few annotations. An example application is preparing large-scale video datasets for various video analysis tasks such as video tracking, detection and segmentation.
Technical Details: The invention’s Active Sparse Labeling (ASL) algorithm identifies the usefulness of each frame of a video. ASL then suggests the frames and videos that can improve dense video understanding tasks such as activity detection in videos. Along with the algorithm, the invention uses a unique method called Spatio-Temporal Weighted loss (STeW loss) to train video models using datasets with sparsely annotated frames. The invention works in two stages, first, it trains a deep-learning video model using very sparse frames and then uses the trained deep-learning model to select more frames for annotation based on their utility value. When tested on public benchmark datasets, UCF-101 and J-HMDB, with more than 400,000 frames, the invention effectively reduced annotation cost by 90 percent and learned action detection in videos using only 10 percent of annotated video frames.
Partnering Opportunity: The research team is seeking partners for licensing, research collaboration, or both.
Stage of Development: Prototype available.
Are All Frames Equal? Active Sparse Labeling for Video Action Detection, 36th Conference on Neural Information Processing Systems (NeurIPS 2022).
The University of Central Florida invention introduces a novel approach to person identification based on daily activities, addressing the limitations of traditional methods such as face recognition and gait analysis. Face recognition techniques, while advanced, often fail in real-world scenarios where facial features are not visible due to factors like long distances, environmental disturbances, occlusions (e.g., mask-wearing), and uncooperative subjects. Gait recognition, which analyzes walking patterns, also has limitations as individuals are not always walking in real-world situations.
This invention focuses on identifying individuals based on a wide range of daily activities, such as sitting, taking off a jacket, drinking water, and more. These activities provide unique behavioral cues that can be instrumental in identifying individuals even when facial information is unavailable. The approach leverages video analysis to process both biometric features (e.g., body shape, gait) and non-biometric features (e.g., clothing, background) from video data.
Technical Details: The UCF invention utilizes advanced video analysis techniques to process both biometric and non-biometric features from video data. Initially, an RGB video input containing a mix of biometric features (such as body shape and gait) and non-biometric features (such as clothing and background) is received. This video is passed through a sophisticated video encoder that extracts spatial-temporal features, which are divided into two streams. The "activity head" analyzes the specific activities depicted in the video, while the "actor head" distinguishes between the biometric and non-biometric features of the individual.
To enhance identification accuracy, the invention employs a bias-less distillation process, where a silhouette version of the video is processed through a teacher network to distill unbiased biometric features back into the main model, filtering out appearance-related biases. Additionally, a bias-learning technique distorts the original video to obscure biometric features while keeping non-biometric features intact, allowing the system to learn and compensate for appearance biases during training. The method concludes with a joint training process that refines both activity and actor features, enabling the system to use these enhanced features for accurate identification by comparing them against a pre-existing gallery of identities.