Research Terms
Center for Research in Computer Vision
Director |
Mubarak Shah |
Phone | 407.823.1119 |
Website | https://www.crcv.ucf.edu/ |
Mission | The common goal and purpose of the center is to strongly promote basic research in computer vision and its applications in all related areas including National Defense & Intelligence, Homeland Security, Environment Monitoring, Life Sciences and Biotechnology and Robotics. The Mission of CRCV is to:
|
Applications for facial recognition have diversified from security into the domain of entertainment videos. By applying this technology, users search for a particular actor and receive a list of videos in which the actor’s face appeared. In order to accomplish this, a fully automatic end-to-end system for video face recognition is required, which leverages information from still images both in the known dictionary and in the video itself.
In the past, application of the popular 1-minimization Sparse Representation-based Classification (SRC) has been prohibitively expensive since it operates on a frame-by-frame basis. Instead of computing SRC on each frame, which takes approximately 45 minutes per track, MSSRC reduces a face track to a single feature vector for 1-minimization (1.5 minutes per track), while obtaining MSSRC recall at 90 percent precision. By using all of the available video data in a face track, this method provides greater accuracy in identifying faces and excels at rejecting unknown identities. When tested, the Movie Trailer Face Dataset outperformed many existing state-of-the art datasets in identifying actors in movie trailers.
Technical Details
This method developed by UCF researchers performs the difficult task of face tracking using high-performance SHORE face detection, which generates tracks using two metrics: spatial and appearance. While the spatial metric compares the current bounding box with the previous, the appearance metric computes a histogram intersection of the local bounding box, which can handle abrupt changes in the scene and the face. Each new face is compared to existing tracks, and if the location and appearance metric is similar, the face is added to the track. Using this method, 113 movie trailers have been processed to form the Movie Trailer Face Dataset, which consists of 3,585 face tracks. On the face tracks, the MSSRC algorithm performs video face recognition using a joint optimization, leveraging all of the available video data and the knowledge that the face track frames belong to the same individual. The MSSRC algorithm increases speed and cuts computational costs by applying 1-minimization on the mean of the face track, which is five times as fast as fast as a frame-by-frame application.
Researchers at the University of Central Florida have developed a method for determining an accurate head count from video or still images. Instead of manually counting individuals in very dense crowds. Existing crowd-counting algorithms cannot distinguish individuals in crowds of hundreds or thousands, resulting in counting errors. Most of the existing algorithms for exact counting have been tested on low to medium density crowds (3-53 people per frame). In contrast, the new approach produces accurate counts from still images or video containing an average of 1,280 people per frame.
While dense crowds occur frequently in ticketed events like concerts, marathons, religious ceremonies, and sports games, obtaining a count of participants is relatively easy. However, in events where participants are not registered, measuring the number of constantly shifting attendees often becomes crucial as in the cases of political speeches and public protests. Determination of the exact size of a crowd can be important to candidates, the media, or law enforcement, and relying on human estimation or inadequate algorithms can lead to errors. A need for a method to accurately count dense crowds in still images or video is needed.
Technical Details
The new method from UCF to count dense crowds of people works by analyzing an image at multiple densities. Although the density of people varies across the image, adjacent patches should be similar allowing for an accurate estimate by counting individuals in small patches. In medium density crowds, the process recognizes the periodic occurrence of heads–the harmonics, which it captures through Fourier analysis, and, in high density crowds, the texture of the crowd is captured through scale-invariant feature transform. The algorithm functions with new constraints in multi-scale Markov random field to infer a single count over the entire image.
The University of Central Florida invention is a real-time online system and method that can detect multiple activities occurring in long, untrimmed security videos. The invention uses a deep learning approach to process videos in an online fashion at a clip level—drastically reducing the computation time in detecting activities. The ability of the method to process one video clip at a time in an online fashion makes it robust against varying length activities. The methodology was tested on the VIRAT and MEVA (Multiview Extended Video with Activities) datasets with more than 250 hours of videos and demonstrated effective performance in terms of processing speed as well as activity detection. The invention can process high-resolution security videos at 100 frames per second.
The University of Central Florida invention is a capsule network approach to enhance Visual Question Answering (VQA) processes. The invention is a method that applies reasoning within a visual scene to determine a more accurate object, action or relational recognition. Most approaches rely on input feature maps from object detection models that are pretrained with the relevant object classes. This makes it necessary to restrict the scope to known object classes or to annotate the regions of relevant objects. The approaches also require the pretraining of an object detector, thus, limiting the extension of such methods to datasets with object-level annotation. This work focuses on weakly-supervised visual grounding based on VQA supervision.
Stage of Development
Prototype available.
Found a Reason for me? Weakly-supervised Grounded Visual Question Answering using Capsules, arXivLabs, arXiv:2105.04836v1. IEEE Conference on Computer Vision and Pattern Recognition, 2021.
The University of Central Florida invention is a privacy preservation action recognition system. The novel training framework removes privacy information from input video in a self-supervised manner without requiring privacy labels. Visual private information leakage is an emerging key issue for the fast-growing applications of video understanding, like activity recognition. Existing approaches for mitigating privacy leakage in action recognition require privacy labels along with the action labels from the video dataset. However, annotating frames of video datasets for privacy labels is not feasible. Recent developments in self-supervised learning (SSL) have unleashed the untapped potential of unlabeled data.
Technical Details
The UCF training framework consists of three main components: anonymization function, self-supervised privacy removal branch, and action recognition branch. Researchers trained the framework using a minimax optimization strategy to minimize the action recognition cost function and maximize the privacy cost function through a contrastive self-supervised loss. By employing existing protocols of known action and privacy attributes, the framework technology achieves a competitive action-privacy trade-off to the current state-of-the-art supervised methods. In addition, the invention introduces a new protocol to evaluate the generalization of the anonymization function to novel action and privacy attributes. The self-supervised framework outperforms existing supervised methods. Code is available at https://github.com/DAVEISHAN/SPAct.
Partnering Opportunity
The research team is seeking partners for licensing, research collaboration, or both.
Stage of Development
Prototype available.
SPAct: Self-supervised Privacy Preservation for Action Recognition, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 18-24, 2022, New Orleans, LA, USA, 2022, pp. 20132-20141. DOI: 10.1109/CVPR52688.2022.01953.
The University of Central Florida TransGeo invention is a pure transformer-based approach that takes full advantage of the strengths of transformers related to global information modeling and explicit position information encoding. The dominant Convolutional Neural Network (CNN)-based methods for cross-view image geo-localization rely on polar transform and fail to model global correlation.
In contrast, the UCF TransGeo approach addresses these limitations from a different perspective. The invention leverages the flexibility of transformer input and offers an attention-guided, non-uniform cropping method so that uninformative image patches are removed with a negligible drop in performance to reduce computation cost. The saved computation can be reallocated to increase resolution only for informative patches, resulting in performance improvement with no additional computation cost. This "attend and zoom-in" strategy is highly similar to human behavior when observing images. Remarkably, TransGeo achieves state-of-the-art results on both urban and rural datasets, with significantly less computation cost than CNN-based methods. It does not rely on polar transform and infers faster than CNN-based methods. Code is available at https://github.com/Jeff-Zilence/TransGeo2022.
Partnering Opportunity
The research team is seeking partners for licensing, research collaboration, or both.
Stage of Development
Prototype available.
TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), proceedings pages 1152-1161. DOI Bookmark: 10.1109/CVPR52688.2022.00123.
Year: | 2020 |
Link Address: | https://youtu.be/H4dIcMyKijA |
Source: | upload |
Duration: | 00:30:11 |