Tutorials and Presentations by Suman Saha
In this work we propose a new approach
to the spatiotemporal localisation (detection)
and classif... more In this work we propose a new approach
to the spatiotemporal localisation (detection)
and classification of multiple concurrent actions
within temporally untrimmed videos. Our
framework is composed of three stages. In stage
1, a cascade of deep region proposal and detection
networks are employed to classify regions
of each video frame potentially containing an
action of interest. In stage 2, appearance and
motion cues are combined by merging the detection
boxes and softmax classification scores
generated by the two cascades. In stage 3, sequences
of detection boxes most likely to be associated
with a single action instance, called action
tubes, are constructed by solving two optimisation
problems via dynamic programming.
Papers by Suman Saha

Proceedings of the International Conference on Computer Vision (ICCV), 2017
We present a deep-learning framework for real-time multiple spatio-temporal (S/T) action localisa... more We present a deep-learning framework for real-time multiple spatio-temporal (S/T) action localisation and classification. Current state-of-the-art approaches work offline, and are too slow to be useful in real-world settings. To overcome their limitations we introduce two major developments. Firstly, we adopt real-time SSD (Single Shot Multi-Box Detector) CNNs to regress and classify detection boxes in each video frame potentially containing an action of interest. Secondly, we design an original and efficient on-line algorithm to incrementally construct and label 'action tubes' from the SSD frame level detections. As a result, our system is not only capable of performing S/T detection in real time, but can also perform early action prediction in an online fashion. We achieve new state-of-the-art results in both S/T action localisation and early action prediction on the challenging UCF101-24 and J-HMDB-21 benchmarks, even when compared to the top offline competitors. To the best of our knowledge, ours is the first real-time (up to 40fps) system able to perform online S/T action localisation on the untrimmed videos of UCF101-24.

In this work we propose a new approach to the spatiotemporal localisation (detection) and classif... more In this work we propose a new approach to the spatiotemporal localisation (detection) and classification of multiple concurrent actions within temporally untrimmed videos. Our framework is composed of three stages. In stage 1, a cascade of deep region proposal and detection networks are employed to classify regions of each video frame potentially containing an action of interest. In stage 2, appearance and motion cues are combined by merging the detection boxes and softmax classification scores generated by the two cascades. In stage 3, sequences of detection boxes most likely to be associated with a single action instance, called action tubes, are constructed by solving two optimisation problems via dynamic programming. While in the first pass action paths spanning the whole video are built by linking detection boxes over time using their class-specific scores and their spatial overlap, in the second pass temporal trimming is performed by ensuring label consistency for all constituting detection boxes. We demonstrate the performance of our algorithm on the challenging UCF101, J-HMDB-21 and LIRIS-HARL datasets, achieving new state-of-the-art results across the board and significantly lower detection latency at test time.
Drafts by Suman Saha

Current state-of-the-art action detection systems are tailored for offline batch-processing appli... more Current state-of-the-art action detection systems are tailored for offline batch-processing applications. However, for online applications like human-robot interaction, current systems fall short, either because they only detect one action per video, or because they assume that the entire video is available ahead of time. In this work, we introduce a real-time and online joint-labelling and association algorithm for action detection that can incrementally construct space-time action tubes on the most challenging action videos in which different action categories occur concurrently. In contrast to previous methods, we solve the detection-window association and action labelling problems jointly in a single pass. We demonstrate superior on-line association accuracy and speed (2.2ms per frame) as compared to the current state-of-the-art offline systems. We further demonstrate that the entire action detection pipeline can easily be made to work effectively in real-time using our action tube construction algorithm.
Uploads
Tutorials and Presentations by Suman Saha
to the spatiotemporal localisation (detection)
and classification of multiple concurrent actions
within temporally untrimmed videos. Our
framework is composed of three stages. In stage
1, a cascade of deep region proposal and detection
networks are employed to classify regions
of each video frame potentially containing an
action of interest. In stage 2, appearance and
motion cues are combined by merging the detection
boxes and softmax classification scores
generated by the two cascades. In stage 3, sequences
of detection boxes most likely to be associated
with a single action instance, called action
tubes, are constructed by solving two optimisation
problems via dynamic programming.
Papers by Suman Saha
Drafts by Suman Saha