Papers by Konrad Schindler

2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015
Tracking-by-detection has proven to be the most successful strategy to address the task of tracki... more Tracking-by-detection has proven to be the most successful strategy to address the task of tracking multiple targets in unconstrained scenarios [e.g. 40, 53, 55]. Traditionally, a set of sparse detections, generated in a preprocessing step, serves as input to a high-level tracker whose goal is to correctly associate these "dots" over time. An obvious shortcoming of this approach is that most information available in image sequences is simply ignored by thresholding weak detection responses and applying non-maximum suppression. We propose a multi-target tracker that exploits low level image information and associates every (super)-pixel to a specific target or classifies it as background. As a result, we obtain a video segmentation in addition to the classical bounding-box representation in unconstrained, realworld videos. Our method shows encouraging results on many standard benchmark sequences and significantly outperforms state-of-the-art tracking-by-detection approaches in crowded scenes with long-term partial occlusions.
Revisiting 3D geometric models for accurate object shape and pose
2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), 2011
... M. Zeeshan Zia1, Michael Stark2, Bernt Schiele2, and Konrad Schindler1 ... These approaches, ... more ... M. Zeeshan Zia1, Michael Stark2, Bernt Schiele2, and Konrad Schindler1 ... These approaches, however, ei-ther use purely qualitative geometry descriptions [20, 15] or resort to geometric representations of rather coarse gran-ularity [9, 41], where reasoning is performed at most ...
Raw results and evaluation script

In the recent past, the computer vision community has developed centralized benchmarks for the pe... more In the recent past, the computer vision community has developed centralized benchmarks for the performance evaluation of a variety of tasks, including generic object and pedestrian detection, 3D reconstruction, optical flow, single-object short-term tracking, and stereo estimation. Despite potential pitfalls of such benchmarks, they have proved to be extremely helpful to advance the state of the art in the respective area. Interestingly, there has been rather limited work on the standardization of quantitative benchmarks for multiple target tracking. One of the few exceptions is the well-known PETS dataset [20], targeted primarily at surveillance applications. Despite being widely used, it is often applied inconsistently, for example involving using different subsets of the available data, different ways of training the models, or differing evaluation scripts. This paper describes our work toward a novel multiple object tracking benchmark aimed to address such issues. We discuss the...
Face Recognition from Video by Matching Image Sets
Digital Image Computing: Techniques and Applications (DICTA'05), 2005
Page 1. Face Recognition from Video by Matching Image Sets Tat-Jun Chin ∗ James U Konrad Schindle... more Page 1. Face Recognition from Video by Matching Image Sets Tat-Jun Chin ∗ James U Konrad Schindler David Suter Institute of Vision Systems Engineering, Monash University, Victoria, Australia. {tat.chin | james.u | konrad.schindler | d.suter}@eng.monash.edu.au Abstract ...

2009 IEEE 12th International Conference on Computer Vision, 2009
We present a method for tracking a hand while it is interacting with an object. This setting is a... more We present a method for tracking a hand while it is interacting with an object. This setting is arguably the one where hand-tracking has most practical relevance, but poses significant additional challenges: strong occlusions by the object as well as self-occlusions are the norm, and classical anatomical constraints need to be softened due to the external forces between hand and object. To achieve robustness to partial occlusions, we use an individual local tracker for each segment of the articulated structure. The segments are connected in a pairwise Markov random field, which enforces the anatomical hand structure through soft constraints on the joints between adjacent segments. The most likely hand configuration is found with belief propagation. Both range and color data are used as input. Experiments are presented for synthetic data with ground truth and for real data of people manipulating objects.

2007 IEEE 11th International Conference on Computer Vision, 2007
When trying to extract 3D scene information and camera motion from an image sequence alone, it is... more When trying to extract 3D scene information and camera motion from an image sequence alone, it is often necessary to cope with independently moving objects. Recent research has unveiled some of the mathematical foundations of the problem, but a general and practical algorithm, which can handle long, realistic sequences, is still missing. In this paper, we identify the necessary parts of such an algorithm, highlight both unexplored theoretical issues and practical challenges, and propose solutions. Theoretical issues include proper handling of different situations, in which the number of independent motions changes: objects can enter the scene, objects previously moving together can split and follow independent trajectories, or independently moving objects can merge into one common motion. We derive model scoring criteria to handle these changes in the number of segments. A further theoretical issue is the resolution of the relative scale ambiguity between such changes. Practical issues include robust 3D reconstruction of freely moving foreground objects, which often have few and short feature tracks. The proposed framework simultaneously tracks features, groups them into rigidly moving segments, and reconstructs all segments in 3D. Such an online approach, as opposed to batch processing techniques, which first track features, and then perform segmentation and reconstruction, is vital in order to handle small foreground objects.
Nowadays, different sensors and processing techniques provide Digital Elevation Models (DEMs) for... more Nowadays, different sensors and processing techniques provide Digital Elevation Models (DEMs) for the same site, which differ significantly with regard to their geometric characteristics and accuracy. Each DEM contains intrinsic errors due to the primary data acquisition technology, the processing chain, and the characteristics of the terrain. DEM fusion aims at overcoming the limitations of different DEMs by merging them in an intelligent way. In this paper we present a generic algorithmic approach for fusing two arbitrary DEMs, using the framework of sparse representations. We conduct extensive experiments with real DEMs from different earth observation satellites to validate the proposed approach. Our evaluation shows that, together with adequately chosen fusion weights, the proposed algorithm yields consistently better DEMs.

2007 IEEE 11th International Conference on Computer Vision, 2007
We present a novel approach for multi-object tracking which considers object detection and spacet... more We present a novel approach for multi-object tracking which considers object detection and spacetime trajectory estimation as a coupled optimization problem. It is formulated in a hypothesis selection framework and builds upon a state-of-the-art pedestrian detector. At each time instant, it searches for the globally optimal set of spacetime trajectories which provides the best explanation for the current image and for all evidence collected so far, while satisfying the constraints that no two objects may occupy the same physical space, nor explain the same image pixels at any point in time. Successful trajectory hypotheses are fed back to guide object detection in future frames. The optimization procedure is kept efficient through incremental computation and conservative hypothesis pruning. The resulting approach can initialize automatically and track a large and varying number of persons over long periods and through complex scenes with clutter, occlusions, and large-scale background changes. Also, the global optimization framework allows our system to recover from mismatches and temporarily lost tracks. We demonstrate the feasibility of the proposed approach on several challenging video sequences.
Pattern Recognition, 2008
We present a method for object class detection in images based on global shape. A distance measur... more We present a method for object class detection in images based on global shape. A distance measure for elastic shape matching is derived, which is invariant to scale and rotation, and robust against nonparametric deformations. Starting from an over-segmentation of the image, the space of potential object boundaries is explored to find boundaries, which have high similarity with the shape template of the object class to be detected. An extensive experimental evaluation is presented. The approach achieves a remarkable detection rate of 83-91% at 0.2 false positives per image on three challenging data sets.
Journal of Vision, 2011
Given the presence of massive feedback loops in brain networks, it is difficult to disentangle th... more Given the presence of massive feedback loops in brain networks, it is difficult to disentangle the contribution of feedforward and feedback processing on the recognition of visual stimuli, in this case, of emotional body expressions. The aim of the present work is to shed light on how well feed-forward processing explains rapid processing of this important class of stimuli.

IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000
Multibody structure from motion (SfM) is the extension of classical SfM to dynamic scenes with mu... more Multibody structure from motion (SfM) is the extension of classical SfM to dynamic scenes with multiple rigidly moving objects. Recent research has unveiled some of the mathematical foundations of the problem, but a practical algorithm, which can handle realistic sequences, is still missing. In this paper, we discuss the requirements for such an algorithm, highlight theoretical issues and practical problems, and describe how a static structure-from-motion framework needs to be extended to handle real dynamic scenes. Theoretical issues include different situations, in which the number of independently moving scene objects changes: moving objects can enter or leave the field of view, merge into the static background (e.g. when a car is parked), or split off the background and start moving independently. Practical issues arise due to small freely moving foreground objects with few and short feature tracks. We argue that all these difficulties need to be handled online, as structure-frommotion estimation progresses, and present an exemplary solution using the framework of probabilistic model-scoring.

Detailed 3D Representations for Object Recognition and Modeling
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000
Geometric 3D reasoning at the level of objects has received renewed attention recently, in the co... more Geometric 3D reasoning at the level of objects has received renewed attention recently, in the context of visual scene understanding. The level of geometric detail, however, is typically limited to qualitative representations or coarse boxes. This is linked to the fact that today’s object class detectors are tuned towards robust 2D matching rather than accurate 3D geometry, encouraged by bounding-box based benchmarks such as Pascal VOC. In this paper, we revisit ideas from the early days of computer vision, namely, detailed, 3D geometric object class representations for recognition. These representations can recover geometrically far more accurate object hypotheses than just bounding boxes, including continuous estimates of object pose, and 3D wireframes with relative 3D positions of object parts. In combination with robust techniques for shape description and inference, we outperform state-of-the-art results in monocular 3D pose estimation. In a series of experiments, we analyze our approach in detail, and demonstrate novel applications enabled by such an object class representation, such as fine-grained categorization of cars and bicycles according to their 3D geometry, and ultra-wide baseline matching.
Providing Multimedia Tools for Recording, Reconstruction, Visualisation and Database Storage/Access of Archaeological Excavations
Abstract: Over the years archaeologists have been swift to embrace new advances in technology tha... more Abstract: Over the years archaeologists have been swift to embrace new advances in technology that allow them to more comprehensively document the results of their work. Today it is commonplace to find information technologies, in the form MS Office-type tools with some CAD and GIS, deployed for primary data capture, analysis, presentation and publication. While these computing technologies can be used effectively to record and interpret archaeological sites, the radical developments in 3D recording, reconstruction ...

Are Cars Just 3D Boxes? Jointly Estimating the 3D Shape of Multiple Objects
2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014
ABSTRACT Current systems for scene understanding typically represent objects as 2D or 3D bounding... more ABSTRACT Current systems for scene understanding typically represent objects as 2D or 3D bounding boxes. While these representations have proven robust in a variety of applications, they provide only coarse approximations to the true 2D and 3D extent of objects. As a result, object-object interactions, such as occlusions or ground-plane contact, can be represented only superficially. In this paper, we approach the problem of scene understanding from the perspective of 3D shape modeling, and design a 3D scene representation that reasons jointly about the 3D shape of multiple objects. This representation allows to express 3D geometry and occlusion on the fine detail level of individual vertices of 3D wireframe models, and makes it possible to treat dependencies between objects, such as occlusion reasoning, in a deterministic way. In our experiments, we demonstrate the benefit of jointly estimating the 3D shape of multiple objects in a scene over working with coarse boxes, on the recently proposed KITTI dataset of realistic street scenes.

Towards Scene Understanding with Detailed 3D Object Representations
International Journal of Computer Vision, 2014
ABSTRACT Current approaches to semantic image and scene understanding typically employ rather sim... more ABSTRACT Current approaches to semantic image and scene understanding typically employ rather simple object representations such as 2D or 3D bounding boxes. While such coarse models are robust and allow for reliable object detection, they discard much of the information about objects' 3D shape and pose, and thus do not lend themselves well to higher-level reasoning. Here, we propose to base scene understanding on a high-resolution object representation. An object class - in our case cars - is modeled as a deformable 3D wireframe, which enables fine-grained modeling at the level of individual vertices and faces. We augment that model to explicitly include vertex-level occlusion, and embed all instances in a common coordinate frame, in order to infer and exploit object-object interactions. Specifically, from a single view we jointly estimate the shapes and poses of multiple objects in a common 3D frame. A ground plane in that frame is estimated by consensus among different objects, which significantly stabilizes monocular 3D pose estimation. The fine-grained model, in conjunction with the explicit 3D scene model, further allows one to infer part-level occlusions between the modeled objects, as well as occlusions by other, unmodeled scene elements. To demonstrate the benefits of such detailed object class models in the context of scene understanding we systematically evaluate our approach on the challenging KITTI street scene dataset. The experiments show that the model's ability to utilize image evidence at the level of individual parts improves monocular 3D pose estimation w.r.t. both location and (continuous) viewpoint.
Uploads
Papers by Konrad Schindler