Academia.eduAcademia.edu

Hierarchical Implicit Shape Modeling

2014, Journal of Visual Communication and Image Representation

https://doi.org/10.1016/J.JVCIR.2013.12.020

Abstract

In this paper, a new hierarchical approach for object detection is proposed. Object detection methods based on Implicit Shape Model (ISM) efficiently handle deformable objects, occlusions and clutters. The structure of each object in ISM is defined by a spring like graph. We introduce hierarchical ISM in which structure of each object is defined by a hierarchical star graph. Hierarchical ISM has two layers. In the first layer, a set of local ISMs are used to model object parts. In the second layer, structure of parts with respect to the object center is modeled by global ISM. In the proposed approach, the obtained parts for each object category have high discriminative ability. Therefore, our approach does not require a verification stage. We applied the proposed approach to some datasets and compared the performance of our algorithm to comparable methods. The results show that our method has a superior performance.

J. Vis. Commun. Image R. xxx (2014) xxx–xxx Contents lists available at ScienceDirect J. Vis. Commun. Image R. journal homepage: www.elsevier.com/locate/jvci Hierarchical Implicit Shape Modeling Parvin Razzaghi a,⇑, Shadrokh Samavi a,b a Department of Electrical and Computer Engineering, Isfahan University of Technology, Isfahan, Iran b Department of Electrical and Computer Engineering, McMaster University, Hamilton, Canada a r t i c l e i n f o a b s t r a c t Article history: In this paper, a new hierarchical approach for object detection is proposed. Object detection methods Received 20 March 2013 based on Implicit Shape Model (ISM) efficiently handle deformable objects, occlusions and clutters. Accepted 29 December 2013 The structure of each object in ISM is defined by a spring like graph. We introduce hierarchical ISM in Available online xxxx which structure of each object is defined by a hierarchical star graph. Hierarchical ISM has two layers. In the first layer, a set of local ISMs are used to model object parts. In the second layer, structure of parts Keywords: with respect to the object center is modeled by global ISM. In the proposed approach, the obtained parts Object recognition for each object category have high discriminative ability. Therefore, our approach does not require a ver- Statistical part-based object recognition Implicit shape model ification stage. We applied the proposed approach to some datasets and compared the performance of our Hierarchical Implicit Shape Model algorithm to comparable methods. The results show that our method has a superior performance. Hierarchical star graph Ó 2014 Elsevier Inc. All rights reserved. Discriminative parts Parts filter Histogram of gradients 1. Introduction are extended to topic modeling [3]. In topic modeling, each object is modeled as a mixture of topics. Each topic is a probability distri- Object recognition is one of the most important research areas bution over visual words which frequently occur together. Hence, in computer vision. Object recognition has received much interest the final representation for a particular object is being composed in recent decades. Many computer vision systems can benefit from of the mixture of the histograms corresponding to each topic. an accurate object recognition stage. Some computer vision sys- Two important algorithms based on the topic modeling are proba- tems include scene understanding, video surveillance, and human bilistic Latent Semantic Analysis (pLSA) [4] and Latent Dirichlet Allo- robot interactions, etc. Change within class object appearance, cation (LDA) [5]. In LDA, the topic distribution has a Dirichlet prior, viewpoint, scale and illumination pose major challenges to the ob- whereas in pLSA the topic distribution is uniform. It should be ject recognition task. noted that, topic modeling is based on BoW; therefore the spatial Up to now, many approaches are introduced in the field of structures among visual words are ignored. Sivic et al. [6] intro- object recognition. Some of these approaches are part-based duced a new approach based on pLSA for discovering objects and modelings in which it is assumed that each object consists of some their locations in an image. They introduced ‘doublets’ which try parts which are placed in a special structure. In statistical to consider spatial structural information between visual words. part-based modeling, structure of parts is represented by a graph. Doublets are two neighboring visual words which have high occur- Statistical part-based modeling is divided into three categories rence probability in each topic. Niu et al. [7] also proposed a new based on type of the structure [1]. part-based object recognition method based on the supervised The first statistical part-based modeling is Bag of visual Words LDA in which the spatial information of parts are also consid- (BoW) in which it is assumed that all parts are independent. Hence, ered. It has been shown that BoW models perform poorly on the learning and inference procedures based on BoW are easy to localization [8]. implement. In these approaches, at first, interest points are ex- The second group of statistical part-based modeling is called the tracted and described. These features are then quantized into vi- constellation model. In contrast to BoW model, these constellation sual words using a clustering algorithm. Eventually, each object models have no independence assumption between parts, and is represented by a histogram of visual words [2]. The BoW models hence exact inference is intractable. In order to simplify the learn- ing and inference procedure, constellation model uses a full multi- variate Gaussian distribution to model the spatial distribution of ⇑ Corresponding author. Fax: +98 (0) 311 3912718. parts [9]. E-mail addresses: [email protected], [email protected] (P. Razzaghi). 1047-3203/$ - see front matter Ó 2014 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.jvcir.2013.12.020 Please cite this article in press as: P. Razzaghi, S. Samavi, Hierarchical Implicit Shape Modeling, J. Vis. Commun. (2014), http://dx.doi.org/10.1016/ j.jvcir.2013.12.020 2 P. Razzaghi, S. Samavi / J. Vis. Commun. Image R. xxx (2014) xxx–xxx The third statistical part-based modeling is pictorial structures parts are considered as a latent variable; however the structure be- in which the structures between parts are modeled by a tree. In tween parts is modeled by a k-fan graph. Similar to [18], in Crand- [10], some efficient algorithms for learning these models and all’s approach an initialization procedure is needed to obtain initial matching the model to the image are presented. Methods belong- parts appearance. Also the number of parts should be specified a ing to this category are divided into three sub-categories based on priori. A disadvantage of [1] is that there is no guarantees that the method of determining parts. In the first sub-category parts are model (parts and their relative position) would give a complete determined by a human subject which is not useful for most appli- description of objects. This is due to the fact that characteristics ex- cations [10]. In the second sub-category, parts are determined by a tracted from the appearance and the structures between the parts low level interest point detectors. Leibe et al. [11] represented an are interrelated and are produced concurrently. Hence, the short- object as a set of low level features which have a special structure comings in modeling one would affect the other. with respect to the center of the object. Their model is called Impli- In this paper we propose a new part-based modeling approach cit Shape Model (ISM). In [11], at first, low level features, such as for object detection. In our approach, for each object category, a set Harris corner points [12], are extracted. In the training step, of parts are extracted to comprehensively represent the object and General Hough Transform (GHT) is then used to learn spatial to have strong discriminative ability. Spatial structures of parts of occurrence distribution of the low-level features relative to the ob- objects are modeled by ISM. In training images, only the bounding ject centroid. In the inference step, each feature votes to a point boxes of objects are provided and parts are unknown. We propose according to the learned spatial occurrence distribution. Conse- hierarchical ISM to extract a set of discriminative parts for each ob- quently, votes are gathered and points whose strengths are greater ject category during the training phase. The proposed hierarchical than a predetermined threshold are selected as an object. Ferrari ISM use a hierarchical star graph to model the structure of each ob- et al. [13] introduce a new approach for object detection. They ject. In this paper, hierarchical ISM has two layers. In the first layer, use contour feature instead of appearance feature. At first, they each part is modeled by ISM in which the structure of each low-le- learn a prototypical shape of an object class. Then, the learned vel feature is considered with respect to the part center. A set of shape model matches to the object boundaries by using Hough- ISMs which are used to model parts are called local ISMs. In the style voting schema [11,14,15]. The primary weakness of ISM is second layer, structure of parts with respect to the object center that each part independently vote whereas there is a mutually is modeled by a global ISM. Using hierarchical ISM, parts for train- dependencies between parts. In [16], this problem has been stud- ing images are extracted. To model the visual appearance of each ied. To do this, in the test phase, groups of features jointly vote part, a filter on HOG features is learned. In the test phase, each part to the center of the object. In [16] grouping, voting and correspon- filter is correlated by test image and candidate positions of all parts dences problems are considered jointly and optimized iteratively are identified. Then, these candidate positions vote to decide on the by a single objective function. Limitation of [16] is that grouping location of the object center by using global ISM. Each candidate of features is done in the test phase. object center is investigated to determine which parts have voted In most cases, the used features in ISM do not have enough dis- to that position. If an object center position receives votes from criminative information for object classes. Hence, they often match all parts, it is considered as a predicted object center. It should with a background and create false positives. To overcome this dif- be noted that, due to the high discriminative ability of parts, the ficulty, there are two solutions. Some approaches use a verification proposed approach would not require a verification stage. stage to overcome this problem. Leibe et al. [11] use an MDL for- Our approach differs from others in several important notions: mulation as a verification stage. Yarlagadda et al. [16] apply SVM (1) introducing hierarchical ISM to extract a set of discriminative using pyramid match kernel to verify voting hypothesis. Some parts for each object category in the training phase, (2) not requir- other approaches create visual features which have enough ing a verification stage due to high discriminative ability of parts, discriminative power. Opelt et al. [17] use a Boundary Fragment (3) requiring that all parts vote to a candidate object center Model (BFM). They assign a score to each boundary fragment and location for that candidate to be selected as the object center. This obtain some of them as candidates. Then, to attain discriminative reduces the number of false positive cases. Such requirement is not features, they combined candidate boundary fragments. To do this, present in other ISM based methods. boundary fragments should fit well on the positive training sam- The rest of the paper is organized as follows. The training stage ples and their centroid estimates should concur and agree with and inference stage of the proposed approach are explained in Sec- the true object centroid. It should be noted that, to keep the search tions 2 and 3 respectively. Section 4 shows the results of applying tractable, the number of codebook entries should be restricted. our proposed approach to some known object recognition datasets. In the third sub-category parts are assumed as latent variables Our conclusion is given in Section 5. and are achieved during the learning procedure. These parts have a more discriminative ability with respect to the low-level fea- tures. Felzenszwalb et al. [18] present an approach in which each 2. Proposed approach: training stage deformable object is represented by a filter root and a set of filter parts. In their model, the structure between parts is represented by In this section, hierarchical ISM is proposed to extract a set of a spring like graph in which there is a connection between each parts and learn their spatial structure for each object category in part and a root filter. Also, for each part, there is a deformation vec- the training phase. In training images, only the bounding boxes tor which models the deformation of each part with respect to the of objects are provided and parts are unknown. In our approach, root filter. In [18] parts are considered as a latent variable and are at first, a set of initial parts are explicitly determined for each ob- achieved during the learning procedure. To obtain these latent ject category using one training image. Then, an initial model for variables (parts), latent SVM is introduced. Their approach needs object by using the hierarchical ISM is constructed. Next, based an initialization procedure to obtain initial part filters. Also, their on the initial hierarchical ISM, corresponding of initial parts in approach needs the number of parts to be specified a priori. The other training images are extracted and the hierarchical ISM is structures between parts are also limited to a spring-like graph. updated iteratively. Eventually, to model the visual appearance of Zhu et al. [19] extend [18] to a hierarchical two layer model. Mot- extracted parts of training images, a filter is defined using HOG taghi [20] integrates the result detection of [18] with Histogram of features. The architecture of the proposed approach in the training Oriented Gradient (HOG) bundles [21] to capture large deforma- phase is shown in Fig. 1. Briefly, in the training phase of the tions of object. Crandall [1] introduces a new approach in which proposed approach, the following steps are done: Please cite this article in press as: P. Razzaghi, S. Samavi, Hierarchical Implicit Shape Modeling, J. Vis. Commun. (2014), http://dx.doi.org/10.1016/ j.jvcir.2013.12.020 P. Razzaghi, S. Samavi / J. Vis. Commun. Image R. xxx (2014) xxx–xxx 3 Fig. 1. The architecture of the proposed approach in the training phase. Let T denotes to the number of training images.  Determine a set of initial parts for a given training image I. 2.1. Determining initial parts  Define an initial hierarchical ISM for the object using the obtained initial parts. Given an image I, first, low-level features are extracted and  Extract corresponding of parts in other training images by using described. There are many ways to extract and describe these the learned hierarchical ISM and then update it. features. In this paper, SIFT [22] is used to detect and describe  Define a filter on HOG features for each part. features. However, any pair of detector-descriptor can be used. Each object in the training samples is already labeled by a Extraction of the initial parts and defining the initial object bounding box around it. To obtain the initial parts, low-level fea- model are given in Sections 2.1 and 2.2, respectively. In Section 2.3, tures in image I which lie inside the bounding box are grouped. the learned hierarchical ISM is used for extracting corresponding of A set of features which are located in the same group should not parts in other training image. Updating the object model is ex- be very sparse with respect to the object centroid. pressed in Section 2.4. Visual appearance of each part is learned The similarity matrix between features is constructed based on in Section 2.5. their similarity appearance and locations. To cluster these local Please cite this article in press as: P. Razzaghi, S. Samavi, Hierarchical Implicit Shape Modeling, J. Vis. Commun. (2014), http://dx.doi.org/10.1016/ j.jvcir.2013.12.020 4 P. Razzaghi, S. Samavi / J. Vis. Commun. Image R. xxx (2014) xxx–xxx Fig. 2. The obtained initial parts for ‘‘motorbike’’ and ‘‘cow’’ objects. The number of parts is specified as 7 and 6 for ‘‘motorbike’’ and ‘‘cow’’, respectively. features, Linkage algorithm is used. It should be noted that, it is Hence, the energy function which is used to extract the parts in needed that the number of clusters is specified prior. The obtained training image It is defined as follows: clusters are shown symbolically with fP i gni¼1 and are called initial ni n parts. Let n indicates the number of parts. Let fF ij gj¼1 denote the set X ni Energy ¼ E1 ðpi ; xÞ þ E2 ðpi ; xÞ þ E3 ðpi ; xÞ ð2Þ of features that are placed in the ith initial part at locations fLij gj¼1 . i¼1 These initial parts should be able to provide a complete repre- sentation of the object. To satisfy this requirement, the image I In the following, each term in Eq. (2) is explained in detail. should not contain any occlusion. In the proposed approach, the In the first term (E1(pi, x)), the cost of location x in image It as the obtained initial parts are meaningful due to the explicitly grouping ith part’s center is considered. To define this term, each feature in of low-level features. Two examples of the achieved results from image It which lies inside the bounding box of the object, votes for ni this step are shown in Fig. 2 for motorbike and cow objects. As it a possible part center according to the learned local ISM. Let ffji gj¼1 is clear, the achieved results are meaningful. denotes the set of features that contribute at location x as the cen- ter of ith part in image It . These features locations are denoted by i ni flj gj¼1 . Let SL(pi, x) denotes the score of location x as the ith part 2.2. Initial model definition center of the object. The ith part is symbolically represented by pi. As described in [11], the score of location x as the center of ith In this step, an initial hierarchical ISM for object category is de- part is obtained as follows: fined. Hierarchical ISM is a two-layer model which is proposed to model the structure of object in a hierarchical manner. In the first ni X m i i X layer, a set of local ISMs are used to model object parts. In each lo- SL ðpi ; xÞ / pðck jfji Þpðxjpi ; ck ; lj Þpðpi jck ; lj Þ ð3Þ j¼1 k¼1 cal ISM, the structure of each low-level feature is considered with respect to the part center. In the second layer, structure of initial The first term in Eq. (3) is defined based on the Gibbs-like parts with respect to the object center is modeled by another distribution: ISM which is called global ISM. ( 1 To define an ISM, class-specific alphabets, features descriptors, expðc dðck ; fji ÞÞ if dðck ; fji Þ  t pðck jfji Þ ¼ Z ð4Þ features locations and object centroid are needed [11]. In the first 0 otherwise layer, the local ISM ofor each initial part is defined using set of i ni n fck gm i where Z denotes the normalization constant. Also, parameters c and k¼1 ; fF j ; Lj gj¼1 ; C i . Local low-level features are extracted from all training images and are clustered using k-means algorithm into t are positive constants. The function d(.,.) computes the dissimilar- m clusters. Let ck denotes the kth codebook entry of the quantized ity between the matched codebook entry and features. Euclidian i feature space. The parameter Ci denotes the center of ith initial part distance as a dissimilarity measure is used. Also, pðxjpi ; ck ; lj Þ is esti- which is calculated as follows: mated based on the learned spatial occurrence distribution of the i n ith local ISM. It should be noted that pðpi jck ; lj Þ is set to be a uniform 1X i Ci ¼ Li ð1Þ distribution, so it could be ignored. Finally, the first term in Eq. (2) is ni j¼1 j defined as follow: Similarly, in the second layer, the spatial arrangement of the initial E1 ðpi ; xÞ ¼ ð1  SL ðpi ; xÞÞ  kðpi ; BPðxÞÞ ð5Þ parts with respect to the object center is learned by using another ISM (global ISM). This process uses ffP i ; C i gni¼1 ; C o g in which Co de- By back projecting the contributing votes in each location of the im- notes the object center and n denotes the number of parts. age (x), the features of the back projected hypothesis are retrieved which are denoted by BP(). Let k(pi, BP(x)) denotes a discriminative function which determines the cost of belonging support features of 2.3. Extracting parts in all training images the back projected hypothesis to the ith part. Any discriminative function could be used here. In this paper, non-linear SVM is used. Up to now, initial parts for object using one training image are In the second term in Eq. (2), it should be guaranteed that each determined and an initial hierarchical ISM to model object is learned. part represents a compact region of the object. This term guaran- The main goal in this step is to use the learned hierarchical ISM to tees that each part have a locality characteristics. Hence, the sec- guide to extract the corresponding of parts in other training images. ond term is defined as follow: Suppose It is a training image in which a bounding box around the object and its centroid are present. The goal is to attain the center i i dðjjlj  lk jj2  kÞ X X E2 ðpi ; xÞ ¼ ð6Þ of parts in training image It. To do this, it is formulated in the energy fji 2BPðxÞf i 2BPðxÞ k minimization framework. For each location x in image It as a center of n ith part, the following three criteria should be satisfied: where, as it is mentioned ffji gj¼1 i denotes the set of features that contribute at location x as the center of ith part in image It (the i ni  The low-level features should vote the location x as a center of output of BP()). These features locations are denoted by flj gj¼1 . ith part using the ith learned local ISM. Function d() maps to 1 at non-negative points, otherwise it maps  Each part should represent a compact region of the object. to 0. Let parameter k is a positive constant.  The ith part should lie in the right position relative to the object Consequently, in the third term (E3(pi, x)), it should be guaran- center based on the learned global ISM. teed that the ith part center x lies in the right position relative to Please cite this article in press as: P. Razzaghi, S. Samavi, Hierarchical Implicit Shape Modeling, J. Vis. Commun. (2014), http://dx.doi.org/10.1016/ j.jvcir.2013.12.020 P. Razzaghi, S. Samavi / J. Vis. Commun. Image R. xxx (2014) xxx–xxx 5 the object centroid. Hence, the distance between the true object where x denotes to the pixel coordinates in image It which is consid- center and the predicted center by the ith part is considered. ered as part center. In the next step, these three maps are added and Hence, the third term is formulated as follows: the ith final map is defined as follow: E3 ðpi ; xÞ ¼ expðf  jjC o  C poi jj2 Þ ð7Þ final mapðx; pi Þ ¼ appearance mapðx; pi Þ þ locality mapðx; pi Þ where Co and C poi denote the true object center and the predicted ob- þ global mapðx; pi Þ ject center by the ith part, respectively. Also, the parameter f is a positive constant. where each pixel value in the ith final map denotes to the strength By substituting Eqs. (5)–(7) into Eq. (2), the energy function is of that pixel as the ith part center. Finally, the center of ith part is defined as follow: considered as global minimum in the ith final map. To attain the lo- n h cal minimum, Mean Shift algorithm is used. The above procedure is ð1  SL ðpi ; C ip ÞÞ  kðpi ; BPðC ip ÞÞ done for each part. Therefore, centers of parts for image It are X Energy ¼ i¼1 extracted. The proposed heuristic to optimize Eq. (9) works well because: 3 (1) the search space is small. Since only low-level features which   i i d jjlj  lk jj2  k þ expðf  jjC o  C poi jj2 5 ð8Þ X X þ 7 fji 2BPðC ip Þfki 2BPðC ip Þ are placed in the bounding box of object is considered in forming of energy function, (2) there is low quantization error of assigning where C ip indicates the ith part center. Hence, corresponding parts local feature to codebook visual word because of generating code- centers in image It are achieved as: book by the same training images and (3) false positive cases in the n h final map of each part are low because we have considered signif-  n fðC ip Þ gi¼1 ¼ arg min ð1  SL ðpi ; C ip ÞÞ  kðpi ; BPðC ip ÞÞ X n icant and involved criteria in the definition of the energy function fC ip gi¼1 i¼1 3 to detect part centers.     Also, to show the effectiveness of the proposed simple heuristic, i i d jjlj  lk jj2  k þ exp f  jjC o  C poi jj2 5 X X þ 7 Eq. (9) is also optimized by Genetic Algorithm (GA) in which there fji 2BPðC ip Þf i 2BPðC ip Þ is no independence condition. In another words, part center are k jointly determined. In this case, each chromosome is denoted by ð9Þ  X = {x1, x2,... , xr} where r denotes to the number of extracted low-le- where ðC ip Þ indicates the predicted ith part center in image I . This t vel features in image It The gene values lie between 1 and n where energy function is neither convex nor differentiable. Due to this fact, n denotes to the number of parts which constitute the object. to optimize this energy function (Eq. (9)), a simple heuristic is used. Hence, each part is denoted by pi ¼ fxj xj ¼ i; 1  j  ng. Now, In this case, each summand in Eq. (9) (summand on parts index) is each candidate solution is evaluated based on the Eq. (2). Experi- optimized independently. In another words, center of each part are mental results show that the obtained results by GA are compara- determined independently. To optimize of each summand, Mean ble to the obtained results by proposed heuristic. However, Shift mode estimation algorithm is used. To do this, three maps because of slow convergence in GA, optimization by a proposed are formed as follow: heuristic is selected. appearance mapðx; pi Þ ¼ ð1  SL ðpi ; xÞÞ  kðpi ; BPðxÞÞ 2.4. Model updating i i lk k 2 X X locality mapðx; pi Þ ¼ dðklj   kÞ fji 2BPðxÞf i 2BPðxÞ In this step, the extracted parts for image It are used to update k the learned hierarchical ISM. A global ISM (the second layer of the hierarchical ISM) encodes the structure between parts. Spatial global mapðx; pi Þ ¼ expðf  kC o  C poi k2 Þ occurrence distribution of the global ISM is updated using parts Fig. 3. Extracted parts for training samples of ‘‘motorbike’’ and ‘‘airplane’’ objects. Fig. 4. Samples of two discovered parts for ‘‘motorbike’’ object. Please cite this article in press as: P. Razzaghi, S. Samavi, Hierarchical Implicit Shape Modeling, J. Vis. Commun. (2014), http://dx.doi.org/10.1016/ j.jvcir.2013.12.020 6 P. Razzaghi, S. Samavi / J. Vis. Commun. Image R. xxx (2014) xxx–xxx  n centers fðC ip Þ gi¼1 and object center. Set of local ISMs (the first layer Algorithm 1. Training procedure. of the hierarchical ISM) consider the structure of low-level features in each part. To update ith local ISM, the contributing low-level Input:  T features in the ith mid-level part center ðC ip Þ are considered (see fIt gt¼1 : training images T Fig. 1). fC to gt¼1 : objects centroid Steps 3 and 4 (Sections 2.3 and 2.4, respectively) are repeated n: number of parts for each object m: number of codebook entries for all training images. Therefore, corresponding parts for each Train: training image are extracted and then hierarchical ISM of object // Extract interest points (IP) for all image and cluster them is updated. Some of the obtained parts for motorbike and airplane For t = 1: T training images are shown in Fig. 3. Algorithm 1 summarizes the IPt = Extract_low_level_features (It); training procedure of the proposed approach. End T It should be noted that, our proposed approach can determine ½fck gm t k¼1  = clusteringðfIP gt¼1 ; mÞ; ni ni n the parts in deformable object. To do so, the sequence of training ½fP i ; fF ij gj¼1 ; fLij gj¼1 g  = Detect_initial_parts (IP1); i¼1 images should be selected such that appearance of object varies // Learn an initial model gradually (Fig. 5). For i = 1: n C i ¼ n1i nj¼1 P i i Lj ni L(pi) = Learn_ISM_modelðffck gm i i k¼1 ; fF j ; Lj gj¼1 ; C i gÞ; // local model 2.5. Parts filter learning End G = Learn_ISM_modelðffP i ; C i gni¼1 ; C 1o gÞ; // global model for structure of object For t = 2: T In this subsection, to model the visual appearance of each part, For i = 1: n a filter on HOG features is learned by using max-margin discrimi- // each low-level feature in Image It which lie in the bounding native approach. In other words, appearance model of each part is box vote based on learned similar to HOG detectors of [23]. To simplify the test phase // the given local ISM and reduce uncertainties, instead of using groups of features as a Vote_features (It, IPt, fck gm k¼1 ; Lðpi Þ); For each pixel (x) representation of parts, filters on HOG features are used. It is i ni flj gj¼1 ¼ BPðxÞ; experimentally shown that part filters could efficiently encode appearance_map(x, pi) = (1  SL(pi, x))  k(pi, BP(x)) parts appearance.   i i 2 locality mapðx; pi Þ ¼ fki 2BPðxÞ d jjlj  lk jj  k P P Up to know, each part is represented by a set of local features. fji 2BPðxÞ   Here, minimal bounding box of each part is used as training sam- global mapðx; pi Þ ¼ exp f  jjC to  C poi jj2 ples to learn part’s filter. Some of the training samples for two dif- final_map(x, pi) = appearance_map(x, pi) + locality_map(x, pi) ferent parts of the motorbike object are shown in Fig. 4. As it is + global_map(x, pi)  shown, most of them contain the desirable information. However, fðC ip Þ g = Mean_shift_mode_estimation(final_map(x,pi))  there is not any spatial alignment between the training samples for MP ti ¼ Minimal_Bounding_Box ðBPððC ip Þ ÞÞ   each part. To overcome this limitation, an approach is proposed in Update_Local_ISM(L(pi), fBPððC ip Þ Þ; ðC ip Þ g); which at first the samples are spatially aligned and then their End appearance model are learned. End  n It should be noted that the vertical and horizontal length of part Update_Global_ISM (G,ffðC ip Þ gi¼1 ; C to Þg); in training samples are not necessarily the same. Hence, they End // Learn an appearance model for each part should be converted to the same size. For this purpose, it is as- For i = 1: n sumed that vertical and horizontal length of the ith part in training T T fHti gt¼1 = histogram_of_Gradient ðfMP ti gt¼1 Þ; samples have Gaussian distribution, hence, a Gaussian distribution T wi = learn_appearance_model ðfHti gt¼1 Þ; is fit to it. Then, each ith part in training samples is resized to the End mean value of the Gaussian distribution. Next, all of training sam- ples that corresponded to the ith part are cross-correlated with the ith initial part. Then, they are all converted to the same size of the peak of the cross-correlation. In the next steps, all of resized sam- 3. Proposed approach: Inference stage ples that correspond to ith part are described by HOG. These HOG T maps are represented by fHij gj¼1 . Now, samples are ready to be In this step, detecting of objects in a test image is demonstrated. used for learning a filter on HOG features for each part. Here, learn- In the inference step, only the second layer of the hierarchical ISM, ing the appearance model for the ith part is formulated via the the learned global ISM, is used. In the global ISM, spatial arrange- max-margin framework: ment of the parts relative to the object center is considered. Also, some characteristics of the proposed approach are discussed in Section 3.2. 1 T min w wi 2 i w;b s:t: yij ðwTi Hij þ bÞ P 1 ð10Þ 3.1. Object recognition 8j ¼ 1; :::; M To detect objects in a test image, candidate positions of parts are determined in the test image and then they vote to a possible where yij 2 f0; 1g which yij ¼ 1 if the jth samples belong to the ith position of the objects center by using a learned global ISM. To find part or 0 otherwise. wi denotes the appearance model for the ith candidate positions of each part, the learned part filter is correlated part and is called part filter. It should be noted that negative train- with the test image. Each local maximum, produced by the ing samples are collected by Google image search. As it is clear, Eq. correlation process, which has a score greater than a predefined (10) is similar to the optimization problem of Support Vector Ma- threshold is selected as a candidate positions for part. SVM is used chine (SVM). A standard optimization package called CVX [24] is to learn these thresholds. Then, each candidate position of parts used for solving this problem. votes to a possible object center point based on the learned spatial Please cite this article in press as: P. Razzaghi, S. Samavi, Hierarchical Implicit Shape Modeling, J. Vis. Commun. (2014), http://dx.doi.org/10.1016/ j.jvcir.2013.12.020 P. Razzaghi, S. Samavi / J. Vis. Commun. Image R. xxx (2014) xxx–xxx 7 occurrence distribution of the global ISM. Eventually, each local the number of parts is low, each part models a large portion of ob- maxima which has collected votes from all parts and has strength jects. Hence, it would not be able to consider within-class variations. more than a threshold is chosen as an object center. In our method, It should be noted that if the number of parts is set to one, our ap- each candidate location is investigated to determine which parts proach correspond to the Dalal and Triggs [23] approach which have voted to that position. As we said before, it decreases false po- use a single filter on HOG feature to represent object category. sitive cases. It should be noted that, previous ISM based methods In general, parts due to the following reasons are responsible in cannot guarantee that object centers have votes from a specified improving performance of the proposed approach: number of special features. Whereas, our approach guarantees that an object center has received votes from all n different parts. – The extracted parts by our approach have high discrimina- As it has been said before, ISM-based methods use a verification tive ability. It causes candidate positions of parts to coin- stage to decrease the false positive rate [11]. However, due to the cide with their true positions in the test image. Also they high discriminative power of the parts, the proposed approach rarely match with the background or other classes of parts. does not require a verification stage. – Parts can also handle within-class appearance variations of each object class. 3.2. Discussion – Our approach guarantees that each potential object center location should receive votes from n different parts. It In this subsection, the characteristics of the proposed method causes reduction of false positive cases. However, there is are discussed. In the proposed approach, in the training phase, no such guarantee in other ISM based methods. However, the object model (i.e. hierarchical ISM) is learned incrementally. to consider occlusion, it should be assumed that the object This is due to that, at first, an initial model for object is learned. center receives votes from the most number of parts. Then, based on the learned model, parts for other training images are extracted. It is clear, if pose of the object in the new training Also, another advantage of the proposed approach is that, it is sample is different from the learned model, the proposed approach simple in the inference step and this is an important property of fails. In Fig. 5, the proposed model is applied to the hand-waving an algorithm. Also, for semi-rigid objects, the number of training human action. As it is clear, the proposed model has detected the images is low. In the experimental results, this property is consid- parts of a non-rigid object in consecutive frames. It should be noted ered and discussed. that, if the pose variations of the object in consecutive training The extracted parts by the proposed approach can also be used samples are high, then the proposed approach fails. Therefore, in other methods as initialization process. For example, in [18], the the proposed approach can be applied on the standard semi-rigid appearance model of the extracted parts can be used as initial val- object recognition datasets, since in semi-rigid objects, pose varia- ues for parts appearance in the latent SVM algorithm. tions of objects of training images are low. However, to detect non- rigid objects by using the proposed approach, it is better to use a 4. Experimental results sequence of training samples (i.e. video) in which pose of an object changes slowly. In the future work, it is considered. In this section, to evaluate the proposed object recognition As it is mentioned, appearance and viewpoint variations pose method, it is applied to TUD (cow, motorbike, and side-view car), major challenges in the object recognition field. Most of the meth- Caltech 4 (airplane, face, motorbike, side-view car) and Caltech ods represent each object as a mixture models to overcome this car rear-view datasets. The obtained results are compared to the difficulty [1,18,21]. In the proposed method, each non-rigid object other state-of-the art methods. Also, it is shown that our method can be represented only by one model providing the pose of the ob- has a good performance in the presence of low number of training ject in the consecutive training samples varies gradually. This is samples. To verify it, effects of the size of the training samples are due to the fact that ISM is a non parametric model which can con- explored. It should be considered that all of results belong to the sider structural variations of the object in a simply manner. The state-of-the-art methods are simply reported. Also, in Section 4.1, appearance variation of each part in the training phase is consid- part localization is done to show the ability of our approach in ered by the learned non-linear discriminative function of Eq. (5). extracting effective parts for each object category. Also, mixture of appearance models in non-rigid objects could be TUD cow dataset: This dataset consists of 113 images. Each im- used to consider high appearance variations of each part. Also, age consists of a cow walking from right to left. One of the major the proposed method is invariant to scale. In the test stage, each challenges in this dataset is articulated body of cows (only cow image in different scales is investigated to detect object. feet). To show our method’s robustness to this challenge, we The number of initial parts should be selected so that it is close to evaluated it using this dataset. To do this, we selected 20 images the actual number of object parts. If the number of initial parts is se- as training samples and used other images as test samples. Similar lected too high, then their discriminative ability and the number of to [11], in the proposed approach, detection is true, if the overlap low-level features in each group decrease. Hence, initial ISM for each between the predicted bounding box and the true bounding box part would not be reliable and the corresponding extracted parts in is greater than 50%. We achieved 100% accuracy which means other training images cannot be relied on. Therefore, considering a our proposed approach can detect all of the objects and have no large number would have destructive impact on the final result. If false positive detections. Fig. 5. Extracted parts for a non-rigid object. The sequence of training samples varies gradually. Please cite this article in press as: P. Razzaghi, S. Samavi, Hierarchical Implicit Shape Modeling, J. Vis. Commun. (2014), http://dx.doi.org/10.1016/ j.jvcir.2013.12.020 8 P. Razzaghi, S. Samavi / J. Vis. Commun. Image R. xxx (2014) xxx–xxx TUD motorbike dataset: This dataset is a subset of PASCAL col- proves to 99.2% detection rate. It is shown that increasing the num- lection [25]. It consists of 115 images that contain 125 motorbike ber of training samples have no effect in the detection rate. Also, side views. TUD motorbike dataset is a challenging dataset which our approach reaches 86% recall rate at the 79% precision level. has scale variation, clutter and occlusion. The achieved results However, [11] reaches only 82% recall rate at the 39% precision le- are shown in Table 1. Results are reported based on Equal Error vel. We evaluate our method in the comparison of the other meth- Rate (EER) measure at which precision and recall have the equal ods, on the object presence/absence classification task. To do this, value. As it is shown in Table 1, our proposed approach has a supe- the evaluation scheme of [27] is followed. An image contains an rior performance compared to the most successful and recent object, if at least one occurrence can be found. Table 2 shows that existing algorithms. However, our proposed approach has some our approach outperforms other methods. advantages in comparison to the other methods. First, the number Caltech 4 dataset: This dataset contains four categories for air- of training images is low, while other approaches need many train- plane, face, motorbike, and side view car class objects. Negative ing images to achieve their best results. For example [26,11] use image sets contain 1600 images which do not contain any of these respectively 400 and 153 training samples from Caltech dataset objects. The number of images in each category and number of to train their models. However, in the proposed approach twenty used training samples in the proposed approach are shown in images are used as training samples. Moreover, our proposed ap- Table 3. This dataset has two main challenges: within class object proach does not require a verification stage. appearance variations and scale variations. Table 4 summarizes the Our approach is a modified version of ISM in which parts of ob- obtained results for the proposed approach together with the per- ject, rather than low-level features, vote to the object center. In an- formance of other methods. The obtained performance on side- other words, our approach without using parts is a simple ISM, view car category has a lower value with respect to the other cat- without a verification stage, in which low-level features vote to egories. It is due to the fact that the training images in this category the object center. have low resolutions. It causes extraction of a low number of SIFT To show the effectiveness of introducing parts, we compared features in these images. Therefore, parts definition degrades, since our method with a simple ISM without a verification stage. Simple parts are groups of low-level features which are placed in a special ISM on TUD motorbike dataset with 153 training samples reach structure. Some training images in the side-view car dataset are 76% recall rate at the precision rate of 48.2%. Training images are shown in Fig. 7. chosen from Caltech motorbike dataset. Simple ISM results are The proposed approach is compared with [1]. In [1] an object based on our implementation. To have a fair comparison, in the recognition approach based on the part-based modeling is intro- simple ISM similar to our approach, only SIFT interest point detec- duced. As it is shown in Table 4, our proposed approach in two cat- tors and descriptors are used. Also, to generate codebook, k-means algorithm is used. ISM based approaches, such as [11], due to some deficiencies of k-mean clustering algorithms, use agglomerative clustering to generate codebook. Simple ISM produce many false positive cases, hence, ISM based approaches to produce acceptable performance need to use verifica- tion stage. Leibe et al. [11] is an ISM based method which use MDL as a verification stage, reaches 92.7% EER measure. Our approach is 2.65% better than their approach while there is no verification stage in our approach due to the high discriminative ability of parts. TUD side view car dataset: This dataset contains 100 images with one side view of a car in each image. There are some chal- lenges in this dataset containing within class object appearance variation, some occlusion and background variations. We have achieved 100% accuracy on this dataset. Caltech rear view car dataset: we also test our approach on car rear view dataset. This dataset contains 128 training samples and Fig. 6. Accuracy versus number of training samples for Caltech car rear view 1115 test samples. It contains road scenes with significant scale dataset. variation which is taken from the inside of the moving vehicles. There are also changes in within class object appearance in this Table 2 dataset. As it is mentioned, our approach needs small training sam- The success rate of others methods on Caltech car rear view dataset (%). ples. To verify it, the effect of the training set size on detection per- Methods Caltech cars rear (%) formance is explored. To do this, we have applied the proposed approach for different training set sizes from 5 to 20. The obtained [27] 90.3 [29] 98.3 results (detection rate) are shown in Fig. 6. As can be seen from the [30] 98.9 plot, the proposed approach reaches 77% detection rate with 5 [11] (Patch) 93.9 training samples. When more training samples are added, it im- [11] (SC) 96.7 [31] 94.6 Proposed approach 99.2 Table 1 The EER measure of category detection rate on TUD motorbike side view dataset (%). Methods TUD motorbike (%) [28] 89.3 Table 3 [26] 89 Number of images and number of used training samples in the proposed approach in [14] 87 Caltech 4 dataset. [11] ISM + MDL (153 training samples) 92.7 Object class Airplane Face Motorbike Side-view car Simple ISM (153 training sample) 76* Proposed approach 95.35 # Images 800 435 800 123 * # Training samples 25 25 25 25 Only recall rate is reported. Please cite this article in press as: P. Razzaghi, S. Samavi, Hierarchical Implicit Shape Modeling, J. Vis. Commun. (2014), http://dx.doi.org/10.1016/ j.jvcir.2013.12.020 P. Razzaghi, S. Samavi / J. Vis. Commun. Image R. xxx (2014) xxx–xxx 9 Table 4 Classification performance on Caltech 4 dataset. The performance is reported based on EER measure. Category Method Weakly supervised [1] Supervised [31] Proposed approach (%) 0-Fan (%) 1-Fan (%) 2-Fan (%) 0-Fan (%) 1-Fan (%) 2-Fan (%) Airplanes 90.5 94.5 95.8 90.5 91.3 93.3 97.8 Faces 86.0 98.4 98.4 98.2 98.2 98.2 95.8 Motorbike 96.7 98.8 98.8 96.5 97 97 97.6 Cars (side) – – – – – – 81.2 egories has better performance and in the other category, it achieves comparable performances with [1]. In [1], the structure between parts is modeled by a k-fan graph and similar to our ap- proach, parts are achieved during the training stage. The obtained model for motorbike object by [1] and the proposed approach are shown in Fig. 8. As it is shown, obtained parts by the proposed ap- proach not only have higher visual discriminative as compared to that of [1] but also give a complete description of the object. In [1,10], parts appearance are initialized randomly. Therefore, there is no guarantee that the model (parts and their relative position) would give a complete description of objects. Consequently, it af- fects in the final result. However, in our approach, the obtained parts give a complete description of objects due to explicit and efficient way of extracting of initial parts. Fig. 9 summarizes preci- sion–recall curve for each category in Caltech 4 dataset. 4.1. Part localization In this subsection, the discriminative ability of the extracted parts in improving performance is experimentally explored. To do so, part localization is done in which positions of parts in the test image are determined. If the localization error is low, it de- notes that the employed part-filters of our approach have enough Fig. 8. (top) Obtained model by the proposed approach (bottom) obtained parts by discriminative ability to detect correct positions of the parts. To do [1]. The obtained parts by the proposed approach have high discriminative ability this experiment, it is needed that the true location of each part in (best view in electronic version). every test image is specified. It should be noted that in our ap- proach, the extracted parts are meaningful. Hence a human could annotate the true extracted part location. When annotating the ex- iment for localization of parts is also done. However, they claimed act position of each part by human it is normal to see some errors that to perform this experiment, it was done in full supervised set- of 0–5 pixels. To do an evaluation, we use two categories of Caltech ting. In other words, they manually labeled parts in the training 4 dataset: Motorbikes and airplanes. These datasets are large, so samples, and then learned their approach. Comparison of our their results are reliable. The test images were annotated with approach with that of [1] is not fair due to the fact that they use the true location of parts. Some samples of part annotation for different set of parts and different settings for learning of object ‘‘Motorbike’’ and ‘‘airplane’’ are shown in Fig. 10. Each part is num- model. However, in Table 5 we compare two approaches based bered by a digit. For test images where bounding box of object is on mean localization errors for all parts. It is shown that our determined correctly, votes are back projected and the predicted approach localizes parts more accurately. position of parts are then determined. To show the results, trimmed mean of Euclidean distance between estimated position 5. Conclusions and the ground truth position of parts are computed. The obtained results are shown in Table 5. In this paper, we proposed a new multi-layer approach for part- As it is shown, localization errors of parts are low. This denotes based object recognition. In the first layer, low-level features (i.e. the effectiveness of the proposed approach in extraction of parts. It SIFT) vote to parts. Parts were considered as latent variables since is noted that the localization of less distinctive parts like part #1 in they were not represented in the training samples. In the second motorbike object is done by using spatial model. In our approach, layer, the obtained parts vote to object centroid. To represent the global ISM is used as a spatial model in the test phase. In [1], exper- structure of low-level features, in each part and the structure of Fig. 7. Some training images in the car side-view dataset in Caltech 4 dataset. Please cite this article in press as: P. Razzaghi, S. Samavi, Hierarchical Implicit Shape Modeling, J. Vis. Commun. (2014), http://dx.doi.org/10.1016/ j.jvcir.2013.12.020 10 P. Razzaghi, S. Samavi / J. Vis. Commun. Image R. xxx (2014) xxx–xxx Fig. 9. Precision–recall curves for Caltech 4 dataset for (top left) face (top right) airplane (bottom left) motorbike (bottom right) car side-view categories. Fig. 10. Examples of the annotation for ‘‘motorbike’’ and ‘‘airplanes’’. Table 5 Localization error for 5 parts on ‘‘motorbike’’ and ‘‘airplane’’ from Caltech 4 dataset. Results are reported based on 90% and 75% of trimmed means in pixels. Part 1 2 3 4 5 Mean [11], Mean 1-fan Motorbike Localization error 90% 12.57 7.6 3.39 2.04 1.83 5.49 8.82 75% 12.61 7.82 3.37 1.60 1.52 5.38 5.7 Airplane Localization error 90% 7.24 8.56 4.06 4.98 12.59 7.49 20.75 75% 7.35 7.25 3.37 3.19 11.19 6.47 14.88 Bold values emphasize that proposed method provides better performance compared to method [11]. parts with respect to the object centroid, ISM was used. The reason was shown that the number of required training samples for the was that ISM could efficiently handle deformable objects, occlu- proposed approach was low. sions and clutters. In the inference step, candidate positions of parts were determined and then they voted to the centroid of the object. Our proposed method unlike most of the methods based References on Hough voting schema does not require a verification stage. It is due to the fact that parts have high discriminative ability and [1] D.J. Crandall, Part-based statistical models for visual object class recognition (Doctor of Philosophy dissertation), Cornell University, 2008. rarely match with the background. In the experimental results, it [2] L. Fei-Fei, R. Fergus, A. Torralba, Recognizing and Learning Object Categories, CVPR 7 short course, 2007. Please cite this article in press as: P. Razzaghi, S. Samavi, Hierarchical Implicit Shape Modeling, J. Vis. Commun. (2014), http://dx.doi.org/10.1016/ j.jvcir.2013.12.020 P. Razzaghi, S. Samavi / J. Vis. Commun. Image R. xxx (2014) xxx–xxx 11 [3] T.K. Landauer, S.T. Dumais, A solution to Plato’s problem: the latent semantic [19] L. Zhu, Y. Chen, A. Yuille, W. Freeman, Latent hierarchical structural learning analysis theory of acquisition, induction, and representation of knowledge, for object detection, in: Computer Vision and Pattern Recognition, 2010, pp. Psychol. Rev. 104 (1997) 211–240. 1062–1069. [4] T. Hofmann, Probabilistic latent semantic indexing, in: Twenty-Second Annual [20] R. Mottaghi, Augmenting deformable part models with irregular-shaped object International SIGIR Conference, 1999, pp. 50–57. patches, in: Computer Vision and Pattern Recognition, 2012, pp. 3116–3123. [5] D.M. Blei, A.Y. Ng, M.I. Jordan, Latent Dirichlet allocation, J. Mach. Learn. Res. 3 [21] R. Mottaghi, A. Ranganathan, A. Yuille, A compositional approach to learning (2003) 993–1022. part-based models of objects, in: International Computer Vision Conference, [6] J. Sivic, B.C. Russell, A.A. Efros, A. Zisserman, W.T. Freeman, Discovering objects 2011, pp. 561–568. and their location in images, in: Computer Vision and Pattern Recognition, [22] D. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. 2005, pp. 370–377. Comput. Vision 60 (2004) 91–110. [7] Z. Niu, G. Hua, X.G.Q. Tian, Spatial-DiscLDA for visual recognition, in: Computer [23] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: Vision and Pattern Recognition, 2011. Computer Vision and Pattern Recognition, 2005, pp. 886–893. [8] M. Everingham, L.V. Gool, C.K.I. Williams, J. Winn, A. Zisserman, The PASCAL [24] M. Grant, S. Boyd, cvx Users’ Guide, 2012. Visual Object Classes, Challenge 2007 (VOC2007) Results. [25] M. Everingham, L. Gool, C.K. Williams, J. Winn, A. Zisserman, The pascal visual [9] R. Fergus, P. Perona, A. Zisserman, Weakly supervised scale-invariant learning object classes (VOC) challenge, Int. J. Comput. Vision Arch. 88 (2010) 303–338. of models for visual recognition, Int. J. Comput. Vision 71 (2005) 273–303. [26] K. Mikolajczyk, B. Leibe, B. Schiele, Multiple object class detection with a [10] P.F. Felzenszwalb, D.P. Huttenlocher, Pictorial structures for object recognition, generative model, in: Computer Vision and Pattern Recognition, 2006, pp. 26– Int. J. Comput. Vision 61 (2005) 55–79. 36. [11] B. Leibe, A. Leonardis, B. Schiele, Robust object detection with interleaved [27] R. Fergus, A. Zisserman, P. Perona, Object class recognition by unsupervised categorization and segmentation, Int. J. Comput. Vision 77 (2008) 259–289. scale-invariant learning, in: Computer Vision and Pattern Recognition, 2003, [12] C. Harris, M. Stephens, A combined corner and edge detector, in: Alvey Vision pp. 264–271. Conference, 1988, pp. 147–151. [28] M. Villamizar, F. Moreno-Noguer, J. Andrade-Cetto, A. Sanfeliu, Efficient [13] V. Ferrari, F. Jurie, C. Schmid, From images to shape models for object rotation invariant object detection using boosted random ferns, in: detection, Int. J. Comput. Vision 87 (2010) 284–303. Computer Vision and Pattern Recognition, 2010, pp. 1038–1045. [14] B. Leibe, K. Mikolajczyk, B. Schiele, Segmentation based multi-cue integration [29] J. Zhang, M. Marszalek, S. Lazebnik, C. Schmid, Local features and kernels for for object detection, in: British Machine Vision Conference, 2006. classification of texture and object categories: a comprehensive study, Int. J. [15] J. Shotton, A. Blake, R. Cipolla, Contour-based learning for object detection, in: Comput. Vision 73 (2007) 213–238. International Conference on Computer Vision, 2005. [30] T. Deselaers, D. Keysers, H. Ney, Improving a discriminative approach to object [16] P. Yarlagadda, A. Monroy, B.O. Ommer, Voting by grouping dependent parts, recognition using image patches, in: DAGM Conference on Pattern in: European Conferance on Computer Vision, 2010, pp. 197–210. Recognition, 2005, pp. 326–333. [17] A. Opelt, A. Pinz, A. Zisserman, Learning an alphabet of shape and appearance [31] D.J. Crandall, P.F. Felzenszwalb, D.P. Huttenlocher, Spatial priors for part-based for multi-class object detection, Int. J. Comput. Vision 80 (2008) 16–44. recognition using statistical models, in: Computer Vision and Pattern [18] P.F. Felzenszwalb, R.B. Girshick, D. McAllester, D. Ramanan, Object detection Recognition, 2005. with discriminatively trained part based models, IEEE Trans. Pattern Anal. Mach. Intell. 32 (2010) 1627–1645. Please cite this article in press as: P. Razzaghi, S. Samavi, Hierarchical Implicit Shape Modeling, J. Vis. Commun. (2014), http://dx.doi.org/10.1016/ j.jvcir.2013.12.020

References (31)

  1. D.J. Crandall, Part-based statistical models for visual object class recognition (Doctor of Philosophy dissertation), Cornell University, 2008.
  2. L. Fei-Fei, R. Fergus, A. Torralba, Recognizing and Learning Object Categories, CVPR 7 short course, 2007.
  3. T.K. Landauer, S.T. Dumais, A solution to Plato's problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge, Psychol. Rev. 104 (1997) 211-240.
  4. T. Hofmann, Probabilistic latent semantic indexing, in: Twenty-Second Annual International SIGIR Conference, 1999, pp. 50-57.
  5. D.M. Blei, A.Y. Ng, M.I. Jordan, Latent Dirichlet allocation, J. Mach. Learn. Res. 3 (2003) 993-1022.
  6. J. Sivic, B.C. Russell, A.A. Efros, A. Zisserman, W.T. Freeman, Discovering objects and their location in images, in: Computer Vision and Pattern Recognition, 2005, pp. 370-377.
  7. Z. Niu, G. Hua, X.G.Q. Tian, Spatial-DiscLDA for visual recognition, in: Computer Vision and Pattern Recognition, 2011.
  8. M. Everingham, L.V. Gool, C.K.I. Williams, J. Winn, A. Zisserman, The PASCAL Visual Object Classes, Challenge 2007 (VOC2007) Results.
  9. R. Fergus, P. Perona, A. Zisserman, Weakly supervised scale-invariant learning of models for visual recognition, Int. J. Comput. Vision 71 (2005) 273-303.
  10. P.F. Felzenszwalb, D.P. Huttenlocher, Pictorial structures for object recognition, Int. J. Comput. Vision 61 (2005) 55-79.
  11. B. Leibe, A. Leonardis, B. Schiele, Robust object detection with interleaved categorization and segmentation, Int. J. Comput. Vision 77 (2008) 259-289.
  12. C. Harris, M. Stephens, A combined corner and edge detector, in: Alvey Vision Conference, 1988, pp. 147-151.
  13. V. Ferrari, F. Jurie, C. Schmid, From images to shape models for object detection, Int. J. Comput. Vision 87 (2010) 284-303.
  14. B. Leibe, K. Mikolajczyk, B. Schiele, Segmentation based multi-cue integration for object detection, in: British Machine Vision Conference, 2006.
  15. J. Shotton, A. Blake, R. Cipolla, Contour-based learning for object detection, in: International Conference on Computer Vision, 2005.
  16. P. Yarlagadda, A. Monroy, B.O. Ommer, Voting by grouping dependent parts, in: European Conferance on Computer Vision, 2010, pp. 197-210.
  17. A. Opelt, A. Pinz, A. Zisserman, Learning an alphabet of shape and appearance for multi-class object detection, Int. J. Comput. Vision 80 (2008) 16-44.
  18. P.F. Felzenszwalb, R.B. Girshick, D. McAllester, D. Ramanan, Object detection with discriminatively trained part based models, IEEE Trans. Pattern Anal. Mach. Intell. 32 (2010) 1627-1645.
  19. L. Zhu, Y. Chen, A. Yuille, W. Freeman, Latent hierarchical structural learning for object detection, in: Computer Vision and Pattern Recognition, 2010, pp. 1062-1069.
  20. R. Mottaghi, Augmenting deformable part models with irregular-shaped object patches, in: Computer Vision and Pattern Recognition, 2012, pp. 3116-3123.
  21. R. Mottaghi, A. Ranganathan, A. Yuille, A compositional approach to learning part-based models of objects, in: International Computer Vision Conference, 2011, pp. 561-568.
  22. D. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vision 60 (2004) 91-110.
  23. N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: Computer Vision and Pattern Recognition, 2005, pp. 886-893.
  24. M. Grant, S. Boyd, cvx Users' Guide, 2012.
  25. M. Everingham, L. Gool, C.K. Williams, J. Winn, A. Zisserman, The pascal visual object classes (VOC) challenge, Int. J. Comput. Vision Arch. 88 (2010) 303-338.
  26. K. Mikolajczyk, B. Leibe, B. Schiele, Multiple object class detection with a generative model, in: Computer Vision and Pattern Recognition, 2006, pp. 26- 36.
  27. R. Fergus, A. Zisserman, P. Perona, Object class recognition by unsupervised scale-invariant learning, in: Computer Vision and Pattern Recognition, 2003, pp. 264-271.
  28. M. Villamizar, F. Moreno-Noguer, J. Andrade-Cetto, A. Sanfeliu, Efficient rotation invariant object detection using boosted random ferns, in: Computer Vision and Pattern Recognition, 2010, pp. 1038-1045.
  29. J. Zhang, M. Marszalek, S. Lazebnik, C. Schmid, Local features and kernels for classification of texture and object categories: a comprehensive study, Int. J. Comput. Vision 73 (2007) 213-238.
  30. T. Deselaers, D. Keysers, H. Ney, Improving a discriminative approach to object recognition using image patches, in: DAGM Conference on Pattern Recognition, 2005, pp. 326-333.
  31. D.J. Crandall, P.F. Felzenszwalb, D.P. Huttenlocher, Spatial priors for part-based recognition using statistical models, in: Computer Vision and Pattern Recognition, 2005.
About the author
Papers
35
Followers
2
View all papers from parvin razzaghiarrow_forward