Mining Representative Unsubstituted Graph Patterns Using Prior Similarity Matrix ∗ †

Rabie Saidi

doi:10.1016/J.IS.2017.05.006

Outline

Mining Representative Unsubstituted Graph Patterns Using Prior Similarity Matrix ∗ †

Rabie Saidi

2016

https://doi.org/10.1016/J.IS.2017.05.006

visibility

…

description

9 pages

Abstract

One of the most powerful techniques to study protein struc-tures is to look for recurrent fragments (also called substruc-tures or spatial motifs), then use them as patterns to charac-terize the proteins under study. An emergent trend consists in parsing proteins three-dimensional (3D) structures into graphs of amino acids. Hence, the search of recurrent spa-tial motifs is formulated as a process of frequent subgraph discovery where each subgraph represents a spatial motif. In this scope, several efficient approaches for frequent subgraph discovery have been proposed in the literature. However, the set of discovered frequent subgraphs is too large to be effi-ciently analyzed and explored in any further process. In this paper, we propose a novel pattern selection approach that shrinks the large number of discovered frequent subgraphs by selecting the representative ones. Existing pattern selec-tion approaches do not exploit the domain knowledge. Yet, in our approach we incorporate th...

Mining Representative Unsubstituted Graph Patterns Using Prior Similarity Matrix∗ † Wajdi Dhifli Rabie Saidi Engelbert Mephu Nguifo Clermont University, Blaise European Bioinformatics Clermont University, Blaise Pascal University, LIMOS, BP Institute Pascal University, LIMOS, BP 10448, F-63000 Hinxton, Cambridge, CB10 10448, F-63000 Clermont-Ferrand, France 1SD, United Kingdom Clermont-Ferrand, France CNRS, UMR 6158, LIMOS, [email protected] CNRS, UMR 6158, LIMOS, F-63173 Aubière, France F-63173 Aubière, France [email protected] [email protected] arXiv:1303.2054v1 [cs.CE] 8 Mar 2013 ABSTRACT General Terms One of the most powerful techniques to study protein struc- Algorithms, Experimentation tures is to look for recurrent fragments (also called substruc- tures or spatial motifs), then use them as patterns to charac- terize the proteins under study. An emergent trend consists Keywords in parsing proteins three-dimensional (3D) structures into Feature selection, graph mining, representative unsubsti- graphs of amino acids. Hence, the search of recurrent spa- tuted subgraphs, spatial motifs, protein structures tial motifs is formulated as a process of frequent subgraph discovery where each subgraph represents a spatial motif. In 1. INTRODUCTION this scope, several efficient approaches for frequent subgraph discovery have been proposed in the literature. However, the Studying protein structures can reveal relevant structural set of discovered frequent subgraphs is too large to be effi- and functional information which may not be derived from ciently analyzed and explored in any further process. In this protein sequences alone. During recent years, various meth- paper, we propose a novel pattern selection approach that ods that study protein structures have been elaborated based shrinks the large number of discovered frequent subgraphs on diverse types of descriptor such as profiles [25], spatial by selecting the representative ones. Existing pattern selec- motifs [17, 21] and others. Yet, the exponential growth of tion approaches do not exploit the domain knowledge. Yet, online databases such as the Protein Data Bank (PDB) [4], in our approach we incorporate the evolutionary information CATH [7], SCOP [2] and others, arises an urgent need for of amino acids defined in the substitution matrices in order more accurate methods that will help to better understand to select the representative subgraphs. We show the effec- the studied phenomenons such as protein evolution, func- tiveness of our approach on a number of real datasets. The tions, etc. results issued from our experiments show that our approach In this scope, proteins have recently been interpreted as is able to considerably decrease the number of motifs while graphs of amino acids and studied based on graph theory enhancing their interestingness. concepts [14]. This representation enables the use of graph mining techniques to study protein structures in a graph perspective. In fact, in graph mining, any problem or object Categories and Subject Descriptors under consideration is represented in the form of nodes and E.4 [Coding and information theory]: [Database Appli- edges and studied based on graph theory concepts [3, 12, 16, cations - Data Mining] 6]. One of the powerful and current trends in graph mining is ∗An implementation of the proposed approach and frequent subgraph discovery. It aims to discover subgraphs that frequently occur in a graph dataset and use them as the datasets used in the experiments are freely patterns to describe the data. These patterns are lately available on the first authors personal web page: http://fc.isima.fr/∼dhifli/unsubpatt/ or by email request analyzed by domain experts to reveal interesting information to any one of the authors hidden in the original graphs, such as discovering pathways †This paper is the full version of a preliminary work accepted in metabolic networks [9], identifying residues that play the as abstract in MLCB’12 (NIPS workshop) role of hubs in the protein and stabilize its structure [24], etc. The graph isomorphism test is one of the main bottle- necks of frequent subgraph mining. Yet, many efficient and scalable algorithms have been proposed in the literature and Permission to make digital or hard copies of all or part of this work for made it feasible for instance FFSM [15], gSpan [29], GAS- personal or classroom use is granted without fee provided that copies are TON [18], etc. Unfortunately, the exponential number of not made or distributed for profit or commercial advantage and that copies discovered frequent subgraphs is another serious issue that bear this notice and the full citation on the first page. To copy otherwise, to still needs more attention [27], since it may hinder or even republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. make any further analysis unfeasible due to time, resources, Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. and computational limitations. For example, in an AIDS an- tiviral screen dataset composed of 422 chemical compounds, gPLS uses the mathematical concept of partial least squares there are more than 1 million frequent substructures when regression to create latent variables allowing a better pre- the minimum support is 5%. This problem becomes even diction. COM [16] is another subgraph selection approach more serious with graphs of higher density such as those which follows a process of pattern mining and classifier learn- representing protein structures. In fact, the issues raised ing. It mines co-occurrence rules. Then, based on the co- from the huge number of frequent subgraphs are mainly due occurrence information it assembles weak features in order to two factors, namely redundancy and significance [22]. Re- to generate strong ones. In [22], authors proposed a fea- dundancy in a frequent subgraph set is caused by structural ture selection approach termed CORK. To find frequent sub- and/or semantic similarity, since most discovered subgraphs graphs, it uses the state-of-the-art approach gSpan. Then differ slightly in structure and may infer similar or even the using a submodular quality function, it selects among them same meaning. Moreover, the significance of the discovered the subset of subgraphs that are most discriminative for frequent subgraphs is only related to frequency. This yields classification. In [10], authors designed LPGBCMP, a gen- an urgent need for efficient approaches allowing to select rel- eral model which selects clustered features by considering evant patterns among the large set of frequent subgraphs. the structure relationship between subgraph patterns in the In this paper, we propose a novel selection approach which functional space. The selected subgraphs are used as weak selects a subset of representative patterns from a set of la- classifiers (base learners) to obtain high quality classifica- beled subgraphs, we term them unsubstituted patterns. In tion models. To the best of our knowledge, in all existing order to select these unsubstituted patterns and to shrink subgraph selection approaches [11], the selection is usually the large size of the initial set of frequent subgraphs, we based on structural similarity [5] and/or statistical measures exploit a specific domain knowledge, which is the substitu- (e.g. frequency and coverage (closed [30], maximal [23]), dis- tion between amino acids represented as nodes. Though, crimination [22], ...). Yet, the prior information and knowl- the main contribution of this work is to define a new ap- edge about the domain are often ignored. However, these proach for mining a representative summary of the set of prior knowledge may help building dedicated approaches frequent subgraphs by incorporating a specific background that best fit the studied data. domain knowledge which is the ability of substitution be- tween nodes labels in the graph. In this work, we apply the proposed approach on protein structures because of the 3. MINING REPRESENTATIVE UNSUBSTI- availability of substitution matrices in the literature, how- TUTED PATTERNS ever, it can be considered as general framework for other applications whenever it is possible to define a matrix quan- 3.1 Background tifying the possible substitution between the labels. Our Statistical pattern selection methods have been widely approach can also be used on any type of subgraph struc- used to resolve the dimensionality problem when the number ture such as cliques, trees and paths (sequences). In addi- of discovered patterns is too large. However, these methods tion, it can be easily coupled with other pattern selection are too generic and do not consider the specificity of the methods such as discrimination or orthogonality based ap- domain and the used data. We believe that in many con- proaches. Moreover, this approach is unsupervised and can texts, it would be important to incorporate the background help in various mining tasks, unlike other approaches that knowledge about the domain in order to create approaches are supervised and dedicated to a specific task such as clas- that best fit the considered data. In proteomics, a protein sification. structure is composed by the folding of a set of amino acids. The remainder of the paper is organized as follows. Sec- During evolution, amino acids can substitute each other. tion 2 discusses the recent related works in the area of pat- The scores of substitution between pairs of amino acids were tern selection for subgraphs. In Section 3, we present the quantified by biologists in the literature in the form of sub- background of our work and we define the preliminary con- stitution matrices such as Blosum62 [8]. Our approach uses cepts as well as the main algorithm of our approach. Then, the substitution information given in the substitution ma- Section 4 describes the characteristics of the used data and trices in order to select a subset of unsubstituted patterns the experimental settings. Section 5 presents the obtained that summarizes the whole set of frequent subgraphs. We experimental results and the discussion. It is worth noting consider the selected patterns as the representative ones. that in the rest of the paper, we use the following terms The main idea of our approach is based on node substi- interchangeably : spatial motifs, patterns, subgraphs. tution. Since the nodes of a protein graph represent amino acids, though, using a substitution matrix, it would be pos- 2. RELATED WORKS sible to quantify the substitution between two given sub- Recently, several approaches have been proposed for pat- graphs. Starting from this idea, we define a similarity func- tern selection in subgraph mining. In [5], authors proposed tion that measures the distance between a given pair of sub- ORIGAMI, an approach for both subgraph discovery and graphs. Then, we preserve only one subgraph from each pair selection. First they randomly mine a sample of maximal having a similarity score greater or equal to a user specified frequent subgraphs, then straightforwardly they select an threshold such that the preserved subgraphs represent the α-orthogonal (non-redundant), β-representative subgraphs set of representative unsubstituted patterns. An overview of from the mined set. The LEAP algorithm proposed in [28] the proposed approach is illustrated in Figure 1 and a more tries to locate patterns that individually have high discrim- detailed description is given in the following sections. ination power, using an objective function score that mea- The substitution between amino acids was also used in sures each pattern’s significance. Another approach termed the literature but for sequential feature extraction from pro- gPLS was proposed in [20]. It attempts to select a set of tein sequences in [19], where the authors proposed a novel informative subgraphs in order to rapidly build a classifier. feature extraction approach termed DDSM for protein se- In the case of protein’s substitution matrices, both posi- tive and negative values represents possible substitutions. However, positive scores are given to the more likely sub- stitutions while negative scores are given to the less likely substitutions. Though, in order to give more magnitude to higher values of x, ∀ l and l0 ∈ L: 0 M(l, l0 ) = eM(l,l ) (2) Figure 1: Unsubstituted pattern selection frame- Definition 2. (Structural isomorphism) Two patterns P = work. (VP , EP , L) and P 0 = (VP 0 , EP 0 , L) are said to be struc- turally isomorphic (having the same shape), we note shape(P, P 0 ). quence classification. Their approach is restricted to protein shape(P, P 0 ) = true , iff: sequences and generates every subsequence substituting an- - P and P 0 have the same order, i.e., |VP | = |VP 0 |, other one. In other words, DDSM eliminates any pattern substituted by another one and which itself does not sub- - P and P 0 have the same size, i.e., |EP | = |EP 0 |, stitute any other one. We believe that their approach does - ∃ a bijective function f : VP → VP 0 | ∀u, v ∈ VP if not guarantee an optimal summarization since its output (u, v) ∈ EP then (f (u), f (v)) ∈ EP 0 and inversely. may still contain patterns that substitute each other. In addition, they do consider negative substitution scores as It is worth mentioning that in the graph isomorphism impossible substitutions which is biologically not true since problem we test whether two graphs are exactly the same by negative scores are only expressing the less likely substitu- considering both structures and labels. But in this defini- tions, which obviously does not mean that they are impos- tion, we only test whether two given graphs are structurally sible. Moreover, DDSM is limited to protein sequences and the same, in terms of nodes and edges, without considering does not handle more complex structures such as the protein the labels. tertiary structure. Our approach overcomes these shortcom- Definition 3. (Elementary mutation probability) Given a ings, since it can handle both protein sequences and struc- node v of a label l ∈ L, the elementary mutation probability, tures (since a sequence can be seen as a line graph). In Mel (v), measures the possibility that v stay itself and does addition, it consider both the positive and negative scores not mutate to any other node depending on its label l. of the matrix. Moreover, our approach generates a set of representative unsubstituted patterns ensuring an optimal  ⊥ 0, if M(l, l) = e  summarization of the initial set. Besides, it is unsupervised > Mel (v) = 1, if M(l, l) = e (3) and can be exploited in classification as well as other anal- M(l,l) , otherwise   P|L| ysis and knowledge discovery contexts unlike DDSM which i=1 M(l,li ) is dedicated to classification. Obviously, if the substitution score in M between l and 3.2 Preliminaries itself is ⊥ then it is certain that l will mutate to another label l0 and the probability value that v does not mutate In this section we present the fundamental definitions and should be 0. Respectively, if this substitution score is > the formal problem statement. Let G be a dataset of graphs. then it is certain that v will stay itself and conserve its label Each graph G = (V, E, L) of G is given as a collection of l so the probability value must be equal to 1. Otherwise, we nodes V and a collection of edges E. The nodes of V are divide the score that l mutates to itself by the sum of all the labeled within an alphabet L. We denote by |V | the number possible mutations. of nodes (also called the graph order) and by |E| the number of edges (also called the graph size). Let also Ω be the set Definition 4. (Pattern mutation probability) Given a pat- of frequent subgraphs extracted from G, also referred here tern P = (VP , EP , L) ∈ Ω, the pattern mutation probability, as patterns. Mpatt (P ), measures the possibility that P mutates to any Definition 1. (Substitution matrix) Given an alphabet L, other pattern having the same order. a substitution matrix M over L is the function defined as: |VP | Y M: 2 L −→ [⊥, >] ⊂ R Mpatt (P ) = 1 − Mel (P [i]) (4) (1) (l, l0 ) 7−→ x i=1 where |V Q P| The higher the value of x is, the likely is the substitution i=1 Mel (P [i]) represents the probability that the of l0 by l. If x = ⊥ then the substitution is impossible, and pattern P does not mutate to any other pattern i.e. P stays if x = > then the substitution is certain. The values ⊥ and itself. > are optional and user-specified. They may appear or not Definition 5. (Elementary substitution probability) Given in M. The scores in M should respect the following two two nodes v and v 0 having correspondingly the labels l, l0 ∈ properties: L, the elementary substitution probability, Sel (v, v 0 ), mea- 1. ∀ l ∈ L, ∃ l’ ∈ L | M(l, l0 ) 6= ⊥, sures the possibility that v substitutes v 0 . 2. ∀ l ∈ L, if ∃ l’ ∈ L | M(l, l0 ) = > then ∀ l” ∈ L\{l, l’ }, M(l, l0 ) Sel (v, v 0 ) = (5) M(l, l”) = ⊥ and M(l0 , l”) = ⊥. M(l, l) In many real world applications, the substitution matrices It is worth noting that Sel is not bijective i.e. Sel (v, v 0 ) 6= may contain at the same time positive and negative scores. Sel (v 0 , v). Definition 6. (Pattern substitution score) Given two pat- Algorithm 1: UnSubPatt terns P = (VP , EP , L) and P 0 = (VP 0 , EP 0 , L) having the Data: Ω, M, τ , (⊥,>) [Optional] same shape, we denote by Spatt (P, P 0 ) the substitution score Result: Ω∗ : {unsubstituted patterns} of P 0 by P . In other words, it measures the possibility that P begin mutates to P 0 by computing the sum of the elementary sub- Ω ← {Ωk ← {P ∈ Ω | ∀(P 0 , P ”) ∈ Ωk , |VP 0 | = stitution probabilities then normalize it by the total number |VP ” | and |EP 0 | = |EP ” |}}; of nodes of P . Formally: foreach Ωk ∈ Ω do Ωk ← sort(Ωk by Mpatt ); P|VP | 0 i=1 Sel (P [i], P [i]) Spatt (P, P 0 ) = (6) | VP | foreach P ∈ Ωk do if Mpatt (P ) > 0 then Definition 7. (Pattern substitution) A pattern P substi- foreach P 0 ∈ Ωk \P | tutes a pattern P 0 , we note subst(P, P 0 , τ ) = true, iff: Mpatt (P 0 ) < Mpatt (P ) do if Mpatt (P 0 ) > 0 then 1. P and P 0 are structurally isomorphic (shape(P, P 0 ) = if shape(P, P 0 ) and true), subst(P, P 0 , τ ) then merge support(P, P 0 ); 2. Spatt (P, P 0 ) ≥ τ , where τ is a user-specified threshold remove(P 0 , Ωk ); such that 0% ≤ τ ≤ 100%. Definition 8. (Unsubstituted pattern) Given a threshold Ω∗ ← Ω∗ ∪ Ωk ; τ and Ω∗ ∈ Ω, a pattern P ∗ ∈ Ω∗ is said to be unsubstituted iff @P ∈ Ω∗ | Mpatt (P ) > Mpatt (P ∗ ) and subst(P, P ∗ , τ ) = true. Theorem 1. Let Ω be a set of patterns and Ω∗ its subset Proposition 1 (Null Mpatt case). Given a pattern P = of unsubstituted patterns based on a substitution matrix M (Vp , Ep , L) ∈ Ω, if Mpatt (P ) = 0 then P is an unsubstituted and a threshold τ , i.e., UnSubPatt (Ω, M, τ, (⊥, >)) = Ω∗ . pattern. Then : Proof. The proof can be simply deduced from Defini- UnSubPatt(Ω∗ , M, τ, (⊥, >)) = Ω∗ (8) tions 3 and 4. Proof. The proof can be deduced simply from Definition 8. Given a threshold τ , Ω∗ can not be summarized by its Definition 9. (Merge support) Given two patterns P and subsets unless itself. Formally: P 0 , if P substitutes P 0 then P will represent P 0 in the list ∀P ∈ Ω∗ , @P 0 ∈ Ω∗ |Mpatt (P ) > Mpatt (P 0 ) and subst(P, P 0 , τ ) of graphs where P 0 occurs. Formally: (9) ∀(P, P 0 ) | subst(P, P 0 , τ ) = true then DP = DP ∪ DP 0 (7) where DP and DP 0 are correspondingly the occurrence set 3.4 Complexity of P and that of P 0 . Suppose Ω contains n patterns. Ω is divided into g groups, each containing patterns of order k. This is done in O(n). 3.3 Algorithm Each group Ωk is sorted in O(|Ωk | ∗ log|Ωk |). Searching Given a set of patterns Ω and a substitution matrix M, we for unsubstituted patterns requires browsing Ωk (O(|Ωk |)) propose UnSubPatt(see Algorithm 1), a pattern selection and for each pattern, browsing in the worst case all remain- algorithm which enables detecting the set of unsubstituted ing patterns (O(|Ωk |)) to check the shape (O(k)) and the patterns Ω∗ within Ω. Based on our similarity concept, all substitution (O(k)). This means that searching for unsub- the patterns in Ω∗ are dissimilar, since it does not contain stituted patterns in a group Ωk can be done in O(|Ωk |2 ∗ k2 ). any pair of patterns that are substitutable. This represents Hence, in the worst case, the complexity of our algorithm is a reliable summarization of Ω. O(g ∗ m2max ∗ kmax 2 ), where kmax is the maximum pattern or- The general process of the algorithm is described as fol- der and mmax is the number of patterns of the largest group lows: first, Ω is divided into subsets of patterns having the Ωk . same number of nodes and edges. Then, each subset is sorted in a descending order by the pattern mutation probability 4. EXPERIMENTS Mpatt . Each subset is browsed starting from the pattern having the highest Mpatt . For each pattern, we remove all 4.1 Datasets the patterns it substitutes and we merge their supports such In order to experimentally evaluate our approach, we use that the preserved pattern will represent all the removed four graph datasets of protein structures, which also have ones wherever they occurs. The remaining patterns repre- been used in [28] then [10]. Each dataset consists of two sent the unsubstituted pattern set. Though, Ω∗ can not be classes: positive and negative. Positive samples are proteins summarized by a subset of it but itself. Our algorithm uses selected from a considered protein family whereas negative Proposition 1 to avoid unnecessary computation related to samples are proteins randomly gathered from the Protein patterns with Mpatt = 0. They are directly considered as Data Bank [4]. Each protein is parsed into a graph of amino unsubstituted patterns, since they can not mutate to any acids. Each node represents an amino acid residue and is la- other pattern. beled with its amino acid type. Two nodes u and v are linked by an edge e(u, v) = 1 if the euclidean distance between their two Cα atoms ∆(Cα (u), Cα (v)) is below a threshold distance Table 2: Number of frequent subgraphs (Ω), repre- δ. Formally: sentative unsubstituted subgraphs (Ω∗ ) and the se- lection rate | Ω∗ | Selection rate (%) ( 1, if ∆(Cα (u), Cα (v)) ≤ δ Dataset |Ω| e(u, v) = (10) DS1 799094 7291 0.91 0, otherwise DS2 258371 15898 6.15 In the literature, many methods use this definition with DS3 114792 14713 12.82 usually δ ≥ 7Å on the argument that Cα atoms define the DS4 1073393 9958 0.93 overall shape of the protein conformation [13]. In our exper- iments, we use δ = 7Å. Table 1 summarizes the characteristics of each dataset. 5.1 Empirical Results SCOP ID, Avg.|V|, Avg.|E|, Max.|V| and Max.|E| corre- spond respectively to the id of the protein family in SCOP Here, we show the results of our experiments obtained database [2], the average number of nodes, the average num- in terms of number of motifs and classification results. As ber of edges, the maximal number of nodes and the maximal mentioned before, we use gSpan to extract the frequent sub- number of edges in each dataset. graphs from each dataset with frequency ≥ 30%. Then, we use UnSubPatt to select the unsubstituted patterns among 4.2 Protocol and Settings them with a substitution threshold τ =30% and using Blo- Generally, in a pattern selection approach two aspects are sum62 as substitution matrix. At last, we perform a 5-fold emphasized, namely the number of selected patterns and cross-validation classification (5 runs) to evaluate the inter- their interestingness. In order to evaluate our approach, estingness of each subset using the NB classifier. The ob- we first use the state-of-the-art method of frequent sub- tained average results are reported in Table 3. graph discovery gSpan [29] to find the frequent subgraphs in The high number of discovered frequent subgraphs is due each dataset with a minimum frequency threshold of 30%. to their combinatorial nature (this was discussed in the in- Then, we use UnSubPatt to select the unsubstituted pat- troductory section). The results reported in Table 2 show terns among them with a minimum substitution threshold that our approach decreases considerably the number of sub- τ =30%. For our approach, we use Blosum62 [8] as the sub- graphs. The selection rate shows that the number of unsub- stitution matrix since it turns out that it performs well on stituted patterns | Ω∗ | does not exceed 13% of the initial set detecting the majority of weak protein similarities, and it is of frequent subgraphs | Ω | with DS3 and even reaches less used as the default matrix by most biological applications than 1% with DS1 and DS4. This proves that exploiting the such as BLAST [1]. It is worth mentioning that the choice domain knowledge, which in this case consists in the sub- of 30% as minimum frequency threshold for the frequent stitution matrix, enables emphasizing information that can subgraph extraction is to make the experimental evaluation possibly be ignored when using exiting subgraph selection feasible due to time and computational limitations. approaches. In order to evaluate the number of selected subgraphs, we The classification results reported in Table 3 help to evalu- define the selection rate as the rate of the number of unsub- ate the interestingness of the selected patterns. Indeed, this stituted subgraphs from the initial set of frequent subgraphs. will demonstrate if the unsubstituted patterns were arbitrar- Formally : ily selected or they are really representative. Table 3 shows that the classification accuracy significantly increases with |Ω∗ | ∗ 100 Selection rate = (11) all datasets. We notice a huge leap in accuracy especially |Ω| with DS1 and DS4 with a gain of more than 17% and reach- To evaluate the interestingness of the set of selected pat- ing almost full accuracy with DS4. To better understand the terns, we use them as features for classification. We per- accuracy results, we also reports the average precision, re- form a 5-fold cross-validation classification (5 runs) on each call, F-measure and AUC values for all cases. We notice an protein-structure dataset. We encode each protein into a enhancement of performance with all the mentioned qual- binary vector, denoting by ”1” or ”0” the presence or the ab- ity metrics. This supports the reliability of our selection sence of the feature in the considered protein. To judge the approach. interestingness of the selected subgraphs, we use one of the most known classifier, namely the naı̈ve bayes (NB) classi- 5.2 Results Using Other Substitution Matri- fier, due to its simplicity and fast prediction and that its ces classification technique is based on a global and conditional Besides Blosum62, biologists also defined other substitu- evaluation of the input features. NB is used with the default tion matrices describing the likelihood that two amino acid parameters from the workbench Weka [26]. types would mutate to each other in evolutionary time. We want to study the effect of using other substitution matrices 5. RESULTS AND DISCUSSION on the experimental results. Though, we perform the same In this section, we conduct experiments to examine the experiments following the same protocol and settings but us- effectiveness and efficiency of UnSubPatt in finding the ing two other substitution matrices, namely Blosum80 and representative unsubstituted subgraphs. We test the effect P am250. We compare the obtained results in terms of num- of changing the substitution matrix and the substitution ber of subgraphs and classification accuracy with those ob- threshold on the results. Moreover, we study the size-based tained using the whole set of frequent subgraphs and those distribution of patterns and we compare the results of our using subgraphs selected by UnSubPatt with Blosum62. approach with those of other subgraph selection methods The results are reported in Table 4. A high selection rate from the literature. accompanied with a clear enhancement of the classification Table 1: Experimental data Dataset SCOP ID Family name Pos. Neg. Avg.|V| Avg.|E| Max.|V| Max.|E| DS1 52592 G proteins 33 33 246 971 897 3 544 DS2 48942 C1 set domains 38 38 238 928 768 2 962 DS3 56437 C-type lectin domains 38 38 185 719 755 3 016 DS4 88854 Kinases, catalytic subunit 41 41 275 1077 775 3 016 Table 3: Accuracy, precision, recall (sensitivity), F-score and AUC of the classification of each dataset using NB coupled with frequent subgraphs (FSg) then representative unsubstituted subgraphs (UnSubPatt) Accuracy Precision Recall F-score AUC Dataset FSg UnSubPatt FSg UnSubPatt FSg UnSubPatt FSg UnSubPatt FSg UnSubPatt DS1 0.62 0.78 0.61 0.69 0.70 0.90 0.64 0.78 0.64 0.78 DS2 0.80 0.90 0.86 0.94 0.74 0.86 0.79 0.89 0.79 0.89 DS3 0.86 0.94 0.89 1.00 0.86 0.89 0.86 0.94 0.86 0.94 DS4 0.79 0.98 0.86 0.92 0.70 0.98 0.76 0.94 0.76 0.94 accuracy is noticed using UnSubPatt with all the substitu- tion matrices compared to the results using the whole set of frequent subgraphs. It is clearly noticed that even using dif- ferent substitution matrices, UnSubPatt shows relatively similar behavior and is able to select a small yet relevant subset of patterns. It is also worth mentioning that for all the datasets, the best classification accuracy is obtained us- ing Blosum62 and the best selection rate is achieved using Pam250. This is simply due to how distant proteins within the same dataset are, since each substitution matrix was constructed to implicitly express a particular theory of evo- lution. Though, choosing the appropriate substitution ma- trix can influence the outcome of the analysis. Figure 2: Rate of unsubstituted patterns from Ω 5.3 Impact of Substitution Threshold depending on the substitution threshold (τ ). In our experiments, we used a substitution threshold (of 30%) to select the unsubstituted patterns from the set of discovered frequent subgraphs. In this section, we study the This fact is illustrated in Figures 3, 4 and 5 which show impact of variation of the substitution threshold on both the that the unsubstituted patterns allow better classification number of selected subgraphs and the classification results. performance compared to the original set of frequent sub- To do so, we perform the same experiments while varying graphs. UnSubPatt scores very well with the three used the substitution threshold from 0% to 90% with a step-size classifiers and even reaches full accuracy in some cases. This of 10. In order to check if the enhancements of the obtained confirms our assumptions and shows that our selection is re- results are due to our selected features or to the classifier, liable and contributes to the enhancement of the accuracy. we use two other well-known classifiers namely the support However, we believe that NB allows the most reliable eval- vector machine (SVM) and decision tree (C4.5) besides the uation because it performs a classification based on a global naı̈ve bayes (NB) classifier. We use the same protocol and and conditional evaluation of features, unlike SVM which settings of the previous experiments. Figure 2 presents the performs itself another attribute selection to select the sup- selection rate for different substitution thresholds and Fig- port vectors and unlike C4.5 which performs an attribute by ures 3, 4 and 5 illustrate the classification accuracy obtained attribute evaluation. respectively using NB, SVM and C4.5 with each dataset. In the case of proteins, a substitution threshold of 0% The classification accuracy of the initial set of frequent sub- enables selecting subgraphs based only on their structure. graphs (gSpan, the line in red) is considered as a standard Precisely, UnSubPatt will select only one pattern from each value for comparison. Thus, the accuracy values of UnSub- group of subgraphs that are structurally isomorphic. Based Patt (in blue) that are above the line of the standard value on the experimental results, we believe that using these pat- are considered as gains, and those under the standard value terns is enough for a structural classification task since it are considered as losses. allows a fast selection, selects a very small number of sub- In Figure 2, we notice that UnSubPatt reduces consid- graphs and performs very well on classification. erably the number of subgraphs especially with lower sub- stitution thresholds. In fact, the number of unsubstituted 5.4 Size-based Distribution of Patterns patterns does not exceed 30% for all substitution thresholds In this section, we study the distribution of patterns based below 70% and even reaches less then 1% in some cases. on their size (number of edges). We try to check which sizes This important reduction in the number of patterns comes of patterns are more concerned by the selection. The Figures with a notable enhancement of the classification accuracies. 6 and 7 draw the distribution of patterns for the original set Table 4: Number of subgraphs (#SG) and accuracy (ACC) of the classification of each dataset using NB coupled with frequent subgraphs (FSg) then representative unsubstituted subgraphs using Blosum80 (Un- SubPatt Blosum80) and Pam250 (UnSubPatt Pam250) FSg UnSubPatt Blosum62 UnSubPatt Blosum80 UnSubPatt Pam250 Dataset #SG Accuracy #SG Accuracy #SG Accuracy #SG Accuracy DS1 799094 0.62 7291 0.78 7328 0.67 6137 0.68 DS2 258371 0.80 15898 0.90 15930 0.87 15293 0.87 DS3 114793 0.86 14713 0.94 14792 0.91 14363 0.93 DS4 1073393 0.79 9958 0.98 10417 0.90 9148 0.90 Figure 3: Classification accuracy by NB. Figure 6: Distribution of patterns of DS1 for all the frequent subgraphs and for the representative un- substituted ones with all the substitution thresholds Figure 4: Classification accuracy by SVM. Figure 7: Distribution of patterns of DS2 for all the frequent subgraphs and for the representative un- substituted ones with all the substitution thresholds of frequent subgraphs and for the final set of representative unsubstituted subgraphs with all the substitution thresholds using Blosum62. The downward tendency of UnSubPatt using lower substitution thresholds and with respect to the original set of frequent subgraph is very clear. In fact, Un- SubPatt leans towards cutting off the peaks and flattening the curves with lower substitution thresholds. Another in- teresting observation is that the curves are flattened in the regions of small patterns as well as in those of big and dense patterns. This demonstrates the effectiveness of UnSub- Patt with both small and big patterns. Figure 5: Classification accuracy by C4.5. 5.5 Comparison with other approaches To objectively evaluate our approach, we compare it with current trends in subgraph selection. In Figure 8, we report Table 5: Runtime analysis of UnSubPatt with dif- ferent substitution thresholds Number of Substitution thresholds patterns τ = 10% τ = 30% τ = 50% 10000 4s 4s 4s 20000 8s 8s 10s 30000 13s 13s 17s 40000 18s 18s 25s 50000 23s 23s 33s 60000 28s 28s 41s Figure 8: Classification accuracy comparison with 70000 35s 35s 52s other pattern selection approaches. 80000 40s 42s 66s 90000 46s 49s 80s the classification accuracy using the representative unsub- 100000 53s 57s 136s stituted patterns of UnSubPatt besides those using pat- terns of other new subgraph selection approaches from the literature namely LEAP[28], gPLS[20], COM[16] and LPG- of subgraphs having the same size and order. Hence, these BCMP[10] (reported and explained in the introductory sec- groups can be distributed and treated separately in different tion). processes. For UnSubPatt, we report the results obtained using the substitution matrix Blosum62, a minimum substitution 6. CONCLUSION threshold τ = 30% and SVM for classification. For LEAP+SVM, In this paper, we proposed a novel selection approach for LEAP is used iteratively to discover a set of discrimina- mining a representative summary of the set of frequent sub- tive subgraphs with a leap length=0.1. The discovered sub- graphs. Unlike existent methods that are based on the re- graphs are consider as features to train SVM with a 5-fold lations between patterns in the transaction space, our ap- cross validation. COM is used with tp = 30% and tn = 0%. proach considers the distance between patterns in the pat- For gPLS, the frequency threshold is set to 30% and the best tern space. The proposed approach exploits a specific do- accuracies are reported for all the datasets among all the pa- main knowledge, in the form of a substitution matrix, to se- rameters combinations from m = 2, 4, 8, 16 and k = 2, 4, 8, lect a subset of representative unsubstituted patterns from 16, where m is the number of iterations and k is the number a given set of frequent subgraphs. It also allows to reduce of patterns obtained per search. For LPGBCMP, threshold considerably the size of the initial set of subgraphs to ob- values of maxv ar = 1 and δ = 0.25 were respectively used tain an interesting and representative one enabling easier for feature consistency map building and for overlapping. and more efficient further explorations. It is also worth men- The obtained results are reported in the Figure 8. tioning that our approach can be used on graphs as well as The classification results displayed in Figure 8 show that on sequences and is not limited to classification tasks, but UnSubPatt allows a better classification than all the other can help in other subgraph-based analysis such as indexing, pattern selection methods in the four cases. Considering clustering, visual inspection, etc. only these results does not allow to confirm that UnSub- Since the proposed approach is a filter approach, a promis- Patt would always outperform the considered methods. How- ing future direction could be to find a way to integrate the ever, this proves that UnSubPatt represents a very compet- selection within the extraction process in order to directly itive and promising approach. It is also worth noting that mine the representative patterns from data. Moreover, we these approaches are supervised and dedicated to classifica- intend to use our approach in other classification contexts tion unlike UnSubPatt which is an unsupervised approach. and in other mining applications. This allows it to be used in classification as well as in other mining tasks such as clustering and indexing. 7. ACKNOWLEDGEMENT 5.6 Runtime Analysis This work is supported by a PhD grant from the French To study the variation of UnSubPatt’s runtime with larger Ministry of Higher Education to the first author. amounts of data, we use different sets of frequent patterns from 10000 to 100000 with step-size of 10000. In Table 5, we 8. REFERENCES report the runtime results for the pattern sets using three substitution thresholds. [1] S. Altschul, W. Gish, W. Miller, E. Myers, and Even though the complexity of the problem due to the D. Lipman. Basic local alignment search tool. Journal combinatorial test of substitution between subgraphs, our of Molecular Biology, 215:403–410, 1990. algorithm is scalable with higher amounts of data. With in- [2] A. Andreeva, D. Howorth, J.-M. Chandonia, S. E. creasing number of patterns, the runtime is still reasonable. Brenner, T. J. P. Hubbard, C. Chothia, and A. G. The use of different substitution thresholds slightly affected Murzin. Data growth and its impact on the scop the runtime of UnSubPatt, since the number of selected database: new developments. Nucleic Acids Research, patterns is comparable for all thresholds. 36(1):D419–D425, 2008. A possible way to make UnSubPatt runs faster is par- [3] L. Bartoli, P. Fariselli, and R. Casadio. The effect of allelization. In fact, UnSubPatt can be easily parallelized, backbone on the small-world properties of protein since it tests separately the substitution among each group contact maps. Physical Biology, 4(4):L1+, 2007. [4] H. M. Berman, J. D. Westbrook, Z. Feng, G. Gilliland, [20] H. Saigo, N. Krämer, and K. Tsuda. Partial least T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. squares regression for graph mining. In ACM Bourne. The protein data bank. Nucleic Acids knowledge discovery and data mining conference Research, 28(1):235–242, 2000. (KDD), pages 578–586, 2008. [5] V. Chaoji, M. A. Hasan, S. Salem, J. Besson, and [21] H. Sun, A. Sacan, H. Ferhatosmanoglu, and Y. Wang. M. J. Zaki. Origami: A novel and effective approach Smolign: A spatial motifs-based protein multiple for mining representative orthogonal graph patterns. structural alignment method. IEEE/ACM Statistical Analysis and Data Mining, 1(2):67–84, Transactions on Computational Biology and 2008. Bioinformatics, 9(1):249–261, 2012. [6] H. Cheng, X. Yan, and J. Han. Mining graph patterns. [22] M. Thoma, H. Cheng, A. Gretton, J. Han, H.-P. In Managing and Mining Graph Data, pages 365–392. Kriegel, A. Smola, L. Song, P. S. Yu, X. Yan, and Springer, 2010. K. M. Borgwardt. Discriminative frequent subgraph [7] A. L. Cuff, I. Sillitoe, T. Lewis, A. B. Clegg, mining with optimality guarantees. Statistical Analysis R. Rentzsch, N. Furnham, M. Pellegrini-Calace, D. T. and Data Mining, 3(5):302–318, 2010. Jones, J. M. Thornton, and C. A. Orengo. Extending [23] L. T. Thomas, S. R. Valluri, and K. Karlapalem. cath: increasing coverage of the protein structure Margin: Maximal frequent subgraph mining. ACM universe and linking structure with function. Nucleic Transactions on Knowledge Discovery from Data Acids Research, 39:420–426, 2011. (TKDD), 4(3):10:1–10:42, 2010. [8] S. R. Eddy. Where did the blosum62 alignment score [24] R. R. Vallabhajosyula, D. Chakravarti, S. Lutfeali, matrix come from? Nature Biotechnology, pages A. Ray, and A. Raval. Identifying hubs in protein 1035–1036, 2004. interaction networks. PLoS ONE, 4(4):e5344, 2009. [9] K. Faust, P. Dupont, J. Callut, and J. van Helden. [25] N. von Öhsen, I. Sommer, R. Zimmer, and Pathway discovery in metabolic networks by subgraph T. Lengauer. Arby: automatic protein structure extraction. Bioinformatics, 26(9):1211–1218, 2010. prediction using profile-profile alignment and [10] H. Fei and J. Huan. Boosting with structure confidence measures. Bioinformatics, information in the functional space: an application to 20(14):2228–2235, 2004. graph classification. In ACM knowledge discovery and [26] I. H. Witten and E. Frank. Data Mining: Practical data mining conference (KDD), pages 643–652, 2010. Machine Learning Tools and Techniques. Morgan [11] M. A. Hasan. Pattern summarization in pattern Kaufmann, 2005. mining. Encyclopedia of Data Warehousing and [27] A. Woznica, P. Nguyen, and A. Kalousis. Model Mining, (2nd Ed), 2008. mining for robust feature selection. In ACM knowledge [12] M. A. Hasan and M. J. Zaki. Output space sampling discovery and data mining conference (KDD), pages for graph patterns. PVLDB, 2(1):730–741, 2009. 913–921, 2012. [13] J. Huan, D. Bandyopadhyay, W. Wang, J. Snoeyink, [28] X. Yan, H. Cheng, J. Han, and P. S. Yu. Mining J. Prins, and A. Tropsha. Comparing graph significant graph patterns by leap search. ACM representations of protein structure for mining SIGMOD international conference on Management of family-specific residue-based packing motifs. Journal data, pages 433–444, 2008. of Computational Biology, 12(6):657–671, 2005. [29] X. Yan and J. Han. gspan: Graph-based substructure [14] J. Huan, W. Wang, D. B, J. Snoeyink, J. Prins, and pattern mining. Order A Journal On The Theory Of A. Tropsha. Mining spatial motifs from protein Ordered Sets And Its Applications, 02:721–724, 2002. structure graphs. In International Conference on [30] X. Yan and J. Han. Closegraph: mining closed Research in Computational Molecular Biology frequent graph patterns. In ACM knowledge discovery (RECOMB), pages 308–315, 2004. and data mining conference (KDD), pages 286–295, [15] J. Huan, W. Wang, and J. Prins. Efficient mining of 2003. frequent subgraphs in the presence of isomorphism. In IEEE International Conference on Data Mining (ICDM), pages 549–552, 2003. [16] N. Jin, C. Young, and W. Wang. Graph classification based on pattern co-occurrence. In ACM International Conference on Information and Knowledge Management, pages 573–582, 2009. [17] G. J. Kleywegt. Recognition of spatial motifs in protein structures. Journal of molecular biology, 285(4):1887–1897, 1999. [18] S. Nijssen and J. N. Kok. A quickstart in frequent structure mining can make a difference. In ACM knowledge discovery and data mining conference (KDD), pages 647–652, 2004. [19] R. Saidi, M. Maddouri, and E. Mephu Nguifo. Protein sequences classification by means of feature extraction with substitution matrices. BMC Bioinformatics, 11(1):175+, 2010.

References (31)

REFERENCES
S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215:403-410, 1990.
A. Andreeva, D. Howorth, J.-M. Chandonia, S. E. Brenner, T. J. P. Hubbard, C. Chothia, and A. G. Murzin. Data growth and its impact on the scop database: new developments. Nucleic Acids Research, 36(1):D419-D425, 2008.
L. Bartoli, P. Fariselli, and R. Casadio. The effect of backbone on the small-world properties of protein contact maps. Physical Biology, 4(4):L1+, 2007.
H. M. Berman, J. D. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne. The protein data bank. Nucleic Acids Research, 28(1):235-242, 2000.
V. Chaoji, M. A. Hasan, S. Salem, J. Besson, and M. J. Zaki. Origami: A novel and effective approach for mining representative orthogonal graph patterns. Statistical Analysis and Data Mining, 1(2):67-84, 2008.
H. Cheng, X. Yan, and J. Han. Mining graph patterns. In Managing and Mining Graph Data, pages 365-392. Springer, 2010.
A. L. Cuff, I. Sillitoe, T. Lewis, A. B. Clegg, R. Rentzsch, N. Furnham, M. Pellegrini-Calace, D. T. Jones, J. M. Thornton, and C. A. Orengo. Extending cath: increasing coverage of the protein structure universe and linking structure with function. Nucleic Acids Research, 39:420-426, 2011.
S. R. Eddy. Where did the blosum62 alignment score matrix come from? Nature Biotechnology, pages 1035-1036, 2004.
K. Faust, P. Dupont, J. Callut, and J. van Helden. Pathway discovery in metabolic networks by subgraph extraction. Bioinformatics, 26(9):1211-1218, 2010.
H. Fei and J. Huan. Boosting with structure information in the functional space: an application to graph classification. In ACM knowledge discovery and data mining conference (KDD), pages 643-652, 2010.
M. A. Hasan. Pattern summarization in pattern of Data Warehousing and Mining, (2nd Ed), 2008.
M. A. Hasan and M. J. Zaki. Output space sampling for graph patterns. PVLDB, 2(1):730-741, 2009.
J. Huan, D. Bandyopadhyay, W. Wang, J. Snoeyink, J. Prins, and A. Tropsha. Comparing graph representations of protein structure for mining family-specific residue-based packing motifs. Journal of Computational Biology, 12(6):657-671, 2005.
J. Huan, W. Wang, D. B, J. Snoeyink, J. Prins, and A. Tropsha. Mining spatial motifs from protein structure graphs. In International Conference on Research in Computational Molecular Biology (RECOMB), pages 308-315, 2004.
J. Huan, W. Wang, and J. Prins. Efficient mining of frequent subgraphs in the presence of isomorphism. In IEEE International Conference on Data Mining (ICDM), pages 549-552, 2003.
N. Jin, C. Young, and W. Wang. Graph classification based on pattern co-occurrence. In ACM International Conference on Information and Knowledge Management, pages 573-582, 2009.
G. J. Kleywegt. Recognition of spatial motifs in protein structures. Journal of molecular biology, 285(4):1887-1897, 1999.
S. Nijssen and J. N. Kok. A quickstart in frequent structure mining can make a difference. In ACM knowledge discovery and data mining conference (KDD), pages 647-652, 2004.
R. Saidi, M. Maddouri, and E. Mephu Nguifo. Protein sequences classification by means of feature extraction with substitution matrices. BMC Bioinformatics, 11(1):175+, 2010.
H. Saigo, N. Krämer, and K. Tsuda. Partial least squares regression for graph mining. In ACM knowledge discovery and data mining conference (KDD), pages 578-586, 2008.
H. Sun, A. Sacan, H. Ferhatosmanoglu, and Y. Wang. Smolign: A spatial motifs-based protein multiple structural alignment method. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9(1):249-261, 2012.
M. Thoma, H. Cheng, A. Gretton, J. Han, H.-P. Kriegel, A. Smola, L. Song, P. S. Yu, X. Yan, and K. M. Borgwardt. Discriminative frequent subgraph mining with optimality guarantees. Statistical Analysis and Data Mining, 3(5):302-318, 2010.
L. T. Thomas, S. R. Valluri, and K. Karlapalem. Margin: Maximal frequent subgraph mining. ACM Transactions on Knowledge Discovery from Data (TKDD), 4(3):10:1-10:42, 2010.
R. R. Vallabhajosyula, D. Chakravarti, S. Lutfeali, A. Ray, and A. Raval. Identifying hubs in protein interaction networks. PLoS ONE, 4(4):e5344, 2009.
N. von Öhsen, I. Sommer, R. Zimmer, and T. Lengauer. Arby: automatic protein structure prediction using profile-profile alignment and confidence measures. Bioinformatics, 20(14):2228-2235, 2004.
I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, 2005.
A. Woznica, P. Nguyen, and A. Kalousis. Model mining for robust feature selection. In ACM knowledge discovery and data mining conference (KDD), pages 913-921, 2012.
X. Yan, H. Cheng, J. Han, and P. S. Yu. Mining significant graph patterns by leap search. ACM SIGMOD international conference on Management of data, pages 433-444, 2008.
X. Yan and J. Han. gspan: Graph-based substructure pattern mining. Order A Journal On The Theory Of Ordered Sets And Its Applications, 02:721-724, 2002.
X. Yan and J. Han. Closegraph: mining closed frequent graph patterns. In ACM knowledge discovery and data mining conference (KDD), pages 286-295, 2003.

About the author

Rabie Saidi

Blaise Pascal University, Department Member

Papers

Followers

View all papers from Rabie Saidiarrow_forward

Over the past era, subgraph mining from a large collection of graph database is a crucial problem. Existing works on subgraph mining is based on the threshold value, which returns similar graphs for a query graph. However, the number of graphs in the answer set for the same threshold can vary in terms of structure and context. In addition, scalability is another big problem due to insufficient storage.Further, it can suffer from security issues. In a distributed environment, since more queries come from different users, it is possible for the attackers to access the graph. These three problems highly exist in the current subgraph mining. To address this downside, our proposed work introduces aBlockchain-based Triune Layered Architecture for authenticated query search in the large scale dynamic graphs. The two fold process is handled in BTLA-LSDG: graph indexing and authenticated query search (query processing), which are implemented in triune layers (Data Generation Layer, Data Storage Layer, and Service Layer). Initially, data owners are authenticated to blockchain using Four-Q-Curve algorithm. The graph index is constructed by data owners and merged graph index is constructed by service providers. Based on the uploaded graph index, hash index is constructed using SHA-3. The hash index is fed into blockchain and form Dendrimer -Fractal Index. On the other hand, data user submits query with authentication. For every authenticated query, the four fold process is handled. Firstly, Multi-Constraint based Belief Entropy function is used for feature sets computation for a given query. Then Dual Similarity based MapReduce helps inmapping and reducing the relevant subgraphs with the use of optimal feature sets. Thirdly, Recurrent Neural Network (RNN)is used for subgraph isomorphic testing. Finally, graph index refinement process is undertaken to improve the query results. For this, query error is computed for each user's query at the end of retrieval. With respect to query error, fuzzy logic is used to refine the index of graph dynamically. This experiment is implemented using Hadoop environment and the results show better efficiency in terms of Scalability, Security and Storage. Further, it is tested with Precision, Recall, F-measure, Accuracy, Error Rate, Query Response Time and Positive Results.

Mining Representative Unsubstituted Graph Patterns Using Prior Similarity Matrix ∗ †

Sign up for access to the world's latest research

Abstract

Related papers

References (31)

Related papers

Cited by