Mining Representative Unsubstituted Graph Patterns
Using Prior Similarity Matrix∗ †
Wajdi Dhifli Rabie Saidi Engelbert Mephu Nguifo
Clermont University, Blaise European Bioinformatics Clermont University, Blaise
Pascal University, LIMOS, BP Institute Pascal University, LIMOS, BP
10448, F-63000 Hinxton, Cambridge, CB10 10448, F-63000
Clermont-Ferrand, France 1SD, United Kingdom Clermont-Ferrand, France
CNRS, UMR 6158, LIMOS,
[email protected] CNRS, UMR 6158, LIMOS,
F-63173 Aubière, France F-63173 Aubière, France
[email protected] [email protected]
arXiv:1303.2054v1 [cs.CE] 8 Mar 2013
ABSTRACT General Terms
One of the most powerful techniques to study protein struc- Algorithms, Experimentation
tures is to look for recurrent fragments (also called substruc-
tures or spatial motifs), then use them as patterns to charac-
terize the proteins under study. An emergent trend consists
Keywords
in parsing proteins three-dimensional (3D) structures into Feature selection, graph mining, representative unsubsti-
graphs of amino acids. Hence, the search of recurrent spa- tuted subgraphs, spatial motifs, protein structures
tial motifs is formulated as a process of frequent subgraph
discovery where each subgraph represents a spatial motif. In 1. INTRODUCTION
this scope, several efficient approaches for frequent subgraph
discovery have been proposed in the literature. However, the Studying protein structures can reveal relevant structural
set of discovered frequent subgraphs is too large to be effi- and functional information which may not be derived from
ciently analyzed and explored in any further process. In this protein sequences alone. During recent years, various meth-
paper, we propose a novel pattern selection approach that ods that study protein structures have been elaborated based
shrinks the large number of discovered frequent subgraphs on diverse types of descriptor such as profiles [25], spatial
by selecting the representative ones. Existing pattern selec- motifs [17, 21] and others. Yet, the exponential growth of
tion approaches do not exploit the domain knowledge. Yet, online databases such as the Protein Data Bank (PDB) [4],
in our approach we incorporate the evolutionary information CATH [7], SCOP [2] and others, arises an urgent need for
of amino acids defined in the substitution matrices in order more accurate methods that will help to better understand
to select the representative subgraphs. We show the effec- the studied phenomenons such as protein evolution, func-
tiveness of our approach on a number of real datasets. The tions, etc.
results issued from our experiments show that our approach In this scope, proteins have recently been interpreted as
is able to considerably decrease the number of motifs while graphs of amino acids and studied based on graph theory
enhancing their interestingness. concepts [14]. This representation enables the use of graph
mining techniques to study protein structures in a graph
perspective. In fact, in graph mining, any problem or object
Categories and Subject Descriptors under consideration is represented in the form of nodes and
E.4 [Coding and information theory]: [Database Appli- edges and studied based on graph theory concepts [3, 12, 16,
cations - Data Mining] 6]. One of the powerful and current trends in graph mining is
∗An implementation of the proposed approach and frequent subgraph discovery. It aims to discover subgraphs
that frequently occur in a graph dataset and use them as
the datasets used in the experiments are freely patterns to describe the data. These patterns are lately
available on the first authors personal web page:
http://fc.isima.fr/∼dhifli/unsubpatt/ or by email request analyzed by domain experts to reveal interesting information
to any one of the authors hidden in the original graphs, such as discovering pathways
†This paper is the full version of a preliminary work accepted in metabolic networks [9], identifying residues that play the
as abstract in MLCB’12 (NIPS workshop) role of hubs in the protein and stabilize its structure [24],
etc.
The graph isomorphism test is one of the main bottle-
necks of frequent subgraph mining. Yet, many efficient and
scalable algorithms have been proposed in the literature and
Permission to make digital or hard copies of all or part of this work for made it feasible for instance FFSM [15], gSpan [29], GAS-
personal or classroom use is granted without fee provided that copies are TON [18], etc. Unfortunately, the exponential number of
not made or distributed for profit or commercial advantage and that copies discovered frequent subgraphs is another serious issue that
bear this notice and the full citation on the first page. To copy otherwise, to still needs more attention [27], since it may hinder or even
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. make any further analysis unfeasible due to time, resources,
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. and computational limitations. For example, in an AIDS an-
tiviral screen dataset composed of 422 chemical compounds, gPLS uses the mathematical concept of partial least squares
there are more than 1 million frequent substructures when regression to create latent variables allowing a better pre-
the minimum support is 5%. This problem becomes even diction. COM [16] is another subgraph selection approach
more serious with graphs of higher density such as those which follows a process of pattern mining and classifier learn-
representing protein structures. In fact, the issues raised ing. It mines co-occurrence rules. Then, based on the co-
from the huge number of frequent subgraphs are mainly due occurrence information it assembles weak features in order
to two factors, namely redundancy and significance [22]. Re- to generate strong ones. In [22], authors proposed a fea-
dundancy in a frequent subgraph set is caused by structural ture selection approach termed CORK. To find frequent sub-
and/or semantic similarity, since most discovered subgraphs graphs, it uses the state-of-the-art approach gSpan. Then
differ slightly in structure and may infer similar or even the using a submodular quality function, it selects among them
same meaning. Moreover, the significance of the discovered the subset of subgraphs that are most discriminative for
frequent subgraphs is only related to frequency. This yields classification. In [10], authors designed LPGBCMP, a gen-
an urgent need for efficient approaches allowing to select rel- eral model which selects clustered features by considering
evant patterns among the large set of frequent subgraphs. the structure relationship between subgraph patterns in the
In this paper, we propose a novel selection approach which functional space. The selected subgraphs are used as weak
selects a subset of representative patterns from a set of la- classifiers (base learners) to obtain high quality classifica-
beled subgraphs, we term them unsubstituted patterns. In tion models. To the best of our knowledge, in all existing
order to select these unsubstituted patterns and to shrink subgraph selection approaches [11], the selection is usually
the large size of the initial set of frequent subgraphs, we based on structural similarity [5] and/or statistical measures
exploit a specific domain knowledge, which is the substitu- (e.g. frequency and coverage (closed [30], maximal [23]), dis-
tion between amino acids represented as nodes. Though, crimination [22], ...). Yet, the prior information and knowl-
the main contribution of this work is to define a new ap- edge about the domain are often ignored. However, these
proach for mining a representative summary of the set of prior knowledge may help building dedicated approaches
frequent subgraphs by incorporating a specific background that best fit the studied data.
domain knowledge which is the ability of substitution be-
tween nodes labels in the graph. In this work, we apply
the proposed approach on protein structures because of the 3. MINING REPRESENTATIVE UNSUBSTI-
availability of substitution matrices in the literature, how- TUTED PATTERNS
ever, it can be considered as general framework for other
applications whenever it is possible to define a matrix quan- 3.1 Background
tifying the possible substitution between the labels. Our
Statistical pattern selection methods have been widely
approach can also be used on any type of subgraph struc-
used to resolve the dimensionality problem when the number
ture such as cliques, trees and paths (sequences). In addi-
of discovered patterns is too large. However, these methods
tion, it can be easily coupled with other pattern selection
are too generic and do not consider the specificity of the
methods such as discrimination or orthogonality based ap-
domain and the used data. We believe that in many con-
proaches. Moreover, this approach is unsupervised and can
texts, it would be important to incorporate the background
help in various mining tasks, unlike other approaches that
knowledge about the domain in order to create approaches
are supervised and dedicated to a specific task such as clas-
that best fit the considered data. In proteomics, a protein
sification.
structure is composed by the folding of a set of amino acids.
The remainder of the paper is organized as follows. Sec-
During evolution, amino acids can substitute each other.
tion 2 discusses the recent related works in the area of pat-
The scores of substitution between pairs of amino acids were
tern selection for subgraphs. In Section 3, we present the
quantified by biologists in the literature in the form of sub-
background of our work and we define the preliminary con-
stitution matrices such as Blosum62 [8]. Our approach uses
cepts as well as the main algorithm of our approach. Then,
the substitution information given in the substitution ma-
Section 4 describes the characteristics of the used data and
trices in order to select a subset of unsubstituted patterns
the experimental settings. Section 5 presents the obtained
that summarizes the whole set of frequent subgraphs. We
experimental results and the discussion. It is worth noting
consider the selected patterns as the representative ones.
that in the rest of the paper, we use the following terms
The main idea of our approach is based on node substi-
interchangeably : spatial motifs, patterns, subgraphs.
tution. Since the nodes of a protein graph represent amino
acids, though, using a substitution matrix, it would be pos-
2. RELATED WORKS sible to quantify the substitution between two given sub-
Recently, several approaches have been proposed for pat- graphs. Starting from this idea, we define a similarity func-
tern selection in subgraph mining. In [5], authors proposed tion that measures the distance between a given pair of sub-
ORIGAMI, an approach for both subgraph discovery and graphs. Then, we preserve only one subgraph from each pair
selection. First they randomly mine a sample of maximal having a similarity score greater or equal to a user specified
frequent subgraphs, then straightforwardly they select an threshold such that the preserved subgraphs represent the
α-orthogonal (non-redundant), β-representative subgraphs set of representative unsubstituted patterns. An overview of
from the mined set. The LEAP algorithm proposed in [28] the proposed approach is illustrated in Figure 1 and a more
tries to locate patterns that individually have high discrim- detailed description is given in the following sections.
ination power, using an objective function score that mea- The substitution between amino acids was also used in
sures each pattern’s significance. Another approach termed the literature but for sequential feature extraction from pro-
gPLS was proposed in [20]. It attempts to select a set of tein sequences in [19], where the authors proposed a novel
informative subgraphs in order to rapidly build a classifier. feature extraction approach termed DDSM for protein se-
In the case of protein’s substitution matrices, both posi-
tive and negative values represents possible substitutions.
However, positive scores are given to the more likely sub-
stitutions while negative scores are given to the less likely
substitutions. Though, in order to give more magnitude to
higher values of x, ∀ l and l0 ∈ L:
0
M(l, l0 ) = eM(l,l )
(2)
Figure 1: Unsubstituted pattern selection frame- Definition 2. (Structural isomorphism) Two patterns P =
work. (VP , EP , L) and P 0 = (VP 0 , EP 0 , L) are said to be struc-
turally isomorphic (having the same shape), we note shape(P, P 0 ).
quence classification. Their approach is restricted to protein shape(P, P 0 ) = true , iff:
sequences and generates every subsequence substituting an- - P and P 0 have the same order, i.e., |VP | = |VP 0 |,
other one. In other words, DDSM eliminates any pattern
substituted by another one and which itself does not sub- - P and P 0 have the same size, i.e., |EP | = |EP 0 |,
stitute any other one. We believe that their approach does - ∃ a bijective function f : VP → VP 0 | ∀u, v ∈ VP if
not guarantee an optimal summarization since its output (u, v) ∈ EP then (f (u), f (v)) ∈ EP 0 and inversely.
may still contain patterns that substitute each other. In
addition, they do consider negative substitution scores as It is worth mentioning that in the graph isomorphism
impossible substitutions which is biologically not true since problem we test whether two graphs are exactly the same by
negative scores are only expressing the less likely substitu- considering both structures and labels. But in this defini-
tions, which obviously does not mean that they are impos- tion, we only test whether two given graphs are structurally
sible. Moreover, DDSM is limited to protein sequences and the same, in terms of nodes and edges, without considering
does not handle more complex structures such as the protein the labels.
tertiary structure. Our approach overcomes these shortcom-
Definition 3. (Elementary mutation probability) Given a
ings, since it can handle both protein sequences and struc-
node v of a label l ∈ L, the elementary mutation probability,
tures (since a sequence can be seen as a line graph). In
Mel (v), measures the possibility that v stay itself and does
addition, it consider both the positive and negative scores
not mutate to any other node depending on its label l.
of the matrix. Moreover, our approach generates a set of
representative unsubstituted patterns ensuring an optimal
⊥
0, if M(l, l) = e
summarization of the initial set. Besides, it is unsupervised >
Mel (v) = 1, if M(l, l) = e (3)
and can be exploited in classification as well as other anal- M(l,l)
, otherwise
P|L|
ysis and knowledge discovery contexts unlike DDSM which i=1 M(l,li )
is dedicated to classification.
Obviously, if the substitution score in M between l and
3.2 Preliminaries itself is ⊥ then it is certain that l will mutate to another
label l0 and the probability value that v does not mutate
In this section we present the fundamental definitions and
should be 0. Respectively, if this substitution score is >
the formal problem statement. Let G be a dataset of graphs.
then it is certain that v will stay itself and conserve its label
Each graph G = (V, E, L) of G is given as a collection of
l so the probability value must be equal to 1. Otherwise, we
nodes V and a collection of edges E. The nodes of V are
divide the score that l mutates to itself by the sum of all the
labeled within an alphabet L. We denote by |V | the number
possible mutations.
of nodes (also called the graph order) and by |E| the number
of edges (also called the graph size). Let also Ω be the set Definition 4. (Pattern mutation probability) Given a pat-
of frequent subgraphs extracted from G, also referred here tern P = (VP , EP , L) ∈ Ω, the pattern mutation probability,
as patterns. Mpatt (P ), measures the possibility that P mutates to any
Definition 1. (Substitution matrix) Given an alphabet L, other pattern having the same order.
a substitution matrix M over L is the function defined as: |VP |
Y
M: 2
L −→ [⊥, >] ⊂ R Mpatt (P ) = 1 − Mel (P [i]) (4)
(1)
(l, l0 ) 7−→ x i=1
where |V
Q P|
The higher the value of x is, the likely is the substitution i=1 Mel (P [i]) represents the probability that the
of l0 by l. If x = ⊥ then the substitution is impossible, and pattern P does not mutate to any other pattern i.e. P stays
if x = > then the substitution is certain. The values ⊥ and itself.
> are optional and user-specified. They may appear or not Definition 5. (Elementary substitution probability) Given
in M. The scores in M should respect the following two two nodes v and v 0 having correspondingly the labels l, l0 ∈
properties: L, the elementary substitution probability, Sel (v, v 0 ), mea-
1. ∀ l ∈ L, ∃ l’ ∈ L | M(l, l0 ) 6= ⊥, sures the possibility that v substitutes v 0 .
2. ∀ l ∈ L, if ∃ l’ ∈ L | M(l, l0 ) = > then ∀ l” ∈ L\{l, l’ }, M(l, l0 )
Sel (v, v 0 ) = (5)
M(l, l”) = ⊥ and M(l0 , l”) = ⊥. M(l, l)
In many real world applications, the substitution matrices It is worth noting that Sel is not bijective i.e. Sel (v, v 0 ) 6=
may contain at the same time positive and negative scores. Sel (v 0 , v).
Definition 6. (Pattern substitution score) Given two pat- Algorithm 1: UnSubPatt
terns P = (VP , EP , L) and P 0 = (VP 0 , EP 0 , L) having the
Data: Ω, M, τ , (⊥,>) [Optional]
same shape, we denote by Spatt (P, P 0 ) the substitution score
Result: Ω∗ : {unsubstituted patterns}
of P 0 by P . In other words, it measures the possibility that P
begin
mutates to P 0 by computing the sum of the elementary sub-
Ω ← {Ωk ← {P ∈ Ω | ∀(P 0 , P ”) ∈ Ωk , |VP 0 | =
stitution probabilities then normalize it by the total number
|VP ” | and |EP 0 | = |EP ” |}};
of nodes of P . Formally:
foreach Ωk ∈ Ω do
Ωk ← sort(Ωk by Mpatt );
P|VP | 0
i=1 Sel (P [i], P [i])
Spatt (P, P 0 ) = (6)
| VP | foreach P ∈ Ωk do
if Mpatt (P ) > 0 then
Definition 7. (Pattern substitution) A pattern P substi- foreach P 0 ∈ Ωk \P |
tutes a pattern P 0 , we note subst(P, P 0 , τ ) = true, iff: Mpatt (P 0 ) < Mpatt (P ) do
if Mpatt (P 0 ) > 0 then
1. P and P 0 are structurally isomorphic (shape(P, P 0 ) = if shape(P, P 0 ) and
true), subst(P, P 0 , τ ) then
merge support(P, P 0 );
2. Spatt (P, P 0 ) ≥ τ , where τ is a user-specified threshold remove(P 0 , Ωk );
such that 0% ≤ τ ≤ 100%.
Definition 8. (Unsubstituted pattern) Given a threshold Ω∗ ← Ω∗ ∪ Ωk ;
τ and Ω∗ ∈ Ω, a pattern P ∗ ∈ Ω∗ is said to be unsubstituted
iff @P ∈ Ω∗ | Mpatt (P ) > Mpatt (P ∗ ) and subst(P, P ∗ , τ ) =
true.
Theorem 1. Let Ω be a set of patterns and Ω∗ its subset
Proposition 1 (Null Mpatt case). Given a pattern P = of unsubstituted patterns based on a substitution matrix M
(Vp , Ep , L) ∈ Ω, if Mpatt (P ) = 0 then P is an unsubstituted and a threshold τ , i.e., UnSubPatt (Ω, M, τ, (⊥, >)) = Ω∗ .
pattern. Then :
Proof. The proof can be simply deduced from Defini- UnSubPatt(Ω∗ , M, τ, (⊥, >)) = Ω∗ (8)
tions 3 and 4. Proof. The proof can be deduced simply from Definition
8. Given a threshold τ , Ω∗ can not be summarized by its
Definition 9. (Merge support) Given two patterns P and subsets unless itself. Formally:
P 0 , if P substitutes P 0 then P will represent P 0 in the list
∀P ∈ Ω∗ , @P 0 ∈ Ω∗ |Mpatt (P ) > Mpatt (P 0 ) and subst(P, P 0 , τ )
of graphs where P 0 occurs. Formally:
(9)
∀(P, P 0 ) | subst(P, P 0 , τ ) = true then DP = DP ∪ DP 0 (7)
where DP and DP 0 are correspondingly the occurrence set 3.4 Complexity
of P and that of P 0 . Suppose Ω contains n patterns. Ω is divided into g groups,
each containing patterns of order k. This is done in O(n).
3.3 Algorithm Each group Ωk is sorted in O(|Ωk | ∗ log|Ωk |). Searching
Given a set of patterns Ω and a substitution matrix M, we for unsubstituted patterns requires browsing Ωk (O(|Ωk |))
propose UnSubPatt(see Algorithm 1), a pattern selection and for each pattern, browsing in the worst case all remain-
algorithm which enables detecting the set of unsubstituted ing patterns (O(|Ωk |)) to check the shape (O(k)) and the
patterns Ω∗ within Ω. Based on our similarity concept, all substitution (O(k)). This means that searching for unsub-
the patterns in Ω∗ are dissimilar, since it does not contain stituted patterns in a group Ωk can be done in O(|Ωk |2 ∗ k2 ).
any pair of patterns that are substitutable. This represents Hence, in the worst case, the complexity of our algorithm is
a reliable summarization of Ω. O(g ∗ m2max ∗ kmax
2
), where kmax is the maximum pattern or-
The general process of the algorithm is described as fol- der and mmax is the number of patterns of the largest group
lows: first, Ω is divided into subsets of patterns having the Ωk .
same number of nodes and edges. Then, each subset is sorted
in a descending order by the pattern mutation probability 4. EXPERIMENTS
Mpatt . Each subset is browsed starting from the pattern
having the highest Mpatt . For each pattern, we remove all 4.1 Datasets
the patterns it substitutes and we merge their supports such In order to experimentally evaluate our approach, we use
that the preserved pattern will represent all the removed four graph datasets of protein structures, which also have
ones wherever they occurs. The remaining patterns repre- been used in [28] then [10]. Each dataset consists of two
sent the unsubstituted pattern set. Though, Ω∗ can not be classes: positive and negative. Positive samples are proteins
summarized by a subset of it but itself. Our algorithm uses selected from a considered protein family whereas negative
Proposition 1 to avoid unnecessary computation related to samples are proteins randomly gathered from the Protein
patterns with Mpatt = 0. They are directly considered as Data Bank [4]. Each protein is parsed into a graph of amino
unsubstituted patterns, since they can not mutate to any acids. Each node represents an amino acid residue and is la-
other pattern. beled with its amino acid type. Two nodes u and v are linked
by an edge e(u, v) = 1 if the euclidean distance between their
two Cα atoms ∆(Cα (u), Cα (v)) is below a threshold distance Table 2: Number of frequent subgraphs (Ω), repre-
δ. Formally: sentative unsubstituted subgraphs (Ω∗ ) and the se-
lection rate
| Ω∗ | Selection rate (%)
(
1, if ∆(Cα (u), Cα (v)) ≤ δ Dataset |Ω|
e(u, v) = (10) DS1 799094 7291 0.91
0, otherwise
DS2 258371 15898 6.15
In the literature, many methods use this definition with DS3 114792 14713 12.82
usually δ ≥ 7Å on the argument that Cα atoms define the DS4 1073393 9958 0.93
overall shape of the protein conformation [13]. In our exper-
iments, we use δ = 7Å.
Table 1 summarizes the characteristics of each dataset. 5.1 Empirical Results
SCOP ID, Avg.|V|, Avg.|E|, Max.|V| and Max.|E| corre-
spond respectively to the id of the protein family in SCOP Here, we show the results of our experiments obtained
database [2], the average number of nodes, the average num- in terms of number of motifs and classification results. As
ber of edges, the maximal number of nodes and the maximal mentioned before, we use gSpan to extract the frequent sub-
number of edges in each dataset. graphs from each dataset with frequency ≥ 30%. Then, we
use UnSubPatt to select the unsubstituted patterns among
4.2 Protocol and Settings them with a substitution threshold τ =30% and using Blo-
Generally, in a pattern selection approach two aspects are sum62 as substitution matrix. At last, we perform a 5-fold
emphasized, namely the number of selected patterns and cross-validation classification (5 runs) to evaluate the inter-
their interestingness. In order to evaluate our approach, estingness of each subset using the NB classifier. The ob-
we first use the state-of-the-art method of frequent sub- tained average results are reported in Table 3.
graph discovery gSpan [29] to find the frequent subgraphs in The high number of discovered frequent subgraphs is due
each dataset with a minimum frequency threshold of 30%. to their combinatorial nature (this was discussed in the in-
Then, we use UnSubPatt to select the unsubstituted pat- troductory section). The results reported in Table 2 show
terns among them with a minimum substitution threshold that our approach decreases considerably the number of sub-
τ =30%. For our approach, we use Blosum62 [8] as the sub- graphs. The selection rate shows that the number of unsub-
stitution matrix since it turns out that it performs well on stituted patterns | Ω∗ | does not exceed 13% of the initial set
detecting the majority of weak protein similarities, and it is of frequent subgraphs | Ω | with DS3 and even reaches less
used as the default matrix by most biological applications than 1% with DS1 and DS4. This proves that exploiting the
such as BLAST [1]. It is worth mentioning that the choice domain knowledge, which in this case consists in the sub-
of 30% as minimum frequency threshold for the frequent stitution matrix, enables emphasizing information that can
subgraph extraction is to make the experimental evaluation possibly be ignored when using exiting subgraph selection
feasible due to time and computational limitations. approaches.
In order to evaluate the number of selected subgraphs, we The classification results reported in Table 3 help to evalu-
define the selection rate as the rate of the number of unsub- ate the interestingness of the selected patterns. Indeed, this
stituted subgraphs from the initial set of frequent subgraphs. will demonstrate if the unsubstituted patterns were arbitrar-
Formally : ily selected or they are really representative. Table 3 shows
that the classification accuracy significantly increases with
|Ω∗ | ∗ 100
Selection rate = (11) all datasets. We notice a huge leap in accuracy especially
|Ω| with DS1 and DS4 with a gain of more than 17% and reach-
To evaluate the interestingness of the set of selected pat- ing almost full accuracy with DS4. To better understand the
terns, we use them as features for classification. We per- accuracy results, we also reports the average precision, re-
form a 5-fold cross-validation classification (5 runs) on each call, F-measure and AUC values for all cases. We notice an
protein-structure dataset. We encode each protein into a enhancement of performance with all the mentioned qual-
binary vector, denoting by ”1” or ”0” the presence or the ab- ity metrics. This supports the reliability of our selection
sence of the feature in the considered protein. To judge the approach.
interestingness of the selected subgraphs, we use one of the
most known classifier, namely the naı̈ve bayes (NB) classi- 5.2 Results Using Other Substitution Matri-
fier, due to its simplicity and fast prediction and that its ces
classification technique is based on a global and conditional Besides Blosum62, biologists also defined other substitu-
evaluation of the input features. NB is used with the default tion matrices describing the likelihood that two amino acid
parameters from the workbench Weka [26]. types would mutate to each other in evolutionary time. We
want to study the effect of using other substitution matrices
5. RESULTS AND DISCUSSION on the experimental results. Though, we perform the same
In this section, we conduct experiments to examine the experiments following the same protocol and settings but us-
effectiveness and efficiency of UnSubPatt in finding the ing two other substitution matrices, namely Blosum80 and
representative unsubstituted subgraphs. We test the effect P am250. We compare the obtained results in terms of num-
of changing the substitution matrix and the substitution ber of subgraphs and classification accuracy with those ob-
threshold on the results. Moreover, we study the size-based tained using the whole set of frequent subgraphs and those
distribution of patterns and we compare the results of our using subgraphs selected by UnSubPatt with Blosum62.
approach with those of other subgraph selection methods The results are reported in Table 4. A high selection rate
from the literature. accompanied with a clear enhancement of the classification
Table 1: Experimental data
Dataset SCOP ID Family name Pos. Neg. Avg.|V| Avg.|E| Max.|V| Max.|E|
DS1 52592 G proteins 33 33 246 971 897 3 544
DS2 48942 C1 set domains 38 38 238 928 768 2 962
DS3 56437 C-type lectin domains 38 38 185 719 755 3 016
DS4 88854 Kinases, catalytic subunit 41 41 275 1077 775 3 016
Table 3: Accuracy, precision, recall (sensitivity), F-score and AUC of the classification of each dataset using
NB coupled with frequent subgraphs (FSg) then representative unsubstituted subgraphs (UnSubPatt)
Accuracy Precision Recall F-score AUC
Dataset
FSg UnSubPatt FSg UnSubPatt FSg UnSubPatt FSg UnSubPatt FSg UnSubPatt
DS1 0.62 0.78 0.61 0.69 0.70 0.90 0.64 0.78 0.64 0.78
DS2 0.80 0.90 0.86 0.94 0.74 0.86 0.79 0.89 0.79 0.89
DS3 0.86 0.94 0.89 1.00 0.86 0.89 0.86 0.94 0.86 0.94
DS4 0.79 0.98 0.86 0.92 0.70 0.98 0.76 0.94 0.76 0.94
accuracy is noticed using UnSubPatt with all the substitu-
tion matrices compared to the results using the whole set of
frequent subgraphs. It is clearly noticed that even using dif-
ferent substitution matrices, UnSubPatt shows relatively
similar behavior and is able to select a small yet relevant
subset of patterns. It is also worth mentioning that for all
the datasets, the best classification accuracy is obtained us-
ing Blosum62 and the best selection rate is achieved using
Pam250. This is simply due to how distant proteins within
the same dataset are, since each substitution matrix was
constructed to implicitly express a particular theory of evo-
lution. Though, choosing the appropriate substitution ma-
trix can influence the outcome of the analysis.
Figure 2: Rate of unsubstituted patterns from Ω
5.3 Impact of Substitution Threshold depending on the substitution threshold (τ ).
In our experiments, we used a substitution threshold (of
30%) to select the unsubstituted patterns from the set of
discovered frequent subgraphs. In this section, we study the This fact is illustrated in Figures 3, 4 and 5 which show
impact of variation of the substitution threshold on both the that the unsubstituted patterns allow better classification
number of selected subgraphs and the classification results. performance compared to the original set of frequent sub-
To do so, we perform the same experiments while varying graphs. UnSubPatt scores very well with the three used
the substitution threshold from 0% to 90% with a step-size classifiers and even reaches full accuracy in some cases. This
of 10. In order to check if the enhancements of the obtained confirms our assumptions and shows that our selection is re-
results are due to our selected features or to the classifier, liable and contributes to the enhancement of the accuracy.
we use two other well-known classifiers namely the support However, we believe that NB allows the most reliable eval-
vector machine (SVM) and decision tree (C4.5) besides the uation because it performs a classification based on a global
naı̈ve bayes (NB) classifier. We use the same protocol and and conditional evaluation of features, unlike SVM which
settings of the previous experiments. Figure 2 presents the performs itself another attribute selection to select the sup-
selection rate for different substitution thresholds and Fig- port vectors and unlike C4.5 which performs an attribute by
ures 3, 4 and 5 illustrate the classification accuracy obtained attribute evaluation.
respectively using NB, SVM and C4.5 with each dataset. In the case of proteins, a substitution threshold of 0%
The classification accuracy of the initial set of frequent sub- enables selecting subgraphs based only on their structure.
graphs (gSpan, the line in red) is considered as a standard Precisely, UnSubPatt will select only one pattern from each
value for comparison. Thus, the accuracy values of UnSub- group of subgraphs that are structurally isomorphic. Based
Patt (in blue) that are above the line of the standard value on the experimental results, we believe that using these pat-
are considered as gains, and those under the standard value terns is enough for a structural classification task since it
are considered as losses. allows a fast selection, selects a very small number of sub-
In Figure 2, we notice that UnSubPatt reduces consid- graphs and performs very well on classification.
erably the number of subgraphs especially with lower sub-
stitution thresholds. In fact, the number of unsubstituted 5.4 Size-based Distribution of Patterns
patterns does not exceed 30% for all substitution thresholds In this section, we study the distribution of patterns based
below 70% and even reaches less then 1% in some cases. on their size (number of edges). We try to check which sizes
This important reduction in the number of patterns comes of patterns are more concerned by the selection. The Figures
with a notable enhancement of the classification accuracies. 6 and 7 draw the distribution of patterns for the original set
Table 4: Number of subgraphs (#SG) and accuracy (ACC) of the classification of each dataset using NB
coupled with frequent subgraphs (FSg) then representative unsubstituted subgraphs using Blosum80 (Un-
SubPatt Blosum80) and Pam250 (UnSubPatt Pam250)
FSg UnSubPatt Blosum62 UnSubPatt Blosum80 UnSubPatt Pam250
Dataset
#SG Accuracy #SG Accuracy #SG Accuracy #SG Accuracy
DS1 799094 0.62 7291 0.78 7328 0.67 6137 0.68
DS2 258371 0.80 15898 0.90 15930 0.87 15293 0.87
DS3 114793 0.86 14713 0.94 14792 0.91 14363 0.93
DS4 1073393 0.79 9958 0.98 10417 0.90 9148 0.90
Figure 3: Classification accuracy by NB. Figure 6: Distribution of patterns of DS1 for all the
frequent subgraphs and for the representative un-
substituted ones with all the substitution thresholds
Figure 4: Classification accuracy by SVM.
Figure 7: Distribution of patterns of DS2 for all the
frequent subgraphs and for the representative un-
substituted ones with all the substitution thresholds
of frequent subgraphs and for the final set of representative
unsubstituted subgraphs with all the substitution thresholds
using Blosum62. The downward tendency of UnSubPatt
using lower substitution thresholds and with respect to the
original set of frequent subgraph is very clear. In fact, Un-
SubPatt leans towards cutting off the peaks and flattening
the curves with lower substitution thresholds. Another in-
teresting observation is that the curves are flattened in the
regions of small patterns as well as in those of big and dense
patterns. This demonstrates the effectiveness of UnSub-
Patt with both small and big patterns.
Figure 5: Classification accuracy by C4.5. 5.5 Comparison with other approaches
To objectively evaluate our approach, we compare it with
current trends in subgraph selection. In Figure 8, we report
Table 5: Runtime analysis of UnSubPatt with dif-
ferent substitution thresholds
Number of Substitution thresholds
patterns τ = 10% τ = 30% τ = 50%
10000 4s 4s 4s
20000 8s 8s 10s
30000 13s 13s 17s
40000 18s 18s 25s
50000 23s 23s 33s
60000 28s 28s 41s
Figure 8: Classification accuracy comparison with 70000 35s 35s 52s
other pattern selection approaches.
80000 40s 42s 66s
90000 46s 49s 80s
the classification accuracy using the representative unsub- 100000 53s 57s 136s
stituted patterns of UnSubPatt besides those using pat-
terns of other new subgraph selection approaches from the
literature namely LEAP[28], gPLS[20], COM[16] and LPG- of subgraphs having the same size and order. Hence, these
BCMP[10] (reported and explained in the introductory sec- groups can be distributed and treated separately in different
tion). processes.
For UnSubPatt, we report the results obtained using
the substitution matrix Blosum62, a minimum substitution 6. CONCLUSION
threshold τ = 30% and SVM for classification. For LEAP+SVM, In this paper, we proposed a novel selection approach for
LEAP is used iteratively to discover a set of discrimina- mining a representative summary of the set of frequent sub-
tive subgraphs with a leap length=0.1. The discovered sub- graphs. Unlike existent methods that are based on the re-
graphs are consider as features to train SVM with a 5-fold lations between patterns in the transaction space, our ap-
cross validation. COM is used with tp = 30% and tn = 0%. proach considers the distance between patterns in the pat-
For gPLS, the frequency threshold is set to 30% and the best tern space. The proposed approach exploits a specific do-
accuracies are reported for all the datasets among all the pa- main knowledge, in the form of a substitution matrix, to se-
rameters combinations from m = 2, 4, 8, 16 and k = 2, 4, 8, lect a subset of representative unsubstituted patterns from
16, where m is the number of iterations and k is the number a given set of frequent subgraphs. It also allows to reduce
of patterns obtained per search. For LPGBCMP, threshold considerably the size of the initial set of subgraphs to ob-
values of maxv ar = 1 and δ = 0.25 were respectively used tain an interesting and representative one enabling easier
for feature consistency map building and for overlapping. and more efficient further explorations. It is also worth men-
The obtained results are reported in the Figure 8. tioning that our approach can be used on graphs as well as
The classification results displayed in Figure 8 show that on sequences and is not limited to classification tasks, but
UnSubPatt allows a better classification than all the other can help in other subgraph-based analysis such as indexing,
pattern selection methods in the four cases. Considering clustering, visual inspection, etc.
only these results does not allow to confirm that UnSub- Since the proposed approach is a filter approach, a promis-
Patt would always outperform the considered methods. How- ing future direction could be to find a way to integrate the
ever, this proves that UnSubPatt represents a very compet- selection within the extraction process in order to directly
itive and promising approach. It is also worth noting that mine the representative patterns from data. Moreover, we
these approaches are supervised and dedicated to classifica- intend to use our approach in other classification contexts
tion unlike UnSubPatt which is an unsupervised approach. and in other mining applications.
This allows it to be used in classification as well as in other
mining tasks such as clustering and indexing.
7. ACKNOWLEDGEMENT
5.6 Runtime Analysis This work is supported by a PhD grant from the French
To study the variation of UnSubPatt’s runtime with larger Ministry of Higher Education to the first author.
amounts of data, we use different sets of frequent patterns
from 10000 to 100000 with step-size of 10000. In Table 5, we 8. REFERENCES
report the runtime results for the pattern sets using three
substitution thresholds. [1] S. Altschul, W. Gish, W. Miller, E. Myers, and
Even though the complexity of the problem due to the D. Lipman. Basic local alignment search tool. Journal
combinatorial test of substitution between subgraphs, our of Molecular Biology, 215:403–410, 1990.
algorithm is scalable with higher amounts of data. With in- [2] A. Andreeva, D. Howorth, J.-M. Chandonia, S. E.
creasing number of patterns, the runtime is still reasonable. Brenner, T. J. P. Hubbard, C. Chothia, and A. G.
The use of different substitution thresholds slightly affected Murzin. Data growth and its impact on the scop
the runtime of UnSubPatt, since the number of selected database: new developments. Nucleic Acids Research,
patterns is comparable for all thresholds. 36(1):D419–D425, 2008.
A possible way to make UnSubPatt runs faster is par- [3] L. Bartoli, P. Fariselli, and R. Casadio. The effect of
allelization. In fact, UnSubPatt can be easily parallelized, backbone on the small-world properties of protein
since it tests separately the substitution among each group contact maps. Physical Biology, 4(4):L1+, 2007.
[4] H. M. Berman, J. D. Westbrook, Z. Feng, G. Gilliland, [20] H. Saigo, N. Krämer, and K. Tsuda. Partial least
T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. squares regression for graph mining. In ACM
Bourne. The protein data bank. Nucleic Acids knowledge discovery and data mining conference
Research, 28(1):235–242, 2000. (KDD), pages 578–586, 2008.
[5] V. Chaoji, M. A. Hasan, S. Salem, J. Besson, and [21] H. Sun, A. Sacan, H. Ferhatosmanoglu, and Y. Wang.
M. J. Zaki. Origami: A novel and effective approach Smolign: A spatial motifs-based protein multiple
for mining representative orthogonal graph patterns. structural alignment method. IEEE/ACM
Statistical Analysis and Data Mining, 1(2):67–84, Transactions on Computational Biology and
2008. Bioinformatics, 9(1):249–261, 2012.
[6] H. Cheng, X. Yan, and J. Han. Mining graph patterns. [22] M. Thoma, H. Cheng, A. Gretton, J. Han, H.-P.
In Managing and Mining Graph Data, pages 365–392. Kriegel, A. Smola, L. Song, P. S. Yu, X. Yan, and
Springer, 2010. K. M. Borgwardt. Discriminative frequent subgraph
[7] A. L. Cuff, I. Sillitoe, T. Lewis, A. B. Clegg, mining with optimality guarantees. Statistical Analysis
R. Rentzsch, N. Furnham, M. Pellegrini-Calace, D. T. and Data Mining, 3(5):302–318, 2010.
Jones, J. M. Thornton, and C. A. Orengo. Extending [23] L. T. Thomas, S. R. Valluri, and K. Karlapalem.
cath: increasing coverage of the protein structure Margin: Maximal frequent subgraph mining. ACM
universe and linking structure with function. Nucleic Transactions on Knowledge Discovery from Data
Acids Research, 39:420–426, 2011. (TKDD), 4(3):10:1–10:42, 2010.
[8] S. R. Eddy. Where did the blosum62 alignment score [24] R. R. Vallabhajosyula, D. Chakravarti, S. Lutfeali,
matrix come from? Nature Biotechnology, pages A. Ray, and A. Raval. Identifying hubs in protein
1035–1036, 2004. interaction networks. PLoS ONE, 4(4):e5344, 2009.
[9] K. Faust, P. Dupont, J. Callut, and J. van Helden. [25] N. von Öhsen, I. Sommer, R. Zimmer, and
Pathway discovery in metabolic networks by subgraph T. Lengauer. Arby: automatic protein structure
extraction. Bioinformatics, 26(9):1211–1218, 2010. prediction using profile-profile alignment and
[10] H. Fei and J. Huan. Boosting with structure confidence measures. Bioinformatics,
information in the functional space: an application to 20(14):2228–2235, 2004.
graph classification. In ACM knowledge discovery and [26] I. H. Witten and E. Frank. Data Mining: Practical
data mining conference (KDD), pages 643–652, 2010. Machine Learning Tools and Techniques. Morgan
[11] M. A. Hasan. Pattern summarization in pattern Kaufmann, 2005.
mining. Encyclopedia of Data Warehousing and [27] A. Woznica, P. Nguyen, and A. Kalousis. Model
Mining, (2nd Ed), 2008. mining for robust feature selection. In ACM knowledge
[12] M. A. Hasan and M. J. Zaki. Output space sampling discovery and data mining conference (KDD), pages
for graph patterns. PVLDB, 2(1):730–741, 2009. 913–921, 2012.
[13] J. Huan, D. Bandyopadhyay, W. Wang, J. Snoeyink, [28] X. Yan, H. Cheng, J. Han, and P. S. Yu. Mining
J. Prins, and A. Tropsha. Comparing graph significant graph patterns by leap search. ACM
representations of protein structure for mining SIGMOD international conference on Management of
family-specific residue-based packing motifs. Journal data, pages 433–444, 2008.
of Computational Biology, 12(6):657–671, 2005. [29] X. Yan and J. Han. gspan: Graph-based substructure
[14] J. Huan, W. Wang, D. B, J. Snoeyink, J. Prins, and pattern mining. Order A Journal On The Theory Of
A. Tropsha. Mining spatial motifs from protein Ordered Sets And Its Applications, 02:721–724, 2002.
structure graphs. In International Conference on [30] X. Yan and J. Han. Closegraph: mining closed
Research in Computational Molecular Biology frequent graph patterns. In ACM knowledge discovery
(RECOMB), pages 308–315, 2004. and data mining conference (KDD), pages 286–295,
[15] J. Huan, W. Wang, and J. Prins. Efficient mining of 2003.
frequent subgraphs in the presence of isomorphism. In
IEEE International Conference on Data Mining
(ICDM), pages 549–552, 2003.
[16] N. Jin, C. Young, and W. Wang. Graph classification
based on pattern co-occurrence. In ACM International
Conference on Information and Knowledge
Management, pages 573–582, 2009.
[17] G. J. Kleywegt. Recognition of spatial motifs in
protein structures. Journal of molecular biology,
285(4):1887–1897, 1999.
[18] S. Nijssen and J. N. Kok. A quickstart in frequent
structure mining can make a difference. In ACM
knowledge discovery and data mining conference
(KDD), pages 647–652, 2004.
[19] R. Saidi, M. Maddouri, and E. Mephu Nguifo. Protein
sequences classification by means of feature extraction
with substitution matrices. BMC Bioinformatics,
11(1):175+, 2010.