M2 SP: Mining Sequential Patterns among Several
Dimensions
M. Plantevit1 , Y.W. Choong2,3, A. Laurent1 , D. Laurent2 and M. Teisseire1
1LIRMM, Université Montpellier 2, CNRS, 161 rue Ada, 34392 Montpellier, France
2 LICP, Université de Cergy Pontoise, 2 av. Chauvin, 95302 Cergy-Pontoise, France
3 HELP University College, BZ-2 Pusat Bandar Damansara, 50490 Kuala Lumpur, Malaysia
Abstract. Mining sequential patterns aims at discovering correlations between
events through time, like A customer who bought a TV and a DVD player at
the same time later bought a recorder. Even if many works have dealt with se-
quential pattern mining, there is no work leading to rules like A customer who
bought a surfboard together with a bag in NY later bought a wetsuit in SF. In this
paper, we propose M 2 SP, a complete approach to mine such multidimensional
sequential patterns. The main originality of our proposition is that we obtain not
only intra-pattern sequences but also inter-pattern sequences. Moreover, we con-
sider generalized multidimensional sequential patterns, called jokerized patterns,
in which some of the dimension values may not be instanciated. Experiments on
synthetic data are reported and show the scalability of our approach.
Keywords: Data Mining, Sequential Patterns, Multidimensional Rules.
1 Introduction
Mining sequential patterns aim at discovering correlations between events through time.
For instance, rules that can be built are A customer who bought a TV and a DVD player
at the same time later bought a recorder. Work dealing with this issue in the literature
have proposed scalable methods and algorithms to mine such rules [9]. As for asso-
ciation rules, the efficient discovery is based on the support which indicates to which
extend data from the database contains the patterns.
However, these methods only consider one dimension to appear in the patterns,
which is usually called the product dimension. This dimension may also represent
web pages for web usage mining, but there is normally a single dimension. Although
some works from various studies claim to combine several dimensions, we argue here
that they do not provide a complete framework for multidimensional sequential pattern
mining [4,8,11]. The way we consider multidimensionality is indeed generalized in the
sense that patterns must contain several dimensions combined over time. For instance
we aim at building rules like A customer who bought a sur f board and a bag in NY later
bought a wetsuit in SF. This rule not only combines two dimensions (City and Product)
but it also combines them over time (NY appears before SF, surfboard appears before
wetsuit). As far as we know, no method has been proposed to mine such rules.
In this paper, we present existing methods and their limits from several studies in
recent years. We define then the basic concepts associated to our proposition M 2 SP and
the algorithms to build such rules. Experiments are performed on synthetic data to as-
sess our proposition.
In our approach, sequential patterns are mined from a relational table, that can be
seen as a fact table in a multidimensional database. This is why, contrary to the stan-
dard terminology of the relational model, the attributes over which a relational table is
defined are called dimensions.
In order to mine such frequent sequences, we extend our approach so as to take
into account partially instanciated tuples in sequences. More precisely, our algorithms
are designed in order to mine frequent jokerized multidimensional sequences contain-
ing as few ∗ as possible, i.e., replacing an occurrence of ∗ with any value from the
corresponding domain cannot give a frequent sequence.
This paper is organized as follows. Section 2 introduces a motivating example illus-
trating the goal of our work. Section 3 presents existing works concerning sequential
patterns, multidimensional approaches and multidimensional databases. Section 4 in-
troduces our contribution, M2 SP, as an approach to discover multidimensional sequen-
tial patterns. Section 5 introduces jokerized patterns. Section 6 presents the algorithms.
Section 7 presents results of experiments performed on synthetic data.
2 Motivating Example
In order to illustrate our approach, we consider the following example that will be used
throughout the paper as a running example.
Let us consider a relational table T in which transactions issued by customers are
stored. More precisely, we assume that T is defined over six dimensions (or attributes)
as shown in Fig. 1, and where: D is the date of transactions (considering three dates,
denoted by 1, 2 and 3), CG is the category of customers (considering two categories,
denoted by Educ and Ret, standing for educational and retired customers, respectively),
A is the age of customers (considering three discretized values, denoted by Y (young),
M (middle) and O (old)), C is the city where transactions have been issued (considering
three cities, denoted by NY (New York), LA (Los Angeles) and SF (San Francisco)), P
is the product of the transactions (considering four products, denoted by c, m, p and r), Q
stands for the quantity of products in the transactions (considering nine such quantities).
For instance, the first tuple of T (see Fig. 1) means that, at date 1, educational young
customers bought 50 products c in New York. Let us now assume that we want to ex-
tract all multidimensional sequences that deal with the age of customers, the products
they bought and the corresponding quantities, and that are frequent with respect to the
groups of customers and the cities where transactions have been issued. To this end, we
consider three sets of dimensions as follows: (i) the dimension D, representing the date,
(ii) the three dimensions A, P and Q that we call analysis dimensions (tuples over anal-
ysis dimensions are those that appear in the items that constitute the sequential patterns
to be mined), (iii) the two dimensions CG and C, that we call reference dimensions (the
table is partitioned according to tuple values over reference dimensions and the support
of a given multidimensional sequence is the ratio of the number of blocks supporting
the sequence over the total number of blocks. Fig. 5 displays the corresponding blocks
in our example).
In this framework, h{(Y, c, 50), (M, p, 2)}, {M, r, 10)}i is a multidimensional sequence
having support 13 , since the partition according to the reference dimensions contains
3 blocks, among which one supports the sequence. This is so because (Y, c, 50) and
(M, p, 2) both appear at the same date (namely date 1), and (M, r, 10) appears later on
(namely at date 2) in the first block shown in Figure 4.
It is important to note that, in our approach, more general patterns, called joker-
ized sequences, can be mined. The reason for this generalization is that considering
partially instanciated tuples in sequences implies that more frequent sequences are
mined. To see this, considering a support threshold of 32 , no sequence of the form
h{(Y, c, µ)}, {(M, r, µ0 )}i is frequent. On the other hand, in the first two blocks of Fig.
5, Y associated with c and M associated with r appear one after the other, according
to the date of transactions. Thus, we consider that the jokerized sequence, denoted by
h{(Y, c, ∗)}, {(M, r, ∗)}i, is frequent since its support is equal to 32 .
D CG C A P Q
(Date) (Customer-Group) (City) (Age) (Product) (Quantity)
1 Educ NY Y c 50
1 Educ NY M p 2
1 Educ LA Y c 30
1 Ret. SF O c 20
1 Ret. SF O m 2
2 Educ NY M p 3
2 Educ NY M r 10
2 Educ LA Y c 20
3 Educ LA M r 15
Fig. 1. Table T
3 Related Work
3.1 Sequential Patterns
An early example of research in the discovering of patterns from sequences of events
can be found in [5]. In this work, the idea is the discovery of rules underlying the gen-
eration of a given sequence in order to predict a plausible sequence continuation. This
idea is then extended to the discovery of finding interesting patterns (or rules) embedded
in a database of sequences of sets of events (items). A more formal approach in solving
the problem of mining sequential patterns is the AprioriAll algorithm as presented in
[6]. Given a database of sequences, where each sequence is a list of transactions ordered
by transaction time, and each transaction is a set of items, the goal is to discover all se-
quential patterns with a user-specified minimum support, where the support of a pattern
is the number of data-sequences that contain the pattern. In [1], the authors introduce
the problem of mining sequential patterns over large databases of customer transactions
where each transaction consists of customer-id, transaction time, and the items bought
in the transaction. Formally, given a set of sequences, where each sequence consists of
a list of elements and each element consists of a set of items, and given a user-specified
min support threshold, sequential pattern mining is to find all of the frequent subse-
quences, i.e., the subsequences whose occurrence frequency in the set of sequences is
no less than min support. Sequential pattern mining discovers frequent patterns ordered
by time. An example of this type of pattern is A customer who bought a new television 3
months ago, is likely to buy a DVD player now. Subsequently, many studies have intro-
duced various methods in mining sequential patterns (mainly in time-related data) but
almost all proposed methods are Apriori-like, i.e., based on the Apriori property which
states the fact that any super-pattern of a nonfrequent pattern cannot be frequent. An
example using this approach is the GSP algorithm [9].
3.2 Data to be Mined
We assume that the reader is familiar with the relational model of databases [10], and we
briefly recall the basic ingredients of this model used in this paper. Let U = {D 1 , . . . Dn }
be a set of attributes, which we call dimensions in our approach. Each dimension D i is
associated with a (possibly infinite) domain of values, denoted by dom(D i ). A relational
table T over universe U is a finite set of tuples t = {d1 , . . . , dn } such that, for every
i = 1, . . . , n, di ∈ dom(Di ). Moreover, given a table T over U, for every i = 1, . . . , n,
we denote by DomT (Di ) (or simply Dom(Di ) if T is clear from the context) the active
domain of Di in T , i.e., the set of all values of dom(Di ) that occur in T .
Since we are interested in sequential patterns, we assume that U contains at least
one dimension whose domain is totally ordered, corresponding to the time dimension.
In our running example, we consider U = {D,CG, A, P, Q}, where D is the time dimen-
sion. Moreover, considering the table T of Fig. 1, we have for instance Dom(D) = {1, 2, 3} and
Dom(CG) = {Educ, Ret}.
3.3 Multidimensional Sequential Patterns
As far as we know, three propositions have been studied in order to deal with several
dimensions when building sequential patterns.
Pinto et al. [8] is the first paper dealing with several dimensions in the framework of
sequential patterns. For instance, purchases are not only described by considering the
customer ID and the products, but also by considering the age, type of the customer
(Cust-Grp) and the city where he lives, as shown in Fig. 1. Multidimensional sequen-
tial patterns are defined over the schema A1 , ..., Am , S where the set of Ai stands for the
dimensions describing the data and S stands for the sequence of items purchased by
the customers ordered over time. A multidimensional sequential pattern is defined as
(id1 ,(a1 , ..., am ),s) where ai ∈ Ai ∪ {∗}. id1 ,(a1 , ..., am ) is said to be a multidimensional
pattern. For instance, the authors consider the sequence ((∗, NY, ∗),hb f i) meaning that
customers from NY have all bought a product b and then a product f . Sequential pat-
terns are mined from such multidimensional databases either (i) by mining all frequent
sequential patterns over the product dimension and then regrouping them into multidi-
mensional patterns, (ii) or by mining all frequent multidimensional patterns and then
mining frequent product sequences over these patterns. Note that the sequences found
by this approach do not contain several dimensions since the dimension time only con-
cerns products. Dimension product is the only dimension that can be combined over
time, meaning that it is not possible to have a rule indicating that when b is bought in
Boston then c is bought in NY .
Yu et Chen. [11] considers sequential pattern mining in the framework of Web Usage
Mining. Even if they consider three dimensions (pages, sessions, days), these dimen-
sions are very particular since they belong to a single hierarchized dimension. More-
over, the sequences found describe correlations between objects over time by consider-
ing only one dimension, which corresponds to the web pages.
de Amo et al. [4] is based on first order temporal logic. This proposition is close to
our approach, but there are some limits since (i) groups used to compute the support
are predefined whereas we consider the fact that the user should be able to define them
(see reference dimensions below), (ii) several attributes cannot appear in the sequences.
The authors claim that they aim at considering several dimensions but they have only
shown one dimension for the sake of simplicity. However, the paper does not provide
any complete proposition to extend this point to real multidimensional patterns.
4 M2 SP: Mining Multidimensional Sequential Patterns
As shown above, no work has been done in order to mine sequential patterns over
several dimensions. The work described in [8] is said to be intra-pattern since sequences
are mined within the framework of a single description (the so-called pattern). In this
paper, we propose to generalize this work to inter-pattern multidimensional sequences.
4.1 Dimension Partition
D CG C A P Q D CG C A P Q D CG C A P Q
1 Educ NY Y c 50 1 Educ LA Y c 30 1 Ret. SF O c 20
1 Educ NY M p 2 2 Educ LA Y c 20 1 Ret. SF O m 2
2 Educ NY M p 3 3 Educ LA M r 15
2 Educ NY M r 10
Fig. 2. block (Educ, NY ) Fig. 3. block (Educ, LA) Fig. 4. block (Ret., SF)
Fig. 5. blocks defined on T over dimensions CG and C
For each table defined on the set of dimensions D, we consider a partition of D into
four sets: Dt for the temporal dimension, DA for the analysis dimensions, DR for the
reference dimensions, DF for the ignored dimensions.
Each tuple c = (d1 , . . . , dn ) can thus be denoted c = ( f , r, a,t) with f the restriction on
DF of c, r the restriction on DR , a the restriction on DA , t the restriction on Dt .
Given a table T , the set of all tuples in T having the same value r on D R is said to
be a block and we denote by BT,DR the set of blocks from table T . Fig. 5 shows blocks
built from table T . Each block B in BT,DR is denoted by the tuple r that defines it.
/ DR = {CG,C}, DA = {A, P}, Dt = {D}.
In our running example, we consider F = 0,
When mining multidimensional sequential patterns, the set DR identifies the blocks
of the database to be considered when computing the support. For this reason, this set
is called reference. The support of a sequence is the proportion of blocks embedding it.
Note that usual sequential patterns, and sequential patterns from [8] and [4], this set is
reduced to one dimension (cid in [8] or IdG in [4]). The set DA describes the analy-
sis dimensions, meaning that these dimensions will be found in the multidimensional
sequential patterns. Note that usual sequential patterns only consider one analysis di-
mension corresponding to the products purchased or the web pages visited. The set F
describes the ignored dimensions, which are used neither to define the date, nor the
blocks, and which are not present within the patterns mined.
4.2 Multidimensional Item, Itemset and Sequential Pattern
Definition 1 (Multidimensional Item) A multidimensional item e defined on DA =
{Di1 , . . . , Dim } is a tuple e = (di1 , . . . , dim ) such that ∀k ∈ [1, m], dik ∈ Dom(Dik ).
Definition 2 (Multidimensional Itemset) A multidimensional itemset i defined on DA =
{Di1 ,. . . ,Dim } is a non empty set of items i = {e1 , . . . , e p } where ∀ j ∈ [1, p], e j is a mul-
tidimensional item defined on DA and ∀ j, k ∈ [1, p], e j 6= ek .
Note that all items from an itemset are defined using the same dimensions (D A ). Also
note that all pairs of multidimensional items from an itemset are different.
Definition 3 (Multidimensional Sequence) A multidimensional sequence ς defined on
DA = {Di1 ,. . . ,Dim } is an ordered non empty list of itemsets ς = hi1 , . . . , il i where ∀ j ∈
[1, l], i j is a multidimensional itemset defined on DA .
In our running example, (Y, c, 50), (M, p, 2), (M, r, 10) are three multidimensional items on
DA = {A, P, Q}. {(Y, c, 50), (M, p, 2)} and {(M, r, 10)} are multidimensional itemsets on DA .
h{(Y, c, 50), (M, p, 2)}, {(M, r, 10)}i is a multidimensional sequence on DA .
Definition 4 (Inclusion of sequence) A multidimensional sequence ς = ha1 , . . . , al i is
said to be a subsequence of a sequence ς0 = hb1 , . . . , bl 0 i if there exist integers 1 ≤ j1 ≤
j2 ≤ . . . ≤ jl ≤ l 0 such that a1 ⊆ b j1 , a2 ⊆ b j2 , . . . , al ⊆ b jl .
Let ς = h{(Y, c, 50)}, {(M, r, 10)}i and ς0 = h{(Y, c, 50), (M, p, 2)}, {(M, r, 10)}i be two multi-
dimensional sequences. ς is a subsequence of ς0 .
4.3 Support
Computing the support of a sequence amounts to counting the number of blocks that
contain the sequence. A block supports a sequence if it is possible to find a set of tuples
which satisfy it. For each itemset from the sequence, we must thus find a date from
Dom(Dt ) such that all items from the itemset appear at this date. All itemsets must be
retrieved at different dates from Dom(Dt ) such that the order of the itemsets from the
sequence is satisfied.
Definition 5 A table T supports a sequence hi1 , . . . , il i if ∀ j = 1 . . . l, ∃d j ∈ Dom(Dt ),
for every item e in i j , ∃t = ( f , r, e, d j ) ∈ T with d1 < d2 < . . . < dl .
The block (Educ, NY ) from Fig. 2 supports h{(Y, c, 50), (M, p, 2)}, {(M, r, 10)}i since
{(Y, c, 50), (M, p, 2)} appears at date = 1 and {(M, r, 10)} appears at date = 2.
The support of a sequence in a table T is the proportion of blocks of T that support it.
Definition 6 (Sequence Support) Let DR be the reference dimensions and T be a ta-
ble partitioned into the set of blocks BT,DR . The support of a sequence ς is:
|{B∈BT,DR s.t. B supports ς}|
support(ς) = |BT,DR |
Definition 7 (Frequent Sequence) Let minsup ∈ [0, 1] be the minimum user-defined
support value. A sequence ς is said to be frequent if support(ς) ≥ minsup.
Note that an item e is said to be frequent if so is the sequence h{e}i.
Let us consider DR = {CG,C}, DA = {A, P}, minsup = 15 , ς = h{(Y, c, 50), (M, p, 2)}, {(M, r, 10)}i.
The three blocks of the partition of T from Fig. 5 must be scanned to compute support(ς).
1. block (Educ, NY) (Fig. 2). Two dates can be found in this block. At date 1, we have (Y, c, 50)
and (M, p, 2) and at date 2, we have (M, r, 10). Thus this block supports ς.
2. block (Educ, LA) (Fig. 3). This block cannot be counted since it does not contain (M, p, 2).
3. block (Ret., SF) (Fig. 4). This block contains only one date. Therefore, it is not possible ς to
be embedded in it.
We have thus support(ς) = 13 ≥ minsup.
5 Jokerized Sequential Patterns
Considering the definitions above, an item can only be retrieved if there exists a fre-
quent tuple of values from domains of DA containing it. For instance, it can occur that
neither (Y, r) nor (M, r) nor (O, r) is frequent whereas the value r is frequent. Thus, we
introduce the joker value ∗. In this case, we consider (∗, r)which is said to be jokerized.
Definition 8 (Jokerized Item) Let e = (d1 , . . . , dm ) a multidimensional item. We de-
note by e[di /δ] the replacement in e of di by δ. e is said to be a jokerized multidimen-
sional item if: (i) ∀i ∈ [1, m], di ∈ Dom(Di ) ∪ {∗}, and (ii) ∃i ∈ [1, m] such that di 6= ∗,
and (iii) ∀di = ∗, @δ ∈ Dom(Di ) such that e[di /δ] is frequent.
A jokerized item contains at least one specified analysis dimension. It contains a ∗
only if no specific value from the domain can be set. A jokerized sequence is a sequence
containing at least one jokerized item. A block is said to support a sequence if a set of
tuples containing the itemsets satisfying the temporal constraints can be found.
Definition 9 (Support of a Jokerized Sequence) A table T supports a jokerized se-
quence ς = his1 , . . . , isl i if ∀ j ∈ [1, l], ∃δ j ∈ Dom(Dt ), ∀e = (di1 , . . . , dim ) ∈ i j , ∃t =
( f , r, (xi1 , . . . , xim ), δ j ) ∈ T with di = xi or di = ∗ and δ1 < δ2 < . . . < δl .
|{B∈BT,DR s.t. B supports ς}|
The support of ς is the proportion of blocks containing ς. support(ς) = |BT,DR |
6 Algorithms
6.1 Mining 1-frequent items
1-frequent items are items built on the analysis dimensions. When considering no joker
value, a single scan of the database results in the multidimensional items being frequent.
When considering jokerized items, a levelwise algorithm is used in order to build the
frequent multidimensional items having the smallest possible number of joker values.
To this end, we consider a lattice which lower bound is the (∗, . . . , ∗) multidimensional
item. This lattice is partially built from (∗, . . . , ∗) up to the frequent items containing as
few ∗ as possible. At level i, i values are specified. Then items at level i are combined
to build a set of candidates at level i + 1. Two frequent items are combined to build a
candidate if they are on-compatible.
Definition 10 (o n-compatibility) Let e1 = (d1 , . . . , dn ) and e2 = (d10 , . . . , dn0 ) be two dis-
tinct multidimensional items where di and di0 ∈ dom(Di ) ∪ {∗}. e1 and e2 are said to be
n-compatible if ∃∆ = {Di1 , . . . , Din−2 } ⊂ {D1 , . . . , Dn } such that for every j ∈ [1, n − 2],
o
di j = di0 j 6= ∗ with din−1 = ∗ and di0n−1 6= ∗ and din 6= ∗ and di0n = ∗.
Definition 11 (Join) Let e1 = (d1 , . . . , dn ) and e2 = (d10 , . . . , dn0 ) be two on-compatible
multidimensional items. We define e1 o n e2 = (v1 , . . . , vn ) where vi = di if di = di0 , vi = di
if di0 = ∗ and vi = di0 if di = ∗.
Let E and E 0 be two sets of multidimensional items of size n, we define
Eon E 0 = {e o
n e0 s.t. (e, e0 ) ∈ E × E 0 ∧ e and e’ are o n-compatible}
Two multidimensional items defined on n dimensions are o n-compatible if they share n-
2 values. For instance, (NY,Y, ∗) and (∗,Y, r) are on-compatible. We have (NY,Y, ∗) o n
(∗,Y, r) = (NY,Y, r). On the contrary, (NY, M, ∗) and (NY,Y, ∗) are not o n-compatible.
Note that this method is close to the one used for iceberg cubes in [2,3].
F1i stands for the set of 1-frequent items having i dimensions which are specified
(different from ∗). Frequent items of size 1 are obtained from the candidate items of
size 1. More generally, the algorithm to build candidate items of size i are obtained by
considering the frequent items of size i − 1: F11 = { f ∈ Cand11, support( f ) ≥ minsup},
Cand1i = F1i−1 on F1i−1 .
6.2 Mining Jokerized Multidimensional Sequences
Once 1-frequent items are mined, the candidate sequences of size k (k ≥ 2) are generated
and validated to keep the frequent items. This computation is based on usual algorithms
such as PSP [7] by introducing the treatment of joker values.
The support of a sequence ς in a table T according to the reference dimensions D R
is given by Algorithm 1. The reference dimensions are used to compute the partition to
be considered. This algorithm checks whether each block of the partition supports the
sequence by calling the function supportTable (Algorithm 2). supportTable attempts to
find a tuple from the block that matches the first item of the first itemset of the sequence
in order to anchor the sequence. This operation is repeated recursively until all itemsets
from the sequence are found (return true) or until there is no way to go on further (return
false). Several possible anchors may be tested if the first ones do not fit.
Function supportcount
Data : ς, T, DR , counting
Result : support of ς
Integer support ←− 0 ; BooleanseqSupported;
BT,DR ←− {blocks o f T identi f ied over DR };
foreach B ∈ BT,DR do
seqSupported ←− supportTable(ς, B, counting) ;
if seqSupported then support ←− support + 1;
support
return |B |
T,DR
Algorithm 1: Support of a sequence (supportcount)
Function supportTable
Data : ς, T, counting //counting indicates whether joker values are considered or not
Result : Boolean
ItemSetFound ←− f alse ; seq ←− ς ; itset ←− sequence. f irst() ; it ←− itset. f irst()
if ς = 0/ then return (true) // End of Recursivity
while t ←− T.next 6= 0/ do
if supports(t, it, counting) then
if (NextItem ←− itset.second()) = 0/ then ItemSetFound ←− true
// Look for all the items from the itemset
else
// Anchoring on the item (date)
T 0 ←− σdate=t.date (T )
while t 0 ←− T 0 .next() 6= 0/ ∧ ItemSetFound = f alse do
if supports(t 0 , NextItem, counting) then NextItem ←− itset.next()
if NextItem = 0/ then ItemSetFound ←− true
if ItemSetFound = true then
// Look now for the other itemsets
return (supportTable(seq.tail(), σdate>t.date (T ), counting))
else
itset ←− seq. f irst()
C ←− σdate>t.date (T ) // Skip to next dates
return(false) // Not found
Algorithm 2: supportTable (Checks if a sequence ς is supported by a table T )
7 Experiments
In this section, we report experiments performed on synthetic data. These experiments
aim at showing the interest and scalability of our approach, especially in the jokerized
approach. As many databases from the real world include quantitative information, we
have distinguished a quantitative dimension. In order to highlight the particular role of
this quantitative dimension, we consider four ways of computing frequent sequential
patterns: (i) no joker (M 2 SP), (ii) jokers on all dimensions but the quantitative one
(M 2 SP-al pha), (iii) jokers only on the quantitative dimension (M 2 SP-mu), (iv) jokers
on all dimensions (M 2 SP-al pha-mu). Note that case (iv) corresponds to the jokerized
approach presented in Section 5. Our experiments can thus be seen as being conducted
in the context of a fact table of a multidimensional database, where the quantitative
dimension is the measure. In Figures 5-12, minsup is the minimum support taken into
account, nb dim is the number of analysis dimensions being considered, DB size is
the number of tuples, avg card is the average number of values in the domains of the
analysis dimensions.
Fig. 6 and 7 compare the behavior of the four approaches described above when
the support changes. M 2 SP-al pha and M 2 SP-al pha-mu have a similar behavior, the
difference being due to the verification of quantities in the case of M 2 SP-al pha. Note
that these experiments are not led on the same minimum support values since no fre-
quent items are found for M 2 SP and M 2 SP-mu if the support is too high. Fig. 8 shows
the scalability of our approach since runtime grows almost linearly when the database
size increases (from 1, 000 tuples up to 26, 000 tuples). Fig. 9 shows how runtime be-
haves when the average cardinality of the domains analysis dimensions changes. When
this average is very low, numerous frequent items are mined among few candidates.
On the contrary, when this average is high, numerous candidates have to be considered
from which few frequent items are mined. Between these two extrema, the runtime de-
creases. Fig. 10 and 11 show the behavior of our approach when the number of analysis
dimensions changes. The number of frequent items increases as the number of analy-
sis dimensions grows, leading to an increase of the number of frequent sequences. Fig.
12 and 13 show the differential between the number of frequent sequences mined by
our approach compared to the number of frequent sequences mined by the approach
described in [8], highlighting the interest of our proposition.
8 Conclusion
In this paper, we propose a novel definition for multidimensional sequential patterns.
Contrary to the propositions from [4,8,11], several analysis dimensions can be found
in the sequence. This allows the discovery of rules like A customer who bought a surf-
board together with a bag in NY later bought a wetsuit in LA. We also define jokerized
sequential patterns by introducing the joker value ∗ on analysis dimensions. Moreover,
reference dimensions are defined in order to generalize the way clients are defined.
The algorithms implementing our approach are given and evaluated with experiments
performed on synthetic data.
This work can be extended following several directions. For example, we can take
into account approximate values on quantitative dimensions. In this case, we allow
50 10000
M2SP-mu M2SP-alpha-mu
M2SP M2SP-alpha
45
40 8000
35
30 6000
Runtime (s)
Runtime (s)
25
20 4000
15
10 2000
5
0 0
0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Support Support
Fig. 6. Runtime over Support Fig. 7. Runtime over Support
(DB size=12000, nb dim=5, avg card=20) (DB size=12000, nb dim=5, avg card=20)
2000 10000
M2SP-alpha-mu M2SP-alpha-mu
M2SP-alpha M2SP-alpha
1500 1000
Runtime (s)
Runtime (s)
1000 100
500 10
0 1
0 5 10 15 20 25 5 10 15 20 25 30 35 40 45
Database Size (K-tuples) Average cardinality of analysis dimensions
Fig. 8. Runtime over database size Fig. 9. Runtime over Average Cardinal-
(minsup=0.5, nb dim=15, avg card = ity of Analysis Dimensions (minsup=0.8,
20) DB size=12000, nb dim=15)
1000
M2SP-alpha-mu
600 M2SP-alpha
M2SP-alpha-mu
M2SP-alpha
800
500
Number of Frequent Patterns
400 600
Runtime (s)
300
400
200
200
100
0
0 2 4 6 8 10 12 14
2 4 6 8 10 12 14 Number of Analysis Dimensions
Number of Analysis Dimensions
Fig. 11. Number of Frequent patterns
Fig. 10. Runtime over Number of Analysis
over number of analysis dimensions
Dimensions (minsup=0.5, DB size=12000,
(minsup=0.5, DB size=12000, nb dim=15,
nb dim=15, avg card=20)
avg card=20)
1000
M2SP-alpha
2000 Han’s approach
M2SP-alpha
Han’s approach
800
Number of Frequent Patterns
1500
Number of frequent Patterns
600
1000
400
500 200
0
2 4 6 8 10 12 14
0
5 10 15 20 25 Number of Analysis Dimensions
Database Size (K-tuples)
Fig. 13. Number of Frequent Sequences
Fig. 12. Number of Frequent Sequences
over Number of Analysis Dimen-
over Database Size (minsup=0.5,
sions (minsup=0.5, DB size=12000,
nb dim=15, avg card=20)
avg card=20)
the consideration of values that are not fully jokerized while remaining frequent. This
proposition is important when considering data from the real world where the high
number of quantitative values prevent each of them to be frequent. Rules to be built will
then be like The customer who bought a DVD player on the web is likely to buy almost
3 DVDs in a supermarket later. Hierarchies can also be considered in order to mine
multidimensional sequential patterns at different levels of granularity in the framework
of multidimensional databases.
References
1. R. Agrawal and R. Srikant. Mining sequential patterns. In Proc. 1995 Int. Conf. Data
Engineering (ICDE’95), pages 3–14, 1995.
2. K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cube. In Proc.
of ACM SIGMOD Int. Conf. on Management of Data, pages 359–370, 1999.
3. A. Casali, R. Cicchetti, and L. Lakhal. Cube lattices: A framework for multidimensional
data mining. In Proc. of 3rd SIAM Int. Conf. on Data Mining, 2003.
4. S. de Amo, D. A. Furtado, A. Giacometti, and D. Laurent. An apriori-based approach for
first-order temporal pattern mining. In Simpósio Brasileiro de Bancos de Dados, 2004.
5. T.G. Dietterich and R.S. Michalski. Discovering patterns in sequences of events. Artificial
Intelligence, 25(2):187–232, 1985.
6. H. Mannila, H. Toivonen, and A.I. Verkamo. Discovering frequent episodes in sequences. In
Proc. of Int. Conf. on Knowledge Discovery and Data Mining, pages 210–215, 1995.
7. F. Masseglia, F. Cathala, and P. Poncelet. The PSP Approach for Mining Sequential Patterns.
In Proc. of PKDD, volume 1510 of LNCS, pages 176–184, 1998.
8. H. Pinto, J. Han, J. Pei, K. Wang, Q. Chen, and U. Dayal. Multi-dimensional sequential
pattern mining. In ACM CIKM, pages 81–88, 2001.
9. R. Srikant and R. Agrawal. Mining Sequential Patterns: Generalizations and Performance
Improvements. In Proc. of EDBT, pages 3–17, 1996.
10. J.D. Ullman. Principles of Database and Knowledge-Base Systems, volume I. Computer
Science Press, 1988.
11. C.-C. Yu and Y.-L. Chen. Mining sequential patterns from multidimensional sequence data.
IEEE Transactions on Knowledge and Data Engineering, 17(1):136–140, 2005.