Mlg07.dvi

DIGDAG, a first algorithm to mine closed frequent embedded sub-DAGs
Systems Pharmacology Research Institute, GNI Ltd Abstract
gorithms, which discover induced patterns based only onparent-descendant relationship, are not adapted to analyze Although tree and graph mining have attracted a lot of this data. Algorithms discovering embedded patterns, based attention, there are nearly no algorithms devoted to DAG- on the ancestor-descendant relationships, are needed for this mining, whereas many applications are in dire need of such algorithms. We present in this paper DIGDAG, the first al- In this paper, we present DIGDAG, the first algorithm gorithm capable of mining closed frequent embedded sub- capable of mining closed frequent embedded sub-DAGs DAGs. This algorithm combines efficient closed frequent (c.f.e.-DAGs) in DAG data. The input data are DAGs where itemset algorithms with novel techniques in order to scale all the labels must be distinct. This assumption of distinct labels has long lead to think that the solution to this problemwas trivial. However, we will show that naive approachesare unable to provide results on real data and this motivates 1. Introduction
the more elaborate method used in DIGDAG. Experimentswill show that DIGDAG significantly improves a naive ap-proach both in term of memory consumption and in term of A growing percentage of the data available today is semi- structured data, i.e. data that can be represented as a la- The paper is organized as follows: Section 2 gives the beled graph. This kind of data can be found in various do- necessary definitions, Section 3 presents the DIGDAG algo- mains such as bioinformatics, chemistry or computer net- rithm, Section 4 shows some preliminary experiments, and works. To analyze this data, there is an important need Section 5 concludes the paper and gives perspectives for fu- of specific data mining algorithms, especially for the dis- covery of frequent patterns. In the recent years, the re-search community has produced numerous algorithms tofind frequent patterns in general graphs [5] and in trees 2. Definitions and data-mining problem
[2, 7, 10]. However, there are nearly no algorithms for find-ing frequent patterns in data structured as DAGs. This is A labeled graph is a tuple G = (V, E, ϕ), where V is
a problem to be explored, as a lot of interesting data ex- the set of vertices, E ⊆ V × V is the set of edges, and ϕ : hibit this structure. An important body of this data is the V → L is a labeling function with L a finite set of labels.
result of Bayesian networks based algorithms. This is for For an edge (u, v) ∈ E, u is the parent of v and v is the
example the case of most current studies about gene in- child of u. If there is a set of vertices {u1, ., un} ⊆ V such
teraction networks [4]. Such data exhibit large structural that (u1, u2) ∈ E,.,(un−1, un) ∈ E, {u1, ., un} is called variations, meaning that ancestor-descendant relationships a path, u1 is an ancestor of un and un is a descendant of
are more often preserved than parent-child relationships.
u1. There is a cycle in the graph if a path can be found from
Hence current DAG mining [1] and graph mining [5, 9] al- a vertex to itself. An edge (u, v) ∈ E of the graph is said

to be a transitive edge if besides the edge (u, v), there also
tices must be introduced as input data. For each DAG D we exists another path from u to v in G. A labeled DAG is a
already have its set of edges ED. We define the saturated set E+ = {(u, v) | u is an ancestor of v in D} containing Let P1 = (V1, E1, ϕ1) and P2 = (V2, E2, ϕ2) be two all the ancestor-descendant edges of a DAG D. Each input DAGs. P1 is an embedded sub-DAG of P2, written P1 ⊑
DAG D ∈ D will be represented by its E+ set. The set P2, if there exists an injective homomorphism µ : P1 → P2 D+ = {E+ | D ∈ D} regroups all the saturated versions such as: 1) for two vertices u ∈ V1 and v ∈ V2 such that v = µ(u), ϕ2(v) = ϕ1(u) holds and 2) for (u, v) ∈ E1, Now the question is how to correctly use a closed fre- µ(u) is an ancestor of µ(v) in P2. In short, the homomor- quent itemset mining algorithm to discover the sets of edges phism must preserve the labels and the ancestor relation- of the c.f.e.-DAGs. Briefly stated, it can be summed up as a ship. If in 2), the homomorphism only preserves the parent relationship (i.e. the condition becomes: for (u, v) ∈ E1,µ(u) is a parent of µ(v) in P 1. Using D+ as input, apply a closed frequent itemset 2 ), then P1 is an induced sub-
mining algorithm to discover closed frequent ancestor- 2. Note that an induced sub-DAG is also an em- bedded sub-DAG, but not the opposite.
Let D = {D1, ., Dn} be a set of labeled DAGs and 2. Process the closed frequent ancestor-descendant edge ε ≥ 0 be an absolute frequency threshold. A DAG P is a frequent embedded (induced) sub-DAG of D if it is em-
bedded (induced) in at least ε DAGs of D. The set of DAGs
Both of these two steps seem quite straightforward and of D in which P is embedded (induced) is called the tidlist
very easy to implement, i.e., step 1 should be the simple of P , denoted tidlist(P ). The support of P is defined by
application of any well known closed frequent itemset min- ing algorithm, and step 2 should only be connecting the A frequent embedded (induced) sub-DAG P of D is edges together to discover the c.f.e.-DAGs. However this closed if it is maximal for its support, i.e. if there is no fre-
simplicity is only apparent, and in practice numerous prob- quent embedded (induced) sub-DAG P ′ of D such that P is lems arise, which prevent any “naive” approach to success.
embedded (induced) into P ′ and tidlist(P ) = tidlist(P ′).
The main of these problems is that step 1 doesn’t scale up The exact data mining problem at hand is to find closed to complex data. We have tried to analyze real data made of frequent embedded sub-DAGs (c.f.e.-DAGs) in a collection 100 DAGs, each having 300 vertices, coming from a bioin- of DAG data, where for each input DAG all its labels are formatics problem. The closed frequent itemset algorithm was the FIMI 2003 winner, LCM2 [8]. This algorithm usesa lot of techniques to reduce the size of the dataset in mem- 3. Algorithm
ory. But even on a machine with 8 GB of RAM, the algo-rithm could not complete due to memory saturation. This is due to two reasons. First, the item space is the space of allthe ancestor-descendant edges, which is huge. This is diffi- cult for any closed frequent itemset mining algorithm. But rithm for discovering c.f.e.-DAGs in DAG data where for the main reason lies in the very nature of the closed frequent each input DAG, all the labels are distinct. The fact that ancestor-descendant edge sets researched. In the ideal case, all the labels are distinct in the input DAGs is convenient ancestor-descendant edges are regrouped together because because it means that there are no ambiguities, i.e., if two they belong to the same c.f.e.-DAG. But two c.f.e.-DAGs edges (A, B) and (B, C) are found to appear frequently to- can also appear frequently together in any D ∈ D, in this case all their ancestor-descendant edges will be regrouped pattern A → B → C is necessarily frequent and appears in in the same closed frequent ancestor-descendant edge set.
So in facts, the closed frequent itemset algorithm does not 1, ., Dp}. This can be further extended, and any c.f.e.- DAG P is uniquely defined by its set of edges E look for the c.f.e.-DAGs, but it looks for all the closed fre- This means that instead of mining directly c.f.e.-DAGs, quent sets of c.f.e.-DAGs. This problem is much more diffi- the c.f.e.-DAGs are mined by deriving closed frequent sets cult than the original problem, and in complex real data the of edges. Though there are no algorithms to mine c.f.e.- sheer number of possible c.f.e.-DAGs combinations gives DAGs, a lot of closed frequent itemset mining algorithms, too many results for an efficient handling in memory.
capable of mining efficiently sets of any kind of items [6, 8], In the following, we will explain the methods we devel- can then be used to mine the c.f.e.-DAGs.
oped to avoid these problems in the DIGDAG algorithm1.
As we are interested in finding closed frequent embedded 1NB: We will always refer to the vertices of the DAGs simply by their sub-DAGs, the ancestor-descendant relations among ver-

3.2. Item space reduction using the tiles Our solution to this combinatorial problem is to make computations where the item space is reduced to small sub- Experiments show that the successful use of the sets of the whole tile set T . The ideas are that each c.f.e.- ancestor-descendant edges as items for a closed frequent DAG is likely to contain only a few tiles of T , and that two itemset algorithm is not easy on real data, because the item disconnected c.f.e.-DAGs that appear frequently in the same space they make is too big for efficient handling. Our so- input DAGs will necessarily contain different tiles (thanks lution to reduce the item space size is to use instead spe- to the distinct labels hypothesis). A further observation is cial groups of ancestor-descendant edges named tiles as that given D a c.f.e.-DAG and l a vertex of D, let Tl be the items. The tiles are the closed frequent sets of ancestor- biggest tile included in D whose root is l. Then the leaves descendant edges that share the same initial label, like of Tl represent all the descendants of l in D. Let l′ be one {(a, b1), ., (a, bk)} (a, b1, ., bk ∈ L) for example. They of these descendants, then the set of all the vertices of Tl′ , can be represented as depth 1 trees. Lets T denote the set the biggest tile of root l′ included in D, will be included in of all tiles. Figure 1 shows a simple example of tile discov- ery. The input DAGs are D1 and D2. For each label a ma- We have used this monotonic property to create an sim- trix is constructed, whose lines are the vertices of the given ple heuristic method grouping the tiles based on their labels.
label and whose columns indicate the labels of the descen- The result is a set of tile groups T G, satisfying the following dants of these vertices. For example for label A there are two lines corresponding to the vertices A1 and A6. A1 has{C, D, E} as descendants, whereas A Property 1 For any c.f.e.-DAG D, there exists a tile group
as descendants. Applying a closed frequent itemset mining T G ∈ T G such that T G contains all the tiles contained in algorithm on this matrix gives the tiles whose root has the given label, as shown on the right of the figure. Note that in Property 2 For all T G ∈ T G, T G is unlikely to contain
the figure, matrices for labels A, B and D only are shown, the tiles of two c.f.e.-DAGs D1 and D2 that appear fre- other labels, having no descendants, do not produce tiles.
quently in the same DAGs, even if this can happen in rare For each tile group T G ∈ T G we perform a closed fre- quent itemset algorithm on DT i , i.e. DT i reduced to the tiles of T G. A closed frequent tileset obtained here corre- spond to the tiles of an ancestor-descendant relationship sat- urated c.f.e.-DAG. Even if a closed frequent tileset contains multiple c.f.e.-DAGs, a simple safety check is performed to further decompose it into each of the c.f.e.-DAGs it con- 3.4. Discovering the c.f.e.-DAGs from their Having determined each tileset that corresponds exactly Figure 1: Example of tile discovery, frequency threshold to the tiles of an ancestor-descendant relationship saturated ε = 2. Vertex identifiers are subscripts of vertex labels.
c.f.e.-DAG D+, we know that for each c.f.e.-DAG D wehave all its ancestor-descendant edges, i.e. we have E+.
3.3. Further item space reduction by tiles ing to find the non-saturated c.f.e.-DAGs, which are the ex-pected final result. For this we distinguish the edges of a The input DAGs can be represented by the tiles they con- c.f.e.-DAG into two categories, i.e., the direct edges and the tain. When, for each DAG D ∈ D, T iD is the set of tiles in- transitive edges. An edge (u, v) is a direct edge if the only cluded in D, the new input data is DT i = {T iD | D ∈ D}.
way from u to v in the c.f.e.-DAG is the edge (u, v). And Applying a closed frequent itemset mining algorithm on edge (u, v) is a transitive edge if it is not a direct edge, i.e.
DT i gives closed frequent tilesets, from which the c.f.e.- if from u to v there are several paths. These are shown in DAGs can be deduced. However in this case also the results are closed frequent sets of c.f.e.-DAGs, which are too nu-merous for successful computation in complex cases.

We have presented in Section 3 a naive method for ex- tracting closed frequent embedded DAGs, and claimed that this method was not efficient enough for real data analy-sis. We have implemented this method in a way very simi- Figure 2: Simple example of directs and transitive edges lar to DIGDAG, and show the performance difference withDIGDAG in Figures 3 and 4.
The direct edges of the c.f.e.-DAGs are easily found from the saturated c.f.e.-DAGs. The transitive edges are the tricky part. For every transitive edge of each saturated c.f.e.- DAG D+, we check if the transitive edge is actually con- tained in all the input DAGs containing D+. This check is efficiently done by applying a closed frequent itemset min- ing algorithm to the edge sets of all the input DAGs contain-ing D+. An edge of each closed frequent edges set found here represents a transitive edge of a non saturated c.f.e.- DAG in the case where the edge is not a direct edge of D+.
We have proved the following property.
Figure 3: Naive vs DIGDAG, computation time Property 3 The set of DAGs found by DIGDAG is the sound
and complete set of the c.f.e.-DAGs.
4. Experiments
This section is divided into two parts. First, we show that the techniques used in the DIGDAG algorithm drastically reduce computation time and memory usage over the naive method of the beginning of Section 3. Then, we give an application example with the analysis of the action of the The dataset used is a real world dataset of gene network DAGs, as outputted by the algorithm of [4] when analyzing microarray data from [3]. This algorithm models gene net-works with Bayesian networks. From the input microarray Figure 3 shows the difference in computation time.
data, a greedy search is performed, whose results are local DIGDAG is at least two orders of magnitude faster than the optima, the candidate gene networks. They are of mixed naive method: the improvement made is significant. As a quality: they contain both correct and incorrect parts. The consequence, DIGDAG can find results quickly for low sup- goal of our data mining algorithm is to extract closed fre- port values, while the naive method’s computation time ex- quent gene networks patterns from the candidate gene net- ceeds the allocated 2 hours for support values below 30%.
works, which should be more credible and more easy to an- The memory consumption difference is shown on Fig- alyze than the whole candidate networks.
ure 4. For each program we give two values, as reported To keep reasonable runtimes for all algorithms, we used by the Linux memusage command: the heap total, which a simplified dataset reduced to 100 candidate gene networks corresponds to the total amount of memory allocated, and of 50 genes. The DAGs of this dataset have in average the heap peak, which corresponds to the maximum amount 284.5 ancestor-descendant edges, with a maximum of 352 of system memory used by the program during its execu- tion. Due to problems with the memusage command, this The machine used is a Xeon 2.8 GHz with 8 Gb of RAM experiment was made with a machine different from the pre- running under Linux. The implementations of DIGDAG and vious one, a dual-core Xeon at 2.2 GHz with 4 Gb of RAM of the naive algorithm are our own C++ implementation, us- running under Linux. The naive method saturates the avail- ing inside the LCM2 closed frequent itemset mining algo- able memory for support values under 35%. For the lowest support value, the figure shows that the naive method con- 3http://research.nii.ac.jp/∼uno/codes.htm sumes much more memory than DIGDAG: for support value 35%, the heap peak of DIGDAG is three orders of magni- 5. Conclusion and future works
tude lower than the heap peak of the naive method. Since asupport value of 65%, the heap peak of the naive method is We have presented in this paper the DIGDAG algorithm, even higher than the heap total of DIGDAG. These results the first algorithm capable of mining c.f.e.-DAGs. Our first shows that DIGDAG is much less prone to saturating the experiments on real data have shown that this algorithm ex- system’s memory during computation. This is very impor- hibited much better mining performances than a naive ap- tant; as if the memory is saturated the program will not give proach for the task at hand, and can allow to find interesting any results, and computation time will have been wasted.
From these experiments, we can see that the improve- As a topic for future research, we would like to be able ments made by DIGDAG over the naive method are indeed to handle even more complex data (more vertices and more efficient at reducing the memory consumption of the pro- input DAGs) through the use of parallelism. We would also gram. These improvements also allow an important gain in like to research what modifications would be needed for DIGDAG to handle general graphs instead of DAGs.
Acknowledgements: This research was, in part, sup-
ported by Function and Induction project of ROIS/TRIC.
In the previous dataset, the 50 genes have not been cho- References
sen at random, they have been carefully selected as beingthe most probably affected by the terbinafine drug, in order [1] Y.-L. Chen, H.-P. Kao, and M.-T. Ko. Mining dag patterns from dag databases. In Web-Age Information Management(WAIM), Dalian, China, 2004.
[2] Y. Chi, Y. Yang, Y. Xia, and R. R. Muntz.
iner: Mining both closed and maximal frequent subtrees. In ERG6S-adenosyl-methionine delta-24-sterol-c-methyltransferase The Eighth Pacific-Asia Conference on Knowledge Discov-ery and Data Mining (PAKDD’04), 2004.
[3] T. R. Hughes, M. J. Marton, A. R. Jones, C. J. Roberts, R. Stoughton, C. D. Armour, H. A. Bennett, E. Coffey, H. Dai, Y. D. He, M. J. Kidd, A. M. King, M. R. Meyer, D. Slade, P. Y. Lum, S. B. Stepaniants, D. D. Shoemaker, D. Gachotte, K. Chakraburtty, J. Simon, M. Bard, and S. H.
Friend. Functional discovery via a compendium of expres- sion profiles. Cell, 102(1):109–126, July 2000.
[4] S. Imoto, T. Goto, and S. Miyano. Estimation of genetic networks and functional structures between genes by using bayesian network and nonparametric regression. In Pacific Symposium on Biocomputing, pages 175–186, 2002.
[5] A. Inokuchi, T. Washio, and H. Motoda. Complete mining of frequent patterns from graphs: Mining graph data. Machine Learning, 50(3):321–354, 2003.
[6] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal.
3-hydroxy-3-methylglutaryl- coenzyme A reductase 1 covering frequent closed itemsets for association rules. InDatabase Theory - ICDT ’99, 7th International Conference, Figure 5: A significative c.f.e.-DAG.
[7] A. Termier, M. Rousset, and M. Sebag. Dryade : a new approach for discovering closed frequent trees in heteroge- One of the c.f.e.-DAGs found is shown on Figure 5. This neous tree databases. In International Conference on Data gene network clearly shows relations of the gene ERG1 (the Mining ICDM’04, Brighton, England, pages 543–546, 2004.
[8] T. Uno, M. Kiyomi, and H. Arimura. Lcm v.2: Efficient real target of terbinafine) with the genes ERG6, 11, 13, 25 mining algorithms for frequent/closed/maximal itemsets. In and 26 from the steroid metabolism, and HMG1 and 2 from 2nd Workshop on Frequent Itemset Mining Implementations the lipid metabolism. This suggests that terbinafine works on steroid metabolism or lipid metabolism related genes, which is consistent with existent knowledge about this drug.
graph patterns. In KDD, pages 286–295, 2003.
Our c.f.e.-DAG proposes a precise interaction structure be- Fundamenta Informaticae, 65(1-2):33–52, tween these genes, which could serve as a basis to design

Source: http://mlg07.dsi.unifi.it/pdf/01_Termier.pdf

ybds.com.hk

GENERAL OVERVIEW OF THE GROUP HISTORY AND DEVELOPMENT Mr. Lok, the founder of the Group, commenced his career in the development of system softwarein the 1980s and was subsequently engaged in the development and sale of customized systemsoftware in Hong Kong. In 1986, recognising the huge market potential for enterprise applicationsoftware, he started to concentrate his efforts on the deve

Microsoft word - damien-program notes.doc

Damien – Program Notes Father Damien Born Joseph de Veusters in 1840 in a small hamlet called Tremeloo in Belgium, the man whowas to serve the Settlement on Moloka’i chose the religious name of Damien upon entering thepriesthood in the missionary order of the Society of the Sacred Hearts of Jesus and Mary. Taking the place of his brother, a priest who had suffered a bout of typhus,