» Tutorials ECML-PKDD 2012

Tutorials

Monday 24th September
Morning Session	Advanced Topics on Data Stream Mining	Advanced Topics in Ensemble Learning
Afternoon Session	Decomposing Binary Matrices: Where Linear Algebra Meets Combinatorial Data Mining	Mining Deep Web Repositories
Friday 28th September
Morning Session	PAC-Bayesian Analysis in Supervised, Unsupervised, and Reinforcement Learning	Random Projections for Machine Learning and Data Mining: Theory and Applications
Afternoon Session	Probabilistic Modeling of Ranking	Understanding and Managing Cascades on Large Graphs

We would like to inform the tutorial attendees that tutorial presenters may deviate from the default time slots for the coffee and lunch breaks as indicated in the programme booklet; please follow the time indications they will provide.

For example, the organizers of “Advanced Topics on Data Stream Mining” have already set the following revised schedule:

1st PART: 9:00 – 10:30
2nd PART: 10:45 – 12:15

Understanding and Managing Cascades on Large Graphs

B. Aditya Prakash, Christos Faloutsos

http://www.cs.vt.edu/~badityap/TALKS/12-pkdd-tutorial/

How do contagions spread in population networks? Which group should we market to, for maximizing product penetration? Will a given YouTube video go viral? Who are the best people to vaccinate? What happens when two products compete? The objective of this tutorial is to provide an intuitive and concise overview of most important theoretical results and algorithms to help us understand and manipulate such propagation-style processes on large networks.

The tutorial will contain three parts:

(a) Theoretical results on the behavior of fundamental models;
(b) Scalable Algorithms for changing the behavior of these processes e.g., for immunization, marketing etc.;
(c) Empirical Studies of diffusion on blogs and on-line websites like Twitter.

We finally conclude with future research directions. The problems we focus on are central in surprisingly diverse areas: from computer science and engineering, epidemiology and public health, product marketing to information dissemination. Our emphasis is on intuition behind each topic, and guidelines for the practitioner.

Advanced Topics on Data Stream Mining

Albert Bifet, Joao Gama, Richard Gavalda, Georg Krempl, Mykola Pechenizkiy, Bernhard Pfahringer, Myra Spiliopoulou, Indrė Žliobaitė

Nowadays, the quantity of data that is created every two days is estimated to be 5 exabytes. This amount of data is similar to the amount of data created from the dawn of time up until 2003. Moreover, it was estimated that 2007 was the first year in which it was not possible to store all the data that we are producing. This massive amount of real time streaming data opens new challenging discovery tasks. Some of them are already addressed with mature algorithms, while new challenges emerge, including learning on not one but multiple streams. This tutorial has two parts. The first part gives an introduction to recent advances in algorithmic techniques and tools to cope with challenges on stream mining. The second part discusses state of the art research on mining multiple streams – distributed streams and interdependent relational streams.
Concept drift plays a central role in this tutorial. In the first part, we address it in the context of conventional one-stream mining to set the scene. In the second part, we recapitulate on it after introducing multiple-stream mining, and we also consider machine learning methods that are appropriate for incremental data and slow streams.

The first part (9:00 – 10:30), ‘Mining One Stream’, will be presented by Albert Bifet, Ricardo Gavalda, Mykola Pechenizkiy, Bernhard Pfahringer, and Indre Zliobaite.

The second part (10:45 – 12:15), ‘Mining Multiple Streams’ will be presented by Joao Gama, Myra Spiliopoulou, and Georg Krempl.

Mining Deep Web Repositories

N. Zhang, G. das

http://home.gwu.edu/~nzhang10/deepwebmining/

With the proliferation of online repositories (e.g., databases or document corpora) hidden behind proprietary web interfaces, e.g., keyword-/form-based search and hierarchical/graph-based browsing interfaces, efficient ways of enabling machine learning and data mining tasks over contents in such hidden repositories are of increasing importance. There are two key challenges: one on the proper understanding of interfaces, and the other on learning/mining over a properly understood interface. There are three ways to enable efficient machine learning and data mining over deep web data – (1) crawling the deep web repository before applying conventional mining techniques, (2) sampling the deep web repository before learning/mining the retrieved samples, at the expense of additional error introduces by sampling, and (3) estimating aggregates over slices of data in the deep web repository, and then using the estimated aggregates to support machine learning or data mining tasks. In this tutorial, we focus on the fundamental developments in the field, including web interface understanding, crawling, sampling, and aggregate estimation over web repositories with various types of interfaces and containing structured or unstructured data. We also discuss the potential changes required for machine learning and data mining algorithms should one choose to use the second and third methods described above. Our goal is to encourage the audience to initiate their own research in these exciting areas.

PAC-Bayesian Analysis and Its Applications

Yevgeny Seldin, François Laviolette, John Shawe-Taylor

PAC-Bayesian analysis is a basic and very general tool for data-dependent analysis in machine learning. By now, it has been applied in such diverse areas as supervised learning, unsupervised learning, and reinforcement learning, leading to state-of-the-art algorithms and accompanying generalization bounds. PAC-Bayesian analysis, in a sense, takes the best out of Bayesian methods and PAC learning and puts it together: (1) it provides an easy way to exploit prior knowledge (like Bayesian methods); (2) it provides strict and explicit generalization guarantees (like VC theory); and (3) it is data-dependent and provides an easy and strict way of exploiting benign conditions (like Rademacher complexities). In addition, PAC-Bayesian bounds directly lead to efficient learning algorithms. Thus, it is a key and basic subject for machine learning. While the first papers on PAC-Bayesian analysis were not easy to read, subsequent simplifications made it possible to explain it literally in three slides. We will start with a general introduction to PAC-Bayesian analysis, which should be accessible to an average student, who is familiar with machine learning at the basic level. Then, we will survey multiple forms of PAC-Bayesian bounds and their numerous applications in different fields, including supervised and unsupervised learning, finite and continuous domains, and the very recent extension to martingales and reinforcement learning. Some of these applications will be explained in more details, while others will be surveyed at a high level. We will also describe the relations and distinctions between PAC-Bayesian analysis, Bayesian learning, VC theory, and Rademacher complexities. We will discuss the role, value, and shortcomings of frequentist bounds that are inspired by Bayesian analysis.

Random Projections for Machine Learning and Data Mining: Theory and Applications

Ata Kaban and Bob Durrant

Random projections have been used as a dimensionality reduction technique for large data since the appearance of Arriaga and Vempala’s seminal FOCS 1999 paper, and they continue to find applications both within the field of Machine Learning and elsewhere. Starting with some motivating examples from Machine Learning and Data Mining, this tutorial will review some key theoretical properties of random projections, and the practical applications this theory has inspired. In particular, we will cover the Johnson-Lindenstrauss lemma which gives conditions under which random projections approximately preserve geometric properties of data and give related applications, discuss the field of Compressed Sensing from which one can derive guarantees for techniques working with sparse data which we illustrate in the setting of SVM, discuss more recent theory giving guarantees for linear classifiers and regressors working with non-sparse data, and finally we take a look at random projections as regularizers.

Decomposing Binary Matrices: Where Linear Algebra Meets Combinatorial Data Mining

Pauli Miettinen

http://www.mpi-inf.mpg.de/~pmiettin/bmf_tutorial/

Tiling databases and k-centroid clustering on a binary data are two data mining techniques that, on the first sight, do not seem to share many features (except that they both work on binary data). They both can, however, be considered as matrix factorization problems. Matrix factorizations are common tools in data mining and machine learning alike, but they are usually taken as inherently continuous methods as opposed to more combinatorial nature of clustering or tiling. And in order to model these combinatorial problems with matrix factorizations, we need to restrict the factor matrices into similar combinatorial objects; in this case, binary matrices. The idea of discrete factorizations of matrices, while not entirely new, has received growing interest in recent years among data mining and machine learning communities.

This tutorial discusses the recent developments and open problems on using binary matrix factorizations for data mining and machine learning. The main work in this area is done under two different algebras: the normal (where 1+1=2) and the Boolean (where 1+1=1), and much of the tutorial will concentrate on these two algebras with less emphasis on the modulo-2 algebra (where 1+1=0). The first part of the tutorial will cover theoretical aspects regarding the decompositions. This includes discussion about the concept of matrix rank and how it behaves under different algebras, and results about computational complexity of the many problems related to these decompositions. Open problems will be highlighted throughout the tutorial.

The second part will concentrate on algorithms for finding the decompositions and, in particular, the ideas and intuitions behind these algorithms. We will also briefly discuss about the model order (i.e. rank) selection problem and extensions to tensors.

The tutorial will reside firmly in the intersection of continuous and discrete methods aimed for participants with some background in either combinatorial data mining methods or in continuous linear algebra methods. The focus will be on intuition rather than on technical content with the hope that the different perspectives provided during the tutorial will give the participants new ways of thinking.

Advanced Topics in Ensemble Learning

Ioannis Partalas, Alberto Suarez, Daniel Hernandez Lobato, Grigorios Tsoumakas, Gonzalo Martinez

Ensemble methods are widely used by the machine learning community because they lead to improvements in accuracy and robustness in many prediction problems of practical interest. Furthermore, they offer appealing solutions to several interesting learning problems, such as dealing with small or large datasets, performing data fusion and modeling concept drift. Notwithstanding, using ensemble methods also poses some challenges. Specifically, in many problems very large ensembles need to be generated to achieve good generalization performance. Similarly, the ensemble prediction is typically generated by querying all the classifiers contained in the ensemble, a process that can be computationally expensive. In this tutorial we give a brief introduction to ensemble methods for machine learning and describe ensemble pruning techniques to deal with the shortcomings described. We also introduce advanced research topics of current interest, such as the problem of determining the optimal ensemble size or techniques to make ensembles scale to large learning problems

Probabilistic Modeling of Ranking

Jose A. Lozano and Ekhiñe Irurozki

http://www.sc.ehu.es/ccwbayes/members/ekhine/tutorial_ranking/info.html

Rankings and permutations have become, nowadays, ubiquitous. They appear in numerous areas of computer systems: information retrieval, recommender systems, identity tracking or chemical compound classification, etc. Dealing with rankings, and particularly with rankings of many objects is a complex computational task as the number of permutations of $n$ objects scales factorially in $n$. Recently a number of approaches have come to the machine learning arena to address this kind of data. Most of these approaches are based on the building of a probability distribution over the space of rankings. However, given the complexity of storing, learning and making inference on this kind of models, different simplifying assumptions have been considered: the use of parametric models, models based on low-order statistics, models based on kernels and the definition and use of notions of independence and conditional independence in the space of permutations. In this tutorial we will review the literature on the topic, explaining the different approaches in detail that have emerged in the literature, putting them in relation with other non-probabilistic ranking models and giving a collection of open problems in the area. In addition we will present the most relevant applications in the field as well as the most common benchmark datasets and software.

ECML-PKDD 2012