KDD 2006 August 20 - 23, 2006    Philadelphia, USA

The Twelfth Annual SIGKDD International Conference on
Knowledge Discovery and Data Mining

August 20 - 23, 2006
Philadelphia, USA

Clustering Under Constraints: Theory and Practice
Sugato Basu and Ian Davidson
Clustering is traditionally an unsupervised task which is used and researched extensively by the data mining community. The incorporation into clustering of background or prior knowledge in the form of constraints has been a recent innovation that is known as clustering with constraints, semi-supervised clustering or clustering with side-information. This new area has generated much interest amongst practitioners and industry since background information in the form of say, must-link (two instances must be in the same cluster) or cannot-link (two instances must not be in the same cluster) is readily available and has been conclusively shown to improve the quality of the clustering solutions. Similarly the data mining research community has begin to make significant inroads into the computational and algorithmic aspects of using constraints with the best paper awards at SIGKDD 2004, ICDM 2004 and SIAM DM 2005 all going to papers that explore clustering with constraints. This tutorial aims to provide the practitioner, researcher and student an introduction into this rapidly developing area. We begin by introducing the various types of constraints that have been studied and how algorithms have used them. We shall then focus on several real world problems discussed in the literature to illustrate how constraints can be generated after which we shall illustrate the reported benefits of constraints as reported by the literature.

Data Analytics for Marketing Decision Support
Saharon Rosset and Naoki Abe
In this tutorial, we will give an overview of the issues in applying data mining and analytics tools to marketing decision support, primarily for marketing and sales optimization purposes. In particular, we will: - review some challenges involved in designing analytics tools for real-life marketing decision support; - discuss different analytical approaches to addressing these problems in a practically feasible manner; - present several case studies from our own experiences in customer lifetime value modeling, customer wallet estimation, and more. This tutorial is intended for data miners, whether from industry or academia, who have an interest in moving beyond generic tools and methods to design real, practically useful solutions to decision making problems in marketing and related areas.

Data Mining for Software Engineering
Tao Xie and Jian Pei
Since late 90's, various data mining techniques have been applied to analyze software engineering data, and have achieved many noticeable successes. Substantial experience, development, and lessons of data mining for software engineering pose interesting challenges and opportunities for new research and development. In this tutorial, we shall present a survey on the research problems, the latest progress, the challenges, and the potentials of data mining practice in software engineering. The tutorial will focus on the inherent challenges of mining software engineering data, offer a shortcut to the current research and development frontier, and illustrate a few case studies. The tutorial will answer questions like what software engineering tasks can be helped by data mining, what kinds of software engineering data are available for mining, and how data mining techniques can be used in software engineering. The tutors, Drs. Tao Xie and Jian Pei, are active and prolific researchers in software engineering and data mining, respectively.

Mining and Searching Graphs and Structures
Jiawei Han, Xifeng Yan and Philip S. Yu
Scalable methods for mining, indexing, and similarity search in graphs and other complex structures, such as sequences, trees, and networks, have become increasingly important in data mining and database management with broad applications in social science, the Web, computer vision, software engineering, chem-informatics, bio-informatics, etc. Graph mining algorithms such as mining network motif, structural pattern with constraints and contrast graph pattern, graph clustering, and graph classification, have been studied extensively in recent years. The applications built on these algorithms, such as graph indexing and similarity search, are evolving into new components of data management system for handling complex structured data. This tutorial will present a comprehensive overview of this growing area and discuss its potential research and application topics.

Mining High-Throughput Biological Data
David Page
Biology has become a major application area for data mining. In the last decade, biotechnology has made it possible to perform the equivalent of thousands of biological experiments in the time it used to take to perform a single experiment, and new such technologies continue to be developed. Examples of high-throughput technologies include gene expression microarrays, mass spectrometry for proteomics and metabonomics, single-nucleotide polymorphism arrays for genotyping, and robotic high-throughput screening for potential drug compounds. Biologists are eager for computational tools to mine this data, to help in the discovery of biological pathways, in the prediction of disease or patient response to therapy, or in the design of new pharmaceuticals. This tutorial will describe a variety of data mining tasks arising from high-throughput biological data. It will discuss the technologies in enough detail to clarify the challenges raised by each data type. The tutorial will present case studies from a variety of such data mining applications, and it will speculate about novel data mining tasks likely to arise in this area in the near future.

Models and Methods for Privacy-Preserving Data Mining and Data Publishing
Johannes Gehrke
The digitization of our daily lives has led to an explosion in the collection of data by organizations and individuals. Protection of confidentiality of this data is of utmost importance. However, knowledge of statistical properties of this private data can have significant societal benefit, for example, in decisions about the allocation of public funds based on Census data, or in the analysis of medical data from different hospitals to understand the interaction of drugs. This tutorial will survey recent research that builds bridges between the two seemingly conflicting goals of sharing data while preserving data privacy and confidentiality. The tutorial will cover definitions of privacy and disclosure, and associated methods how to enforce them.

Scalable Information Extraction and Integration
Eugene Agichtein and Sunita Sarawagi
Data mining applications over text require efficient methods for extracting and integrating the information ``buried'' in millions, or billions, of documents. This tutorial reviews the state of the art in scaling up the extraction, mining, and integration of information in large amounts of unstructured text. We review key approaches for scaling up information extraction, including using general-purpose as well as specialized indexing techniques. We also overview scalable techniques for integration and cleaning of the extracted information. We highlight research opportunities in applications of scalable information extraction and integration methods, as well as the fundamental challenges that remain.

Sensor Mining at work: Principles and a Water Quality Case-Study
Christos Faloutsos (SCS, CMU) and Jeanne VanBriesen (CEE, CMU)
How can we find patterns in a collection of measurements, say, on water quality sensors? What do these patterns tell us? Is the water safe to drink? Are we under biological attack? How many sensors do we need to place, and where, to answer these questions in real time? The instructors have been collaborating on exactly these problems for the past several years. The tutorial will report on these experiences. Specifically, the tutorial surveys the related areas and has two goals: (a) to review the main principles and main data mining tools for sensor data analysis (correlation discovery, SVD, ICA, Fourier, Wavelets) (b) to showcase them on a real, important application, namely monitoring the quality of drinking water. The tutorial ends with a list of future directions for data mining research, motivated by the water quality monitoring application.

Any questions regarding tutorials can be sent to the Tutorials Chair: Ronen Feldman (ronenf[at]gmail.com).

Webmaster: Teresa Mah