Michael Hahsler

Home Teaching Research Publications Talks Software Lab CV

Talks

[1] Dissimilarity Plots: A Visual Exploration Tool for Partitional Clustering, February 2017. Invited talk, IE Department Seminar Department of Industrial, Manufacturing, & Systems Engineering University of Texas at Arlington. [ slides (pdf) ]
Cluster analysis tries to uncover structure in data by assigning each object in the data set to a group (called cluster) so that objects from the same cluster are more similar to each other than to objects from other clusters. Exploring the cluster structure and assessing the quality of the cluster solution have been a research topic since the invention of cluster analysis. This is especially important since all popular cluster algorithms produce a clustering even for data without a "cluster" structure. Many visualization techniques to judging the quality of a clustering and to explore the cluster structure were developed, but they all suffer from certain restrictions. For example, dendrograms cannot be used for non-hierarchical partitions, silhouette plots provide only a diagnostic tool without the ability to explore structure, data dimensionality may render projection-based methods less useful, and graph-based representations hide the internal structure of clusters. In this talk we introduce a new visualization technique called dissimilarity plots which is based on solving the combinatorial optimization problem of seriation for (near) optimal cluster and object placement in matrix shading. Dissimilarity plots are not affected by data dimensionality, allow the user to directly judge cluster quality by visually analyzing the micro-structure within clusters, while they make misspecification of the used number of clusters instantly apparent. Dissimilarity plots are implemented in the R extension package seriation.

[2] Recommender systems: Harnessing the power of personalization, February 2017. Curricular Recommender System Working Group, SMU, February 17, 2017. [ slides (pdf) ]
This talk gives a short overview of recommender systems with examples.

[3] Predictive models for making patient screening decisions, January 2017. 2017 INFORMS Computing Society Conference, Austin, TX, January 16, 2017. [ slides (pdf) ]
A critical dilemma that clinicians face is when and how often to screen patients who may suffer from a disease. We investigate the use of predictive modeling to develop optimal screening conditions and assist with clinical decision-making with large amounts of missing information.

[4] Grouping association rules using lift, November 2016. 11th INFORMS Workshop on Data Mining and Decision Analytics, November 12, 2016. [ slides (pdf) ]
Association rule mining is a well established and popular data mining method for finding local dependencies between items in large transaction databases. However, a practical drawback of mining and efficiently using association rules is that the set of rules returned by the mining algorithm is typically too large to be directly used. Clustering association rules into a small number of meaningful groups would be valuable for experts who need to manually inspect the rules, for visualization and as the input for other applications. Interestingly, clustering is not widely used as a standard method to summarize large sets of associations. In fact it performs poorly due to high dimensionality, the inherent extreme data sparseness and the dominant frequent itemset structure reflected in sets of association rules. In this paper, we review association rule clustering and their shortcomings. We then propose a simple approach based on grouping columns in a lift matrix and give an example to illustrate its usefulness.

[5] Predictive models for making patient screening decisions, November 2016. 2016 INFORMS Annual Meeting, November 2016. [ slides (pdf) ]
A critical dilemma that clinicians face is when and how often to screen patients who may suffer from a disease. The stakes are heightened in the case of chronic diseases that impose a substantial cost burden. We investigate the use of predictive modeling to develop optimal screening conditions and assist with clinical decision-making. We use electronic health data from a major U.S. Hospital and apply our models in the context of screening patients for type-2 diabetes, one of the most prevalent diseases in the U.S. and worldwide.

[6] Sequential aggregation-disaggregation optimization methods for data stream mining, November 2016. 2016 INFORMS Annual Meeting, November 2016. [ slides (pdf) ]
Clustering-based iterative algorithms to solve certain large optimization problems have been proposed in the past. The algorithms start by aggregating the original data, solving the problem on aggregated data, and then in subsequent steps gradually disaggregate the aggregated data. In this contribution, we investigate the application of aggregation-disaggregation on data streams, where the disaggregation steps cannot be explicitly performed on past data, but has to be performed sequentially on new data.

[7] Data mining tutorial: Methods and tools, November 2016. DCII - Operations Research and Statistics Towards Integrated Analytics Research Cluster, SMU, November 30, 2016. [ slides (pdf) ]
Data mining includes a set of methods and tools heavily used in descriptive and predictive analytics. This tutorial will cover an introduction to data mining including the different types of problems addressed by data mining. We will introduce the data mining process and the core methods used by data mining professionals and highlight how they are related to the fields of statistics, optimization and machine learning. The tutorial will also review available tools and conclude with several examples using a set of representative tools.

[8] Probabilistic approach to association rule mining, May 2016. Seminar talk, IESEG School of Management, Lille, France, May 2016. [ slides (pdf) ]
Mining association rules is an important and widely applied technique for discovering patterns in transaction databases. However, the support-confidence framework has some often overlooked weaknesses. This talk will introduce a simple stochastic model and show how to apply it to simulate data for analyzing the behavior of confidence and other measures of interestingness (e.g., lift), develop a new model-driven approach to mine rules based on the notion of NB-frequent itemsets, and define a measure of interestingness which controls for spurious rules and has a strong foundation in statistical testing theory.

[9] Ordering objects: What heuristic should we use?, November 2015. 2015 INFORMS Annual Meeting, Philadelphia, PA, November 1-4, 2015. [ slides (pdf) ]
Seriation, i.e., finding a suitable linear order for a set of objects given data and a merit function, is a basic combinatorial optimization problem with applications in modern data analysis. Due to the combinatorial nature of the problem, most practical problems require heuristics. We have implemented over 20 different heuristics and more than 10 merit functions. We will discuss the different methods in this presentation and compare them empirically using datasets from several problem domains.

[10] Recommender systems: Harnessing the power of personalization, November 2015. Invited talks at the Southwest EDGe Analyst Community Meeting, Dallas, TX, November 18, 2015. [ slides (pdf) ]
This talk gives a short overview of recommender systems with examples.

[11] arules: Association rule mining with R, February 2015. Tutorial at R User Group Dallas Meeting, Dallas, TX, February, 2015.
Mining association rules, a form of frequent pattern mining, is a widely used data mining technique. The idea is to discover interesting relations between variables in large databases. This talk will introduce the basic concepts behind association rule mining and how these concepts are implemented in the R package arules. We will talk about preparing data, mining rules and analyzing the results, including visualization. I will use example code throughout the presentation which can be easily adapted to data mining problems you might have in mind. As a case study, I will show how to mine a real data set containing data from the Stop-Question-and-Frisk program in New York City.

[12] Cooperative data analysis in supply chains using selective information disclosure, January 2015. 2015 INFORMS Computing Society Conference, Richmond, VA, January 11-13, 2015. [ slides (pdf) ]
Many modern products (e. g., consumer electronics) consist of hundreds of complex parts sourced from a large number of suppliers. In such a setting, finding the source of certain properties, e. g., the source of defects in the final product, becomes increasingly difficult. Data analysis methods can be used on information shared in modern supply chains. However, some information may be confidential since they touch proprietary production processes or trade secrets. In this talk we explore the idea of selective information disclosure to address this issue.

[13] Cooperative data analysis in supply chains using selective information disclosure, November 2014. 2014 INFORMS Annual Meeting, San Fancisco, CA, November 9-12, 2014. [ slides (pdf) ]
Many modern products (e. g., consumer electronics) consist of hundreds of complex parts sourced from a large number of suppliers. In such a setting, finding the source of certain properties, e. g., the source of defects in the final product, becomes increasingly difficult. Data analysis methods can be used on information shared in modern supply chains. However, some information may be confidential since they touch proprietary production processes or trade secrets. In this talk we explore the idea of selective information disclosure to address this issue.

[14] Dissimilarity Plots: A Visual Exploration Tool for Partitional Clustering, November 2013. Invited talk, Graduate Seminar, School of Industrial and Systems Engineering, University of Oklahoma. [ slides (pdf) ]
Cluster analysis tries to uncover structure in data by assigning each object in the data set to a group (called cluster) so that objects from the same cluster are more similar to each other than to objects from other clusters. Exploring the cluster structure and assessing the quality of the cluster solution have been a research topic since the invention of cluster analysis. This is especially important since all popular cluster algorithms produce a clustering even for data without a "cluster" structure. Many visualization techniques to judging the quality of a clustering and to explore the cluster structure were developed, but they all suffer from certain restrictions. For example, dendrograms cannot be used for non-hierarchical partitions, silhouette plots provide only a diagnostic tool without the ability to explore structure, data dimensionality may render projection-based methods less useful, and graph-based representations hide the internal structure of clusters. In this talk we introduce a new visualization technique called dissimilarity plots which is based on solving the combinatorial optimization problem of seriation for (near) optimal cluster and object placement in matrix shading. Dissimilarity plots are not affected by data dimensionality, allow the user to directly judge cluster quality by visually analyzing the micro-structure within clusters, while they make misspecification of the used number of clusters instantly apparent. Dissimilarity plots are implemented in the R extension package seriation.

[15] A study of the efficiency and accuracy of data stream clustering for large data sets, October 2013. 2013 INFORMS Annual Meeting, Minneapolis Convention Center, October 6-9, 2013. [ slides (pdf) ]
Clustering large data sets is important for many data mining applications. Conventional clustering approaches (k-means, hierarchical clustering, etc.) typically do not scale well for very large data sets. We will investigate the use of data stream clustering algorithms as light-weight alternatives to conventional algorithms on large non-streaming data using several synthetic and real-world data sets.

[16] Data stream clustering of large non-streaming data sets, August 2012. 36th Annual Conference of the German Classification Society (GfKl 2012), Hildesheim, Germany, August 1--3, 2012. [ slides (pdf) ]
Identifying groups in large data sets is important for many machine learning and knowledge discovery applications. In recent years, data stream clustering algorithms have been proposed which can deal efficiently with potentially unbounded streams of data. Obviously, these algorithms can also be used for large non-streaming data sets and, as such, present light-weight alternatives to conventional algorithms. In this presentation we will compare the accuracy of several data stream mining algorithms (CluStream, DenStream, ClusTree) with BIRCH and k-means. For the comparison we will use the R-extension package stream which we currently develop.

[17] SOStream: self organizing density-based clustering over data stream, July 2012. International Conference on Machine Learning and Data Mining (MLDM'2012), Berlin, Germany, July 13--20, 2012. [ slides (pdf) ]
In this talk we discuss a data stream clustering algorithm, called Self Organizing density based clustering over data Stream (SOStream). This algorithm has several novel features. Instead of using a fixed, user defined similarity threshold or a static grid, SOStream detects structure within fast evolving data streams by automatically adapting the threshold for density-based clustering. It also employs a novel cluster updating strategy which is inspired by competitive learning techniques developed for Self Organizing Maps (SOMs). In addition, SOStream has built-in online functionality to support advanced stream clustering operations including merging and fading. This makes SOStream completely online with no separate offline components. Experiments performed on KDD Cup'99 and artificial datasets indicate that SOStream is an effective and superior algorithm in creating clusters of higher purity while having lower space and time requirements compared to previous stream clustering algorithms.

[18] Recommender systems: User-facing decision support systems, February 2012. Invited talk for EMIS 7357-Decision Support Systems, Southern Methodist University, Dallas, Texas, February 22, 2012. [ slides (pdf) ]
The world-wide-web presents us with a multitude of choices with online shopping and many other online services like online radio and online TV. In order to be able to select appropriate products and services users need user-facing decision support systems. Recommender systems which apply statistical and knowledge discovery techniques to facilitate in the decision process. . In this talk we will discuss recommendation strategies starting with content analysis. We discuss collaborative filtering and the idea behind the current state-of-the-art approaches based on latent factor analysis. The talk concludes with an overview of currently available open-source software and a short example.

[19] Introduction to the predictive model markup language, January 2012. Orange County R User Group, Webinar together with Ray DiGiacomo, Alex Guazzelli and Rajarshi Guha, January 24, 2012.
This one hour webinar will provide an introduction to PMML (Predictive Model Markup Language). PMML is basically "XML for predictive models" and helps people deploy predictive models into a Business Intelligence environment without having to buy additional analytics software licenses. This webinar will be sponsored by The Orange County R User Group. The webinar's panelists will be Dr. Alex Guazzelli of Zementis, Dr. Rajarshi Guha of The National Institute of Health, Dr. Michael Hahsler of Southern Methodist University and Mr. Ray DiGiacomo, Jr. of Lion Data Systems. In addition to providing an introduction to PMML, the webinar will also discuss the features of PMML's brand new version (4.1) and each panelist's contribution to the PMML standard since its inception.

[20] Data stream modeling: Hurricane intensity prediction, December 2011. IBM Research Presentation, School of Engineering, Southern Methodist University, Dallas, Texas, December 6, 2011. [ slides (pdf) ]
This presentation highlights our research in the area of hurricane intensity prediction using a new data stream modeling technique called TRACDS.

[21] Recommender systems: From content to latent factor analysis, September 2011. CSE Colloquium, Department of Computer Science and Engineering, Southern Methodist University, Dallas, Texas, September 7, 2011. [ slides (pdf) ]
Digitization of content and digital content delivery have revolutionized the way we consume "information goods" (books, music, video, news, computer games, etc.). Online stores can efficiently offer millions of digital products and deliver them seamlessly into your home or to a mobile device. The key is to present the products to the consumer in a way which makes it easy for the consumer to find suitable products within such a large selection. In a brick-and-mortar store this is the job of the sales person. In online stores this is done by recommender systems which apply statistical and knowledge discovery techniques. In this talk we will discuss recommendation strategies starting with content analysis. We discuss collaborative filtering and the idea behind the current state-of-the-art approaches based on latent factor analysis. The talk concludes with an overview of currently available open-source software and a short example.

[22] Dissimilarity plots: A visual exploration tool for partitional clustering, June 2011. Invited talk, Best of Journal of Computational and Graphical Statistics (JGCS) session, 42th Symposium on the Interface, Cary, NC, June 1--3, 2011. [ slides (pdf) ]
For hierarchical clustering, dendrograms are a convenient and powerful visualization technique. Although many visualization methods have been suggested for partitional clustering, their usefulness deteriorates quickly with increasing dimensionality of the data and/or they fail to represent structure between and within clusters simultaneously. In this paper we extend (dissimilarity) matrix shading with several reordering steps based on seriation techniques. Both ideas, matrix shading and reordering, have been well-known for a long time. However, only recent algorithmic improvements allow us to solve or approximately solve the seriation problem efficiently for larger problems. Furthermore, seriation techniques are used in a novel stepwise process (within each cluster and between clusters) which leads to a visualization technique that is able to present the structure between clusters and the microstructure within clusters in one concise plot. This not only allows us to judge cluster quality but also makes mis-specification of the number of clusters apparent. We give a detailed discussion of the construction of dissimilarity plots and demonstrate their usefulness with several examples. Experiments show that dissimilarity plots scale very well with increasing data dimensionality.

[23] Visualizing association rules in hierarchical groups, June 2011. 42th Symposium on the Interface, Cary, NC, June 1--3, 2011. [ slides (pdf) ]
Association rule mining is one of the most popular data mining methods. However, mining association rules often results in a very large number of found rules, leaving the analyst with the task to go through all the rules and discover interesting ones. Sifting manually through large sets of rules is time consuming and strenuous. Visualization has a long history of making large amounts of data better accessible using techniques like selecting and zooming. However, most association rule visualization techniques are still falling short when it comes to a large number of rules. In this talk we present a new interactive visualization technique which lets the user navigate through a hierarchy of groups of association rules. We demonstrate how this new visualization techniques can be used to analyze a large sets of association rules with examples from our implementation in the R-package arulesViz.

[24] Analyzing incomplete biological pathways using network motifs, May 2011. Division of Biomedical Informatics Retreat, UT Southwestern Medical Center, Dallas, TX, May 6 and 12, 2011. [ slides (pdf) ]
Uncovering missing proteins in an incomplete biological pathway guides targeted therapy and drug design and discovery. A biological pathway is a sub-network of proteins interacting with each other to generate a certain biological effect. The pathway is incomplete if it has missing proteins, incorrect connections, or both. We define the pathway completion problem as the problem of retrieving a set of candidate proteins from a probabilistic protein-protein interaction (PPI) network and predicting their locations in a given incomplete pathway. In such a PPI network, nodes represent proteins, edges represent interactions among proteins, and weights are a measure of interaction quality. We propose the use of network motifs to solve the pathway completion problem. This problem is different from the known complex/pathway membership problem and is innovative because it uses motifs. With motifs the completion problem can be tackled and also has the potential to improve the pathway membership problem.

[25] Temporal structure learning for clustering massive data streams in real-time, April 2011. SIAM Conference on Data Mining (SDM11), Phoenix, AZ, April 28--30, 2011. [ slides (pdf) ]
In this talk we describe one of the first attempts to model the temporal structure of massive data streams in real-time using data stream clustering. Recently, many data stream clustering algorithms have been developed which efficiently find a partition of the data points in a data stream. However, these algorithms disregard the information represented by the temporal order of the data points in the stream which for many applications is an important part of the data stream. We propose a new framework called Temporal Relationships Among Clusters for Data Streams (TRACDS) which allows to learn the temporal structure while clustering a data stream. We identify, organize and describe the clustering operations which are used by state-of-the-art data stream clustering algorithms. Then we show that by defining a set of new operations to transform Markov Chains with states representing clusters dynamically, we can efficiently capture temporal ordering information. This framework allows us to preserve temporal relationships among clusters for any state-of-the-art data stream clustering algorithm with only minimal overhead. To investigate the usefulness of TRACDS, we evaluate the improvement of TRACDS over pure data stream clustering for anomaly detection using several synthetic and real-world data sets. The experiments show that TRACDS is able to considerably improve the results even if we introduce a high rate of incorrect time stamps which is typical for real-world data streams.

[26] Dissimilarity plots: A visual exploration tool for partitional clustering, April 2009. CSE Colloquium, Department of Computer Science and Engineering, Southern Methodist University, Dallas, Texas, April 3, 2009. [ slides (pdf) ]
Cluster analysis tries to uncover structure in data by assigning each object in the data set to a group (called cluster) so that objects from the same cluster are more similar to each other than to objects from other clusters. Exploring the cluster structure and assessing the quality of the cluster solution have been a research topic since the invention of cluster analysis. This is especially important since all popular cluster algorithms produce a clustering even for data without a "cluster" structure. Many visualization techniques to judging the quality of a clustering and to explore the cluster structure were developed, but they all suffer from certain restrictions. For example, dendrograms cannot be used for non-hierarchical partitions, silhouette plots provide only a diagnostic tool without the ability to explore structure, data dimensionality may render projection-based methods less useful, and graph-based representations hide the internal structure of clusters. In this talk we introduce a new visualization technique called dissimilarity plots which uses dissimilarity matrix shading and seriation for (near) optimal cluster and object placement. Dissimilarity plots are not affected by data dimensionality, allow for directly judging cluster quality and visual analysis of the micro-structure within clusters, and make misspecification of the used number of clusters instantly apparent. Dissimilarity plots are implemented in the R extension package seriation.

[27] A probabilistic approach to association rule mining, October 2008. CSE Colloquium, Department of Computer Science and Engineering, Southern Methodist University, Dallas, Texas, October 10, 2008. [ slides (pdf) ]
Mining association rules is an important and widely applied technique for discovering patterns in transaction databases. However, the support-confidence framework has some often overlooked weaknesses. For example, the thresholds are typically chosen arbitrarily by the analyst just to keep the set of found rules at a manageable size, confidence and other measures of interestingness are systematically influenced by support, and the risk of using `spurious rules' in an application is generally unknown and ignored. This talk will introduce a simple stochastic model and show how to apply it to simulate data for analyzing the behavior of confidence and other measures of interestingness (e.g., lift), to develop a new model-driven approach to mine rules based on the notion of NB-frequent itemsets, and to define a measure of interestingness which controls for spurious rules and has a strong foundation in statistical testing theory.

[28] Generating top-N recommendations from binary profile data, July 2008. Berufungsvortrag Wirtschaftsinformatik, WU Wien, July 16, 2008. [ slides (pdf) ]
Research on collaborative filtering recommender systems for binary data is limited. In this presentation we motivate why concentrating on binary data is important and we review how well known collaborative filtering methods perform using different binary data sets.

[29] Two applications of the TSP for data analysis, March 2007. 31th Annual Conference of the German Classification Society (GfKl 2007), Freiburg, March 7-9, 2007. [ slides (pdf) ]
The traveling salesperson problem is a well known and important combinatorial optimization problem. The goal is to find the shortest tour that visits each city in a given list exactly once. Even though simple to state, the TSP belongs to the class of NP-complete problems. The importance of the TSP arises from the variety of its applications. In addition to obvious ones for vehicle routing and computer wiring, there exist applications in data analysis which make it important for state of the art data analysis environments to provide capabilities for solving TSPs. In this talk, we introduce the recently developed package TSP which provides a basic infrastructure for handling and solving the traveling salesperson problem within R, an open source environment for statistical computing and graphics. We demonstrate how the new package can be used for two data analysis applications, clustering and the reorganization of data matrices for visualization.

[30] TSP -- A R-package for the traveling salesperson problem, December 2006. Research Seminar Statistical Computation, Wirtschaftsuniversität Wien, December 1, 2006. [ slides (pdf) ]
The traveling salesperson or salesman problem (TSP) is a well known and important combinatorial optimization problem. The goal is to find the shortest tour that visits each city in a given list exactly once and then returns to the starting city. Despite this simple problem statement, the TSP belongs to the class of NP-complete problems. The importance of the TSP arises from the variety of it's applications. Apart from the obvious application for vehicle routing, many other applications, e.g., computer wiring, cutting wallpaper, job sequencing or several data visualization techniques, require the solution of a TSP. In this talk we introduce the R package TSP which provides a basic infrastructure for handling and solving the traveling salesperson problem. The package features informal S3 classes for specifying a TSP and its (possibly optimal) solution as well as several heuristics to find good solutions. In addition, it provides an interface to Concorde, one of the best exact TSP solvers currently available.

[31] Warenkorbanalyse mit Hilfe der Statistiksoftware R, October 2006. WU Competence Day, Wirtschaftsuniversität Wien, 19. Oktober, 2006. [ slides (pdf) ]
Die Warenkorb- oder Sortimentsverbundanalyse bezeichnet eine Reihe von Methoden zur Untersuchung der bei einem Einkauf gemeinsam nachgefragten Produkte oder Kategorien aus einem Handelssortiment. In diesem Beitrag wird die explorative Warenkorbanalyse näher beleuchtet, welche eine Verdichtung und kompakte Darstellung der in (zumeist sehr umfangreichen) Transaktionsdaten des Einzelhandels auffindbaren Verbundbeziehungen beabsichtigt. Mit einer enormen Anzahl an verfügbaren Erweiterungspaketen bietet sich die frei verfügbare Statistiksoftware R als ideale Basis für die Durchführung solcher Warenkorbanalysen an. Die im Erweiterungspaket arules vorhandene Infrastruktur für Transaktionsdaten stellt eine flexible Basis für die Warenkorbanalyse bereit. Unterstützt wird die effiziente Darstellung, Bearbeitung und Analyse von Warenkorbdaten mitsamt beliebigen Zusatzinformationen zu Produkten (zum Beispiel Sortimentshierarchie) und zu Transaktionen (zum Beispiel Umsatz oder Deckungsbeitrag). Das Paket ist nahtlos in R integriert und ermöglicht dadurch die direkte Anwendung von bereits vorhandenen modernsten Verfahren für Sampling, Clusterbildung und Visualisierung von Warenkorbdaten. Zusätzlich sind in arules gängige Algorithmen zum Auffinden von Assoziationsregeln und die notwendigen Datenstrukturen zur Analyse von Mustern vorhanden. Eine Auswahl der wichtigsten Funktionen wird anhand eines realen Transaktionsdatensatzes aus dem Lebensmitteleinzelhandel demonstriert.

[32] Probabilistische Ansätze in der Assoziationsanalyse, May 2006. Habilitationsvortrag, Wirtschaftsuniversität Wien, 19. Mai, 2006. [ slides (pdf) ]
In diesem Vortrag werden kurz die Grundlagen der Assoziationsanalyse mittles Assoziationsregeln (Support-Konfidenz-Framework) dargestellt. Nach der Vorstellung der probabilistischen Interpretation des Support-Konfidenz-Framework werden bekannte Schwächen und Erweiterungen (Lift, Chi-Quadrat-Test) besprochen. Schließlich wird ein stochastisches Unabhängigkeitsmodell eingeführt für das drei Anwendungen besprochen werden.

[33] An association rule mining infrastructure for the R data analysis toolbox, March 2006. 30th Annual Conference of the German Classification Society (GfKl 2006), Berlin, March 8-10, 2006. [ slides (pdf) ]
The free and extensible statistical computing environment R with its enormous number of extension packages already provides many state-of-the-art techniques for data analysis (advanced clustering, sampling and visualization). However, support for analyzing transaction data, e.g., mining association rules, a popular exploratory method which can be used, among other purposes, for uncovering cross-selling opportunities in market baskets, has not been available for R thus far. In this presentation we present the R extension package arules, an infrastructure for handling and mining transaction data. After an introduction to the infrastructure provided by the package arules, we use a market basket data set from a typical retailer to demonstrate the advantages of the seamless integration of arules into the R infrastructure.

[34] Optimizing web sites for customer retention, November 2005. 2005 International Workshop on Customer Relationship Management: Data Mining Meets Marketing November 18th & 19th, 2005, New York City, USA. [ slides (pdf) ]
With customer relationship management (CRM) companies move away from a mainly product-centered view to a customer-centered view. Resulting from this change, the effective management of how to keep contact with customers throughout different channels is one of the key success factors in today's business world. Company Web sites have evolved in many industries into an extremely important channel through which customers can be attracted and retained. To analyze and optimize this channel, accurate models of how customers browse through the Web site and what information within the site they repeatedly view are crucial. Typically, data mining techniques are used for this purpose. However, there already exist numerous models developed in marketing research for traditional channels which could also prove valuable to understanding this new channel. In this paper we propose the application of an extension of the Logarithmic Series Distribution (LSD) model repeat-usage of Web-based information and thus to analyze and optimize a Web Site's capability to support one goal of CRM, to retain customers. As an example, we use the university's blended learning web portal with over a thousand learning resources to demonstrate how the model can be used to evaluate and improve the Web site's effectiveness.

[35] Implications of probabilistic data modeling for rule mining, March 2005. 29th Annual Conference of the German Classification Society (GfKl 2005), March 9-11, 2005, Magdeburg, Germany. [ slides (pdf) ]
Mining association rules is an important technique for discovering meaningful patterns in transaction databases. In the current literature, the properties of algorithms to mine associations are discussed in great detail. In this talk we investigate properties of transaction data sets from a probabilistic point of view. We present a simple probabilistic framework for transaction data and its implementation using the R statistical computing environment. The framework can be used to simulate transaction data when no associations are present. We use such data to explore the ability to filter noise of confidence and lift, two popular interest measures used for rule mining. Based on the framework we develop the measure hyperlift and we compare this new measure to lift using simulated data and a real-world grocery database.

[36] Discussion of a large-scale open source data collection methodology, January 2005. 38th Hawaii International Conference on System Sciences (HICSS-38), January 3-6, 2005, Hilton Waikoloa Village, Big Island, Hawaii. [ slides (pdf) ]
In this talk we discusses in detail a possible methodology for collecting repository data on a large number of open source software projects from a single project hosting and community site. The process of data retrieval is described along with the possible metrics that can be computed and which can be used for further analyses. Example research areas to be addressed with the available data and first results are given. Then, both advantages and disadvantages of the proposed methodology are discussed together with implications for future approaches.

[37] ePubWU - Erfahrungen mit einer Volltextplattform an der Wirtschaftsuniversität Wien, September 2004. 28. Österreichischer Bibliothekartag 2004, Linz, Austria. [ slides (pdf) ]
ePubWU ist eine elektronische Plattform für wissenschaftliche Publikationen der Wirtschaftsuniversität Wien, wo forschungsbezogene Veröffentlichungen der WU im Volltext über das WWW zugänglich gemacht werden. ePubWU wird als Gemeinschaftsprojekt der Universitätsbibliothek der Wirtschaftsuniversität Wien und der Abteilung für Informationswirtschaft betrieben. Derzeit werden in ePubWU zwei Publikationsarten gesammelt - Working Papers und Dissertationen. In diesem Vortrag werden Erfahrungen der über zweijährigen Laufzeit des Projektes dargestellt, u.a. in den Bereichen Binary file (standard input) matches

© Michael Hahsler,

This file has been generated by bibtex2html