[1] What is AI? How did we get here? Where will it lead us?, April 2023. Invited talk: CBRE – Texas Multifamily Exclusive Client Forum, April 19, 2023. [ slides (pdf) ]
AI will have a profound impact on the way we live and work. This talk will introduce what AI is, how it is currently used and what we can expect in the near future.
[2] Data science research software for experiential learning, February 2019. EMIS Seminar, February 5, 2019. [ slides (pdf) ]
This talk will cover several examples of research software for data science and machine learning that was co-developed with students as part of their learning experience. I develop widely used R software packages for more than 15 years as part of research and teaching. These packages cover areas, including association rule mining, combinatorial optimization, data stream mining, recommender systems, and reinforcement learning. All these areas are highly connected and just use different terms to describe very similar concepts and algorithms. In this talk, I will focus on two examples. I recently started to work on combining reinforcement learning with predictive modeling and will introduce the used concepts and point to some interesting questions. A second example comes from association rule learning, where combinatorial optimization in the form of clustering and ordering is used to organize large sets of associations. All developed software is available on GitHub and distributed via CRAN. A complete list can be found at https://michael.hahsler.net/#Software
[3] Electronic health record analytics: The case of optimal diabetes screening, December 2018. EMIS Industry Advisory Board and Outreach Meeting December 3, 2018. [ slides (pdf) ]
Prevalence of diabetes has reached epidemic proportions in the US with costs surpassing 300 billion annually. We will discuss a new framework that integrates a disease progression model (hidden Markov model), a predictive model on electronic health record data and a partially observable Markov decision process to derive an optimal diabetes screening strategy. The models are developed using data from the Parkland Health & Hospital System. A comparison with current screening guidelines shows that the new approach is effective at providing similar benefits while reducing the number of required screenings.
[4] The impact of information sharing on organ transplantation: A simulation model, November 2018. 2018 INFORMS Annual Meeting, Phoenix, AZ, November 3, 2018. [ slides (pdf) ]
Recently, several changes to the kidney allocation rules were approved, and modern IT solutions are developed to improve kidney assignment. We present a flexible simulation framework that considers the effect of several important factors for organ transplantation. We use the model to investigate how the optimal individual-level center listing and kidney acceptance decisions are affected by differences in supply due to different allocation rules. At the macro-level, we evaluate the potential effect of information sharing on the social welfare and organ discard rates.
[5] Probabilistic approaches to mine association rules, October 2018. Department Seminar, Department of Statistics and Actuarial Sciences, University of Waterloo, Canada, October 2018. [ slides (pdf) ]
Mining association rules is an important and widely applied data mining technique for discovering patterns in large datasets. However, the used support-confidence framework has some often overlooked weaknesses. This talk introduces a simple stochastic model and shows how it can be used in association rule mining. We apply the model to simulate data for analyzing the behavior and shortcomings of confidence and other measures of interestingness (e.g., lift). Based on these findings, we develop a new model-driven approach to mine rules based on the notion of NB-frequent itemsets, and we define a measure of interestingness which controls for spurious rules and has a strong foundation in statistical testing theory.
[6] Electronic health record analytics: The case of optimal diabetes screening, April 2018. Artificial Intelligence in Medicine Seminar Series Division of Medical Physics and Engineering, UT Southwestern, April 13, 2018. [ slides (pdf) ]
This talk discusses how to learn an optimal diabetes screening strategy derived from electronic health record data in a setting with resource constraints.
[7] Recommender systems: Harnessing the power of personalization, February 2017. Curricular Recommender System Working Group, SMU, February 17, 2017.
This talk gives a short overview of recommender systems with examples.
[8] Dissimilarity Plots: A Visual Exploration Tool for Partitional Clustering, February 2017. Invited talk, IE Department Seminar Department of Industrial, Manufacturing, & Systems Engineering University of Texas at Arlington. [ slides (pdf) ]
Cluster analysis tries to uncover structure in data by assigning each object in the data set to a group (called cluster) so that objects from the same cluster are more similar to each other than to objects from other clusters. Exploring the cluster structure and assessing the quality of the cluster solution have been a research topic since the invention of cluster analysis. This is especially important since all popular cluster algorithms produce a clustering even for data without a "cluster" structure. Many visualization techniques to judging the quality of a clustering and to explore the cluster structure were developed, but they all suffer from certain restrictions. For example, dendrograms cannot be used for non-hierarchical partitions, silhouette plots provide only a diagnostic tool without the ability to explore structure, data dimensionality may render projection-based methods less useful, and graph-based representations hide the internal structure of clusters. In this talk we introduce a new visualization technique called dissimilarity plots which is based on solving the combinatorial optimization problem of seriation for (near) optimal cluster and object placement in matrix shading. Dissimilarity plots are not affected by data dimensionality, allow the user to directly judge cluster quality by visually analyzing the micro-structure within clusters, while they make misspecification of the used number of clusters instantly apparent. Dissimilarity plots are implemented in the R extension package seriation.
[9] Predictive models for making patient screening decisions, January 2017. 2017 INFORMS Computing Society Conference, Austin, TX, January 16, 2017.
A critical dilemma that clinicians face is when and how often to screen patients who may suffer from a disease. We investigate the use of predictive modeling to develop optimal screening conditions and assist with clinical decision-making with large amounts of missing information.
[10] Data mining tutorial: Methods and tools, November 2016. DCII - Operations Research and Statistics Towards Integrated Analytics Research Cluster, SMU, November 30, 2016. [ slides (pdf) ]
Data mining includes a set of methods and tools heavily used in descriptive and predictive analytics. This tutorial will cover an introduction to data mining including the different types of problems addressed by data mining. We will introduce the data mining process and the core methods used by data mining professionals and highlight how they are related to the fields of statistics, optimization and machine learning. The tutorial will also review available tools and conclude with several examples using a set of representative tools.
[11] Sequential aggregation-disaggregation optimization methods for data stream mining, November 2016. 2016 INFORMS Annual Meeting, November 2016.
Clustering-based iterative algorithms to solve certain large optimization problems have been proposed in the past. The algorithms start by aggregating the original data, solving the problem on aggregated data, and then in subsequent steps gradually disaggregate the aggregated data. In this contribution, we investigate the application of aggregation-disaggregation on data streams, where the disaggregation steps cannot be explicitly performed on past data, but has to be performed sequentially on new data.
[12] Predictive models for making patient screening decisions, November 2016. 2016 INFORMS Annual Meeting, November 2016.
A critical dilemma that clinicians face is when and how often to screen patients who may suffer from a disease. The stakes are heightened in the case of chronic diseases that impose a substantial cost burden. We investigate the use of predictive modeling to develop optimal screening conditions and assist with clinical decision-making. We use electronic health data from a major U.S. Hospital and apply our models in the context of screening patients for type-2 diabetes, one of the most prevalent diseases in the U.S. and worldwide.
[13] Grouping association rules using lift, November 2016. 11th INFORMS Workshop on Data Mining and Decision Analytics, November 12, 2016. [ slides (pdf) ]
Association rule mining is a well established and popular data mining method for finding local dependencies between items in large transaction databases. However, a practical drawback of mining and efficiently using association rules is that the set of rules returned by the mining algorithm is typically too large to be directly used. Clustering association rules into a small number of meaningful groups would be valuable for experts who need to manually inspect the rules, for visualization and as the input for other applications. Interestingly, clustering is not widely used as a standard method to summarize large sets of associations. In fact it performs poorly due to high dimensionality, the inherent extreme data sparseness and the dominant frequent itemset structure reflected in sets of association rules. In this paper, we review association rule clustering and their shortcomings. We then propose a simple approach based on grouping columns in a lift matrix and give an example to illustrate its usefulness.
[14] Probabilistic approach to association rule mining, May 2016. Seminar talk, IESEG School of Management, Lille, France, May 2016. [ slides (pdf) ]
Mining association rules is an important and widely applied technique for discovering patterns in transaction databases. However, the support-confidence framework has some often overlooked weaknesses. This talk will introduce a simple stochastic model and show how to apply it to simulate data for analyzing the behavior of confidence and other measures of interestingness (e.g., lift), develop a new model-driven approach to mine rules based on the notion of NB-frequent itemsets, and define a measure of interestingness which controls for spurious rules and has a strong foundation in statistical testing theory.
[15] Recommender systems: Harnessing the power of personalization, November 2015. Invited talks at the Southwest EDGe Analyst Community Meeting, Dallas, TX, November 18, 2015. [ slides (pdf) ]
This talk gives a short overview of recommender systems with examples.
[16] Ordering objects: What heuristic should we use?, November 2015. 2015 INFORMS Annual Meeting, Philadelphia, PA, November 1-4, 2015. [ slides (pdf) ]
Seriation, i.e., finding a suitable linear order for a set of objects given data and a merit function, is a basic combinatorial optimization problem with applications in modern data analysis. Due to the combinatorial nature of the problem, most practical problems require heuristics. We have implemented over 20 different heuristics and more than 10 merit functions. We will discuss the different methods in this presentation and compare them empirically using datasets from several problem domains.
[17] arules: Association rule mining with R, February 2015. Tutorial at R User Group Dallas Meeting, Dallas, TX, February, 2015.
Mining association rules, a form of frequent pattern mining, is a widely used data mining technique. The idea is to discover interesting relations between variables in large databases. This talk will introduce the basic concepts behind association rule mining and how these concepts are implemented in the R package arules. We will talk about preparing data, mining rules and analyzing the results, including visualization. I will use example code throughout the presentation which can be easily adapted to data mining problems you might have in mind. As a case study, I will show how to mine a real data set containing data from the Stop-Question-and-Frisk program in New York City.
[18] Cooperative data analysis in supply chains using selective information disclosure, January 2015. 2015 INFORMS Computing Society Conference, Richmond, VA, January 11-13, 2015.
Many modern products (e. g., consumer electronics) consist of hundreds of complex parts sourced from a large number of suppliers. In such a setting, finding the source of certain properties, e. g., the source of defects in the final product, becomes increasingly difficult. Data analysis methods can be used on information shared in modern supply chains. However, some information may be confidential since they touch proprietary production processes or trade secrets. In this talk we explore the idea of selective information disclosure to address this issue.
[19] Cooperative data analysis in supply chains using selective information disclosure, November 2014. 2014 INFORMS Annual Meeting, San Fancisco, CA, November 9-12, 2014.
Many modern products (e. g., consumer electronics) consist of hundreds of complex parts sourced from a large number of suppliers. In such a setting, finding the source of certain properties, e. g., the source of defects in the final product, becomes increasingly difficult. Data analysis methods can be used on information shared in modern supply chains. However, some information may be confidential since they touch proprietary production processes or trade secrets. In this talk we explore the idea of selective information disclosure to address this issue.
[20] Dissimilarity Plots: A Visual Exploration Tool for Partitional Clustering, November 2013. Invited talk, Graduate Seminar, School of Industrial and Systems Engineering, University of Oklahoma.
Cluster analysis tries to uncover structure in data by assigning each object in the data set to a group (called cluster) so that objects from the same cluster are more similar to each other than to objects from other clusters. Exploring the cluster structure and assessing the quality of the cluster solution have been a research topic since the invention of cluster analysis. This is especially important since all popular cluster algorithms produce a clustering even for data without a "cluster" structure. Many visualization techniques to judging the quality of a clustering and to explore the cluster structure were developed, but they all suffer from certain restrictions. For example, dendrograms cannot be used for non-hierarchical partitions, silhouette plots provide only a diagnostic tool without the ability to explore structure, data dimensionality may render projection-based methods less useful, and graph-based representations hide the internal structure of clusters. In this talk we introduce a new visualization technique called dissimilarity plots which is based on solving the combinatorial optimization problem of seriation for (near) optimal cluster and object placement in matrix shading. Dissimilarity plots are not affected by data dimensionality, allow the user to directly judge cluster quality by visually analyzing the micro-structure within clusters, while they make misspecification of the used number of clusters instantly apparent. Dissimilarity plots are implemented in the R extension package seriation.
[21] A study of the efficiency and accuracy of data stream clustering for large data sets, October 2013. 2013 INFORMS Annual Meeting, Minneapolis Convention Center, October 6-9, 2013.
Clustering large data sets is important for many data mining applications. Conventional clustering approaches (k-means, hierarchical clustering, etc.) typically do not scale well for very large data sets. We will investigate the use of data stream clustering algorithms as light-weight alternatives to conventional algorithms on large non-streaming data using several synthetic and real-world data sets.
[22] Data stream clustering of large non-streaming data sets, August 2012. 36th Annual Conference of the German Classification Society (GfKl 2012), Hildesheim, Germany, August 1--3, 2012.
Identifying groups in large data sets is important for many machine learning and knowledge discovery applications. In recent years, data stream clustering algorithms have been proposed which can deal efficiently with potentially unbounded streams of data. Obviously, these algorithms can also be used for large non-streaming data sets and, as such, present light-weight alternatives to conventional algorithms. In this presentation we will compare the accuracy of several data stream mining algorithms (CluStream, DenStream, ClusTree) with BIRCH and k-means. For the comparison we will use the R-extension package stream which we currently develop.
[23] SOStream: self organizing density-based clustering over data stream, July 2012. International Conference on Machine Learning and Data Mining (MLDM'2012), Berlin, Germany, July 13--20, 2012.
In this talk we discuss a data stream clustering algorithm, called Self Organizing density based clustering over data Stream (SOStream). This algorithm has several novel features. Instead of using a fixed, user defined similarity threshold or a static grid, SOStream detects structure within fast evolving data streams by automatically adapting the threshold for density-based clustering. It also employs a novel cluster updating strategy which is inspired by competitive learning techniques developed for Self Organizing Maps (SOMs). In addition, SOStream has built-in online functionality to support advanced stream clustering operations including merging and fading. This makes SOStream completely online with no separate offline components. Experiments performed on KDD Cup'99 and artificial datasets indicate that SOStream is an effective and superior algorithm in creating clusters of higher purity while having lower space and time requirements compared to previous stream clustering algorithms.
[24] Recommender systems: User-facing decision support systems, February 2012. Invited talk for EMIS 7357-Decision Support Systems, Southern Methodist University, Dallas, Texas, February 22, 2012.
The world-wide-web presents us with a multitude of choices with online shopping and many other online services like online radio and online TV. In order to be able to select appropriate products and services users need user-facing decision support systems. Recommender systems which apply statistical and knowledge discovery techniques to facilitate in the decision process. . In this talk we will discuss recommendation strategies starting with content analysis. We discuss collaborative filtering and the idea behind the current state-of-the-art approaches based on latent factor analysis. The talk concludes with an overview of currently available open-source software and a short example.
[25] Introduction to the predictive model markup language, January 2012. Orange County R User Group, Webinar together with Ray DiGiacomo, Alex Guazzelli and Rajarshi Guha, January 24, 2012.
This one hour webinar will provide an introduction to PMML (Predictive Model Markup Language). PMML is basically "XML for predictive models" and helps people deploy predictive models into a Business Intelligence environment without having to buy additional analytics software licenses. This webinar will be sponsored by The Orange County R User Group. The webinar's panelists will be Dr. Alex Guazzelli of Zementis, Dr. Rajarshi Guha of The National Institute of Health, Dr. Michael Hahsler of Southern Methodist University and Mr. Ray DiGiacomo, Jr. of Lion Data Systems. In addition to providing an introduction to PMML, the webinar will also discuss the features of PMML's brand new version (4.1) and each panelist's contribution to the PMML standard since its inception.
[26] Data stream modeling: Hurricane intensity prediction, December 2011. IBM Research Presentation, School of Engineering, Southern Methodist University, Dallas, Texas, December 6, 2011. [ slides (pdf) ]
This presentation highlights our research in the area of hurricane intensity prediction using a new data stream modeling technique called TRACDS.
[27] Recommender systems: From content to latent factor analysis, September 2011. CSE Colloquium, Department of Computer Science and Engineering, Southern Methodist University, Dallas, Texas, September 7, 2011.
Digitization of content and digital content delivery have revolutionized the way we consume "information goods" (books, music, video, news, computer games, etc.). Online stores can efficiently offer millions of digital products and deliver them seamlessly into your home or to a mobile device. The key is to present the products to the consumer in a way which makes it easy for the consumer to find suitable products within such a large selection. In a brick-and-mortar store this is the job of the sales person. In online stores this is done by recommender systems which apply statistical and knowledge discovery techniques. In this talk we will discuss recommendation strategies starting with content analysis. We discuss collaborative filtering and the idea behind the current state-of-the-art approaches based on latent factor analysis. The talk concludes with an overview of currently available open-source software and a short example.
[28] Visualizing association rules in hierarchical groups, June 2011. 42th Symposium on the Interface, Cary, NC, June 1--3, 2011.
Association rule mining is one of the most popular data mining methods. However, mining association rules often results in a very large number of found rules, leaving the analyst with the task to go through all the rules and discover interesting ones. Sifting manually through large sets of rules is time consuming and strenuous. Visualization has a long history of making large amounts of data better accessible using techniques like selecting and zooming. However, most association rule visualization techniques are still falling short when it comes to a large number of rules. In this talk we present a new interactive visualization technique which lets the user navigate through a hierarchy of groups of association rules. We demonstrate how this new visualization techniques can be used to analyze a large sets of association rules with examples from our implementation in the R-package arulesViz.
[29] Dissimilarity plots: A visual exploration tool for partitional clustering, June 2011. Invited talk, Best of Journal of Computational and Graphical Statistics (JGCS) session, 42th Symposium on the Interface, Cary, NC, June 1--3, 2011. [ slides (pdf) ]
For hierarchical clustering, dendrograms are a convenient and powerful visualization technique. Although many visualization methods have been suggested for partitional clustering, their usefulness deteriorates quickly with increasing dimensionality of the data and/or they fail to represent structure between and within clusters simultaneously. In this paper we extend (dissimilarity) matrix shading with several reordering steps based on seriation techniques. Both ideas, matrix shading and reordering, have been well-known for a long time. However, only recent algorithmic improvements allow us to solve or approximately solve the seriation problem efficiently for larger problems. Furthermore, seriation techniques are used in a novel stepwise process (within each cluster and between clusters) which leads to a visualization technique that is able to present the structure between clusters and the microstructure within clusters in one concise plot. This not only allows us to judge cluster quality but also makes mis-specification of the number of clusters apparent. We give a detailed discussion of the construction of dissimilarity plots and demonstrate their usefulness with several examples. Experiments show that dissimilarity plots scale very well with increasing data dimensionality.
[30] Analyzing incomplete biological pathways using network motifs, May 2011. Division of Biomedical Informatics Retreat, UT Southwestern Medical Center, Dallas, TX, May 6 and 12, 2011.
Uncovering missing proteins in an incomplete biological pathway guides targeted therapy and drug design and discovery. A biological pathway is a sub-network of proteins interacting with each other to generate a certain biological effect. The pathway is incomplete if it has missing proteins, incorrect connections, or both. We define the pathway completion problem as the problem of retrieving a set of candidate proteins from a probabilistic protein-protein interaction (PPI) network and predicting their locations in a given incomplete pathway. In such a PPI network, nodes represent proteins, edges represent interactions among proteins, and weights are a measure of interaction quality. We propose the use of network motifs to solve the pathway completion problem. This problem is different from the known complex/pathway membership problem and is innovative because it uses motifs. With motifs the completion problem can be tackled and also has the potential to improve the pathway membership problem.
[31] Temporal structure learning for clustering massive data streams in real-time, April 2011. SIAM Conference on Data Mining (SDM11), Phoenix, AZ, April 28--30, 2011.
In this talk we describe one of the first attempts to model the temporal structure of massive data streams in real-time using data stream clustering. Recently, many data stream clustering algorithms have been developed which efficiently find a partition of the data points in a data stream. However, these algorithms disregard the information represented by the temporal order of the data points in the stream which for many applications is an important part of the data stream. We propose a new framework called Temporal Relationships Among Clusters for Data Streams (TRACDS) which allows to learn the temporal structure while clustering a data stream. We identify, organize and describe the clustering operations which are used by state-of-the-art data stream clustering algorithms. Then we show that by defining a set of new operations to transform Markov Chains with states representing clusters dynamically, we can efficiently capture temporal ordering information. This framework allows us to preserve temporal relationships among clusters for any state-of-the-art data stream clustering algorithm with only minimal overhead. To investigate the usefulness of TRACDS, we evaluate the improvement of TRACDS over pure data stream clustering for anomaly detection using several synthetic and real-world data sets. The experiments show that TRACDS is able to considerably improve the results even if we introduce a high rate of incorrect time stamps which is typical for real-world data streams.
[32] Dissimilarity plots: A visual exploration tool for partitional clustering, April 2009. CSE Colloquium, Department of Computer Science and Engineering, Southern Methodist University, Dallas, Texas, April 3, 2009.
Cluster analysis tries to uncover structure in data by assigning each object in the data set to a group (called cluster) so that objects from the same cluster are more similar to each other than to objects from other clusters. Exploring the cluster structure and assessing the quality of the cluster solution have been a research topic since the invention of cluster analysis. This is especially important since all popular cluster algorithms produce a clustering even for data without a "cluster" structure. Many visualization techniques to judging the quality of a clustering and to explore the cluster structure were developed, but they all suffer from certain restrictions. For example, dendrograms cannot be used for non-hierarchical partitions, silhouette plots provide only a diagnostic tool without the ability to explore structure, data dimensionality may render projection-based methods less useful, and graph-based representations hide the internal structure of clusters. In this talk we introduce a new visualization technique called dissimilarity plots which uses dissimilarity matrix shading and seriation for (near) optimal cluster and object placement. Dissimilarity plots are not affected by data dimensionality, allow for directly judging cluster quality and visual analysis of the micro-structure within clusters, and make misspecification of the used number of clusters instantly apparent. Dissimilarity plots are implemented in the R extension package seriation.
[33] A probabilistic approach to association rule mining, October 2008. CSE Colloquium, Department of Computer Science and Engineering, Southern Methodist University, Dallas, Texas, October 10, 2008.
Mining association rules is an important and widely applied technique for discovering patterns in transaction databases. However, the support-confidence framework has some often overlooked weaknesses. For example, the thresholds are typically chosen arbitrarily by the analyst just to keep the set of found rules at a manageable size, confidence and other measures of interestingness are systematically influenced by support, and the risk of using `spurious rules' in an application is generally unknown and ignored. This talk will introduce a simple stochastic model and show how to apply it to simulate data for analyzing the behavior of confidence and other measures of interestingness (e.g., lift), to develop a new model-driven approach to mine rules based on the notion of NB-frequent itemsets, and to define a measure of interestingness which controls for spurious rules and has a strong foundation in statistical testing theory.
[34] Generating top-N recommendations from binary profile data, July 2008. Berufungsvortrag Wirtschaftsinformatik, WU Wien, July 16, 2008. [ slides (pdf) ]
Research on collaborative filtering recommender systems for binary data is limited. In this presentation we motivate why concentrating on binary data is important and we review how well known collaborative filtering methods perform using different binary data sets.
[35] Two applications of the TSP for data analysis, March 2007. 31th Annual Conference of the German Classification Society (GfKl 2007), Freiburg, March 7-9, 2007.
The traveling salesperson problem is a well known and important combinatorial optimization problem. The goal is to find the shortest tour that visits each city in a given list exactly once. Even though simple to state, the TSP belongs to the class of NP-complete problems. The importance of the TSP arises from the variety of its applications. In addition to obvious ones for vehicle routing and computer wiring, there exist applications in data analysis which make it important for state of the art data analysis environments to provide capabilities for solving TSPs. In this talk, we introduce the recently developed package TSP which provides a basic infrastructure for handling and solving the traveling salesperson problem within R, an open source environment for statistical computing and graphics. We demonstrate how the new package can be used for two data analysis applications, clustering and the reorganization of data matrices for visualization.
[36] TSP -- A R-package for the traveling salesperson problem, December 2006. Research Seminar Statistical Computation, Wirtschaftsuniversität Wien, December 1, 2006. [ slides (pdf) ]
The traveling salesperson or salesman problem (TSP) is a well known and important combinatorial optimization problem. The goal is to find the shortest tour that visits each city in a given list exactly once and then returns to the starting city. Despite this simple problem statement, the TSP belongs to the class of NP-complete problems. The importance of the TSP arises from the variety of it's applications. Apart from the obvious application for vehicle routing, many other applications, e.g., computer wiring, cutting wallpaper, job sequencing or several data visualization techniques, require the solution of a TSP. In this talk we introduce the R package TSP which provides a basic infrastructure for handling and solving the traveling salesperson problem. The package features informal S3 classes for specifying a TSP and its (possibly optimal) solution as well as several heuristics to find good solutions. In addition, it provides an interface to Concorde, one of the best exact TSP solvers currently available.
[37] Warenkorbanalyse mit Hilfe der Statistiksoftware R, October 2006. WU Competence Day, Wirtschaftsuniversität Wien, 19. Oktober, 2006.
Die Warenkorb- oder Sortimentsverbundanalyse bezeichnet eine Reihe von Methoden zur Untersuchung der bei einem Einkauf gemeinsam nachgefragten Produkte oder Kategorien aus einem Handelssortiment. In diesem Beitrag wird die explorative Warenkorbanalyse näher beleuchtet, welche eine Verdichtung und kompakte Darstellung der in (zumeist sehr umfangreichen) Transaktionsdaten des Einzelhandels auffindbaren Verbundbeziehungen beabsichtigt. Mit einer enormen Anzahl an verfügbaren Erweiterungspaketen bietet sich die frei verfügbare Statistiksoftware R als ideale Basis für die Durchführung solcher Warenkorbanalysen an. Die im Erweiterungspaket arules vorhandene Infrastruktur für Transaktionsdaten stellt eine flexible Basis für die Warenkorbanalyse bereit. Unterstützt wird die effiziente Darstellung, Bearbeitung und Analyse von Warenkorbdaten mitsamt beliebigen Zusatzinformationen zu Produkten (zum Beispiel Sortimentshierarchie) und zu Transaktionen (zum Beispiel Umsatz oder Deckungsbeitrag). Das Paket ist nahtlos in R integriert und ermöglicht dadurch die direkte Anwendung von bereits vorhandenen modernsten Verfahren für Sampling, Clusterbildung und Visualisierung von Warenkorbdaten. Zusätzlich sind in arules gängige Algorithmen zum Auffinden von Assoziationsregeln und die notwendigen Datenstrukturen zur Analyse von Mustern vorhanden. Eine Auswahl der wichtigsten Funktionen wird anhand eines realen Transaktionsdatensatzes aus dem Lebensmitteleinzelhandel demonstriert.
[38] Probabilistische Ansätze in der Assoziationsanalyse, May 2006. Habilitationsvortrag, Wirtschaftsuniversität Wien, 19. Mai, 2006. [ slides (pdf) ]
In diesem Vortrag werden kurz die Grundlagen der Assoziationsanalyse mittles Assoziationsregeln (Support-Konfidenz-Framework) dargestellt. Nach der Vorstellung der probabilistischen Interpretation des Support-Konfidenz-Framework werden bekannte Schwächen und Erweiterungen (Lift, Chi-Quadrat-Test) besprochen. Schließlich wird ein stochastisches Unabhängigkeitsmodell eingeführt für das drei Anwendungen besprochen werden.
[39] An association rule mining infrastructure for the R data analysis toolbox, March 2006. 30th Annual Conference of the German Classification Society (GfKl 2006), Berlin, March 8-10, 2006.
The free and extensible statistical computing environment R with its enormous number of extension packages already provides many state-of-the-art techniques for data analysis (advanced clustering, sampling and visualization). However, support for analyzing transaction data, e.g., mining association rules, a popular exploratory method which can be used, among other purposes, for uncovering cross-selling opportunities in market baskets, has not been available for R thus far. In this presentation we present the R extension package arules, an infrastructure for handling and mining transaction data. After an introduction to the infrastructure provided by the package arules, we use a market basket data set from a typical retailer to demonstrate the advantages of the seamless integration of arules into the R infrastructure.
[40] Optimizing web sites for customer retention, November 2005. 2005 International Workshop on Customer Relationship Management: Data Mining Meets Marketing November 18th & 19th, 2005, New York City, USA.
With customer relationship management (CRM) companies move away from a mainly product-centered view to a customer-centered view. Resulting from this change, the effective management of how to keep contact with customers throughout different channels is one of the key success factors in today's business world. Company Web sites have evolved in many industries into an extremely important channel through which customers can be attracted and retained. To analyze and optimize this channel, accurate models of how customers browse through the Web site and what information within the site they repeatedly view are crucial. Typically, data mining techniques are used for this purpose. However, there already exist numerous models developed in marketing research for traditional channels which could also prove valuable to understanding this new channel. In this paper we propose the application of an extension of the Logarithmic Series Distribution (LSD) model repeat-usage of Web-based information and thus to analyze and optimize a Web Site's capability to support one goal of CRM, to retain customers. As an example, we use the university's blended learning web portal with over a thousand learning resources to demonstrate how the model can be used to evaluate and improve the Web site's effectiveness.
[41] Implications of probabilistic data modeling for rule mining, March 2005. 29th Annual Conference of the German Classification Society (GfKl 2005), March 9-11, 2005, Magdeburg, Germany. [ slides (pdf) ]
Mining association rules is an important technique for discovering meaningful patterns in transaction databases. In the current literature, the properties of algorithms to mine associations are discussed in great detail. In this talk we investigate properties of transaction data sets from a probabilistic point of view. We present a simple probabilistic framework for transaction data and its implementation using the R statistical computing environment. The framework can be used to simulate transaction data when no associations are present. We use such data to explore the ability to filter noise of confidence and lift, two popular interest measures used for rule mining. Based on the framework we develop the measure hyperlift and we compare this new measure to lift using simulated data and a real-world grocery database.
[42] Discussion of a large-scale open source data collection methodology, January 2005. 38th Hawaii International Conference on System Sciences (HICSS-38), January 3-6, 2005, Hilton Waikoloa Village, Big Island, Hawaii.
In this talk we discusses in detail a possible methodology for collecting repository data on a large number of open source software projects from a single project hosting and community site. The process of data retrieval is described along with the possible metrics that can be computed and which can be used for further analyses. Example research areas to be addressed with the available data and first results are given. Then, both advantages and disadvantages of the proposed methodology are discussed together with implications for future approaches.
[43] ePubWU - Erfahrungen mit einer Volltextplattform an der Wirtschaftsuniversität Wien, September 2004. 28. Österreichischer Bibliothekartag 2004, Linz, Austria.
ePubWU ist eine elektronische Plattform für wissenschaftliche Publikationen der Wirtschaftsuniversität Wien, wo forschungsbezogene Veröffentlichungen der WU im Volltext über das WWW zugänglich gemacht werden. ePubWU wird als Gemeinschaftsprojekt der Universitätsbibliothek der Wirtschaftsuniversität Wien und der Abteilung für Informationswirtschaft betrieben. Derzeit werden in ePubWU zwei Publikationsarten gesammelt - Working Papers und Dissertationen. In diesem Vortrag werden Erfahrungen der über zweijährigen Laufzeit des Projektes dargestellt, u.a. in den Bereichen Akquisition, Workflows, Erschließung, Vermittlung.
[44] Association rules and the negative binomial model, April 2004. Research Seminar Statistical Learning, Wirtschaftsuniversität Wien, Austria. [ slides (pdf) ]
In this talk we try to apply a simple stochastic model, the Negative Binomial model, for mining rules that can be used for applications like recommender systems. We present how several publicly available datasets can be fitted by the model and how the model is extended to mine rules. Finally, we present a mining algorithm and discuss evaluation and open research questions.
[45] Generating synthetic transaction data for tuning usage mining algorithms, March 2003. 27th Annual GfKl-Conference, Cottbus, Germany.
The Internet was rapidly adopted as a channel for advertising and selling products. Especially the Internet is useful for selling information based goods (documents, software, music, ...) which can be delivered instantly via the net. Competition is fierce and the sellers have to provide additional services to keep their customers and attract new customers. A common approach is to improve the user interface by adding recommender services as known from the book recommender used by the successful Internet retailer Amazon.com. Market basket analysis and association rule algorithms for transaction data are frequently used to generate recommendations. However, tuning the performance and testing different parameters of algorithms is crucial for the success of such approaches. Unfortunately, the great amount of high-quality historical data needed is often not available. To generate synthetic data, that has similar characteristics like the real world data, is often the only solution to this problem. In this talk we analyze the Quest synthetic data generation code for associations (see: http://www.almaden.ibm.com/cs/quest). The transaction data generated by this program is used in several papers to evaluate performance increases of association rule algorithms.owever, the characteristics of the generated data seem to be not in line with real world data, especially data concerning the Web and information goods. As an alternative, we present the first version of a generator based on Ehrenberg's repeat buying theory to generate synthetic transaction data. The repeat buying theory has a solid empirical basis and models the micro-structure of the purchase processes. We conclude with a comparison of data generated by the two generators with real world data. We belief, that for an objective evaluation and the tuning of algorithms always real data as well as a combination of different synthetic data sets from generators using different models should be used.
[46] Software reuse with analysis patterns, August 2002. AMCIS 2002, August 9-11, Dallas, Texas.
The purpose of this talk is to promote reuse of domain knowledge by introducing patterns already in the analysis phase of the software life-cycle. We propose an outline template for analysis patterns that strongly supports the whole analysis process from the requirements analysis to the analysis model and further on to its transformation into a flexible and reusable design and implementation. As an example we develop a family of analysis patterns in this talk that deal with a series of pressing problems in cooperative work, collaborative information filtering and sharing, and knowledge management. We evaluate the reuse potential of these patterns by analyzing several components of an information system, that was developed for the Virtual University project of the Vienna University of Economics and Business Administration. The findings of this analysis suggest that using patterns in the analysis phase has the potential to reducing development time significantly by introducing reuse already at the analysis stage and by improving the interface between analysis and design phase.
[47] Evaluation of recommender algorithms for an internet information broker based on simple association rules and on the repeat-buying theory, July 2002. WEBKDD 2002, Edmonton, Alberta, Canada.
Association rules are a widely used technique to generate recommendations in commercial and research recommender systems. Since more and more Web sites, especially of retailers, offer automatic recommender services using Web usage mining, evaluation of recommender algorithms becomes increasingly important. In this talk we first present a framework for the evaluation of different aspects of recommender systems based on the process of discovering knowledge in databases of Fayyad et al. and then we focus on the comparison of the performance of two recommender algorithms based on frequent itemsets. The first recommender algorithm uses association rules, and the other recommender algorithm is based on the repeat-buying theory known from marketing research. For the evaluation we concentrated on how well the patterns extracted from usage data match the concept of useful recommendations of users. We use 6 month of usage data from an educational Internet information broker and compare useful recommendations identified by users from the target group of the broker with the results of the recommender algorithms. The results of the evaluation presented in this talk suggest that frequent itemsets from purchase histories match the concept of useful recommendations expressed by users with satisfactory accuracy (higher than 70%) and precision (between 60% and 90%). Also the evaluation suggests that both algorithms studied in the talk perform similar on real-world data if they are tuned properly.
[48] Data warehouses and data mining, December 2001. Guest Lecture at Webster University, Vienna.
This presentation gives a very short introduction to the concepts of Data Warehouses and Data Mining with some examples from the information system called the WU Virtual University.
[49] Patterns im Softwareentwicklungsprozess, September 2001. ADV Arbeitsgemeinschaft für Datenverarbeitung, Wien. [ slides (pdf) ]
In dieser Präsentation wird die Einbindung von Patterns in den Softwareentwicklungsprozess eingegangen. Zuerst wird der OO-Lebenszyklus dargestellt und danach wird gezeigt wo und wie Design und Analyse Patterns zum Einsatz kommen können. Dabei werden der Nutzen und Probleme aufgezeigt. Zum Abschluss werden Beispiele für den Einsatz von Analyse Patterns im Zusammenhang mir einem Projekt zur Entwicklung eines Informationssystems für Forschung und Lehre einer Universität präsentiert.
[50] A customer purchase incidence model applied to recommender services, August 2001. WEBKDD 2001, San Francisco, CA.
In this presentation we transfer a customer purchase incidence model for consumer products which is based on Ehrenberg's repeat-buying theory to Web-based information products. Ehrenberg's repeat-buying theory successfully describes regularities in a large number of consumer product markets. We show that these regularities exist in electronic markets for information goods too, and that purchase incidence models provide a well founded theoretical foundation for recommender and alert systems. The article consists of three parts. First, we present the architecture of an information market and its instrumentation for collecting data on customer behavior. In the second part Ehrenberg's repeat-buying theory and its assumptions are reviewed and adapted for Web-based information markets. Finally, we present the empirical validation of the model based on data collected from the information market of the Virtual University of the Vienna University of Economics and Business Administration at http://vu.wu-wien.ac.at
[51] User-centered navigation re-design for web-based information systems, August 2000. AMCIS 2000, Long Beach, CA.
[52] Living Lectures - WU Virtual Library: Ein Lernportal, March 2000. in VortragsreiheLernen per Internet”, Technische Universität Wien.
[53] Vorstellung der Virtuellen Universität im Rahmen des Jungassisstententrainings, March 2000. Wirtschaftsuniversität Wien.
[54] Das Living Lectures - Virtual University Projekt: Informationstechnologie im universitären Bildungsbereich, June 1999. Global Village 99.
[55] Das Living Lectures - Virtual University Projekt: Projektbericht, June 1999. Institut für Betriebswirtschaftslehre, Universität Wien.
[56] The living lectures - virtual university project, June 1999. Global Bangemann Challenge, Stockholm, Sweden.
[57] Automatic labelling of references for internet information systems, March 1999. 23rd Annual GfKl-Conference, Bielefeld, Germany.
[58] Workshop Living Lectures - Virtual University, September 1998. Wirtschaftsuniversität Wien.