[1] Rao M. Kotamarti, Michael Hahsler, Douglas Raiford, Monnie McGee, and Margaret H. Dunham. Analyzing Taxonomic Classification Using Extensible Markov Models. Bioinformatics, 2010. Advance Access published July 12, 2010. [ bib | DOI | at the publisher ]
Motivation: As next generation sequencing is rapidly adding new genomes, their correct placement in the taxonomy needs verification. However, the current methods for confirming classification of a taxon or suggesting revision for a potential misplacement relies on computationally intense multi-sequence alignment followed by an iterative adjustment of the distance matrix. Due to intra-heterogeneity issues with the 16S rRNA marker, no classifier is available for sub-genus level that could readily suggest a classification for a novel 16S rRNA sequence. Metagenomics further complicates the issue by generating fragmented 16S rRNA sequences. This paper proposes a novel alignment-free method for representing the microbial profiles using Extensible Markov Models (EMM) with an extended Karlin-Altschul statistical framework similar to the classic alignment paradigm. We propose a Log Odds (LOD) score classifier based on Gumbel difference distribution that confirms correct classifications with statistical significance qualifications and suggests revisions where necessary. Results: We tested our method by generating a sub-genus level classifier with which we re-evaluated classifications of 676 microbial organisms using the NCBI FTP database for the 16S rRNA. The results confirm current classification for all genera while ascertaining significance at 95%. Furthermore, this novel classifier isolates heterogeneity issues to a mere 12 strains while confirming classifications with significance qualification for the remaining 98%. The models require less memory than that needed by multi-sequence alignments and have better time complexity than the current methods. The classifier operates at sub-genus level and thus outperforms the naive Bayes classifier of the RNA Database Project where much of the taxonomic analysis is available online. Finally, using information redundancy in model building, we show that the method applies to metagenomic fragment classification of 19 E.coli strains.

[2] Michael Hahsler and Margaret H. Dunham. rEMM: Extensible markov model for data stream clustering in R. Journal of Statistical Software, 35(5):1-31, 2010. [ bib | at the publisher ]
Clustering streams of continuously arriving data has become an important application of data mining in recent years and efficient algorithms have been proposed by several researchers. However, clustering alone neglects the fact that data in a data stream is not only characterized by the proximity of data points which is used by clustering, but also by a temporal component. The Extensible Markov Model (EMM) adds the temporal component to data stream clustering by superimposing a dynamically adapting Markov Chain. In this paper we introduce the implementation of the R extension package rEMM which implements EMM and we discuss some examples and applications.

[3] Michael Hahsler, Christian Buchta, and Kurt Hornik. Selective association rule generation. Computational Statistics, 23(2):303-315, April 2008. [ bib | DOI | at the publisher | .pdf ]
Mining association rules is a popular and well researched method for discovering interesting relations between variables in large databases. A practical problem is that at medium to low support values often a large number of frequent itemsets and an even larger number of association rules are found in a database. A widely used approach is to gradually increase minimum support and minimum confidence or to filter the found rules using increasingly strict constraints on additional measures of interestingness until the set of rules found is reduced to a manageable size. In this paper we describe a different approach which is based on the idea to first define a set of “interesting” itemsets (e.g., by a mixture of mining and expert knowledge) and then, in a second step to selectively generate rules for only these itemsets. The main advantage of this approach over increasing thresholds or filtering rules is that the number of rules found is significantly reduced while at the same time it is not necessary to increase the support and confidence thresholds which might lead to missing important information in the database.

[4] Michael Hahsler, Kurt Hornik, and Christian Buchta. Getting things in order: An introduction to the R package seriation. Journal of Statistical Software, 25(3):1-34, March 2008. [ bib | at the publisher | .pdf ]
Seriation, i.e., finding a linear order for a set of objects given data and a loss or merit function, is a basic problem in data analysis. Caused by the problem's combinatorial nature, it is hard to solve for all but very small sets. Nevertheless, both exact solution methods and heuristics are available. In this paper we present the package seriation which provides the infrastructure for seriation with R. The infrastructure comprises data structures to represent linear orders as permutation vectors, a wide array of seriation methods using a consistent interface, a method to calculate the value of various loss and merit functions, and several visualization techniques which build on seriation. To illustrate how easily the package can be applied for a variety of applications, a comprehensive collection of examples is presented.

[5] Michael Hahsler and Kurt Hornik. TSP - Infrastructure for the traveling salesperson problem. Journal of Statistical Software, 23(2):1-21, December 2007. [ bib | at the publisher | .pdf ]
The traveling salesperson (or, salesman) problem (TSP) is a well known and important combinatorial optimization problem. The goal is to find the shortest tour that visits each city in a given list exactly once and then returns to the starting city. Despite this simple problem statement, solving the TSP is difficult since it belongs to the class of NP-complete problems. The importance of the TSP arises besides from its theoretical appeal from the variety of its applications. Typical applications in operations research include vehicle routing, computer wiring, cutting wallpaper and job sequencing. The main application in statistics is combinatorial data analysis, e.g., reordering rows and columns of data matrices or identifying clusters. In this paper we introduce the R package TSP which provides a basic infrastructure for handling and solving the traveling salesperson problem. The package features S3 classes for specifying a TSP and its (possibly optimal) solution as well as several heuristics to find good solutions. In addition, it provides an interface to Concorde, one of the best exact TSP solvers currently available.

[6] Michael Hahsler and Kurt Hornik. New probabilistic interest measures for association rules. Intelligent Data Analysis, 11(5):437-455, 2007. [ bib | at the publisher | .pdf ]
Mining association rules is an important technique for discovering meaningful patterns in transaction databases. Many different measures of interestingness have been proposed for association rules. However, these measures fail to take the probabilistic properties of the mined data into account. In this paper, we start with presenting a simple probabilistic framework for transaction data which can be used to simulate transaction data when no associations are present. We use such data and a real-world database from a grocery outlet to explore the behavior of confidence and lift, two popular interest measures used for rule mining. The results show that confidence is systematically influenced by the frequency of the items in the left hand side of rules and that lift performs poorly to filter random noise in transaction data. Based on the probabilistic framework we develop two new interest measures, hyper-lift and hyper-confidence, which can be used to filter or order mined association rules. The new measures show significantly better performance than lift for applications where spurious rules are problematic.

[7] Thomas Reutterer, Michael Hahsler, and Kurt Hornik. Data Mining und Marketing am Beispiel der explorativen Warenkorbanalyse. Marketing ZFP, 29(3):165-181, 2007. [ bib | at the publisher ]
Techniken des Data Mining stellen für die Marketingforschung und -praxis eine zunehmend bedeutsamere Bereicherung des herkömmlichen Methodenarsenals dar. Mit dem Einsatz solcher primär datengetriebener Analysewerkzeuge wird das Ziel verfolgt, marketingrelevante Informationen ”intelligent” aus großen Datenbanken (sog. Data Warehouses) zu extrahieren und für die weitere Entscheidungsvorbereitung in geeigneter Form aufzubereiten. Im vorliegenden Beitrag werden Berührungspunkte zwischen Data Mining und Marketing diskutiert und der konkrete Einsatz ausgewählter Data-Mining-Methoden am Beispiel der explorativen Warenkorb- bzw. Sortimentsverbundanalyse für einen Transaktionsdatensatz aus dem Lebensmitteleinzelhandel demonstriert. Zur Anwendung gelangen dabei Techniken aus dem Bereich der klassischen Affinitätsanalyse, ein K-Medoid-Verfahren der Clusteranalyse sowie Werkzeuge zur Generierung und anschließenden Beurteilung von Assoziationsregeln zwischen im Sortiment enthaltenen Warengruppen. Die Vorgehensweise wird dabei anhand des mit der Statistik-Software R frei verfügbaren Erweiterungspakets arules illustriert.

[8] Michael Hahsler. A model-based frequency constraint for mining associations from transaction data. Data Mining and Knowledge Discovery, 13(2):137-166, September 2006. [ bib | DOI | at the publisher | .pdf ]
Mining frequent itemsets is a popular method for finding associated items in databases. For this method, support, the co-occurrence frequency of the items which form an association, is used as the primary indicator of the associations's significance. A single user-specified support threshold is used to decided if associations should be further investigated. Support has some known problems with rare items, favors shorter itemsets and sometimes produces misleading associations. In this paper we develop a novel model-based frequency constraint as an alternative to a single, user-specified minimum support. The constraint utilizes knowledge of the process generating transaction data by applying a simple stochastic mixture model (the NB model) which allows for transaction data's typically highly skewed item frequency distribution. A user-specified precision threshold is used together with the model to find local frequency thresholds for groups of itemsets. Based on the constraint we develop the notion of NB-frequent itemsets and adapt a mining algorithm to find all NB-frequent itemsets in a database. In experiments with publicly available transaction databases we show that the new constraint provides improvements over a single minimum support threshold and that the precision threshold is more robust and easier to set and interpret by the user.

[9] Christoph Breidert, Michael Hahsler, and Thomas Reutterer. A review of methods for measuring willingness-to-pay. Innovative Marketing, 2(4):8-32, 2006. [ bib | at the publisher | .pdf ]
Knowledge about a product's willingness-to-pay on behalf of its (potential) customers plays a crucial role in many areas of marketing management like pricing decisions or new product development. Numerous approaches to measure willingness-to-pay with differential conceptual foundations and methodological implications have been presented in the relevant literature so far. This article provides the reader with a systematic overview of the relevant literature on these competing approaches and associated schools of thought, recognizes their respective merits and discusses obstacles and issues regarding their adoption to measuring willingness-to-pay. Because of its practical relevance, special focus will be put on indirect surveying techniques and, in particular, conjoint-based applications will be discussed in more detail. The strengths and limitations of the individual approaches are discussed and evaluated from a managerial point of view.

[10] Michael Hahsler, Bettina Grün, and Kurt Hornik. arules - A computational environment for mining association rules and frequent item sets. Journal of Statistical Software, 14(15):1-25, October 2005. [ bib | at the publisher | .pdf ]
Mining frequent itemsets and association rules is a popular and well researched approach for discovering interesting relationships between variables in large databases. The R package arules presented in this paper provides a basic infrastructure for creating and manipulating input data sets and for analyzing the resulting itemsets and rules. The package also includes interfaces to two fast mining algorithms, the popular C implementations of Apriori and Eclat by Christian Borgelt. These algorithms can be used to mine frequent itemsets, maximal frequent itemsets, closed frequent itemsets and association rules.

[11] Michael Hahsler. Integrating digital document acquisition into a university library: A case study of social and organizational challenges. Journal of Digital Information Management, 1(4):162-171, December 2003. [ bib | at the publisher | .pdf ]
In this article we report on the effort of the university library of the Vienna University of Economics and Business Administration to integrate a digital library component for research documents authored at the university into the existing library infrastructure. Setting up a digital library has become a relatively easy task using the current data base technology and the components and tools freely available. However, to integrate such a digital library into existing library systems and to adapt existing document acquisition work-flows in the organization are non-trivial tasks. We use a research frame work to identify the key players in this change process and to analyze their incentive structures. Then we describe the light-weight integration approach employed by our university and show how it provides incentives to the key players and at the same time requires only minimal adaptation of the organization in terms of changing existing work-flows. Our experience suggests that this light-weight integration offers a cost efficient and low risk intermediate step towards switching to exclusive digital document acquisition.

[12] Wolfgang Gaul, Andreas Geyer-Schulz, Michael Hahsler, and Lars Schmidt-Thieme. eMarketing mittels Recommendersystemen. Marketing ZFP, 24:47-55, 2002. [ bib | at the publisher ]
Recommendersysteme liefern einen wichtigen Beitrag für die Ausgestaltung von eMarketing Aktivitäten. Ausgehend von einer Diskussion von Input/Output Charakteristika zur Beschreibung solcher Systeme, die bereits eine geeignete Unterscheidung praxisrelevanter Erscheinungsformen erlauben, wird motiviert, warum eine solche Charakterisierung durch die Einbeziehung methodischer Aspekte aus der Marketing Forschung angereichert werden muss. Ein auf der Theorie des Wiederkaufverhaltens basierendes Recommendersystem sowie ein System, das Empfehlungen mittels Analyse des Navigationsverhaltens von Site Besuchern erzeugt, werden vorgestellt. Am Beispiel der Amazon Site werden die Marketing Möglichkeiten von Recommendersystemen verdeutlicht. Abschließend wird zur Abrundung auf weitere Literatur mit Recommendersystem Bezug eingegangen. In einem Ausblick werden Hinweise gegeben, in welche Richtungen Weiterentwicklungen geplant sind.

[13] Andreas Geyer-Schulz, Michael Hahsler, and Maximillian Jahn. Educational and scientific recommender systems: Designing the information channels of the virtual university. International Journal of Engineering Education, 17(2):153-163, 2001. [ bib | at the publisher | .pdf ]
In this article we investigate the role of recommender systems and their potential in the educational and scientific environment of a Virtual University. The key idea is to use the information aggregation capabilities of a recommender system to improve the tutoring and consulting services of a Virtual University in an automated way and thus scale tutoring and consulting in a personalized way to a mass audience. We describe the recommender services of myVU, the collection of the personalized services of the Virtual University (VU) of the Vienna University of Economics and Business Administration which are based on observed user behavior and self assignment of experience which are currently field-tested. We show, how the usual mechanism design problems inherent to recommender systems are addressed in this prototype.

[14] Andreas Geyer-Schulz, Michael Hahsler, and Georg Schneider. The virtual university and its embedded agents. ÖGAI Journal, 18(1):14-19, 1999. [ bib ]
In this article we present the current state of usage of (intelligent) Internet agents in the Virtual University (VU) of the Vienna University of Economics and BA. We discuss opportunities and challenges for the development of several classes of agents and their sensor systems. More specifically, agents of the following classes embedded in the virtual university system will be presented: (1) robots which support navigation services and (2) robots which support communication and collaboration.

[15] Peter Bruhn, Andreas Geyer-Schulz, Michael Hahsler, and Markus Mottel. Genetic machine learning and intelligent internet agents. ÖGAI Journal, 17(1):18-25, 1998. [ bib ]
In this paper we report on the status quo of the current machine learning research projects at the Department of Applied Computer Science of the Institute of Information Processing and Information Economics of the Vienna University of Economics and Business Administration. The current research activities can be categorized as follows: (1) Development of a theoretic framework of genetic programming. (2) Application of genetic programming for managerial and economic decision-making and for breeding agents' strategies in organizational learning. (3) Development, adaptation, and integration of (intelligent) Internet agents for support of the virtual organizations. (4) Development of an infrastructure for intelligent Internet agents in the ”Living Lectures - Virtual University” project. (5) Cost-benefit analysis of agents, analysis of tactical and strategic consequences of agents and the analysis of their economic applications.


This file was generated by bibtex2html 1.94.