The problem of mining association rules (see association rule learning at Wikipedia was introduced in Agrawal et al 1993.

The aim of association rule mining is to find interesting and useful patterns in a transaction database. Each transaction in the database contains a set of items and a transaction identifier (e.g., a market basket). Association rules are rules of the form X → Y where X and Y are two disjoint subsets of all available items. X is called the antecedent or LHS (left-hand side), and Y is called the consequent or RHS (right-hand side). Association rules satisfy constraints on minimum support (a measure of rule significance) and minimum confidence (a measure of rule strength).

Research on association rules focuses on efficient algorithms, measures of rule interestingness, sequence mining, and using association rules for classification.

Our Implementations (in R)

Other Implementations

Data Sets

Events

Journals

Our Publications

[1] Michael Hahsler. ARULESPY: Exploring association rules and frequent itemsets in Python. arXiv:2305.15263 [cs.DB], May 2023. [ DOI | at the publisher ]
The R arules package implements a comprehensive infrastructure for representing, manipulating, and analyzing transaction data and patterns using frequent itemsets and association rules. The package also provides a wide range of interest measures and mining algorithms, including the code of Christian Borgelt's popular and efficient C implementations of the association mining algorithms Apriori and Eclat, and optimized C/C++ code for mining and manipulating association rules using sparse matrix representation. This document describes the new Python package arulespy, which makes this infrastructure available for Python users.
[2] M. Ledmi, C. Kara-Mohamed, M.E.H. Souidi, M. Hahsler, and Ledmi. Mining association rules for classification using frequent generator itemsets in arules package. International Journal of Data Mining, Modelling and Management, 15(2):203--221, 2023. [ DOI ]
Mining frequent itemsets is an attractive research activity in data mining whose main aim is to provide useful relationships among data. Consequently, several open-source development platforms are continuously developed to facilitate the users’ exploitation of new data mining tasks. Among these platforms, the R language is one of the most popular tools. In this paper, we propose an extension of arules package by adding the option of mining frequent generator itemsets. We discuss in detail how generators can be used for a classification task through an application example in relation with COVID-19.
[3] Michael Hahsler, Ian Johnson, Tomas Kliegr, and Jaroslav Kuchar. Associative classification in R: arc, arulesCBA, and rCBA. R Journal, 11(2):254--267, 2019. [ DOI | preprint (PDF) | at the publisher ]
Several methods for creating classifiers based on rules discovered via association rule mining have been proposed in the literature. These classifiers are called associative classifiers and the best known algorithm is Classification Based on Associations (CBA). Interestingly, only very few implementations are available and, until recently, no implementation was available for R. Now, three packages provide CBA. This paper introduces associative classification, the CBA algorithm, and how it can be used in R. A comparison of the three packages is provided to give the potential user an idea about the advantages of each of the implementations. We also show how the packages are related to the existing infrastructure for association rule mining already available in R.
[4] Michael Hahsler and Anurag Nagar. Discovering patterns in gene ontology using association rule mining. Biostatistics and Biometrics Open Access Journal, 6(3):1--3, April 2018. [ DOI | preprint (PDF) ]
Gene Ontology (GO) is one of the largest interdisciplinary bioinformatics projects that aims to provide a uniform and consistent representation of genes and gene products across different species. It has fast become a vast repository of data consisting of biological terms arranged in the form of three different ontologies, and annotation files that represent how these terms are linked to genes across different organisms. Further, this dataset is ever growing due to the various genomic projects underway. While this growth in data is a very welcome development, there is a critical need to develop data mining tools and algorithms that can extract summaries, and discover useful knowledge in an automated way. This paper presents a review of the efforts in this area, focusing on information discovery in the form of association rule mining.
[5] Michael Hahsler. arulesViz: Interactive visualization of association rules with R. R Journal, 9(2):163--175, December 2017. [ DOI ]
Association rule mining is a popular data mining method to discover interesting relationships between variables in large databases. An extensive toolbox is available in the R-extension package arules. However, mining association rules often results in a vast number of found rules, leaving the analyst with the task to go through a large set of rules to identify interesting ones. Sifting manually through extensive sets of rules is time-consuming and strenuous. Visualization and especially interactive visualization has a long history of making large amounts of data better accessible. The R-extension package arulesViz provides most popular visualization techniques for association rules. In this paper, we discuss recently added interactive visualizations to explore association rules and demonstrate how easily they can be used in arulesViz via a unified interface. With examples, we help to guide the user in selecting appropriate visualizations and interpreting the results.
[6] Michael Hahsler. Grouping association rules using lift. In C. Iyigun, R. Moghaddess, and A. Oztekin, editors, 11th INFORMS Workshop on Data Mining and Decision Analytics (DM-DA 2016), November 2016. [ preprint (PDF) ]
Association rule mining is a well established and popular data mining method for finding local dependencies between items in large transaction databases. However, a practical drawback of mining and efficiently using association rules is that the set of rules returned by the mining algorithm is typically too large to be directly used. Clustering association rules into a small number of meaningful groups would be valuable for experts who need to manually inspect the rules, for visualization and as the input for other applications. Interestingly, clustering is not widely used as a standard method to summarize large sets of associations. In fact it performs poorly due to high dimensionality, the inherent extreme data sparseness and the dominant frequent itemset structure reflected in sets of association rules. In this paper, we review association rule clustering and their shortcomings. We then propose a simple approach based on grouping columns in a lift matrix and give an example to illustrate its usefulness.
[7] Michael Hahsler and Radoslaw Karpienko. Visualizing association rules in hierarchical groups. Journal of Business Economics, 87(3):317--335, May 2016. [ DOI | at the publisher ]
Association rule mining is one of the most popular data mining methods. However, mining association rules often results in a very large number of found rules, leaving the analyst with the task to go through all the rules and discover interesting ones. Sifting manually through large sets of rules is time consuming and strenuous. Visualization has a long history of making large amounts of data better accessible using techniques like selecting and zooming. However, most association rule visualization techniques are still falling short when it comes to a large number of rules. In this paper we present an integrated framework for post-processing and visualization of association rules, which allows to intuitively explore and interpret highly complex scenarios. We demonstrate how this framework can be used to analyze large sets of association rules using the R software for statistical computing, and provide examples from the implementation in the R-package arulesViz.
[8] Anurag Nagar, Michael Hahsler, and Hisham Al-Mubaid. Association rule mining of gene ontology annotation terms for SGD. In 2015 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB). IEEE, August 2015. [ DOI | preprint (PDF) ]
Gene Ontology is one of the largest bioinformatics project that seeks to consolidate knowledge about genes through annotation of terms to three ontologies. In this work, we present a technique to find association relationships in the annotation terms for the Saccharomyces cerevisiae (SGD) genome. We first present a normalization algorithm to ensure that the annotation terms have a similar level of specificity. Association rule mining algorithms are used to find significant and non-trivial association rules in these normalized datasets. Metrics such as support, confidence, and lift can be used to evaluate the strength of found rules. We conducted experiments on the entire SGD annotation dataset and here we present the top 10 strongest rules for each of the three ontologies. We verify the found rules using evidence from the biomedical literature. The presented method has a number of advantages - it relies only on the structure of the gene ontology, has minimal memory and storage requirements, and can be easily scaled for large genomes, such as the human genome. There are many applications of this technique, such as predicting the GO annotations for new genes or those that have not been studied extensively.
[9] Jörg Lässig and Michael Hahsler. Cooperative data analysis in supply chains using selective information disclosure. In Brian Borchers, J. Paul Brooks, and Laura McLay, editors, Operations Research and Computing: Algorithms and Software for Analytics, 14th INFORMS Computing Society Conference (ICS2015). INFORMS, January 2015. [ at the publisher ]
Many modern products (e. g., consumer electronics) consist of hundreds of complex parts sourced from a large number of suppliers. In such a setting, finding the source of certain properties, e. g., the source of defects in the final product, becomes increasingly difficult. Data analysis methods can be used on information shared in modern supply chains. However, some information may be confidential since they touch proprietary production processes or trade secrets. Important principles of confidentiality are data minimization and that each participant has control over how much information is communicated with others, both of which makes data analysis more difficult. In this work, we investigate the effectiveness of strategies for selective information disclosure in order to perform cooperative data analysis in a supply chain. The goal is to minimize information exchange by only exchanging information which is needed for the analysis tasks at hand. The work is motivated by the growing demand for cross company data analysis, while simultaneously addressing confidentiality concerns. As an example, we apply a newly developed protocol with association mining techniques in an empirical simulation study to compare its effectiveness with complete information disclosure. The results show that the predictive performance is comparable while the amount of exchanged information is reduced significantly.
[10] Michael Hahsler and Sudheer Chelluboina. Visualizing association rules in hierarchical groups. Unpublished. Presented at the 42nd Symposium on the Interface: Statistical, Machine Learning, and Visualization Algorithms (Interface 2011), June 2011. [ preprint (PDF) ]
Association rule mining is one of the most popular data mining methods. However, mining association rules often results in a very large number of found rules, leaving the analyst with the task to go through all the rules and discover interesting ones. Sifting manually through large sets of rules is time consuming and strenuous. Visualization has a long history of making large amounts of data better accessible using techniques like selecting and zooming. However, most association rule visualization techniques are still falling short when it comes to a large number of rules. In this paper we present a new interactive visualization technique which lets the user navigate through a hierarchy of groups of association rules. We demonstrate how this new visualization techniques can be used to analyze a large sets of association rules with examples from our implementation in the R-package arulesViz.
[11] Michael Hahsler, Sudheer Chelluboina, Kurt Hornik, and Christian Buchta. The arules R-package ecosystem: Analyzing interesting patterns from large transaction datasets. Journal of Machine Learning Research, 12:1977--1981, 2011. [ at the publisher ]
This paper describes the ecosystem of R add-on packages developed around the infrastructure provided by the package arules. The packages provide comprehensive functionality for analyzing interesting patterns including frequent itemsets, association rules, frequent sequences and for building applications like associative classification. After discussing the ecosystem's design we illustrate the ease of mining and visualizing rules with a short example.
[12] Michael Hahsler, Christian Buchta, and Kurt Hornik. Selective association rule generation. Computational Statistics, 23(2):303--315, April 2008. [ DOI | preprint (PDF) ]
Mining association rules is a popular and well researched method for discovering interesting relations between variables in large databases. A practical problem is that at medium to low support values often a large number of frequent itemsets and an even larger number of association rules are found in a database. A widely used approach is to gradually increase minimum support and minimum confidence or to filter the found rules using increasingly strict constraints on additional measures of interestingness until the set of rules found is reduced to a manageable size. In this paper we describe a different approach which is based on the idea to first define a set of “interesting” itemsets (e.g., by a mixture of mining and expert knowledge) and then, in a second step to selectively generate rules for only these itemsets. The main advantage of this approach over increasing thresholds or filtering rules is that the number of rules found is significantly reduced while at the same time it is not necessary to increase the support and confidence thresholds which might lead to missing important information in the database.
[13] Thomas Reutterer, Michael Hahsler, and Kurt Hornik. Data Mining und Marketing am Beispiel der explorativen Warenkorbanalyse. Marketing ZFP, 29(3):165--181, 2007. [ at the publisher ]
Techniken des Data Mining stellen für die Marketingforschung und -praxis eine zunehmend bedeutsamere Bereicherung des herkömmlichen Methodenarsenals dar. Mit dem Einsatz solcher primär datengetriebener Analysewerkzeuge wird das Ziel verfolgt, marketingrelevante Informationen ”intelligent” aus großen Datenbanken (sog. Data Warehouses) zu extrahieren und für die weitere Entscheidungsvorbereitung in geeigneter Form aufzubereiten. Im vorliegenden Beitrag werden Berührungspunkte zwischen Data Mining und Marketing diskutiert und der konkrete Einsatz ausgewählter Data-Mining-Methoden am Beispiel der explorativen Warenkorb- bzw. Sortimentsverbundanalyse für einen Transaktionsdatensatz aus dem Lebensmitteleinzelhandel demonstriert. Zur Anwendung gelangen dabei Techniken aus dem Bereich der klassischen Affinitätsanalyse, ein K-Medoid-Verfahren der Clusteranalyse sowie Werkzeuge zur Generierung und anschließenden Beurteilung von Assoziationsregeln zwischen im Sortiment enthaltenen Warengruppen. Die Vorgehensweise wird dabei anhand des mit der Statistik-Software R frei verfügbaren Erweiterungspakets arules illustriert.
[14] Michael Hahsler and Kurt Hornik. New probabilistic interest measures for association rules. Intelligent Data Analysis, 11(5):437--455, 2007. [ DOI | preprint (PDF) | at the publisher ]
Mining association rules is an important technique for discovering meaningful patterns in transaction databases. Many different measures of interestingness have been proposed for association rules. However, these measures fail to take the probabilistic properties of the mined data into account. In this paper, we start with presenting a simple probabilistic framework for transaction data which can be used to simulate transaction data when no associations are present. We use such data and a real-world database from a grocery outlet to explore the behavior of confidence and lift, two popular interest measures used for rule mining. The results show that confidence is systematically influenced by the frequency of the items in the left hand side of rules and that lift performs poorly to filter random noise in transaction data. Based on the probabilistic framework we develop two new interest measures, hyper-lift and hyper-confidence, which can be used to filter or order mined association rules. The new measures show significantly better performance than lift for applications where spurious rules are problematic.
[15] Michael Hahsler and Kurt Hornik. Building on the arules infrastructure for analyzing transaction data with R. In R. Decker and H.-J. Lenz, editors, Advances in Data Analysis, Studies in Classification, Data Analysis, and Knowledge Organization, pages 449--456. Springer-Verlag, 2007. [ DOI | preprint (PDF) ]
The free and extensible statistical computing environment R with its enormous number of extension packages already provides many state-of-the-art techniques for data analysis. Support for association rule mining, a popular exploratory method which can be used, among other purposes, for uncovering cross-selling opportunities in market baskets, has become available recently with the R extension package arules. After a brief introduction to transaction data and association rules, we present the formal framework implemented in arules and demonstrate how clustering and association rule mining can be applied together using a market basket data set from a typical retailer. This paper shows that implementing a basic infrastructure with formal classes in R provides an extensible basis which can very efficiently be employed for developing new applications (such as clustering transactions) in addition to association rule mining.
[16] Michael Hahsler. A model-based frequency constraint for mining associations from transaction data. Data Mining and Knowledge Discovery, 13(2):137--166, September 2006. [ DOI | preprint (PDF) ]
Mining frequent itemsets is a popular method for finding associated items in databases. For this method, support, the co-occurrence frequency of the items which form an association, is used as the primary indicator of the associations's significance. A single user-specified support threshold is used to decided if associations should be further investigated. Support has some known problems with rare items, favors shorter itemsets and sometimes produces misleading associations. In this paper we develop a novel model-based frequency constraint as an alternative to a single, user-specified minimum support. The constraint utilizes knowledge of the process generating transaction data by applying a simple stochastic mixture model (the NB model) which allows for transaction data's typically highly skewed item frequency distribution. A user-specified precision threshold is used together with the model to find local frequency thresholds for groups of itemsets. Based on the constraint we develop the notion of NB-frequent itemsets and adapt a mining algorithm to find all NB-frequent itemsets in a database. In experiments with publicly available transaction databases we show that the new constraint provides improvements over a single minimum support threshold and that the precision threshold is more robust and easier to set and interpret by the user.
[17] Michael Hahsler, Kurt Hornik, and Thomas Reutterer. Implications of probabilistic data modeling for mining association rules. In M. Spiliopoulou, R. Kruse, C. Borgelt, A. Nürnberger, and W. Gaul, editors, From Data and Information Analysis to Knowledge Engineering, Studies in Classification, Data Analysis, and Knowledge Organization, pages 598--605. Springer-Verlag, 2006. [ preprint (PDF) | at the publisher ]
Mining association rules is an important technique for discovering meaningful patterns in transaction databases. In the current literature, the properties of algorithms to mine association rules are discussed in great detail. We present a simple probabilistic framework for transaction data which can be used to simulate transaction data when no associations are present. We use such data and a real-world grocery database to explore the behavior of confidence and lift, two popular interest measures used for rule mining. The results show that confidence is systematically influenced by the frequency of the items in the left-hand-side of rules and that lift performs poorly to filter random noise in transaction data. The probabilistic data modeling approach presented in this paper not only is a valuable framework to analyze interest measures but also provides a starting point for further research to develop new interest measures which are based on statistical tests and geared towards the specific properties of transaction data.
[18] Michael Hahsler, Kurt Hornik, and Thomas Reutterer. Warenkorbanalyse mit Hilfe der Statistik-Software R. In Peter Schnedlitz, Renate Buber, Thomas Reutterer, Arnold Schuh, and Christoph Teller, editors, Innovationen in Marketing, pages 144--163. Linde-Verlag, 2006. [ preprint (PDF) ]
Die Warenkorb- oder Sortimentsverbundanalyse bezeichnet eine Reihe von Methoden zur Untersuchung der bei einem Einkauf gemeinsam nachgefragten Produkte oder Kategorien aus einem Handelssortiment. In diesem Beitrag wird die explorative Warenkorbanalyse näher beleuchtet, welche eine Verdichtung und kompakte Darstellung der in (zumeist sehr umfangreichen) Transaktionsdaten des Einzelhandels auffindbaren Verbundbeziehungen beabsichtigt. Mit einer enormen Anzahl an verfügbaren Erweiterungspaketen bietet sich die frei verfügbare Statistik-Software R als ideale Basis für die Durchführung solcher Warenkorbanalysen an. Die im Erweiterungspaket arules vorhandene Infrastruktur für Transaktionsdaten stellt eine flexible Basis für die Warenkorbanalyse bereit. Unterstützt wird die effiziente Darstellung, Bearbeitung und Analyse von Warenkorbdaten mitsamt beliebigen Zusatzinformationen zu Produkten (zum Beispiel Sortimentshierarchie) und zu Transaktionen (zum Beispiel Umsatz oder Deckungsbeitrag). Das Paket ist nahtlos in R integriert und ermöglicht dadurch die direkte Anwendung von bereits vorhandenen modernsten Verfahren für Sampling, Clusterbildung und Visualisierung von Warenkorbdaten. Zusätzlich sind in arules gängige Algorithmen zum Auffinden von Assoziationsregeln und die notwendigen Datenstrukturen zur Analyse von Mustern vorhanden. Eine Auswahl der wichtigsten Funktionen wird anhand eines realen Transaktionsdatensatzes aus dem Lebensmitteleinzelhandel demonstriert.
[19] Michael Hahsler, Bettina Grün, and Kurt Hornik. arules -- A computational environment for mining association rules and frequent item sets. Journal of Statistical Software, 14(15):1--25, October 2005. [ DOI ]
Mining frequent itemsets and association rules is a popular and well researched approach for discovering interesting relationships between variables in large databases. The R package arules presented in this paper provides a basic infrastructure for creating and manipulating input data sets and for analyzing the resulting itemsets and rules. The package also includes interfaces to two fast mining algorithms, the popular C implementations of Apriori and Eclat by Christian Borgelt. These algorithms can be used to mine frequent itemsets, maximal frequent itemsets, closed frequent itemsets and association rules.
[20] Andreas Geyer-Schulz and Michael Hahsler. Comparing two recommender algorithms with the help of recommendations by peers. In O.R. Zaiane, J. Srivastava, M. Spiliopoulou, and B. Masand, editors, WEBKDD 2002 - Mining Web Data for Discovering Usage Patterns and Profiles 4th International Workshop, Edmonton, Canada, July 2002, Revised Papers, Lecture Notes in Computer Science LNAI 2703, pages 137--158. Springer-Verlag, 2003. (Revised version of the WEBKDD 2002 paper “Evaluation of Recommender Algorithms for an Internet Information Broker based on Simple Association Rules and on the Repeat-Buying Theory”). [ preprint (PDF) | at the publisher ]
Since more and more Web sites, especially sites of retailers, offer automatic recommendation services using Web usage mining, evaluation of recommender algorithms has become increasingly important. In this paper we present a framework for the evaluation of different aspects of recommender systems based on the process of discovering knowledge in databases introduced by Fayyad et al. and we summarize research already done in this area. One aspect identified in the presented evaluation framework is widely neglected when dealing with recommender algorithms. This aspect is to evaluate how useful patterns extracted by recommender algorithms are to support the social process of recommending products to others, a process normally driven by recommendations by peers or experts. To fill this gap for recommender algorithms based on frequent itemsets extracted from usage data we evaluate the usefulness of two algorithms. The first recommender algorithm uses association rules, and the other algorithm is based on the repeat-buying theory known from marketing research. We use 6 months of usage data from an educational Internet information broker and compare useful recommendations identified by users from the target group of the broker (peers) with the recommendations produced by the algorithms. The results of the evaluation presented in this paper suggest that frequent itemsets from usage histories match the concept of useful recommendations expressed by peers with satisfactory accuracy (higher than 70%) and precision (between 60% and 90%). Also the evaluation suggests that both algorithms studied in the paper perform similar on real-world data if they are tuned properly.
[21] Andreas Geyer-Schulz and Michael Hahsler. Evaluation of recommender algorithms for an internet information broker based on simple association rules and on the repeat-buying theory. In Brij Masand, Myra Spiliopoulou, Jaideep Srivastava, and Osmar R. Zaiane, editors, Fourth WEBKDD Workshop: Web Mining for Usage Patterns & User Profiles, pages 100--114, Edmonton, Canada, July 2002. [ preprint (PDF) ]
Association rules are a widely used technique to generate recommendations in commercial and research recommender systems. Since more and more Web sites, especially of retailers, offer automatic recommender services using Web usage mining, evaluation of recommender algorithms becomes increasingly important. In this paper we first present a framework for the evaluation of different aspects of recommender systems based on the process of discovering knowledge in databases of Fayyad et al. and then we focus on the comparison of the performance of two recommender algorithms based on frequent itemsets. The first recommender algorithm uses association rules, and the other recommender algorithm is based on the repeat-buying theory known from marketing research. For the evaluation we concentrated on how well the patterns extracted from usage data match the concept of useful recommendations of users. We use 6 month of usage data from an educational Internet information broker and compare useful recommendations identified by users from the target group of the broker with the results of the recommender algorithms. The results of the evaluation presented in this paper suggest that frequent itemsets from purchase histories match the concept of useful recommendations expressed by users with satisfactory accuracy (higher than 70%) and precision (between 60% and 90%). Also the evaluation suggests that both algorithms studied in the paper perform similar on real-world data if they are tuned properly.