|
[1]
|
Michael Hahsler, Christian Buchta, and Kurt Hornik.
Selective association rule generation.
Computational Statistics, 23(2):303-315, April 2008.
[ bib |
DOI |
at the publisher |
.pdf ]
Mining association rules is a popular and well researched
method for discovering interesting relations between variables in
large databases. A practical problem is that at medium to low support
values often a large number of frequent itemsets and an even larger
number of association rules are found in a database. A widely used
approach is to gradually increase minimum support and minimum
confidence or to filter the found rules using increasingly strict
constraints on additional measures of interestingness until the set of
rules found is reduced to a manageable size. In this paper we describe
a different approach which is based on the idea to first define a set
of “interesting” itemsets (e.g., by a mixture of mining and expert
knowledge) and then, in a second step to selectively generate rules
for only these itemsets. The main advantage of this approach over
increasing thresholds or filtering rules is that the number of rules
found is significantly reduced while at the same time it is not
necessary to increase the support and confidence thresholds which
might lead to missing important information in the database.
|
|
[2]
|
Michael Hahsler and Kurt Hornik.
Building on the arules infrastructure for analyzing transaction data
with R.
In R. Decker and H.-J. Lenz, editors, Advances in Data Analysis,
Proceedings of the 30th Annual Conference of the Gesellschaft für
Klassifikation e.V., Freie Universität Berlin, March 8-10, 2006, Studies
in Classification, Data Analysis, and Knowledge Organization, pages 449-456.
Springer-Verlag, 2007.
[ bib |
at the publisher |
.pdf ]
The free and extensible statistical computing environment R with its
enormous number of extension packages already provides many state-of-the-art
techniques for data analysis. Support for association rule mining,
a popular exploratory method which can be used, among other purposes,
for uncovering cross-selling opportunities in market baskets,
has become available recently with the R extension package arules.
After a brief introduction to transaction data and association rules,
we present the formal framework implemented in arules and demonstrate
how clustering and association rule mining can be applied together
using a market basket data set from a typical retailer. This paper
shows that implementing a basic infrastructure with formal classes
in R provides an extensible basis which can very efficiently be employed
for developing new applications (such as clustering transactions)
in addition to association rule mining.
|
|
[3]
|
Michael Hahsler and Kurt Hornik.
New probabilistic interest measures for association rules.
Intelligent Data Analysis, 11(5):437-455, 2007.
[ bib |
at the publisher |
.pdf ]
Mining association rules is an important technique for discovering
meaningful patterns in transaction databases. Many different measures
of interestingness have been proposed for association rules. However,
these measures fail to take the probabilistic properties of the mined
data into account. In this paper, we start with presenting a simple
probabilistic framework for transaction data which can be used to
simulate transaction data when no associations are present. We use
such data and a real-world database from a grocery outlet to explore
the behavior of confidence and lift, two popular interest measures
used for rule mining. The results show that confidence is systematically
influenced by the frequency of the items in the left hand side of
rules and that lift performs poorly to filter random noise in transaction
data. Based on the probabilistic framework we develop two new interest
measures, hyper-lift and hyper-confidence, which can be used to filter
or order mined association rules. The new measures show significantly
better performance than lift for applications where spurious rules
are problematic.
|
|
[4]
|
Michael Hahsler.
A model-based frequency constraint for mining associations from
transaction data.
Data Mining and Knowledge Discovery, 13(2):137-166, September
2006.
[ bib |
DOI |
at the publisher |
.pdf ]
Mining frequent itemsets is a popular method for finding associated
items in databases. For this method, support, the co-occurrence frequency
of the items which form an association, is used as the primary indicator
of the associations's significance. A single user-specified support
threshold is used to decided if associations should be further investigated.
Support has some known problems with rare items, favors shorter itemsets
and sometimes produces misleading associations. In this paper we
develop a novel model-based frequency constraint as an alternative
to a single, user-specified minimum support. The constraint utilizes
knowledge of the process generating transaction data by applying
a simple stochastic mixture model (the NB model) which allows for
transaction data's typically highly skewed item frequency distribution.
A user-specified precision threshold is used together with the model
to find local frequency thresholds for groups of itemsets. Based
on the constraint we develop the notion of NB-frequent itemsets and
adapt a mining algorithm to find all NB-frequent itemsets in a database.
In experiments with publicly available transaction databases we show
that the new constraint provides improvements over a single minimum
support threshold and that the precision threshold is more robust
and easier to set and interpret by the user.
|
|
[5]
|
Michael Hahsler and Kurt Hornik.
New probabilistic interest measures for association rules.
Report 38, Research Report Series, Department of Statistics and
Mathematics, Wirtschaftsuniversität Wien, Augasse 2-6, 1090 Wien,
Austria, August 2006.
[ bib |
at the publisher ]
Mining association rules is an important technique for discovering
meaningful patterns in transaction databases. Many different measures
of interestingness have been proposed for association rules. However,
these measures fail to take the probabilistic properties of the mined
data into account. In this paper, we start with presenting a simple
probabilistic framework for transaction data which can be used to
simulate transaction data when no associations are present. We use
such data and a real-world database from a grocery outlet to explore
the behavior of confidence and lift, two popular interest measures
used for rule mining. The results show that confidence is systematically
influenced by the frequency of the items in the left hand side of
rules and that lift performs poorly to filter random noise in transaction
data. Based on the probabilistic framework we develop two new interest
measures, hyper-lift and hyper-confidence, which can be used to filter
or order mined association rules. The new measures show significant
better performance than lift for applications where spurious rules
are problematic.
|
|
[6]
|
Michael Hahsler, Kurt Hornik, and Thomas Reutterer.
Implications of probabilistic data modeling for mining association
rules.
In M. Spiliopoulou, R. Kruse, C. Borgelt, A. Nürnberger, and
W. Gaul, editors, From Data and Information Analysis to Knowledge
Engineering, Proceedings of the 29th Annual Conference of the Gesellschaft
für Klassifikation e.V., University of Magdeburg, March 9-11, 2005,
Studies in Classification, Data Analysis, and Knowledge Organization, pages
598-605. Springer-Verlag, 2006.
[ bib |
at the publisher |
.pdf ]
Mining association rules is an important technique for discovering
meaningful patterns in transaction databases. In the current literature,
the properties of algorithms to mine association rules are discussed
in great detail. We present a simple probabilistic framework for
transaction data which can be used to simulate transaction data when
no associations are present. We use such data and a real-world grocery
database to explore the behavior of confidence and lift, two popular
interest measures used for rule mining. The results show that confidence
is systematically influenced by the frequency of the items in the
left-hand-side of rules and that lift performs poorly to filter random
noise in transaction data. The probabilistic data modeling approach
presented in this paper not only is a valuable framework to analyze
interest measures but also provides a starting point for further
research to develop new interest measures which are based on statistical
tests and geared towards the specific properties of transaction data.
|
|
[7]
|
Michael Hahsler, Bettina Grün, and Kurt Hornik.
arules - A computational environment for mining association rules
and frequent item sets.
Journal of Statistical Software, 14(15):1-25, October 2005.
[ bib |
at the publisher |
.pdf ]
Mining frequent itemsets and association rules is a popular and well
researched approach for discovering interesting relationships between
variables in large databases. The R package arules presented in this
paper provides a basic infrastructure for creating and manipulating
input data sets and for analyzing the resulting itemsets and rules.
The package also includes interfaces to two fast mining algorithms,
the popular C implementations of Apriori and Eclat by Christian Borgelt.
These algorithms can be used to mine frequent itemsets, maximal frequent
itemsets, closed frequent itemsets and association rules.
|
|
[8]
|
Michael Hahsler, Bettina Grün, and Kurt Hornik.
A computational environment for mining association rules and frequent
item sets.
Report 15, Research Report Series, Department of Statistics and
Mathematics, Wirtschaftsuniversität Wien, Augasse 2-6, 1090 Wien,
Austria, April 2005.
[ bib |
at the publisher ]
Mining frequent itemsets and association rules is a popular and well
researched approach to discovering interesting relationships between
variables in large databases. The R package arules presented in this
paper provides a basic infrastructure for creating and manipulating
input data sets and for analyzing the resulting itemsets and rules.
The package also includes interfaces to two fast mining algorithms,
the popular C implementations of Apriori and Eclat by Christian Borgelt.
These algorithms can be used to mine frequent itemsets, maximal frequent
itemsets, closed frequent itemsets and association rules.
|
|
[9]
|
Michael Hahsler, Kurt Hornik, and Thomas Reutterer.
Implications of probabilistic data modeling for rule mining.
Report 14, Research Report Series, Department of Statistics and
Mathematics, Wirtschaftsuniversität Wien, Augasse 2-6, 1090 Wien,
Austria, March 2005.
[ bib |
at the publisher ]
Mining association rules is an important technique for discovering
meaningful patterns in transaction databases. In the current literature,
the properties of algorithms to mine associations are discussed in
great detail. In this paper we investigate properties of transaction
data sets from a probabilistic point of view. We present a simple
probabilistic framework for transaction data and its implementation
using the R statistical computing environment. The framework can
be used to simulate transaction data when no associations are present.
We use such data to explore the ability to filter noise of confidence
and lift, two popular interest measures used for rule mining. Based
on the framework we develop the measure hyperlift and we compare
this new measure to lift using simulated data and a real-world grocery
database.
|
|
[10]
|
Michael Hahsler.
A model-based frequency constraint for mining associations from
transaction data.
Working Paper 07/2004, Working Papers on Information Processing and
Information Management, Institut für Informationsverarbeitung und
-wirtschaft, Wirtschaftsuniversität Wien, Augasse 2-6, 1090 Wien,
Austria, November 2004.
[ bib |
at the publisher ]
In this paper we develop an alternative to minimum support which
utilizes knowledge of the process which generates transaction data
and allows for highly skewed frequency distributions. We apply a
simple stochastic model (the NB model), which is known for its usefulness
to describe item occurrences in transaction data, to develop a frequency
constraint. This model-based frequency constraint is used together
with a precision threshold to find individual support thresholds
for groups of associations. We develop the notion of NB-frequent
itemsets and present two mining algorithms which find all NB-frequent
itemsets in a database. In experiments with publicly available transaction
databases we show that the new constraint can provide significant
improvements over a single minimum support threshold and that the
precision threshold is easier to use.
|
|
[11]
|
Andreas Geyer-Schulz, Michael Hahsler, and Anke Thede.
Comparing association-rules and repeat-buying based recommender
systems in a B2B environment.
In M. Schader, W. Gaul, and M. Vichi, editors, Between Data
Science and Applied Data Analysis, Proceedings of the 26th Annual Conference
of the Gesellschaft für Klassifikation e.V., University of Mannheim, July
22-24, 2002, Studies in Classification, Data Analysis, and Knowledge
Organization, pages 421-429. Springer-Verlag, July 2003.
[ bib |
at the publisher ]
In this contribution we present a systematic evaluation and comparison
of recommender systems based on simple association rules and on repeat-buying
theory. Both recommender services are based on the customer purchase
histories of a medium-sized B2B-merchant for computer accessories.
With the help of product managers an evaluation set for recommendations
was generated. With regard to this evaluation set, recommendations
produced by both methods are evaluated and several error measures
are computed. This provides an empirical test whether frequent item
sets or outliers of a stochastic purchase incidence model are suitable
concepts for automatically generation recommendations. Furthermore,
the loss function (performance measures) of the two models are compared
and the sensitivity with regard to a misspecification of the model
parameters is discussed.
|
|
[12]
|
Andreas Geyer-Schulz and Michael Hahsler.
Comparing two recommender algorithms with the help of recommendations
by peers.
In O.R. Zaiane, J. Srivastava, M. Spiliopoulou, and B. Masand,
editors, WEBKDD 2002 - Mining Web Data for Discovering Usage Patterns
and Profiles 4th International Workshop, Edmonton, Canada, July 2002, Revised
Papers, Lecture Notes in Computer Science LNAI 2703, pages 137-158.
Springer-Verlag, 2003.
(Revised version of the WEBKDD 2002 paper “Evaluation of Recommender
Algorithms for an Internet Information Broker based on Simple Association
Rules and on the Repeat-Buying Theory”).
[ bib |
at the publisher |
.pdf ]
Since more and more Web sites, especially sites of retailers, offer
automatic recommendation services using Web usage mining, evaluation
of recommender algorithms has become increasingly important. In this
paper we present a framework for the evaluation of different aspects
of recommender systems based on the process of discovering knowledge
in databases introduced by Fayyad et al. and we summarize research
already done in this area. One aspect identified in the presented
evaluation framework is widely neglected when dealing with recommender
algorithms. This aspect is to evaluate how useful patterns extracted
by recommender algorithms are to support the social process of recommending
products to others, a process normally driven by recommendations
by peers or experts. To fill this gap for recommender algorithms
based on frequent itemsets extracted from usage data we evaluate
the usefulness of two algorithms. The first recommender algorithm
uses association rules, and the other algorithm is based on the repeat-buying
theory known from marketing research. We use 6 months of usage data
from an educational Internet information broker and compare useful
recommendations identified by users from the target group of the
broker (peers) with the recommendations produced by the algorithms.
The results of the evaluation presented in this paper suggest that
frequent itemsets from usage histories match the concept of useful
recommendations expressed by peers with satisfactory accuracy (higher
than 70%) and precision (between 60% and 90%). Also the evaluation
suggests that both algorithms studied in the paper perform similar
on real-world data if they are tuned properly.
|
|
[13]
|
Andreas Geyer-Schulz and Michael Hahsler.
Evaluation of recommender algorithms for an internet information
broker based on simple association rules and on the repeat-buying theory.
In Brij Masand, Myra Spiliopoulou, Jaideep Srivastava, and Osmar R.
Zaiane, editors, Fourth WEBKDD Workshop: Web Mining for Usage Patterns
& User Profiles, pages 100-114, Edmonton, Canada, July 2002.
[ bib |
.pdf ]
Association rules are a widely used technique to generate recommendations
in commercial and research recommender systems. Since more and more
Web sites, especially of retailers, offer automatic recommender services
using Web usage mining, evaluation of recommender algorithms becomes
increasingly important. In this paper we first present a framework
for the evaluation of different aspects of recommender systems based
on the process of discovering knowledge in databases of Fayyad et
al. and then we focus on the comparison of the performance of two
recommender algorithms based on frequent itemsets. The first recommender
algorithm uses association rules, and the other recommender algorithm
is based on the repeat-buying theory known from marketing research.
For the evaluation we concentrated on how well the patterns extracted
from usage data match the concept of useful recommendations of users.
We use 6 month of usage data from an educational Internet information
broker and compare useful recommendations identified by users from
the target group of the broker with the results of the recommender
algorithms. The results of the evaluation presented in this paper
suggest that frequent itemsets from purchase histories match the
concept of useful recommendations expressed by users with satisfactory
accuracy (higher than 70%) and precision (between 60% and 90%).
Also the evaluation suggests that both algorithms studied in the
paper perform similar on real-world data if they are tuned properly.
|