Introduces the measure LEVERAGE which is the simplest function which satisfies his principles for rule-interest functions (0 if the variables are statistically independent; monotonically increasing if the variables occur more often together; monotonically decreasing if one of the variables alone occurs more often).
Introduces CONVICTION (as an improvement to confidence based on implication rules) and INTEREST (later called LIFT).
Points out weaknesses of the large frequent itemset method using support (spuriousness, dense datasets) and that lift gives only values close to one for items which are very frequent, even if they are perfectly positive correlated. COLLECTIVE STRENGTH is introduced. Collective strength uses the violation rate for an itemset which is the fraction of transactions which contains some, but not all items of the itemset. The violation rate is compared to the expected violation rate under independence. Collective strength is downward closed.
Remove insignificant rules using the chi-square test to test for correlation between the antecedent and the confident of a rule. Also DIRECTION SETTING (DS) RULES are introduced. A DS rule has a pos. correlated antecedent and consequent and is not built from a rule with a shorter antecedent which is a DS rule. Normally, only a small and concise fraction of rules are DS rules.
The authors construct several statistical tests to evaluate the significance of discovered associations.
ITEMSET SHARE is the fraction of some measure (e.g., sales, profit) contributed by the items in the set. A itemset is share frequent if it exceeds a threshold. Share frequency is not downward closed! The article presents several algorithms and heuristics to mine share frequent itemsets.
Compare the properties of 21 objective measures (of interest). The measures in general lack to agree with each other. However, the authors show that if support-based pruning or table standardization (of the contingency tables) is used, the measures become highly correlated.
Introduces predictive accuracy which is the expected value of the confidence of a rules with respect to the process underlying the database. The author shows how predictive accuracy can be calculated from confidence and support measured on a data set using a Bayesian frequency correction (very simplified: confidence is discounted for rules with low supports). Also an algorithm is presented which finds the top n most predictive association rules (redundant rules with a 0 predictive accuracy improvement are removed) and shows how to estimate the prior distribution needed for the correction.
Presents a statistical test for the deviation from the equilibrium of a rule. The equilibrium for rule a -> b is defined as: the number of transactions which contain a and b together is equal to the number of transactions which contain a and not b.
An optimal rule set (with respect to a metric of interestingness) contains all rules except those with no greater interestingness than one of its more general rules. An optimal rule set is a subset of a nonredundant rule set. The autors present an algorithm called ORD to find an optimal rule set. Classifiers build on optimal class association rules are at least as accurate as those built from CBA and C4.5 rule.
Presents a simple probabilistic framework for transaction data which can be used to simulate transaction data when no associations are present. Uses such data and a real-world grocery database to explore the behavior of confidence and lift, two popular interest measures used for rule mining. Also introduces the new probabilistic measures hyper-lift and hyper-confidence.
This file was generated by bibtex2html 1.95.