Proposes to use the chi-square test for correlation. For an itemset of length l, the test is carried out on a l-dimensional contengency tables. A problem is cells with low counts and multiple tests.
Journal version of Brin et al. (1997).
Adapts APRIORI to work with different minimum support thresholds assigned to different items (minimum item supports, MIS). To preserve the downward closure property of support item sorting using the MIS values is used.
This paper used JUMPING EMERGING PATTERNS to mine a border for top rules (rules with 100% confidence) for a given consequent. The drawbacks are that only one consequent is mined at a time and that finding rules with other than 100% confidence is difficult.
A reply to Brin et al. (1997). The authors state that the chi-square test tests the whole contingency table, but for larger than 2x2 tables we want to test dependence for single cells.
The paper shows that for data with categorical attributes a UNIVERSAL-EXISTENTIAL UPWARD CLOSURE exists for confidence. With this property algorithms with confidence-based pruning are possible that use a level-wise (from k to k-1) candidate generation are. The paper also discusses a disk-based implementation.
To find longer frequent itemsets, the minimal support requirement decreases as a function of the itemset length. A algorithm based on the FP-tree is presented and a property called small valid extension (SVE) is introduced which makes mining efficient in absence of downward closure.
Search for unusually frequent itemsets using statistical methods. First, the authors propose stratification of the data to avoid finding spurious associations within strata. Then the deviation of the observed frequency over a baseline frequency (based on independence) is used. Since the deviation is unreliable for low counts, an empirical Bayes model (its 95% confidence limit) is used to produce a posterior distribution of the true ratio of actual to baseline frequencies. The Bayes model gives ratios close to the observed ratios for large samples and reduces (shrinks) the ratio if the sample size gets small (to smooth away noise). For multi-item associations log-linear models are proposed to find higher order associations which cannot be explained by pairwise associations.
Uses similarity measures between hashed values of rows in a transaction database. The approach in the paper was only shown for associations between two items.
Uses attributes of the items (e.g., price, page dwelling time) to WEIGHT SUPPORT. A support and significance framework is presented which possesses a weighted downward closure property important for pruning the search space.
Omiecinski introduced several alternatives to support. The first measure, ANY-CONFIDENCE, is defined as the confidence of the rule with the largest confidence which can be generated from an itemset. The author states that although finding all itemsets with a set any-confidence would enable us to find all rules with a given minimum confidence, any-confidence cannot be used efficiently as a measure of interestingness since confidence is not downward closed. The second introduced measure is ALL-CONFIDENCE. This measure is defined as the smallest confidence of all rules which can be produced from an itemset, i.e., all rules produced form an itemset will have a confidence greater or equal to its all-confidence value. BOND, the last measure, is defined as the ratio of the number of transactions which contain all items of an itemset to the number of transactions which contain at least one of these items. Omiecinski showed that bond and all-confidence are downward closed and, therefore, can be used for efficient mining algorithms.
Support-based pruning strategies are not effective for data sets with skewed support distributions. The authors propose the concept of hyperclique pattern, which uses an objective measure called h-confidence (equal to all-confidence by Omiecinski, 2003) to identify strong affinity patterns. The generation of so-called cross-support patterns (patterns with items with substantially different support) is avoided by h-confidence's cross-support property.
See Seno and Karypis 2001.
Develops a novel model-based frequency constraint as an alternative to a single, user-specified minimum support. The constraint utilizes knowledge of the process generating transaction data by applying a simple stochastic mixture model (the NB model) and uses a user-specified precision threshold to find local frequency thresholds for groups of itemsets (NB-frequent itemsets). The new constraint provides improvements over a single minimum support threshold and that the precision threshold is more robust and easier to set and interpret by the user.
This file was generated by bibtex2html 1.93.