Deakin University
Browse

File(s) under permanent embargo

Dual incremental fuzzy schemes for frequent itemsets discovery in streaming numeric data

journal contribution
posted on 2020-04-01, 00:00 authored by H Zheng, P Li, Q Liu, J Chen, Belinda Huang, J Wu, Y Xue, J He
Discovering frequent itemsets is essential for finding association rules, yet too computational expensive using existing algorithms. It is even more challenging to find frequent itemsets upon streaming numeric data. The streaming characteristic leads to a challenge that streaming numeric data cannot be scanned repetitively. The numeric characteristic requires that streaming numeric data should be pre-processed into itemsets, e.g., fuzzy-set methods can transform numeric data into itemsets with non-integer membership values. This leads to a challenge that the frequency of itemsets are usually not integer. To overcome such challenges, fast methods and stream processing methods have been applied. However, the existing algorithms usually either still need to re-visit some previous data multiple times, or cannot count non-integer frequencies. Those existing algorithms re-visiting some previous data have to sacrifice large memory spaces to cache those previous data to avoid repetitive scanning. When dealing with big streaming data nowadays, such large-memory requirement often goes beyond the capacity of many computers. Those existing algorithms unable to count non-integer frequencies would be very inaccurate in estimating the non-integer frequencies of frequent itemsets if used with integer approximation of frequency-counting.

To solve the aforementioned issues, in this paper we propose two incremental schemes for frequent itemsets discovery that are capable to work efficiently with streaming numeric data. In particular, they are able to count non-integer frequency without re-visiting any previous data. The key of our schemes to the benefits in efficiency is to extract essential statistics that would occupy much less memory than the raw data do for the ongoing streaming data. This grants the advantages of our schemes 1) allowing non-integer counting and thus natural integration with a fuzzy-set discretization method to boost robustness and anti-noise capability for numeric data, 2) enabling the design of a decay ratio for different data distributions, which can be adapted for three general stream models: landmark, damped and sliding windows, and 3) achieving highly-accurate fuzzy-item-sets discovery with efficient stream-processing.

Experimental studies demonstrate the efficiency and effectiveness of our dual schemes with both synthetic and real-world datasets.

History

Journal

Information Sciences

Volume

514

Pagination

15 - 43

Publisher

Elsevier

Location

Philadelphia, Pa.

ISSN

0020-0255

Language

eng

Publication classification

C1.1 Refereed article in a scholarly journal

Copyright notice

2019, Elsevier

Usage metrics

    Research Publications

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC