Process of Data Cleansing and handling of Memory in
Data Streams
Journal:
GRENZE International Journal of Engineering and Technology
Authors:
Kavitha N, Y Kalpana, Kumar V
Volume:
10
Issue:
2
Grenze ID:
01.GIJET.10.2.389
Pages:
5171-5177
Abstract
Real-time Mining [1][2][3] and Streaming of Data have become more popular in the
data field, with access to the fastest and the latest data. Real-time Data Mining seeks the
development of a real-time framework for enhancing resource efficiency while minimizing
environmental impact. Real-time analysis has a huge rate of change in data, and it has to be
processed and updated frequently. Data mining is an interdisciplinary subject comprising
machine learning, statistics, database technology, and artificial intelligence. Data mining
primarily aims to decipher the past and anticipate the future through data exploration and
analysis, known as Knowledge Discovery in a Database. Data mining attempts to store the data
in the local data set hosted by local computers connected to the computer networks. In the real
world, data has become abundant with the advent of data streams, invariably raising the question
of data storage. Also, data will not be clean when received in the form of streams. Non-clean data
cannot be stored in the database and will not be effective for data analysis. An organized
repository of related information must be stored in the database; hence, data must be cleaned
before storage. Data is cleaned as per the analysis required of the data. Once the data is cleaned,
the question of the memory to store the data arises. Storing real-time data is a relatively trivial
process, and there should not be any missing data from the streams. Data cleaning and memory
management in real-time data is always challenging. This research proposes a novel methodology
for data cleaning and management of memory to overcome these issues. An algorithm is executed
using the scheduler at the specific interval separating the test and trial data. Trial data will be
used for further analysis, and test data will be discarded at the specific interval. The test data will
be a derivative of trial data.