Summary: | The development of the digital age has resulted in a considerable increase in data volumes. These large volumes of data have been called big data since they exceed the processing capacity of conventional database systems. Several sectors consider various opportunities and applications in the detection of anomalies in big data problems. This type of analysis can be very useful the use of data mining techniques because it allows extracting patterns and relationships from large amounts of data. The processing and analysis of these data volumes need tools capable of processing them as Apache Spark and Hadoop. These tools do not have specific algorithms for detecting anomalies. The general objective of the work is to develop a new algorithm for the detection of neighborhood-based anomalies in big data problems. From a comparative study, the KNNW algorithm was selected by its results, in order to design a big data variant. The implementation of the big data algorithm was done in the Apache Spark tool, using the parallel programming paradigm MapReduce. Subsequently different experiments were performed to analyze the behavior of the algorithm with different configurations. Within the experiments, the execution times and the quality of the results were compared between the sequential variant and the big data variant. Getting better results, the big data variant with significant difference. Getting the big data variant, KNNW-BigData, can process large volumes of data.
Keywords: big data; data mining; detecting anomalies; MapReduce.
|