Anomalies detection for big data
The development of the digital age has resulted in a considerable increase in data volumes. These large volumes of data have been called big data since they exceed the processing capacity of conventional database systems. Several sectors consider various opportunities and applications in the detecti...
Main Authors: | , , , , |
---|---|
Format: | Online |
Language: | spa |
Published: |
Universidad Pedagógica y Tecnológica de Colombia
2019
|
Subjects: | |
Online Access: | https://revistas.uptc.edu.co/index.php/ingenieria/article/view/8793 |
_version_ | 1801706095287205888 |
---|---|
author | Torres-Domínguez, Omar Sabater-Fernández, Samuel Bravo-Ilisatigui, Lisandra Martin-Rodríguez, Diana García-Borroto, Milton |
author_facet | Torres-Domínguez, Omar Sabater-Fernández, Samuel Bravo-Ilisatigui, Lisandra Martin-Rodríguez, Diana García-Borroto, Milton |
author_sort | Torres-Domínguez, Omar |
collection | OJS |
description | The development of the digital age has resulted in a considerable increase in data volumes. These large volumes of data have been called big data since they exceed the processing capacity of conventional database systems. Several sectors consider various opportunities and applications in the detection of anomalies in big data problems. This type of analysis can be very useful the use of data mining techniques because it allows extracting patterns and relationships from large amounts of data. The processing and analysis of these data volumes need tools capable of processing them as Apache Spark and Hadoop. These tools do not have specific algorithms for detecting anomalies. The general objective of the work is to develop a new algorithm for the detection of neighborhood-based anomalies in big data problems. From a comparative study, the KNNW algorithm was selected by its results, in order to design a big data variant. The implementation of the big data algorithm was done in the Apache Spark tool, using the parallel programming paradigm MapReduce. Subsequently different experiments were performed to analyze the behavior of the algorithm with different configurations. Within the experiments, the execution times and the quality of the results were compared between the sequential variant and the big data variant. Getting better results, the big data variant with significant difference. Getting the big data variant, KNNW-BigData, can process large volumes of data.
Keywords: big data; data mining; detecting anomalies; MapReduce. |
format | Online |
id | oai:oai.revistas.uptc.edu.co:article-8793 |
institution | Revista Facultad de Ingeniería |
language | spa |
publishDate | 2019 |
publisher | Universidad Pedagógica y Tecnológica de Colombia |
record_format | ojs |
spelling | oai:oai.revistas.uptc.edu.co:article-87932021-07-13T02:26:17Z Anomalies detection for big data Detección de anomalías en grandes volúmenes de datos Torres-Domínguez, Omar Sabater-Fernández, Samuel Bravo-Ilisatigui, Lisandra Martin-Rodríguez, Diana García-Borroto, Milton big data data mining detecting anomalies MapReduce big data detección de anomalías MapReduce minería de datos The development of the digital age has resulted in a considerable increase in data volumes. These large volumes of data have been called big data since they exceed the processing capacity of conventional database systems. Several sectors consider various opportunities and applications in the detection of anomalies in big data problems. This type of analysis can be very useful the use of data mining techniques because it allows extracting patterns and relationships from large amounts of data. The processing and analysis of these data volumes need tools capable of processing them as Apache Spark and Hadoop. These tools do not have specific algorithms for detecting anomalies. The general objective of the work is to develop a new algorithm for the detection of neighborhood-based anomalies in big data problems. From a comparative study, the KNNW algorithm was selected by its results, in order to design a big data variant. The implementation of the big data algorithm was done in the Apache Spark tool, using the parallel programming paradigm MapReduce. Subsequently different experiments were performed to analyze the behavior of the algorithm with different configurations. Within the experiments, the execution times and the quality of the results were compared between the sequential variant and the big data variant. Getting better results, the big data variant with significant difference. Getting the big data variant, KNNW-BigData, can process large volumes of data. Keywords: big data; data mining; detecting anomalies; MapReduce. El desarrollo de la era digital ha traído como consecuencia un incremento considerable de los volúmenes de datos. A estos grandes volúmenes de datos se les ha denominado big data ya que exceden la capacidad de procesamiento de sistemas de bases de datos convencionales. Diversos sectores consideran varias oportunidades y aplicaciones en la detección de anomalías en problemas de big data. Para realizar este tipo de análisis puede resultar muy útil el empleo de técnicas de minería de datos porque permiten extraer patrones y relaciones desde grandes cantidades de datos. El procesamiento y análisis de estos volúmenes de datos, necesitan de herramientas capaces de procesarlos como Apache Spark y Hadoop. Estas herramientas no cuentan con algoritmos específicos para la detección de anomalías. El objetivo del trabajo es presentar un nuevo algoritmo para la detección de anomalías basado en vecindad para de problemas big data. A partir de un estudio comparativo se seleccionó el algoritmo KNNW por sus resultados, con el fin de diseñar una variante big data. La implementación del algoritmo big data se realizó en la herramienta Apache Spark, utilizando el paradigma de programación paralela MapReduce. Posteriormente se realizaron diferentes experimentos para analizar el comportamiento del algoritmo con distintas configuraciones. Dentro de los experimentos se compararon los tiempos de ejecución y calidad de los resultados entre la variante secuencial y la variante big data. La variante big data obtuvo mejores resultados con diferencia significativa. Logrando que la variante big data, KNNW-BigData, pueda procesar grandes volúmenes de datos. Universidad Pedagógica y Tecnológica de Colombia 2019-01-10 info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion research investigación application/pdf application/xml https://revistas.uptc.edu.co/index.php/ingenieria/article/view/8793 10.19053/01211129.v28.n50.2019.8793 Revista Facultad de Ingeniería; Vol. 28 No. 50 (2019); 62-76 Revista Facultad de Ingeniería; Vol. 28 Núm. 50 (2019); 62-76 2357-5328 0121-1129 spa https://revistas.uptc.edu.co/index.php/ingenieria/article/view/8793/7288 https://revistas.uptc.edu.co/index.php/ingenieria/article/view/8793/7504 https://revistas.uptc.edu.co/index.php/ingenieria/article/view/8793/7533 N.A. N.A. |
spellingShingle | big data data mining detecting anomalies MapReduce big data detección de anomalías MapReduce minería de datos Torres-Domínguez, Omar Sabater-Fernández, Samuel Bravo-Ilisatigui, Lisandra Martin-Rodríguez, Diana García-Borroto, Milton Anomalies detection for big data |
title | Anomalies detection for big data |
title_alt | Detección de anomalías en grandes volúmenes de datos |
title_full | Anomalies detection for big data |
title_fullStr | Anomalies detection for big data |
title_full_unstemmed | Anomalies detection for big data |
title_short | Anomalies detection for big data |
title_sort | anomalies detection for big data |
topic | big data data mining detecting anomalies MapReduce big data detección de anomalías MapReduce minería de datos |
topic_facet | big data data mining detecting anomalies MapReduce big data detección de anomalías MapReduce minería de datos |
url | https://revistas.uptc.edu.co/index.php/ingenieria/article/view/8793 |
work_keys_str_mv | AT torresdominguezomar anomaliesdetectionforbigdata AT sabaterfernandezsamuel anomaliesdetectionforbigdata AT bravoilisatiguilisandra anomaliesdetectionforbigdata AT martinrodriguezdiana anomaliesdetectionforbigdata AT garciaborrotomilton anomaliesdetectionforbigdata AT torresdominguezomar detecciondeanomaliasengrandesvolumenesdedatos AT sabaterfernandezsamuel detecciondeanomaliasengrandesvolumenesdedatos AT bravoilisatiguilisandra detecciondeanomaliasengrandesvolumenesdedatos AT martinrodriguezdiana detecciondeanomaliasengrandesvolumenesdedatos AT garciaborrotomilton detecciondeanomaliasengrandesvolumenesdedatos |