Anomalies detection for big data

The development of the digital age has resulted in a considerable increase in data volumes. These large volumes of data have been called big data since they exceed the processing capacity of conventional database systems. Several sectors consider various opportunities and applications in the detecti...

Cijeli opis

Bibliografski detalji
Glavni autori:	Torres-Domínguez, Omar, Sabater-Fernández, Samuel, Bravo-Ilisatigui, Lisandra, Martin-Rodríguez, Diana, García-Borroto, Milton
Format:	Online
Jezik:	spa
Izdano:	Universidad Pedagógica y Tecnológica de Colombia 2019
Teme:	big data data mining detecting anomalies MapReduce detección de anomalías minería de datos
Online pristup:	https://revistas.uptc.edu.co/index.php/ingenieria/article/view/8793

_version_	1801706095287205888
author	Torres-Domínguez, Omar Sabater-Fernández, Samuel Bravo-Ilisatigui, Lisandra Martin-Rodríguez, Diana García-Borroto, Milton
author_facet	Torres-Domínguez, Omar Sabater-Fernández, Samuel Bravo-Ilisatigui, Lisandra Martin-Rodríguez, Diana García-Borroto, Milton
author_sort	Torres-Domínguez, Omar
collection	OJS
description	The development of the digital age has resulted in a considerable increase in data volumes. These large volumes of data have been called big data since they exceed the processing capacity of conventional database systems. Several sectors consider various opportunities and applications in the detection of anomalies in big data problems. This type of analysis can be very useful the use of data mining techniques because it allows extracting patterns and relationships from large amounts of data. The processing and analysis of these data volumes need tools capable of processing them as Apache Spark and Hadoop. These tools do not have specific algorithms for detecting anomalies. The general objective of the work is to develop a new algorithm for the detection of neighborhood-based anomalies in big data problems. From a comparative study, the KNNW algorithm was selected by its results, in order to design a big data variant. The implementation of the big data algorithm was done in the Apache Spark tool, using the parallel programming paradigm MapReduce. Subsequently different experiments were performed to analyze the behavior of the algorithm with different configurations. Within the experiments, the execution times and the quality of the results were compared between the sequential variant and the big data variant. Getting better results, the big data variant with significant difference. Getting the big data variant, KNNW-BigData, can process large volumes of data. Keywords: big data; data mining; detecting anomalies; MapReduce.
format	Online
id	oai:oai.revistas.uptc.edu.co:article-8793
institution	Revista Facultad de Ingeniería
language	spa
publishDate	2019
publisher	Universidad Pedagógica y Tecnológica de Colombia
record_format	ojs
spelling	oai:oai.revistas.uptc.edu.co:article-87932021-07-13T02:26:17Z Anomalies detection for big data Detección de anomalías en grandes volúmenes de datos Torres-Domínguez, Omar Sabater-Fernández, Samuel Bravo-Ilisatigui, Lisandra Martin-Rodríguez, Diana García-Borroto, Milton big data data mining detecting anomalies MapReduce big data detección de anomalías MapReduce minería de datos The development of the digital age has resulted in a considerable increase in data volumes. These large volumes of data have been called big data since they exceed the processing capacity of conventional database systems. Several sectors consider various opportunities and applications in the detection of anomalies in big data problems. This type of analysis can be very useful the use of data mining techniques because it allows extracting patterns and relationships from large amounts of data. The processing and analysis of these data volumes need tools capable of processing them as Apache Spark and Hadoop. These tools do not have specific algorithms for detecting anomalies. The general objective of the work is to develop a new algorithm for the detection of neighborhood-based anomalies in big data problems. From a comparative study, the KNNW algorithm was selected by its results, in order to design a big data variant. The implementation of the big data algorithm was done in the Apache Spark tool, using the parallel programming paradigm MapReduce. Subsequently different experiments were performed to analyze the behavior of the algorithm with different configurations. Within the experiments, the execution times and the quality of the results were compared between the sequential variant and the big data variant. Getting better results, the big data variant with significant difference. Getting the big data variant, KNNW-BigData, can process large volumes of data. Keywords: big data; data mining; detecting anomalies; MapReduce. El desarrollo de la era digital ha traído como consecuencia un incremento considerable de los volúmenes de datos. A estos grandes volúmenes de datos se les ha denominado big data ya que exceden la capacidad de procesamiento de sistemas de bases de datos convencionales. Diversos sectores consideran varias oportunidades y aplicaciones en la detección de anomalías en problemas de big data.  Para realizar este tipo de análisis puede resultar muy útil el empleo de técnicas de minería de datos porque permiten extraer patrones y relaciones desde grandes cantidades de datos. El procesamiento y análisis de estos volúmenes de datos, necesitan de herramientas capaces de procesarlos como Apache Spark y Hadoop. Estas herramientas no cuentan con algoritmos específicos para la detección de anomalías. El objetivo del trabajo es presentar un nuevo algoritmo para la detección de anomalías basado en vecindad para de problemas big data. A partir de un estudio comparativo se seleccionó el algoritmo KNNW por sus resultados, con el fin de diseñar una variante big data. La implementación del algoritmo big data se realizó en la herramienta Apache Spark, utilizando el paradigma de programación paralela MapReduce. Posteriormente se realizaron diferentes experimentos para analizar el comportamiento del algoritmo con distintas configuraciones. Dentro de los experimentos se compararon los tiempos de ejecución y calidad de los resultados entre la variante secuencial y la variante big data. La variante big data obtuvo mejores resultados con diferencia significativa. Logrando que la variante big data, KNNW-BigData, pueda procesar grandes volúmenes de datos. Universidad Pedagógica y Tecnológica de Colombia 2019-01-10 info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion research investigación application/pdf application/xml https://revistas.uptc.edu.co/index.php/ingenieria/article/view/8793 10.19053/01211129.v28.n50.2019.8793 Revista Facultad de Ingeniería; Vol. 28 No. 50 (2019); 62-76 Revista Facultad de Ingeniería; Vol. 28 Núm. 50 (2019); 62-76 2357-5328 0121-1129 spa https://revistas.uptc.edu.co/index.php/ingenieria/article/view/8793/7288 https://revistas.uptc.edu.co/index.php/ingenieria/article/view/8793/7504 https://revistas.uptc.edu.co/index.php/ingenieria/article/view/8793/7533 N.A. N.A.
spellingShingle	big data data mining detecting anomalies MapReduce big data detección de anomalías MapReduce minería de datos Torres-Domínguez, Omar Sabater-Fernández, Samuel Bravo-Ilisatigui, Lisandra Martin-Rodríguez, Diana García-Borroto, Milton Anomalies detection for big data
title	Anomalies detection for big data
title_alt	Detección de anomalías en grandes volúmenes de datos
title_full	Anomalies detection for big data
title_fullStr	Anomalies detection for big data
title_full_unstemmed	Anomalies detection for big data
title_short	Anomalies detection for big data
title_sort	anomalies detection for big data
topic	big data data mining detecting anomalies MapReduce big data detección de anomalías MapReduce minería de datos
topic_facet	big data data mining detecting anomalies MapReduce big data detección de anomalías MapReduce minería de datos
url	https://revistas.uptc.edu.co/index.php/ingenieria/article/view/8793
work_keys_str_mv	AT torresdominguezomar anomaliesdetectionforbigdata AT sabaterfernandezsamuel anomaliesdetectionforbigdata AT bravoilisatiguilisandra anomaliesdetectionforbigdata AT martinrodriguezdiana anomaliesdetectionforbigdata AT garciaborrotomilton anomaliesdetectionforbigdata AT torresdominguezomar detecciondeanomaliasengrandesvolumenesdedatos AT sabaterfernandezsamuel detecciondeanomaliasengrandesvolumenesdedatos AT bravoilisatiguilisandra detecciondeanomaliasengrandesvolumenesdedatos AT martinrodriguezdiana detecciondeanomaliasengrandesvolumenesdedatos AT garciaborrotomilton detecciondeanomaliasengrandesvolumenesdedatos