Anomalies detection for big data

The development of the digital age has resulted in a considerable increase in data volumes. These large volumes of data have been called big data since they exceed the processing capacity of conventional database systems. Several sectors consider various opportunities and applications in the detecti...

Full description

Bibliographic Details
Main Authors: Torres-Domínguez, Omar, Sabater-Fernández, Samuel, Bravo-Ilisatigui, Lisandra, Martin-Rodríguez, Diana, García-Borroto, Milton
Format: Online
Language:spa
Published: Universidad Pedagógica y Tecnológica de Colombia 2019
Subjects:
Online Access:https://revistas.uptc.edu.co/index.php/ingenieria/article/view/8793
_version_ 1801706095287205888
author Torres-Domínguez, Omar
Sabater-Fernández, Samuel
Bravo-Ilisatigui, Lisandra
Martin-Rodríguez, Diana
García-Borroto, Milton
author_facet Torres-Domínguez, Omar
Sabater-Fernández, Samuel
Bravo-Ilisatigui, Lisandra
Martin-Rodríguez, Diana
García-Borroto, Milton
author_sort Torres-Domínguez, Omar
collection OJS
description The development of the digital age has resulted in a considerable increase in data volumes. These large volumes of data have been called big data since they exceed the processing capacity of conventional database systems. Several sectors consider various opportunities and applications in the detection of anomalies in big data problems. This type of analysis can be very useful the use of data mining techniques because it allows extracting patterns and relationships from large amounts of data. The processing and analysis of these data volumes need tools capable of processing them as Apache Spark and Hadoop. These tools do not have specific algorithms for detecting anomalies. The general objective of the work is to develop a new algorithm for the detection of neighborhood-based anomalies in big data problems. From a comparative study, the KNNW algorithm was selected by its results, in order to design a big data variant. The implementation of the big data algorithm was done in the Apache Spark tool, using the parallel programming paradigm MapReduce. Subsequently different experiments were performed to analyze the behavior of the algorithm with different configurations. Within the experiments, the execution times and the quality of the results were compared between the sequential variant and the big data variant. Getting better results, the big data variant with significant difference. Getting the big data variant, KNNW-BigData, can process large volumes of data. Keywords: big data; data mining; detecting anomalies; MapReduce.
format Online
id oai:oai.revistas.uptc.edu.co:article-8793
institution Revista Facultad de Ingeniería
language spa
publishDate 2019
publisher Universidad Pedagógica y Tecnológica de Colombia
record_format ojs
spelling oai:oai.revistas.uptc.edu.co:article-87932021-07-13T02:26:17Z Anomalies detection for big data Detección de anomalías en grandes volúmenes de datos Torres-Domínguez, Omar Sabater-Fernández, Samuel Bravo-Ilisatigui, Lisandra Martin-Rodríguez, Diana García-Borroto, Milton big data data mining detecting anomalies MapReduce big data detección de anomalías MapReduce minería de datos The development of the digital age has resulted in a considerable increase in data volumes. These large volumes of data have been called big data since they exceed the processing capacity of conventional database systems. Several sectors consider various opportunities and applications in the detection of anomalies in big data problems. This type of analysis can be very useful the use of data mining techniques because it allows extracting patterns and relationships from large amounts of data. The processing and analysis of these data volumes need tools capable of processing them as Apache Spark and Hadoop. These tools do not have specific algorithms for detecting anomalies. The general objective of the work is to develop a new algorithm for the detection of neighborhood-based anomalies in big data problems. From a comparative study, the KNNW algorithm was selected by its results, in order to design a big data variant. The implementation of the big data algorithm was done in the Apache Spark tool, using the parallel programming paradigm MapReduce. Subsequently different experiments were performed to analyze the behavior of the algorithm with different configurations. Within the experiments, the execution times and the quality of the results were compared between the sequential variant and the big data variant. Getting better results, the big data variant with significant difference. Getting the big data variant, KNNW-BigData, can process large volumes of data. Keywords: big data; data mining; detecting anomalies; MapReduce. El desarrollo de la era digital ha traído como consecuencia un incremento considerable de los volúmenes de datos. A estos grandes volúmenes de datos se les ha denominado big data ya que exceden la capacidad de procesamiento de sistemas de bases de datos convencionales. Diversos sectores consideran varias oportunidades y aplicaciones en la detección de anomalías en problemas de big data.  Para realizar este tipo de análisis puede resultar muy útil el empleo de técnicas de minería de datos porque permiten extraer patrones y relaciones desde grandes cantidades de datos. El procesamiento y análisis de estos volúmenes de datos, necesitan de herramientas capaces de procesarlos como Apache Spark y Hadoop. Estas herramientas no cuentan con algoritmos específicos para la detección de anomalías. El objetivo del trabajo es presentar un nuevo algoritmo para la detección de anomalías basado en vecindad para de problemas big data. A partir de un estudio comparativo se seleccionó el algoritmo KNNW por sus resultados, con el fin de diseñar una variante big data. La implementación del algoritmo big data se realizó en la herramienta Apache Spark, utilizando el paradigma de programación paralela MapReduce. Posteriormente se realizaron diferentes experimentos para analizar el comportamiento del algoritmo con distintas configuraciones. Dentro de los experimentos se compararon los tiempos de ejecución y calidad de los resultados entre la variante secuencial y la variante big data. La variante big data obtuvo mejores resultados con diferencia significativa. Logrando que la variante big data, KNNW-BigData, pueda procesar grandes volúmenes de datos. Universidad Pedagógica y Tecnológica de Colombia 2019-01-10 info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion research investigación application/pdf application/xml https://revistas.uptc.edu.co/index.php/ingenieria/article/view/8793 10.19053/01211129.v28.n50.2019.8793 Revista Facultad de Ingeniería; Vol. 28 No. 50 (2019); 62-76 Revista Facultad de Ingeniería; Vol. 28 Núm. 50 (2019); 62-76 2357-5328 0121-1129 spa https://revistas.uptc.edu.co/index.php/ingenieria/article/view/8793/7288 https://revistas.uptc.edu.co/index.php/ingenieria/article/view/8793/7504 https://revistas.uptc.edu.co/index.php/ingenieria/article/view/8793/7533 N.A. N.A.
spellingShingle big data
data mining
detecting anomalies
MapReduce
big data
detección de anomalías
MapReduce
minería de datos
Torres-Domínguez, Omar
Sabater-Fernández, Samuel
Bravo-Ilisatigui, Lisandra
Martin-Rodríguez, Diana
García-Borroto, Milton
Anomalies detection for big data
title Anomalies detection for big data
title_alt Detección de anomalías en grandes volúmenes de datos
title_full Anomalies detection for big data
title_fullStr Anomalies detection for big data
title_full_unstemmed Anomalies detection for big data
title_short Anomalies detection for big data
title_sort anomalies detection for big data
topic big data
data mining
detecting anomalies
MapReduce
big data
detección de anomalías
MapReduce
minería de datos
topic_facet big data
data mining
detecting anomalies
MapReduce
big data
detección de anomalías
MapReduce
minería de datos
url https://revistas.uptc.edu.co/index.php/ingenieria/article/view/8793
work_keys_str_mv AT torresdominguezomar anomaliesdetectionforbigdata
AT sabaterfernandezsamuel anomaliesdetectionforbigdata
AT bravoilisatiguilisandra anomaliesdetectionforbigdata
AT martinrodriguezdiana anomaliesdetectionforbigdata
AT garciaborrotomilton anomaliesdetectionforbigdata
AT torresdominguezomar detecciondeanomaliasengrandesvolumenesdedatos
AT sabaterfernandezsamuel detecciondeanomaliasengrandesvolumenesdedatos
AT bravoilisatiguilisandra detecciondeanomaliasengrandesvolumenesdedatos
AT martinrodriguezdiana detecciondeanomaliasengrandesvolumenesdedatos
AT garciaborrotomilton detecciondeanomaliasengrandesvolumenesdedatos