Measuring Representativeness Using Covering Array Principles

Representativeness is an important data quality characteristic in data science processes; a data sample is said to be representative when it reflects a larger group as accurately as possible. Having low representativeness indices in the data can lead to the generation of biased models. Hence, this s...

ver descrição completa

Detalhes bibliográficos
Principais autores:	Castro-Romero, Alexander, Cobos-Lozada, Carlos-Alberto
Formato:	Online
Idioma:	eng
Publicado em:	Universidad Pedagógica y Tecnológica de Colombia 2023
Assuntos:	algoritmos de clasificación arreglos de cobertura calidad de datos conjuntos de datos representatividad de datos classification algorithms coverage arrays data quality data sets data representativeness
Acesso em linha:	https://revistas.uptc.edu.co/index.php/ingenieria/article/view/15314

_version_	1801706102461562880
author	Castro-Romero, Alexander Cobos-Lozada, Carlos-Alberto
author_facet	Castro-Romero, Alexander Cobos-Lozada, Carlos-Alberto
author_sort	Castro-Romero, Alexander
collection	OJS
description	Representativeness is an important data quality characteristic in data science processes; a data sample is said to be representative when it reflects a larger group as accurately as possible. Having low representativeness indices in the data can lead to the generation of biased models. Hence, this study shows the elements that make up a new model for measuring representativeness using a mathematical object testing element of coverage arrays called the "P Matrix". To test the model, an experiment was proposed where a data set is taken, divided into training and test data subsets using two sampling strategies: Random and Stratified, and the representativeness values are compared. If the data division is adequate, the two sampling strategies should present similar representativeness indexes. The model was implemented in a prototype software using Python (for data processing) and Vue (for data visualization) technologies, this version of the model only allows to analyze binary data sets (for now). To test the model, the "Wines" dataset (UC Irvine Machine Learning Repository) was fitted. The conclusion is that both sampling strategies generate similar representativeness results for this dataset, although this result is predictable, it is clear that adequate representativeness of the data is important when generating the test and training datasets subsets. Therefore, as future work we plan to extend the model to categorical data and explore more complex datasets.
format	Online
id	oai:oai.revistas.uptc.edu.co:article-15314
institution	Revista Facultad de Ingeniería
language	eng
publishDate	2023
publisher	Universidad Pedagógica y Tecnológica de Colombia
record_format	ojs
spelling	oai:oai.revistas.uptc.edu.co:article-153142024-01-17T01:28:02Z Measuring Representativeness Using Covering Array Principles Medición de la representatividad utilizando principios de la matriz de cobertura Castro-Romero, Alexander Cobos-Lozada, Carlos-Alberto algoritmos de clasificación arreglos de cobertura calidad de datos conjuntos de datos representatividad de datos classification algorithms coverage arrays data quality data sets data representativeness Representativeness is an important data quality characteristic in data science processes; a data sample is said to be representative when it reflects a larger group as accurately as possible. Having low representativeness indices in the data can lead to the generation of biased models. Hence, this study shows the elements that make up a new model for measuring representativeness using a mathematical object testing element of coverage arrays called the "P Matrix". To test the model, an experiment was proposed where a data set is taken, divided into training and test data subsets using two sampling strategies: Random and Stratified, and the representativeness values are compared. If the data division is adequate, the two sampling strategies should present similar representativeness indexes. The model was implemented in a prototype software using Python (for data processing) and Vue (for data visualization) technologies, this version of the model only allows to analyze binary data sets (for now). To test the model, the "Wines" dataset (UC Irvine Machine Learning Repository) was fitted. The conclusion is that both sampling strategies generate similar representativeness results for this dataset, although this result is predictable, it is clear that adequate representativeness of the data is important when generating the test and training datasets subsets. Therefore, as future work we plan to extend the model to categorical data and explore more complex datasets. La representatividad es una característica importante de la calidad de los datos en procesos de ciencia de datos; se dice que una muestra de datos es representativa cuando refleja a un grupo más grande con la mayor precisión posible. Tener bajos índices de representatividad en los datos puede conducir a la generación de modelos sesgados, por tanto, este estudio muestra los elementos que conforman un nuevo modelo para medir la representatividad utilizando un elemento de prueba de objetos matemáticos de matrices de cobertura llamado "Matriz P". Para probar el modelo se propuso un experimento donde se toma un conjunto de datos y se divide en subconjuntos de datos de entrenamiento y prueba utilizando dos estrategias de muestreo: Aleatorio y Estratificado, finalmente, se comparan los valores de representatividad. Si la división de datos es adecuada, las dos estrategias de muestreo deben presentar índices de representatividad similares. El modelo se implementó en un software prototipo usando tecnologías Python (para procesamiento de datos) y Vue (para visualización de datos); esta versión solo permite analizar conjuntos de datos binarios (por ahora). Para probar el modelo, se ajustó el conjunto de datos "Wines" (UC Irvine Machine Learning Repository). La conclusión es que ambas estrategias de muestreo generan resultados de representatividad similares para este conjunto de datos. Aunque este resultado es predecible, está claro que la representatividad adecuada de los datos es importante al generar subconjuntos de conjuntos de datos de prueba y entrenamiento, por lo tanto, como trabajo futuro, planeamos extender el modelo a datos categóricos y explorar conjuntos de datos más complejos. Universidad Pedagógica y Tecnológica de Colombia 2023-09-30 info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion application/pdf text/xml https://revistas.uptc.edu.co/index.php/ingenieria/article/view/15314 Revista Facultad de Ingeniería; Vol. 32 No. 65 (2023): July-September 2023 (Continuous Publication); e15314 Revista Facultad de Ingeniería; Vol. 32 Núm. 65 (2023): Julio-Septiembre 2023 (Publicación Continua); e15314 2357-5328 0121-1129 eng https://revistas.uptc.edu.co/index.php/ingenieria/article/view/15314/13578 https://revistas.uptc.edu.co/index.php/ingenieria/article/view/15314/13816 Copyright (c) 2023 Alexander Castro-Romero, Carlos-Alberto Cobos-Lozada http://creativecommons.org/licenses/by/4.0
spellingShingle	algoritmos de clasificación arreglos de cobertura calidad de datos conjuntos de datos representatividad de datos classification algorithms coverage arrays data quality data sets data representativeness Castro-Romero, Alexander Cobos-Lozada, Carlos-Alberto Measuring Representativeness Using Covering Array Principles
title	Measuring Representativeness Using Covering Array Principles
title_alt	Medición de la representatividad utilizando principios de la matriz de cobertura
title_full	Measuring Representativeness Using Covering Array Principles
title_fullStr	Measuring Representativeness Using Covering Array Principles
title_full_unstemmed	Measuring Representativeness Using Covering Array Principles
title_short	Measuring Representativeness Using Covering Array Principles
title_sort	measuring representativeness using covering array principles
topic	algoritmos de clasificación arreglos de cobertura calidad de datos conjuntos de datos representatividad de datos classification algorithms coverage arrays data quality data sets data representativeness
topic_facet	algoritmos de clasificación arreglos de cobertura calidad de datos conjuntos de datos representatividad de datos classification algorithms coverage arrays data quality data sets data representativeness
url	https://revistas.uptc.edu.co/index.php/ingenieria/article/view/15314
work_keys_str_mv	AT castroromeroalexander measuringrepresentativenessusingcoveringarrayprinciples AT coboslozadacarlosalberto measuringrepresentativenessusingcoveringarrayprinciples AT castroromeroalexander mediciondelarepresentatividadutilizandoprincipiosdelamatrizdecobertura AT coboslozadacarlosalberto mediciondelarepresentatividadutilizandoprincipiosdelamatrizdecobertura