Data Quality Assessment
The Data Quality Assessment Stage in the Data Analytics Life Cycle is dedicated to evaluating the accuracy, consistency, and reliability of data before it is used for analysis. This phase identifies and resolves issues that might undermine the integrity of analytical outputs. Data quality problems often arise from differences in source systems, human input errors, or structural inconsistencies in storage and representation.
Mismatched Data Types
Data collected from multiple sources may vary in type, creating compatibility issues during integration.
• Numerical values could be stored as text in one system, while another uses a dedicated numeric type.
• Dates, monetary values, or categorical fields might follow different formats across platforms.
• In cases involving different database architectures, variations in schema design, field constraints, and indexing approaches can further complicate merging processes.
Mixed Data Values and Descriptors
When datasets originate from diverse systems, they may use inconsistent descriptors for the same feature.
• Gender, for example, may be recorded as "man" in one dataset, "male" in another, or even "M" in a third.
• Even numeric ranges can differ in measurement units, leading to mixed values.
• These inconsistencies require standardization to ensure analytical accuracy.
Data Outliers and Abnormal Values
Outliers are data points that deviate significantly from the expected range.
• They can result from legitimate anomalies, such as rare events, or from recording errors.
• Abnormal values—such as negative dimensions for a physical product—must be identified and either corrected or removed depending on the nature of the analysis.
Human Error
Manual processes often lead to incomplete or incorrect entries.
• Typographical mistakes, skipped mandatory fields, or misinterpretation of data input requirements contribute to inaccuracies.
• Automated validation rules, training for data entry personnel, and user-friendly interface designs help reduce such errors.
Missing Data
Data gaps can occur through missing values, blank spaces, or unanswered questions in surveys and records.
• Missing data reduces the statistical power of a model and can introduce bias..
• Strategies such as imputation, removal, or flagging help mitigate the impact of incomplete datasets.
In essence, this stage safeguards the integrity of the analytical process by ensuring that the input data is coherent, trustworthy, and fit for purpose—eliminating discrepancies that could lead to flawed insights.