The term “Data Cleaning” is used to define the process of cleaning up, correcting and normalising data, incorrectly written, corrupted, incomplete, duplicate, erroneous, entered within a reference set or database.
This process is particularly useful because, very often, the unusability of data is due to user input errors or incompleteness.
Data Cleaning activity is related to Data Activation activity, as it can be considered the starting point for effective marketing activities with a data-driven approach.
In fact, the problem of correctness and wholeness of collected data is quite common today, especially when integrating data and information collected from multiple channels and multiple sources.
However, there are a number of techniques that make it possible to detect, in an automatic way, the “noise” present in the data in order to normalise it.
These techniques can be collected in two macro-workstreams:
- schema-level approach by which an attempt is made to create a correspondence between different file or database structures by exploiting their defined similarities (e.g., merging two tables);
- instance-level approach through which metrics are sought to assess the similarity of items present in order to create subgroups whose items refer to the same entity (let us imagine that we have for Mr. Francesco Rossi, more incorrect data such as “Franci Rossi” , “Francghesco Rossi.”)
Data clening software often uses a combination of the two and allows data to be normalised as best as possible in order to also use it for effective marketing automation strategies.