Saturday, November 5, 2016

Is reading a newspaper “Data mining” ?

Data mining is a hype.  As a result everything is called data mining.  I suppose reading a newspaper to find some interesting information is called “data mining” by some people too.


However there is only one problem : not everything IS data mining.

To clear this mess a bit, in what follows I list and explain several activities that are sometimes (mistakenly) called “data mining”.

Data extraction

the act or process of retrieving data out of (usually unstructured or poorly structured) data sources for further data processing” (wikipedia)

Data extraction software can enable agencies to collect data on the race, gender, and ethnicity for the person(s) owning the majority of rights, equity, or interest in a business.” (Mozenda)

My definition is simple : you get the data from somewhere with some data extraction program.  What you do afterwards with that data is not relevant.


Is making a report : “A Report is a piece of information describing, or an account of certain events given or presented to someone“. (wikipedia)

Reporting is just a genre of writing, alongside essays and stories, and blogggers most certainly fall into that genre. Imho, when they talk about reporting on a show like Frontline, they mean the process a reporter goes through.” (

This seems a bit more complicated than data extraction.  I would say : “extractingfrom whatever sources of data/information those pieces of information that are sufficiently important an structuring/presenting them to be communicated to your audience, customers, boss or whatever other party”.

My defition: reporting is not showing raw data, but some communicable description.  This can be in the form of tables, charts, structured drawings, or simply words.


statistics is … a distinct mathematical science  pertaining to the collection, analysis, interpretation or explanation, and presentation of data . ” (wikipedia)

“methods to collect, analyze and interpret data” (Nebraska university)

“collection of methods for planning experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting and then drawing conclusions” (Akila)

Is a very broad definition, and it has obviously a lot to do with data.

For me, a part from “data”,  the words that are most important here are “science”, “methods”, “interpretation”.  Statistics is not just extracting data or reporting, no, here we have to do better.

Hence my definition : we use some mathematical method(s) to extract the right data, to interpret the data, to draw conclusions based on mathematics and to present these results/conclusions.

Data mining

This is the most difficult one, and most misunderstood.

Some definitions:

“the process of extracting patterns from large data sets by combining methods from statistics and artificial intelligence with <a title=”Database management”href=””>database management.” (wikipedia)

“the process of analyzing data from different perspectives and summarizing it into useful information” (UCLAAnderson)

“Data mining is the use of automated data analysis techniques to uncover previously undetected relationships among data items.” (

“Data mining is the discovery of hidden knowledge, unexpected patterns and new rules in large databases.” (E.Thomas)

The most important words or expressions here are : “extracting patterns”, “analyzing data”, “uncover relationships”, “discovery of knowledge”.

So my definition  is: searching in data collections (databases, the internet) for information that was not put there deliberately, but neverteless can be derived.


