What is the difference between data mining, statistics, machine learning and AI?
What is the difference between data mining, statistics, machine learning and AI?
Would it be accurate to say that they are 4 fields attempting to solve very similar problems but with different approaches? What exactly do they have in common and where do they differ? If there is some kind of hierarchy between them, what would it be?
Similar questions have been asked previously but I still don't get it:
There is considerable overlap among these, but some distinctions can be made. Of necessity, I will have to over-simplify some things or give short-shrift to others, but I will do my best to give some sense of these areas.
Firstly, Artificial Intelligence is fairly distinct from the rest. AI is the study of how to create intelligent agents. In practice, it is how to program a computer to behave and perform a task as an intelligent agent (say, a person) would. This does not have to involve learning or induction at all, it can just be a way to 'build a better mousetrap'. For example, AI applications have included programs to monitor and control ongoing processes (e.g., increase aspect A if it seems too low). Notice that AI can include darn-near anything that a machine does, so long as it doesn't do it 'stupidly'.
In practice, however, most tasks that require intelligence require an ability to induce new knowledge from experiences. Thus, a large area within AI is machine learning. A computer program is said to learn some task from experience if its performance at the task improves with experience, according to some performance measure. Machine learning involves the study of algorithms that can extract information automatically (i.e., without on-line human guidance). It is certainly the case that some of these procedures include ideas derived directly from, or inspired by, classical statistics, but they don't have to be. Similarly to AI, machine learning is very broad and can include almost everything, so long as there is some inductive component to it. An example of a machine learning algorithm might be a Kalman filter.
Data mining is an area that has taken much of its inspiration and techniques from machine learning (and some, also, from statistics), but is put to different ends. Data mining is carried out by a person, in a specific situation, on a particular data set, with a goal in mind. Typically, this person wants to leverage the power of the various pattern recognition techniques that have been developed in machine learning. Quite often, the data set is massive, complicated, and/or may have special problems (such as there are more variables than observations). Usually, the goal is either to discover / generate some preliminary insights in an area where there really was little knowledge beforehand, or to be able to predict future observations accurately. Moreover, data mining procedures could be either 'unsupervised' (we don't know the answer--discovery) or 'supervised' (we know the answer--prediction). Note that the goal is generally not to develop a more sophisticated understanding of the underlying data generating process. Common data mining techniques would include cluster analyses, classification and regression trees, and neural networks.
I suppose I needn't say much to explain what statistics is on this site, but perhaps I can say a few things. Classical statistics (here I mean both frequentist and Bayesian) is a sub-topic within mathematics. I think of it as largely the intersection of what we know about probability and what we know about optimization. Although mathematical statistics can be studied as simply a Platonic object of inquiry, it is mostly understood as more practical and applied in character than other, more rarefied areas of mathematics. As such (and notably in contrast to data mining above), it is mostly employed towards better understanding some particular data generating process. Thus, it usually starts with a formally specified model, and from this are derived procedures to accurately extract that model from noisy instances (i.e., estimation--by optimizing some loss function) and to be able to distinguish it from other possibilities (i.e., inferences based on known properties of sampling distributions). The prototypical statistical technique is regression.
I agree with most of the post, but i would say AI most of the time does not try to create intelligent agents (what is intelligence, anyway?), but rational agents. By rational it is meant "optimal given the available knowledge about the world". Although admittedly the ultimate goal is something like a general problem solver.
sorry, i still quite don't get the difference between data mining and machine learning. from what i see, data mining = machine learning's unsupervised learning. isn't machine learning unsupervised about discovering new insights?
An anonymous user suggested this blogpost for a table breaking down the differences between data mining and machine learning on a parameter basis.
`Common data mining techniques would include cluster analyses, classification and regression trees, and neural networks.` Is it safe to say that a neural network **is an example of a machine learning tool** used in data mining, in comparison to a cluster analysis which **is an algorithm not designed for machine learning** used for data mining?
In reality it's all pretty fuzzy, @TomGranot-Scalosub. I would say neural networks are definitely ML, & certainly cluster analysis & CART are studied by ML researchers. I try to make the ideas somewhat clearer & distinct, but there isn't really a bright line between these categories.
Many of the other answers have covered the main points but you asked for a hierarchy if any exists and the way I see it, although they are each disciplines in their own right, there is hierarchy no one seems to have mentioned yet since each builds upon the previous one.
Statistics is just about the numbers, and quantifying the data. There are many tools for finding relevant properties of the data but this is pretty close to pure mathematics.
Data Mining is about using Statistics as well as other programming methods to find patterns hidden in the data so that you can explain some phenomenon. Data Mining builds intuition about what is really happening in some data and is still little more towards math than programming, but uses both.
Machine Learning uses Data Mining techniques and other learning algorithms to build models of what is happening behind some data so that it can predict future outcomes. Math is the basis for many of the algorithms, but this is more towards programming.
Artificial Intelligence uses models built by Machine Learning and other ways to reason about the world and give rise to intelligent behavior whether this is playing a game or driving a robot/car. Artificial Intelligence has some goal to achieve by predicting how actions will affect the model of the world and chooses the actions that will best achieve that goal. Very programming based.
- Statistics quantifies numbers
- Data Mining explains patterns
- Machine Learning predicts with models
- Artificial Intelligence behaves and reasons
Now this being said, there will be some AI problems which fall only into AI and similarly for the other fields but most of the interesting problems today (self driving cars for example) could easily and correctly be called all of these. Hope this clears up the relationship between them you asked about.
Have you ever used WEKA or RapidMiner? For instance, EM is within data mining and it applies a model. Apart of that, check out the definition given by mariana soffer and compare it with your answer. It is a couple of years ago that I read Bishop and Russell/Norvig, but as far as I remember the def. by mariana soffer is more suitable. btw data mining is ("only") the major step in prior to knowledge discovery. data mining is only grabbing for data -and subsequent for information- when using an algorithm with adequate parameters. data mining cannot explain patterns.
No, @mnemonic, this definition of AI is much more in line with Russell and Norvig than mariana's, which is quite dated
I think the description of statistics is poor; quantifyinf numbers is the statistics that national department of statistics report, but this is not the same as statistical science that creates models for the data, estimates their parameters and makes inference. Also, the relationship between data mining and machine learning is upside down; data science uses machine learning techniques, not the other way around. See the answer by Ken van Haren as well.
- Statistics is concerned with probabilistic models, specifically inference on these models using data.
- Machine Learning is concerned with predicting a particular outcome given some data. Almost any reasonable machine learning method can be formulated as a formal probabilistic model, so in this sense machine learning is very much the same as statistics, but it differs in that it generally doesn't care about parameter estimates (just prediction) and it focuses on computational efficiency and large datasets.
- Data Mining is (as I understand it) applied machine learning. It focuses more on the practical aspects of deploying machine learning algorithms on large datasets. It is very much similar to machine learning.
- Artificial Intelligence is anything that is concerned with (some arbitrary definition of) intelligence in computers. So, it includes a lot of things.
In general, probabilistic models (and thus statistics) have proven to be the most effective way to formally structure knowledge and understanding in a machine, to such an extent that all three of the others (AI, ML and DM) are today mostly subfields of statistics. Not the first discipline to become a shadow arm of statistics... (Economics, psychology, bioinformatics, etc.)
@Ken - It would be inaccurate to describe economics psychology or AI as shadow arms of statistics - even if statistics is used heavily within each to analyze many of the problems these fields are interested in. You would not want to suggest medicine is a shadow arm of statistics even if most medical conclusions rely heavily on data analysis.
@Ken - This is a great response but you could describe more fully what the other things that AI consists of. For example, historically AI has also included large amounts of the analysis of non-probabilistic models (e.g. production systems, cellular automata etc., e.g. see Newell & Simon 1972). Of course all such models are limiting cases of some probabilistic model, but they were not analyzed in such a vein until much later.
data mining goes beyond machine learning, as it actually involves how the data is stored and indexed to make the algorithms much faster. It can be characterized as taking methods mostly from AI, ML and statistics and combining them with efficient and clever data management and data layout techniques. When it doesn't involve data management, you can often just call it "machine learning". There are some tasks however, in particular "unsupervised", where there is no "learning" involved, but also no data management, these are still called "data mining" (clustering, outlier detection).
We can say that they are all related, but they are all different things. Although you can have things in common among them, such as that in statistics and data mining you use clustering methods.
Let me try to briefly define each:
Statistics is a very old discipline mainly based on classical mathematical methods, which can be used for the same purpose that data mining sometimes is which is classifying and grouping things.
Data mining consists of building models in order to detect the patterns that allow us to classify or predict situations given an amount of facts or factors.
Artificial intelligence (check Marvin Minsky*) is the discipline that tries to emulate how the brain works with programming methods, for example building a program that plays chess.
Machine learning is the task of building knowledge and storing it in some form in the computer; that form can be of mathematical models, algorithms, etc... Anything that can help detect patterns.
No, most of modern AI does not follow that early "emulate the brain" approach. It focuses on creating "rational agents" which act in an environment to maximize utility, and is more closely related to machine learning. See Russell and Norvig's book.
I'm most familiar with the machine-learning - data mining axis - so I'll concentrate on that:
Machine learning tends to be interested in inference in non-standard situations, for instance non-i.i.d. data, active learning, semi-supervised learning, learning with structured data (for instance strings or graphs). ML also tends to be interested in theoretical bounds on what is learnable, which often forms the basis for the algorithms used (e.g. the support vector machine). ML tends to be of a Bayesian nature.
Data mining is interested in finding patterns in data that you don't already know about. I'm not sure that is significantly different from exploratory data analysis in statistics, whereas in machine learning there is generally a more well-defined problem to solve.
ML tends to be more interested in small datasets where over-fitting is the problem and data mining tends to be interested in large-scale datasets where the problem is dealing with the quantities of data.
Statistics and machine learning provides many of the basic tools used by data miners.
Here is my take at it. Let's start with the two very broad categories:
- anything that even just pretends to be smart is artificial intelligence (including ML and DM).
- anything that summarizes data is statistics, although you usually only apply this to methods that pay attention to the validity of the results (often used in ML and DM)
Both ML and DM are usually both, AI and statistics, as they usually involve basic methods from both. Here are some of the differences:
- in machine learning, you have a well-defined objective (usually prediction)
- in data mining, you essentially have the objective "something I did not know before"
Additionally, data mining usually involves much more data management, i.e. how to organize the data in efficient index structures and databases.
Unfortunately, they are not that easy to separate. For example, there is "unsupervised learning", which is often more closely related to DM than to ML, as it cannot optimize towards the goal. On the other hand, DM methods are hard to evaluate (how do you rate something you do not know?) and often evaluated on the same tasks as machine learning, by leaving out some information. This, however, will usually make them appear to work worse than machine learning methods that can optimize towards the actual evaluation goal.
Furthermore, they are often used in combinations. For example, a data mining method (say, clustering, or unsupervised outlier detection) is used to preprocess the data, then the machine learning method is applied on the preprocessed data to train better classifiers.
Machine learning is usually much easier to evaluate: there is a goal such as score or class prediction. You can compute precision and recall. In data mining, most evaluation is done by leaving out some information (such as class labels) and then testing whether your method discovered the same structure. This is naive in the sense, as you assume that the class labels encode the structure of the data completely; you actually punish data mining algorithm that discover something new in your data. Another way of - indirectly - evaluating it, is how the discovered structure improves the performance of the actual ML algorithm (e.g. when partitioning data or removing outliers). Still, this evaluation is based on reproducing existing results, which is not really the data mining objective...
I'd add some observations to what's been said...
AI is a very broad term for anything that has to do with machines doing reasoning-like or sentient-appearing activities, ranging from planning a task or cooperating with other entities, to learning to operate limbs to walk. A pithy definition is that AI is anything computer-related that we don't know how to do well yet. (Once we know how to do it well, it generally gets its own name and is no longer "AI".)
It's my impression, contrary to Wikipedia, that Pattern Recognition and Machine Learning are the same field, but the former is practiced by computer-science folks while the latter is practiced by statisticians and engineers. (Many technical fields are discovered over and over by different subgroups, who often bring their own lingo and mindset to the table.)
Data Mining, in my mind anyhow, takes Machine Learning/Pattern Recognition (the techniques that work with the data) and wrap them in database, infrastructure, and data validation/cleaning techniques.
Machine learning and pattern recognition are not the same thing, machine learning is also interested in things like regression and causal inference etc. Pattern recognition is only one of the problems of interest in machine learning. Most of the machine learning people I know are in computer science departments.
@Dikran Agree but ML and PR are often aliased and presented under similar topics of data analysis. My preferred book is indeed *Pattern Recognition And Machine Learning*, from Christophe M Bishop. Here is a review by John MainDonald in the JSS, http://j.mp/etg3w1.
I also feel that the word "machine learning" is much more common than "pattern recognition" in the CS world.
Sadly, the difference between these areas is largely where they're taught: statistics is based in maths depts, ai, machine learning in computer science depts, and data mining is more applied (used by business or marketing depts, developed by software companies).
Firstly AI (although it could mean any intelligent system) has traditionally meant logic based approaches (eg expert systems) rather than statistical estimation. Statistics, based in maths depts, has had a very good theoretical understanding, together with strong applied experience in experimental sciences, where there is a clear scientific model, and statistics is needed to deal with the limited experimental data available. The focus has often been on squeezing the maximum information from very small data sets. furthermore there is a bias towards mathematical proofs: you will not get published unless you can prove things about your approach. This has tended to mean that statistics has lagged in the use of computers to automate analysis. Again, the lack of programming knowledge has prevented statisticians to work on large scale problems where computational issues become important (consider GPUs and distributed systems such as hadoop). I believe that areas such as bioinformatics have now moved statistics more in this direction. Finally I would say that statisticians are a more sceptical bunch: they do not claim that you discover knowledge with statistics- rather a scientist comes up with a hypothesis, and the statistician's job is to check that the hypothesis is supported by the data. Machine learning is taught in cs departments, which unfortunately do not teach the appropriate mathematics: multivariable calculus, probability, statistics and optimisation is not commonplace...one has vague 'glamorous' concepts such as learning from examples...rather than boring statistical estimation[ cf eg Elements of statistical learning page 30. This tends to mean that there is very little theoretical understanding and an explosion of algorithms as researchers can always find some dataset on which their algorithm proves better. So there are huge phases of hype as ML researchers chase the next big thing: neural networks, deep learning etc. Unfortunately there is a lot more money in CS departments (think google, Microsoft, together with the more marketable 'learning') so the more sceptical statisticians are ignored. Finally, there is an empiricist bent: basically there is an underlying belief that if you throw enough data at the algorithm it will 'learn' the correct predictions. Whilst I am biased against ML, there is a fundamental insight in ML which statisticians have ignored: that computers can revolutionise the application of statistics.
There are two ways- a) automating the application of standard tests and models. Eg running a battery of models ( linear regression, random forests, etc trying different combinations of inputs, parameter settings etc). This hasn't really happened- though I suspect that competitors on kaggle develop their own automation techniques. b) applying standard statistical models to huge data: think of eg google translate, recommender systems etc (no one is claiming that eg people translate or recommend like that..but its a useful tool). The underlying statistical models are straightforward but there are enormous computational issues in applying these methods to billions of data points.
Data mining is the culmination of this philosophy...developing automated ways of extracting knowledge from data. However, it has a more practical approach: essentially it is applied to behavioural data, where there is no overarching scientific theory (marketing, fraud detection, spam etc) and the aim is to automate the analysis of large volumes of data: no doubt a team of statisticians could produce better analyses given enough time, but it is more cost effective to use a computer. Furthermore as D. Hand explains it is the analysis of secondary data - data that is logged anyway rather than data that has been explicitly collected to answer a scientific question in a solid experimental design.Data mining statistics and more, D Hand
So I would summarise that traditional AI is logic based rather than statistical, machine learning is statistics without theory and statistics is 'statistics without computers', and data mining is the development of automated tools for statistical analysis with minimal user intervention.
Data mining is about discovering hidden patterns or unknown knowledge, which can be used for decision making by people.
Machine learning is about learning a model to classify new objects.
Is machine learning *only* about classification? Can't machine learning be used to serve other goals?
@gung Absolutely not. Reinforcement learning is, IMHO, the most characterizing sub-field of ML and I wouldn't say that it's based on classification but on achieving goals.
With all due respect to former answers, I believe that a huge part of the answer is still missing and it is in front of our eyes. Let me try to have a go at it:
In data mining, just like the name sounds, you mine data. Now mining means extracting knowledge from it, but also in general that usually means you are calculating some measures or statistics in the data, like Jaccard Index as an example.
In Machine Learning, you do not only mine or extract, you learn. Now learning theory has its roots in statistics, but takes it further than that. In learning, you have a task that learns based on finite sample data and it can generalize to unseen data. Your Facebook image recongnition can still tag you in your photo even though every image has new background, new textures and so forth. You cannot use any data mining approaches on this problem.
In Artificial Intelligence, you definitely learn from data like in Machine Leaning, but then you need to perform other more higher level tasks as well, like planning. You need to find policies based on what you have learned, and take it further than that. You cannot play a game of chess or Go by just learning good moves, you need to start finding policies like what is a good initial position that will lead to more chances of winning, even though every game is a new game, and no two other game will follow the same set of moves.