Data mining is now a mature science, and with all things that have been practiced for long enough, it can be broken down into a methodology. There are distinct stages a project goes through so much so that a standard methodology has been put together based on how the project is supposed to play out. This is the Cross Industry Standard Process for Data Mining (CRISP-DM) methodology.
With CRISP-DM breaks the project down into distinct stages. This way you know what to expect as the project progresses. Some stages are prescriptive and require the application of Information Technology where as others require creativity and business understanding.
Looking at the process holistically gives the project structure thereby making the analysis systematic. Broken down into tasks, you will be better able to organise your future IT or business requirements.
The stages in the CRISP-DM methodology are supposed to be cyclical but makes allowances to go back to previous stages. This is implied in the outer circle. The solution being deployed could spawn off a new data mining project or better understanding of the problem could go back to revise the output of a prior stage.
CRISP defines 6 distinct stages that normally follow each other and are natural progressions for one another even though it makes these allowances.
Starting out with a good understanding of the problem is the most crucial item that determines the success of the project. Business projects seldom come packaged as clear unambiguous data mining problems. A high level knowledge of the fundamental problems will help organise and direct the work done from this point. The outcome of this phase should be as clear a definition of the problem as possible along with as much information pertinent to the business environment within which it operates. The better the knowledge of the problem to be solved and the environment within which to solve it, the more creative the analyst can be in formulating the solution.
The data provides the raw material to solve the business problem. Starting out, the data will be limited and is often a shadow of the perfect dataset that is required to solve the problem. It is important for the analyst to understand the strengths and limitations of the data and the compromises taken based on what data is available. The costs of getting the data also need to be considered. Does the data even exist? If so, is it feasible getting it? If not, what are the alternative solutions? Data can be free or come with a cost or be impossible to acquire.
Depending on the scope of the project, there could be a requirement for multiple data sets. At this point you can see what is possible to obtain and how that will affect the overall goal of the project.
The output here is a well documented understanding of the data available and how this supports the business understanding and the goals of the project.
The raw, unprocessed and unstructured data should be put into a format that is useable. This is a technical part of the process and any investment in proper data management techniques implemented here could pay out dividends later. Proper preparation and documentation at this stage is helpful. As the project becomes more complex it becomes harder to keep track of sources and the transformations that were carried out and good documentation goes a long way.
The prepared dataset along with the data mining algorithms comprise the model. The dataset used by the algorithms should follow proper design principles like each column being an attribute of the subject in the dataset and being of the same data type. Each row being an instance of the subject of the dataset. Mining algorithms want well structured data and give bad results if the data is not organised properly.
There are different types of data mining algorithms. The dataset will be created in the format required by the algorithm and the selected algorithm will follow the business problem to be solved set out in the business understanding phase.
This phase assesses the results of the model to gain confidence it is valid and reliable before moving on. There are several methods on how to gauge the success and reliability of a model which is determined by the problem domain. Models that are considered successful in one domain may not be in others. For instance, if the model is 99% accurate but outputs too many false positives wont be feasible if processing false positives is expensive.
Integrating the findings of the model into the business. This means implementing processes within the enterprise to support the model. This is the most high stakes part of the cycle because until this time everything was theoretical.
The deployment phase could then lead to another cycle of a project or with additional business understanding refinement of the process to get a better outcome. Large data mining projects could mean specific teams to deal with different stages in process. An appreciation of how a stage fits into the larger picture will lead to a more relevant output.