Step 1: Obtaining a dataset
The first step is to find your own domain-specific dataset for your data mining project. The
dataset could be complex enough so that it is not straightforward to find patterns with simple
calculations (impossible without preprocessing and data mining approaches). There is no limit
in size for the dataset, but typically a good sized data for mining is around 100k-100M. It could
have thousands/millions rows (or columns or sometimes both rows/columns). A good data
typically contains various types of data (numerical, nominal, ordinal, Boolean etc) with some
errors (missing or dirty values etc) in the data. The dataset could be text data, tabular
formatted data, georeferenced data, etc. Also, the dataset could be your own data or could be
obtained through public sites. Simply the data could be:
– Your own data (obtained or created by yourself, but don’t spend too much time on it);
Step 2: Setting up a business scenario
Once obtained a dataset, then set up a real-like business scenario what kinds of patterns you
want to find from the dataset.
For instance, if you have chosen a set of crime incidents in town, then you might be interested
in finding what crimes are occurring together, and which particular crime is frequently
occurring near pub after midnight, what crimes sequentially occurring after a certain crime
etc. Your business scenario as a police officer might be to find crime hot spots, or sequential
crimes occurring one after another, or periodic crime occurrences.
Another example would be, if you have chosen a retail sales dataset, then you, as a sales
manager, would be interested in finding associative patterns between age group and certain
item (i.e. a young student tends to buy a jean), or the correlative pattern between
geographical location and certain item. For instance, more hats are sold in Stones Corner while
more shoes are sold in Toowong. Your business scenario for this case might be as a sales
manager, who would like to find associative or correlative patterns that could be used to boost
the sales or could be used to optimise stock management. Depending on the business scenario
(goal), you can focus on certain types of patterns to achieve your business goal.
Step 3: Planning data mining
Further explore (browse) the dataset to decide what patterns you would like to focus on, what
data mining algorithms (techniques) you have to use, what preprecessing is required to use
those data mining algorithms. This is a core part of data mining and you have to use the right
data mining technique to find right pattern. Also, you have to apply proper preprocessing
approaches before you use the adequate data mining technique.
Note: The procedural order of the above three steps can be alternated. For example,
you may find an interesting business scenario first and then find a suitable dataset that fits
for the analysis on the scenario chosen. Or you could decide the pattern first (for instance
crime hot spots), and then find a dataset to set up a reasonable business scenario.