Thursday, 9 May 2013

My suggested strategy for building a “good” predictive model


Ian Morton worked in credit risk for big banks for a number of years. He learnt about how to (and how not to) build “good” statistical models in the form of scorecards using the SAS Language.

George E. P. Box “Essentially, all models are wrong, but some are useful”

Initial investigations

1.     Look at the data dictionary to see which data is available
2.     What is the outcome ? is it yes / no ? is it continuous ?
3.     Decide upon the model required (logistic ! for yes / no outcome)

Getting the data ready

4.     cross tabulations on categorical variables to understand the coding and volumes
5.     summary statistics to understand the distribution of the continuous variables
6.     Ask questions about data quality:
  • remove these variables from any potential models ? or,
  • think about imputation ? or,
  • obtain accurate data ?
7.     Convert continuous variables into categorical variables

Modelling

8.     Check for multi-colinearity / correlation between variables (variance inflation factors), or correlation tests
9.     Check for interactions
10.  Choose type of logistic approach (e.g. forward, backward, stepwise)
11.  Choose the baseline attribute for each categorical variable
12.  Create a random variable – mustn’t step into the model - something is wrong if it does step into the model
13.  Split the dataset into two parts (ratio 80%/20%)
  • using random selection without replacement
  • the larger sample is the build dataset
  • the smaller sample is the test dataset

14.  Put all variables from the build dataset (including interactions and the random variable) into the model and run it
  • Check odds ratios – do they make sense ?, and
  • Check the coefficients – do they make sense ?

Check the model

15.  Do diagnostic checks and plots of the fit (e.g. Somers D, residuals etc., etc.)
16.  Put all variables from the test dataset (including interactions and the random variable) into a new model and run it
  • Are the coefficients the same as the model it was built on ? and
  • Are the odds ratios the same as the model it was built on ?

Start again

17.  Back to the start, fine tune the grouping of the data, put variables in or take variables out.

The bottom line: It’s an iterative process and it might take some time to get a model that’s acceptable in terms of fit, and acceptable to business users. Always, always, always - at each stage consult with the business to check on ethical issues, applicability of the model, and that the model can be implemented.

2 comments:

  1. i think 6 should come first. you dontbwant to make tables and graphs with bad data or data thats coded wrong then you just have to do it all over again.

    ReplyDelete
  2. i think 6 should come first. you dontbwant to make tables and graphs with bad data or data thats coded wrong then you just have to do it all over again.

    ReplyDelete