Ian Morton worked in credit risk for big banks for a
number of years. He learnt about how to (and how not to) build “good” statistical
models in the form of scorecards using the SAS Language.
George E. P. Box “Essentially, all models are wrong,
but some are useful”
Initial investigations
1.
Look at the data
dictionary to see which data is available
2.
What is the
outcome ? is it yes / no ? is it continuous ?3. Decide upon the model required (logistic ! for yes / no outcome)
Getting the data ready
4.
cross tabulations on categorical
variables to understand the coding and volumes
5.
summary statistics to
understand the distribution of the continuous variables6. Ask questions about data quality:
-
remove these
variables from any potential models ? or,
-
think about
imputation ? or,
-
obtain accurate
data ?
Modelling
8.
Check for
multi-colinearity / correlation between variables (variance inflation factors), or correlation tests
9.
Check for
interactions10. Choose type of logistic approach (e.g. forward, backward, stepwise)
11. Choose the baseline attribute for each categorical variable
12. Create a random variable – mustn’t step into the model - something is wrong if it does step into the model
13. Split the dataset into two parts (ratio 80%/20%)
-
using random
selection without replacement
-
the larger sample
is the build dataset
-
the smaller
sample is the test dataset
14.
Put all variables
from the build dataset (including
interactions and the random variable) into the model and run it
-
Check odds ratios
– do they make sense ?, and
-
Check the
coefficients – do they make sense ?
Check the model
15.
Do diagnostic
checks and plots of the fit (e.g. Somers D, residuals etc., etc.)
16.
Put all variables
from the test dataset (including
interactions and the random variable) into a new model and run it-
Are the
coefficients the same as the model it was built on ? and
-
Are the odds
ratios the same as the model it was built on ?
Start again
17.
Back to the
start, fine tune the grouping of the data, put variables in or take variables
out.
The bottom line: It’s an iterative process and it might take some time to get a model
that’s acceptable in terms of fit, and acceptable to business users. Always,
always, always - at each stage consult with the business to check on ethical
issues, applicability of the model, and that the model can be implemented.
i think 6 should come first. you dontbwant to make tables and graphs with bad data or data thats coded wrong then you just have to do it all over again.
ReplyDeletei think 6 should come first. you dontbwant to make tables and graphs with bad data or data thats coded wrong then you just have to do it all over again.
ReplyDelete