Wednesday, 15 May 2013

What could propensity score matching do for you ? (with examples from justice, medicine, education and finance)

What is Propensity Score Matching (PSM) ?

PSM is a statistical matching technique that attempts to estimate the effect of a treatment, policy or other intervention by accounting for the covariates that predict receiving the treatment. See for example Rosenbaum and Rubin (1983) – the pioneers of PSM.

Why use PSM ?

It helps to reduce bias due to confounding. It can be used to estimate the counterfactual outcome.

Many of you will have been to a particular university or school and achieved a particular result. Have you ever wondered what would have been the result if you had attended somewhere else ? To determine this you would need to account for the covariates, using information on people like you who studied the same course, and then you could estimate this counterfactual outcome using PSM.

What do you need so that you can do PSM ?

·         A rich data source (i.e. lots of information that predicts the outcome of interest);
·         An outcome of interest;
·         A “good” predictive model (based on the rich data source and outcome) – see an earlier blog “My suggested strategy for building a “good” predictive model”;
·         A program (see the appendix for this); and
·         The necessary software platform e.g. the SAS Language.

Some examples of PSM being used (and where it could be used)

1.     In Justice, policy makers wanted to look at success due to mediation, and the analysts needed to remove confounding – they used PSM (Ministry of Justice, 2010).

2.     In medicine, an analyst is carrying out a case-control study and needs to get an accurate estimate of dose response. They match cases to controls by factors such as age, gender and smoking status. These are confounding variables and need to be removed to reduce any bias – they used PSM (Foster, 2003).

3.     In education, the performance of institutions such as universities and colleges is of interest to stakeholders who require accurate estimates of outcomes such as retention rates. Morton et al (2010) have controlled for differences in student background characteristics (the covariates), and they have performed PSM on the Scottish cohort of students to estimate their counterfactual outcome.

4.      In finance, the consumer magazine Which? reported “High-street banks failing on customer satisfaction”, and I subsequently wrote that they were comparing apples and pears (“Big traditional banks worst on customer satisfaction” or should that be “Comparing apples and pears” ?). In particular, I said that “Any particular bank in the league table will offer different products, and have different customers, to any other bank in the list – after all, they need to do this to suit their customers and to gain competitive advantage through niche products. But I believe this product offering, to different customers, has the impact that the age and gender make-up of customers at some banks will be different to the age and gender make-up of customers at other banks.” You can see where this is going can’t you ? They would get a better estimate of customer satisfaction per se, after they had controlled for the make-up of their customers, and the different products that they offer. A researcher could use PSM, say by looking at just “the leaders” (First Direct) and “the laggers” (Santander). But practically I appreciate that the data may not be captured, and even if it was, it would probably have restrictive availability (to maintain competitive advantage etc. etc.)

What issues have I heard about, when using PSM  ?

·         PSM doesn’t consider the multilevel nature of the data. [My answer to this] You would require a considerable amount of qualitative and quantitative data to take account of the multilevel nature, but that could be an avenue for future work !

·         You can get the same propensity scores for different combinations of predictor variables. [My answer to this] Yes, I accept that this could be the case. But I still maintain that it provides a method of determining the counterfactual outcome, and is better than nothing at all.

·         The hot deck procedure (see Penny et al, 2007) is more robust than PSM. [My answer to this] It’s more complicated to apply, and takes up more resources. But see the references below, (and maybe I will write a future blog on this).


Coca-Perraillon, M. (2007) Local and Global Optimal Propensity Score Matching SAS Global Forum 2007, Orlando, Florida, April16th - 19th 2007.

Foster, E. M. (2003) Propensity Score Matching: An Illustrative Analysis of Dose Response. Medical Care 41 10 1183-1192.

Ministry of Justice (2010) Evaluating the use of judicial mediation in Employment Tribunals Ministry of Justice Research Series 7/10.

Morton, I. D. (2009) The Use of Hot Decking and Propensity Score Matching in Comparing Student Outcomes. MSc Dissertation, Edinburgh Napier University.

Morton, I., Penny, K.I., Ashraf, M.Z., & Duffy J.C. (2010). An analysis of student retention rates using propensity score matching, SAES Working Paper Series, Edinburgh Napier University.

Penny, K. I., Ashraf, M. Z., and Duffy, J. C. (2007) The Use of Hot Deck Imputation to Compare Performance of Further Education Colleges Journal of Computing and Information Technology 15 4 313-318.

Rosenbaum, P. R. and Rubin, D. B. (1983) The central role of the propensity score in observational studies for causal effects Biometrika 70 1 41-56.


Ian Morton has built propensity scoring models for the financial services sector, for a utility company, and for the public sector. He has given a number of presentations on the technique of propensity score matching. For example:

1.     The Royal Statistical Society 2009 Conference (

3.     He has also co-authored a forthcoming peer-reviewed journal article. It’s not published yet (as of May 2013), but for a flavour of its contents you could look at either of the two references with my name in it.

(This is my personal blog, views are my own and not those of my present or past employers)

Appendix - Here is the complete SAS Program

Note: I have subsequently converted the complete program (as shown below) into a SAS Enterprise Guide project - with discrete program blocks, and parameterized code (that chooses the matching method etc., etc.). It is this latter project that is described in the slide presentation found here (

/* here is the Coca-Peraillon (2007) matching macro */
/* see the references to that paper                 */
/* it does nearest neighbour and caliper matching   */

%macro PSMatching(datatreatment=, datacontrol=, method=, numberofcontrols=, caliper=,replacement=);

/* Create copies of the treated units if N > 1 */;
data _Treatment0(drop= i);
set Treatment;
do i= 1 to &numberofcontrols;
RandomNumber= ranuni(12345);

/* Randomly sort both datasets */
proc sort data= _Treatment0 out= _Treatment(drop= RandomNumber);
by RandomNumber;

data _Control0;
set Control;
RandomNumber= ranuni(45678);

proc sort data= _Control0 out= _Control(drop= RandomNumber);
by RandomNumber;

data Matched(keep = IdSelectedControl MatchedToTreatID);
length pscoreC 8;
length idC 8;
/* Load Control dataset into the hash object */
if _N_= 1 then do;
declare hash h(dataset: "_Control", ordered: 'no');
declare hiter iter('h');
h.defineData('pscoreC', 'idC');
call missing(idC, pscoreC);
/* Open the treatment */
set _Treatment;
%if %upcase(&method) ~= RADIUS %then %do;
retain BestDistance 99;
/* Iterate over the hash */
rc= iter.first();
if (rc=0) then BestDistance= 99;
do while (rc = 0);
/* Caliper */
%if %upcase(&method) = CALIPER %then %do;
if (pscoreT - &caliper) <= pscoreC <= (pscoreT + &caliper) then do;
ScoreDistance = abs(pscoreT - pscoreC);
if ScoreDistance < BestDistance then do;
BestDistance = ScoreDistance;
IdSelectedControl = idC;
MatchedToTreatID = idT;
/* NN */
%if %upcase(&method) = NN %then %do;
ScoreDistance = abs(pscoreT - pscoreC);
if ScoreDistance < BestDistance then do;
BestDistance = ScoreDistance;
IdSelectedControl = idC;
MatchedToTreatID = idT;
%if %upcase(&method) = NN or %upcase(&method) = CALIPER %then %do;
rc =;
/* Output the best control and remove it */
if (rc ~= 0) and BestDistance ~=99 then do;
%if %upcase(&replacement) = NO %then %do;
rc1 = h.remove(key: IdSelectedControl);
/* Radius */
%if %upcase(&method) = RADIUS %then %do;
if (pscoreT - &caliper) <= pscoreC <= (pscoreT + &caliper) then do;
IdSelectedControl = idC;
MatchedToTreatID = idT;
rc =;

/* Delete temporary tables. Quote for debugging */
proc datasets;
delete _:(gennum=all);
%mend PSMatching;
/* that’s the end of the matching macro */


/* this part builds the propensity score model */
PROC LOGISTIC DATA=<dataset> Descend;
      class <class variables>/param=ref ref=first ;
      model <outcome> = <independent variables>
      OUTPUT OUT= Propen prob=prob ;

proc sort data=propen;by <outcome>;run;

/* set up the data for the matching macro in two separate data sets */
data treatment(rename = (prob=pscoreT));
      set propen;
      if <outcome> ="Treatment" then output treatment;

data control(rename = (prob=pscoreC));
      set propen;
      if <outcome> ="Control" then output control;

/* this part does the actual matching                */
/* I have shown CALIPER METHOD of doing the matching */
/* caliper of 0.0001 gets n treatments out of m      */

numberofcontrols=1, caliper=0.0001, replacement=no);

proc sort data=matched;by idselectedcontrol;run;

/* need to rename to allow merging */
data caliper1(rename=(idselectedcontrol=idC));set matched;run;
/* merge the original file with the matched file */
data merged(keep=result result2 idC matchedtotreatid);
      merge control(in=a) caliper1(in=b);
      by idC;
      if a and b;


/* now go and summarise the results                              */
/* this part produces the estimate of the counterfactual outcome */
proc summary data=merged nway missing;
      class matchedtotreatid;
      var result2;output out=caliper2(drop=_type_) sum=;

proc summary data= caliper2 nway missing;
      var result2;output out= caliper3(drop=_type_) sum=;

data caliper4;
      set caliper3;answer=(_freq_-result2)/_freq_;run;

proc print;title "The counterfactual outcome using caliper=0.0001";run;

Thursday, 9 May 2013

My suggested strategy for building a “good” predictive model

Ian Morton worked in credit risk for big banks for a number of years. He learnt about how to (and how not to) build “good” statistical models in the form of scorecards using the SAS Language.

George E. P. Box “Essentially, all models are wrong, but some are useful”

Initial investigations

1.     Look at the data dictionary to see which data is available
2.     What is the outcome ? is it yes / no ? is it continuous ?
3.     Decide upon the model required (logistic ! for yes / no outcome)

Getting the data ready

4.     cross tabulations on categorical variables to understand the coding and volumes
5.     summary statistics to understand the distribution of the continuous variables
6.     Ask questions about data quality:
  • remove these variables from any potential models ? or,
  • think about imputation ? or,
  • obtain accurate data ?
7.     Convert continuous variables into categorical variables


8.     Check for multi-colinearity / correlation between variables (variance inflation factors), or correlation tests
9.     Check for interactions
10.  Choose type of logistic approach (e.g. forward, backward, stepwise)
11.  Choose the baseline attribute for each categorical variable
12.  Create a random variable – mustn’t step into the model - something is wrong if it does step into the model
13.  Split the dataset into two parts (ratio 80%/20%)
  • using random selection without replacement
  • the larger sample is the build dataset
  • the smaller sample is the test dataset

14.  Put all variables from the build dataset (including interactions and the random variable) into the model and run it
  • Check odds ratios – do they make sense ?, and
  • Check the coefficients – do they make sense ?

Check the model

15.  Do diagnostic checks and plots of the fit (e.g. Somers D, residuals etc., etc.)
16.  Put all variables from the test dataset (including interactions and the random variable) into a new model and run it
  • Are the coefficients the same as the model it was built on ? and
  • Are the odds ratios the same as the model it was built on ?

Start again

17.  Back to the start, fine tune the grouping of the data, put variables in or take variables out.

The bottom line: It’s an iterative process and it might take some time to get a model that’s acceptable in terms of fit, and acceptable to business users. Always, always, always - at each stage consult with the business to check on ethical issues, applicability of the model, and that the model can be implemented.