Analysis and Statistics

Thursday, 8 May 2014

Statistics, Facebook, friends birthdays and coincidences

I received one of those Facebook emails the other day telling me that three of my friends had birthdays on the same day. Now this might not have been too much of a surprise to me if I had about five hundred friends, but I don’t.

Should I have been surprised given that I only have about fifty friends on Facebook ?

This is the sort of question that is all about chance: something that statisticians learn about at school or college and subsequently use in their work. But sometimes even seemingly simple questions like this can present somewhat difficult analysis for a statistician. One difficulty for us to move forward is that we have to make an assumption that all birthdays are equally likely, and clearly they are not [see box 1].

You will often see assumptions stated by statisticians, they help to describe the accuracy of conclusions that have been made from statistical analysis (http://en.wikipedia.org/wiki/Statistical_assumption).

Box 1: Assumption that all birthdays are equally likely.

It turns out that there isn’t that much difference in the UK month on month. There is a peak of births in July in the UK (about 70,000 in July 2011) (http://data.un.org/Data.aspx?d=POP&f=tableCode:55), and a trough in February (about 61,000 in February 2011).

As an aside, the peak was in August in the US in 2010 (http://www.statisticbrain.com/birth-month-statistics/).

Before trying to answer this question lets change the situation completely and consider other coincidences and chance findings we might experience socially. As a family, we were on holiday the other week and we met someone I knew “Well fancy seeing you here”, I said.

Should I have been surprised [see box 2] ?

Box 2: Assumption that all coincidences are equally likely.

In this situation I hadn’t said beforehand to my family that I would meet a particular person, at a specific event, on a certain day, at a point in time. It is an imprecise research question. As statisticians we can’t put numbers on coincidences like this, and maybe I shouldn’t have been at all surprised given the wide net that this question posed.

And now to the answer to my first question, i.e. Facebook birthdays. This is a more precise research question and it can be answered by statisticians. I am specifically talking about friends, (not work colleagues, doctors, dentists or whoever else I know), birthdays (not random events), and on a specific day (not any unforeseen day).

Usually in statistics we look at a table of critical values (e.g. z, t, and χ², etc) to make a conclusion about our findings. In this case we use an online calculator. This tells me that I should be surprised if there were three people who shared the same birthday in less than 88 friends (this is my critical value).

And in true statistical fashion we use something called a test statistic to see how my observation in practice compares with the critical value. My test statistic is three people sharing the same birthday in 50 friends on Facebook. Since 50 is less than 88, I was correct in being surprised that three people share the same birthday in so few friends.

But then again, I did say earlier that we have to be aware of the assumptions !

And no comments please about my lack of friends.

References

Byron Jones and Robb Muirhead (2012) What a coincidence! It’s not as unlikely as you think Significance 9 (1) pp.40-42

Mario Cortina Borja (2013) The strong birthday problem Significance 10 (6) pp.18-20

Visualisation:

In an earlier blog I discussed a presentation I gave to a group of school children about being a statistician (to one school in Edinburgh). You can find it here:

http://analysisandstatistics.blogspot.co.uk/2014_01_01_archive.html.

I found this interactive graphic, from the Office for National Statistics (ONS), which provides a picture of babies names in England & Wales – it’s not comparable with my presentation but it is an interesting way of showing the results (http://www.ons.gov.uk/ons/interactive/top-100-baby-names-in-england-and-wales---dvc11/index.html)

If you like what I talk about then:

Follow me on Twitter: https://twitter.com/IDMorton001

Connect with me on LinkedIn: www.linkedin.com/in/idmorton

See my Presentations on SlideShare: http://www.slideshare.net/IDMorton001

Here is an example of a presentation I recently gave entitled “Process Improvement & Design of Experiments – Lessons Learnt from a European Statistics Conference”:

http://www.slideshare.net/IDMorton001/enbis-13

Saturday, 11 January 2014

Primary school children learn about being a Statistician

It’s not every day that you get asked to explain what you do in your day-to-day job to someone outside your work. But I am always up for a challenge and because I like being a statistician I generally think it shouldn't be too difficult.

Well they were my long held thoughts until a local headmaster approached me to give a talk to a group of primary school children. The request came about because the school was due to hold a work focus week. As part of this exercise several parents were asked if they could provide presentations about the content of their work.

I tried to think about how I could describe my job in simple but easily understandable terms. I had previously read about the Up-Goer Five Text Editor, which challenges you to explain hard ideas using only the 'ten hundred' most common words, and I thought I would try and describe my job using only words from the list. This is what I came up with, and it allowed me to focus on what I should deliver in a presentation to the children:

I work with numbers and I have always worked with numbers. I write things on a computer that lets me get these numbers and allows me to handle them. I speak to people about them, and then I put them out using a paper which groups them all together. My work can be hard to understand and some do not accept it without a get together before they are taken seriously. Sometimes people don't like the numbers and that causes some problems, they may be against what they say. Other times, maybe when the numbers go down, they are liked by everyone. In the end, I like to think that my work helps to give all people a better life.

So I started with a blank sheet of paper and for a while I was stumped, and then it occurred to me that I should give a talk about the names of children born in Scotland. In this way I could engage the children in a discussion about data collection, analysis and interpretation, and finally dissemination. Of course, I didn’t use those words exactly but they are the specialist skills that we need to develop as professional statisticians.

Armed with my presentation I headed to the school’s classroom. Forty children aged between 8 and 10 years old were in attendance, and they were accompanied by their teachers. There wasn’t much space available so some children sat on the floor, others in the room sat at their tables.

As an introduction I said to the children why I liked numbers and then I gave them some idea about the kind of numbers we can encounter in our everyday working lives. To ensure there was some interaction I asked them if they knew about numbers that included a decimal point, and others that started with £ and % - they were very forthcoming. I covered some of my other examples over and asked the group if they could think of any that remain. Someone enthusiastically put their hand up, and after my acknowledgement they replied “Yes, euros”.

Given it was work focus week it was of course appropriate that I went on to tell them about how I came into my career as a statistician and the training I had undertaken. Subsequently I then discussed my specific example of statistics i.e. the names of children born in Scotland. Here are two slides that I used to provide an example of these statistics. I didn't just want to give them a presentation and end up with 40 blank faces, so I continued with the interaction approach ….

A boy called Ethan put his hand up and said that his name was at the top – he knew several Ethan’s at schools in Edinburgh, then Sophie put her hand up. After hearing from quite a few children I then revealed the answer. Ethan was at number 5 and I gave them several reasons why it went against the thoughts of Ethan in the class e.g. Edinburgh may not reflect the national figures, and it is now at least 8 years after they were born and times change.

I hadn't originally appreciated how much interest my statistics example would generate. I have only paraphrased a small part of the discussion. At the time this made me realise how passionate I am about statistics. It also let me flow into a final discussion about why we need to work and I gave several examples of this.

In summary, the presentation was received with so much enthusiasm and nearly every one of the children asked at least one question. I have transcribed the comments of one of the children who commented after the event:

“Thank you for coming in I liked all the new facts you told us about. I never knew that people made charts of names and other similar stuff. I enjoyed doing the wee sums you asked us. Your job looks exciting, I might want to do it when I'm older. I thought it was very interesting when you told us about the numbers you used and the charts. It looked complicated. I’d like a job a little like yours because I like doing maths. Thank you for coming in to talk to us and for giving up your time.”

Ian Morton has worked as a statistician for a number of years. In 2013, he entered his presentation into The Greenfield Challenge (a competition run by the European Network of Business and Industrial Statisticians (ENBIS)), and he won an award of book tokens for his contribution.

Note: The views expressed are his own and do not represent the views of any organisation he has worked for in the past or at present.

Wednesday, 15 May 2013

What could propensity score matching do for you ? (with examples from justice, medicine, education and finance)

What is Propensity Score Matching (PSM) ?

PSM is a statistical matching technique that attempts to estimate the effect of a treatment, policy or other intervention by accounting for the covariates that predict receiving the treatment. See for example Rosenbaum and Rubin (1983) – the pioneers of PSM.

Why use PSM ?

It helps to reduce bias due to confounding. It can be used to estimate the counterfactual outcome.

Many of you will have been to a particular university or school and achieved a particular result. Have you ever wondered what would have been the result if you had attended somewhere else ? To determine this you would need to account for the covariates, using information on people like you who studied the same course, and then you could estimate this counterfactual outcome using PSM.

What do you need so that you can do PSM ?

· A rich data source (i.e. lots of information that predicts the outcome of interest);

·         An outcome of interest;
·         A “good” predictive model (based on the rich data source and outcome) – see an earlier blog “My suggested strategy for building a “good” predictive model”;
·         A program (see the appendix for this); and
·         The necessary software platform e.g. the SAS Language.

Some examples of PSM being used (and where it could be used)

1. In Justice, policy makers wanted to look at success due to mediation, and the analysts needed to remove confounding – they used PSM (Ministry of Justice, 2010).

2. In medicine, an analyst is carrying out a case-control study and needs to get an accurate estimate of dose response. They match cases to controls by factors such as age, gender and smoking status. These are confounding variables and need to be removed to reduce any bias – they used PSM (Foster, 2003).

3. In education, the performance of institutions such as universities and colleges is of interest to stakeholders who require accurate estimates of outcomes such as retention rates. Morton et al (2010) have controlled for differences in student background characteristics (the covariates), and they have performed PSM on the Scottish cohort of students to estimate their counterfactual outcome.

4. In finance, the consumer magazine Which? reported “High-street banks failing on customer satisfaction”, and I subsequently wrote that they were comparing apples and pears (“Big traditional banks worst on customer satisfaction” or should that be “Comparing apples and pears” ?). In particular, I said that “Any particular bank in the league table will offer different products, and have different customers, to any other bank in the list – after all, they need to do this to suit their customers and to gain competitive advantage through niche products. But I believe this product offering, to different customers, has the impact that the age and gender make-up of customers at some banks will be different to the age and gender make-up of customers at other banks.” You can see where this is going can’t you ? They would get a better estimate of customer satisfaction per se, after they had controlled for the make-up of their customers, and the different products that they offer. A researcher could use PSM, say by looking at just “the leaders” (First Direct) and “the laggers” (Santander). But practically I appreciate that the data may not be captured, and even if it was, it would probably have restrictive availability (to maintain competitive advantage etc. etc.)

What issues have I heard about, when using PSM ?

· PSM doesn’t consider the multilevel nature of the data. [My answer to this] You would require a considerable amount of qualitative and quantitative data to take account of the multilevel nature, but that could be an avenue for future work !

· You can get the same propensity scores for different combinations of predictor variables. [My answer to this] Yes, I accept that this could be the case. But I still maintain that it provides a method of determining the counterfactual outcome, and is better than nothing at all.

· The hot deck procedure (see Penny et al, 2007) is more robust than PSM. [My answer to this] It’s more complicated to apply, and takes up more resources. But see the references below, (and maybe I will write a future blog on this).

References

Coca-Perraillon, M. (2007) Local and Global Optimal Propensity Score Matching SAS Global Forum 2007, Orlando, Florida, April16th - 19th 2007.

Foster, E. M. (2003) Propensity Score Matching: An Illustrative Analysis of Dose Response. Medical Care 41 10 1183-1192.

Ministry of Justice (2010) Evaluating the use of judicial mediation in Employment Tribunals Ministry of Justice Research Series 7/10.

Morton, I. D. (2009) The Use of Hot Decking and Propensity Score Matching in Comparing Student Outcomes. MSc Dissertation, Edinburgh Napier University.

Morton, I., Penny, K.I., Ashraf, M.Z., & Duffy J.C. (2010). An analysis of student retention rates using propensity score matching, SAES Working Paper Series, Edinburgh Napier University.

Penny, K. I., Ashraf, M. Z., and Duffy, J. C. (2007) The Use of Hot Deck Imputation to Compare Performance of Further Education Colleges Journal of Computing and Information Technology 15 4 313-318.

Rosenbaum, P. R. and Rubin, D. B. (1983) The central role of the propensity score in observational studies for causal effects Biometrika 70 1 41-56.

Ian Morton has built propensity scoring models for the financial services sector, for a utility company, and for the public sector. He has given a number of presentations on the technique of propensity score matching. For example:

1. The Royal Statistical Society 2009 Conference (http://www.slideshare.net/IDMorton001/presentation-to-rss-edinburgh-2009)

2. A SAS Users Convention N-SUG1 (http://www.slideshare.net/IDMorton001/presentation-to-n-sug1-2010-with-notes)

3. He has also co-authored a forthcoming peer-reviewed journal article. It’s not published yet (as of May 2013), but for a flavour of its contents you could look at either of the two references with my name in it.

(This is my personal blog, views are my own and not those of my present or past employers)

Appendix - Here is the complete SAS Program

Note: I have subsequently converted the complete program (as shown below) into a SAS Enterprise Guide project - with discrete program blocks, and parameterized code (that chooses the matching method etc., etc.). It is this latter project that is described in the slide presentation found here (http://www.slideshare.net/IDMorton001/presentation-to-n-sug1-2010-with-notes)

/****************************************************/

/* here is the Coca-Peraillon (2007) matching macro */
/* see the references to that paper */
/* it does nearest neighbour and caliper matching */
/****************************************************/

%macro PSMatching(datatreatment=, datacontrol=, method=, numberofcontrols=, caliper=,replacement=);

/* Create copies of the treated units if N > 1 */;

data _Treatment0(drop= i);
set Treatment;
do i= 1 to &numberofcontrols;
RandomNumber= ranuni(12345);
output;
end;
run;

/* Randomly sort both datasets */

proc sort data= _Treatment0 out= _Treatment(drop= RandomNumber);
by RandomNumber;
run;

data _Control0;

set Control;
RandomNumber= ranuni(45678);
run;

proc sort data= _Control0 out= _Control(drop= RandomNumber);

by RandomNumber;
run;

data Matched(keep = IdSelectedControl MatchedToTreatID);

length pscoreC 8;
length idC 8;
/* Load Control dataset into the hash object */
if _N_= 1 then do;
declare hash h(dataset: "_Control", ordered: 'no');
declare hiter iter('h');
h.defineKey('idC');
h.defineData('pscoreC', 'idC');
h.defineDone();
call missing(idC, pscoreC);
end;
/* Open the treatment */
set _Treatment;
%if %upcase(&method) ~= RADIUS %then %do;
retain BestDistance 99;
%end;
/* Iterate over the hash */
rc= iter.first();
if (rc=0) then BestDistance= 99;
do while (rc = 0);
/* Caliper */
%if %upcase(&method) = CALIPER %then %do;
if (pscoreT - &caliper) <= pscoreC <= (pscoreT + &caliper) then do;
ScoreDistance = abs(pscoreT - pscoreC);
if ScoreDistance < BestDistance then do;
BestDistance = ScoreDistance;
IdSelectedControl = idC;
MatchedToTreatID = idT;
end;
end;
%end;
/* NN */
%if %upcase(&method) = NN %then %do;
ScoreDistance = abs(pscoreT - pscoreC);
if ScoreDistance < BestDistance then do;
BestDistance = ScoreDistance;
IdSelectedControl = idC;
MatchedToTreatID = idT;
end;
%end;
%if %upcase(&method) = NN or %upcase(&method) = CALIPER %then %do;
rc = iter.next();
/* Output the best control and remove it */
if (rc ~= 0) and BestDistance ~=99 then do;
output;
%if %upcase(&replacement) = NO %then %do;
rc1 = h.remove(key: IdSelectedControl);
%end;
end;
%end;
/* Radius */
%if %upcase(&method) = RADIUS %then %do;
if (pscoreT - &caliper) <= pscoreC <= (pscoreT + &caliper) then do;
IdSelectedControl = idC;
MatchedToTreatID = idT;
output;
end;
rc = iter.next();
%end;
end;
run;

/* Delete temporary tables. Quote for debugging */

proc datasets;
delete _:(gennum=all);
run;
quit;
%mend PSMatching;
/****************************************/
/* that’s the end of the matching macro */
/****************************************/

/***********************************************/

/* this part builds the propensity score model */
/***********************************************/
PROC LOGISTIC DATA=<dataset> Descend;
      class <class variables>/param=ref ref=first ;
      model <outcome> = <independent variables>
      /SELECTION = STEPWISE RISKLIMITS LACKFIT RSQUARE PARMLABEL;
      OUTPUT OUT= Propen prob=prob ;
RUN;

proc sort data=propen;by <outcome>;run;

/* set up the data for the matching macro in two separate data sets */

data treatment(rename = (prob=pscoreT));
      set propen;
      idT=_n_;
      if <outcome> ="Treatment" then output treatment;
run;

data control(rename = (prob=pscoreC));

      set propen;
      idC=_n_;
      if <outcome> ="Control" then output control;
run;

/*****************************************************/

/* this part does the actual matching */
/* I have shown CALIPER METHOD of doing the matching */
/* caliper of 0.0001 gets n treatments out of m */
/*****************************************************/

%PSMatching(datatreatment=treatment,datacontrol=control,method=caliper,

numberofcontrols=1, caliper=0.0001, replacement=no);

proc sort data=matched;by idselectedcontrol;run;

/* need to rename to allow merging */

data caliper1(rename=(idselectedcontrol=idC));set matched;run;
/* merge the original file with the matched file */
data merged(keep=result result2 idC matchedtotreatid);
      merge control(in=a) caliper1(in=b);
      by idC;
      if a and b;
run;

/*****************************************************************/

/* now go and summarise the results                              */
/* this part produces the estimate of the counterfactual outcome */
/*****************************************************************/
proc summary data=merged nway missing;
      class matchedtotreatid;
      var result2;output out=caliper2(drop=_type_) sum=;
run;

proc summary data= caliper2 nway missing;

var result2;output out= caliper3(drop=_type_) sum=;
run;

data caliper4;

set caliper3;answer=(_freq_-result2)/_freq_;run;

proc print;title "The counterfactual outcome using caliper=0.0001";run;