I’m starting to slog through the thick pile of papers that look interesting from CHI 2012. I wanted to start with the heavier stuff while I have the energy, so I began by looking at a paper discussing statistical methods for HCI. It helped that the first author was Maurits Kaptein, who is a great guy I met while studying abroad at TU/e in the Netherlands.
The basic gist of the paper is that HCI researchers frequently get 3 things wrong:
- We wrongly interpret the p-value as the probability of the null-hypothesis being true
- Our studies often lack statistical power to begin with, making it impossible to draw meaningful conclusions about rejecting the alt hypothesis
- We confuse statistical significance (e.g., p-values) with practical significance (i.e., actual difference meaningful to the real world)
Good stuff and I’m sure I’ve been guilty of both 1 and 2 frequently (I am usually fairly careful with #3). The authors don’t just point out the problem, but also give 7 suggestions for addressing this issue. The main point of this post is to critique these suggestions and perhaps give a few resources:
- Make bolder predictions of direction and magnitude of effects AND
- Predict the size of the effect likely to be found — both of these are easily said, but the problem is that HCI frequently bridges into areas uncharted by previous studies. We frequently just don’t know what the effects might be. Piloting is always useful, but frequently leads to results that are different from the deployment as most pilots are done with confederates (e.g. lab mates) to moderate the costs. Even small differences in how a study is set up or positioned could lead to HUGE changes in a field deployment (for examples, see the “Into the Wild” paper from last year)
- Calculate the number of participants required ahead of time — every time I have done this, the number has WAY exceeded what I actually had resources to do, but Maurits predicted this objection and suggests…
- Team up with other researchers to do multi-site experiments and pool results — I agree with this suggestion, though I wonder how to structure such collaborations in a community like CHI which values novelty over rigor (in my humble opinion). Maurits also suggests that we use valid and appropriate measurement instruments so that we can build on each others’ work. I agree with this SO hard that I’ve actually gone through the process of validating a questionnaire to use in evaluating the emotional aspects of communication technologies. It’s called the ABCCT and it is available freely (the final publication for this is still under review, but I could provide upon request).
- Use Bayesian analysis if you need to calculate the probability of the hypothesis given the data — this is great and it’s definitely new to me! To help others who are trying to learn this new way of doing stats, here are a few resources I’ve found online: (1) the section on Bayesian methods in statspages (a great resource in its own right) (2) a Bayesian t-test tutorial for R and (3) an online calculator for Bayes Factor. I still need to figure out how to put all this stuff together for the actual work that I do… What do I do with non-parametric data, for example? If somebody would write a step-by-step online tutorial for HCI researchers, I would give major kudos!
- Encourage researchers, reviewers, etc. to raise the standard of reporting statistical results — my translation is “reject papers that get it wrong” which is depressing. I think this would be a lot easier to do in the new CSCW-ish model of reviewing where you have a revise cycle. That way you can actually encourage people to learn it rather than just take their (otherwise interesting work) elsewhere
- Interpret the non-standardized sizes of the estimated effect — with this I agree unequivocally and I’d actually like to add one more point to this idea of “considering if saving 1 minute actually significant to anybody.” As HCI researchers, we are usually the ones designing the intervention so we have a fairly good idea of how difficult it would be to incorporate into existing practices. For example, fiddling slightly with the rankings produced by a search algorithm or changing the layout of a site or adding a new widget to an existing system is all fairly low effort, so even if the effect size of the produced outcome is small, it may be worth adopting. Changing your company to a new email system, changing the work flow of an existing organization, or making new hardware get adopted is really high effort, so it’s really only worth considering if the effect size of the produced outcome is quite large.
All in all, I really like this paper and its suggestions, but just to cause some intrigue I would like to point to a slightly different discussion issue. The main goal of this paper seems to be to lead HCI researchers to doing better science. But, is that really what we do? Do all HCI researchers consider themselves scientists? I know that for me it is not the most important part of my identity. I run studies as a designer. The goal for me is not to convince others that A is better than B (A and B so frequently shift and evolve that this is usually a meaningless comparison 2 years after the study is run). Rather, I run studies to understand what aspects of A may make it better than B in what situations and what future directions may be promising (and unpromising) for design. To me, the study is just an expensive design method. The consequences of “getting it wrong,” in the worst case, is spending time exploring a design direction which in the end turns out to be less interesting. There’s rarely an actual optimal design to be found. It’s all just me poking at single points in a large 3D space of possibilities. Should you reject my paper because I didn’t get the number of participants right (which I never will) even if it can inspire others to move towards promising new designs? Just because I didn’t prove it, doesn’t mean that there isn’t something interesting there anyway. Maybe, a large proportion of HCI studies are meant to be sketches rather than masterpieces.