I’m starting to slog through the thick pile of papers that look interesting from CHI 2012. I wanted to start with the heavier stuff while I have the energy, so I began by looking at a paper discussing statistical methods for HCI. It helped that the first author was Maurits Kaptein, who is a great guy I met while studying abroad at TU/e in the Netherlands.

The basic gist of the paper is that HCI researchers frequently get 3 things wrong:

- We wrongly interpret the p-value as the probability of the null-hypothesis being true
- Our studies often lack statistical power to begin with, making it impossible to draw meaningful conclusions about rejecting the alt hypothesis
- We confuse statistical significance (e.g., p-values) with practical significance (i.e., actual difference meaningful to the real world)

Good stuff and I’m sure I’ve been guilty of both 1 and 2 frequently (I am usually fairly careful with #3). The authors don’t just point out the problem, but also give 7 suggestions for addressing this issue. The main point of this post is to critique these suggestions and perhaps give a few resources:

- Make bolder predictions of direction and magnitude of effects AND
- Predict the size of the effect likely to be found — both of these are easily said, but the problem is that HCI frequently bridges into areas uncharted by previous studies. We frequently just don’t know what the effects might be. Piloting is always useful, but frequently leads to results that are different from the deployment as most pilots are done with confederates (e.g. lab mates) to moderate the costs. Even small differences in how a study is set up or positioned could lead to HUGE changes in a field deployment (for examples, see the “Into the Wild” paper from last year)
- Calculate the number of participants required ahead of time — every time I have done this, the number has WAY exceeded what I actually had resources to do, but Maurits predicted this objection and suggests…
- Team up with other researchers to do multi-site experiments and pool results — I agree with this suggestion, though I wonder how to structure such collaborations in a community like CHI which values novelty over rigor (in my humble opinion). Maurits also suggests that we use valid and appropriate measurement instruments so that we can build on each others’ work. I agree with this SO hard that I’ve actually gone through the process of validating a questionnaire to use in evaluating the emotional aspects of communication technologies. It’s called the ABCCT and it is available freely (the final publication for this is still under review, but I could provide upon request).
- Use Bayesian analysis if you need to calculate the probability of the hypothesis given the data — this is great and it’s definitely new to me! To help others who are trying to learn this new way of doing stats, here are a few resources I’ve found online: (1) the section on Bayesian methods in statspages (a great resource in its own right) (2) a Bayesian t-test tutorial for R and (3) an online calculator for Bayes Factor. I still need to figure out how to put all this stuff together for the actual work that I do… What do I do with non-parametric data, for example? If somebody would write a step-by-step online tutorial for HCI researchers, I would give major kudos!
- Encourage researchers, reviewers, etc. to raise the standard of reporting statistical results — my translation is “reject papers that get it wrong” which is depressing. I think this would be a lot easier to do in the new CSCW-ish model of reviewing where you have a revise cycle. That way you can actually encourage people to learn it rather than just take their (otherwise interesting work) elsewhere
- Interpret the non-standardized sizes of the estimated effect — with this I agree unequivocally and I’d actually like to add one more point to this idea of “considering if saving 1 minute
*actually*significant to anybody.” As HCI researchers, we are usually the ones designing the intervention so we have a fairly good idea of how difficult it would be to incorporate into existing practices. For example, fiddling slightly with the rankings produced by a search algorithm or changing the layout of a site or adding a new widget to an existing system is all fairly low effort, so even if the effect size of the produced outcome is small, it may be worth adopting. Changing your company to a new email system, changing the work flow of an existing organization, or making new hardware get adopted is really high effort, so it’s really only worth considering if the effect size of the produced outcome is quite large.

All in all, I really like this paper and its suggestions, but just to cause some intrigue I would like to point to a slightly different discussion issue. The main goal of this paper seems to be to lead HCI researchers to doing better *science*. But, is that really what we do? Do all HCI researchers consider themselves scientists? I know that for me it is not the most important part of my identity. I run studies as a designer. The goal for me is not to convince others that A is better than B (A and B so frequently shift and evolve that this is usually a meaningless comparison 2 years after the study is run). Rather, I run studies to understand *what aspects* of A may make it better than B in what situations and what future directions may be promising (and unpromising) for design. To me, the study is just an expensive design method. The consequences of “getting it wrong,” in the worst case, is spending time exploring a design direction which in the end turns out to be less interesting. There’s rarely an actual optimal design to be found. It’s all just me poking at single points in a large 3D space of possibilities. Should you reject my paper because I didn’t get the number of participants right (which I never will) even if it can inspire others to move towards promising new designs? Just because I didn’t prove it, doesn’t mean that there isn’t something interesting there anyway. Maybe, a large proportion of HCI studies are meant to be *sketches* rather than masterpieces.

This is pretty great, Lana! I like your take on this a lot!

But as you know, doing decent statistics work requires much more background than one might hope. Until that background is incorporated into HCI programs, I think there is potential for a lot of abuse (I’m quite surprised that people commit the p-value fallacy actually, that’s kind of obvious as there are no probabilities in play there).

Random aside about statistical power — the following is quite an interesting take on it:

http://www.sciencedirect.com/science/article/pii/S1053811912003990

With the following rebuttal:

http://www.talyarkoni.org/blog/2012/04/25/sixteen-is-not-magic-comment-on-friston-2012/

Thanks for your comment, Ying! I’m trying to understand what you mean by “there are no probabilities in play”? The correct way of interpreting p-values is “the probability of the observed data given null hypothesis.” But, according to the paper, many researchers think it means “the probability of the null hypothesis given the observed data.” Which isn’t right, right?

Yes, your interpretation is pretty much correct — it’s the probability (under the null hypothesis H_0) of observing data at least as extreme that the one observed. This “probability that H_0 is true” idea is just wrong.

No in a technical sense: from a frequentist (“natural”) view of statistics, you can’t really attach a probability to hypotheses being true. The problem is that p-values are Frankensteined onto modern statistics and aren’t really a coherent framework, and this makes interpretation really really hard.

Point 5 is interesting, but there are a lot of traps in Bayesian statistics, and you have to buy their model, which is pretty unnatural at first.

I see these “we need better stats” papers from time to time. Typically, I just disregard them for the same reasons you mentioned in the last section of the post. Design can’t be treated as rigorously as the hard sciences because we deal with so much subjective uncertainty the scientific method becomes irrelevant. Our tests do not always yield the same results even when the study is recreated and, frankly, recreating a study is hard enough. Most of the time we are running tests into order to validate a hunch we have and want to pursue. I run into this problem all the time when I’m building analytic tools for game designers. A set of collected player data and a good analytic tool will only get you so far. Design needs to happen before and after data is collected and analyzed, no matter the significance of it. Those who want tighter regulations on study results probably believe that when papers state they found their data to be significant those papers have a stronger argument. But how many times have you run a user study where one participant did something off the wall and it snowballed into something awesome for your design? Those are not significant events in the stats sense but are in a design sense. It just comes down to the fact that you can’t regulate design or art using scientific procedures. Those procedures are just another tool we can wield.

Also ,speaking from a humanities perspective, I assume that the people at CHI who worry about better stats get sick of humanities people mucking up their conference with their flaky papers about design 🙂

[…] http://lanayarosh.com/2012/05/review-of-rethinking-statistical-analysis-methods-for-chi/ This entry was posted in Blog. Bookmark the permalink. […]

Lana:

I think that your last point is exactly right and can also be summed up as you don’t do “hypothesis driven research”– nor do most people that are in the CHI community. This is also related to the point you made about there being a premium for novelty– science is often about iteratively changing experimental variables, which means that existing system are “tweaked” not reinvented/redesigned.

I do appreciate the 7 points that the authors make and would echo two more points that my stats instructor (Robert Rosenthal: http://psych.ucr.edu/faculty/rosenthal/index.html) tattooed (figuratively, of course) in our psyche before leaving grad school:

#8: if you calculate a p value then you must also calculate the effect size.

see: http://effectsizefaq.com/essential-guide-to-effect-sizes/

Here are three succinct reasons why you should report an effect size:

http://effectsizefaq.com/2010/05/31/can-you-give-me-three-reasons-for-reporting-effect-sizes/

1. Your estimate of the effect size constitutes your study’s evidence. A p value might tell you the direction of an effect, but only the estimate will tell you how big it is.

2. Reporting the effect size facilitates the interpretation of the substantive significance of a result. Without an estimate of the effect size, no meaningful interpretation can take place.

3. Effect sizes can be used to quantitatively compare the results of studies done in different settings.

9: If you are doing post-hoc analyses, you must do a Bonferroni adjustment– which gives a more conservative estimate of the p-value.

You basically divide .05 (p-value) by the number of post-hoc analyses you conduct

here’s more: http://en.wikipedia.org/wiki/Bonferroni_correction

And finally, here’s a plug for the stats book that I used in grad school and that my prof co-authored:

http://www.amazon.com/Essentials-Behavioral-Research-Methods-Analysis/dp/0070539294/ref=la_B001IQX9F8_1_1?ie=UTF8&qid=1337886128&sr=1-1