The Grumpy Economist: Secret Data Encore

Tuesday, January 5, 2016

Secret Data Encore

My post "Secret data" on replication provoked a lot of comment and emails, more reflection, and some additional links.

This isn't about rules

Many of my correspondents missed my main point -- I am not advocating more and tighter rules by journals! This is not about what you are "allowed to do," how to "get published" and so forth.

In fact, this extra rumination points me even more strongly to the view that rules and censorship by themselves will not work. How to make research transparent, replicable, extendable, and so forth varies by the kind of work, the kind of data, and is subject like everything else to creativity and technical improvement. Most of all, it will not work if nobody cares; if nobody takes the kind of actions in bullet points of my last post, and it's just an issue about rules at journals. Already, (more below) rules are not that well followed.

This isn't just about "replication."

"Replication" is much too narrow a word. Yes, many papers have not documented transparently what they actually did, so that even armed with the data it's hard to produce the same numbers. Other papers are based on secret data, the problem with which I started.

But in the end, most important results are not simply due to outright errors in data or coding. (I hope!)

The important issue is whether small changes in instruments, controls, data sample, measurement error handling, and so forth produce different results, whether results hold out of sample, or whether collecting or recoding data produces the same conclusions. "Robustness" is a better overall descriptor for the problem that many of us suspect pervades empirical economic research.

You need replicability in order to evaluate robustness -- if you get a different result than the original authors', it's essential to be able to track down how the original authors got their result. But the real issue is that much larger one.

The excellent replication wiki (many good links) quotes Daniel Hamermesh on this difference between "narrow" and "wide" replication

Narrow, or pure, replication means first checking the submitted data against the primary sources (when applicable) for consistency and accuracy. Second the tables and charts are replicated using the procedures described in the empirical article. The aim is to confirm the accuracy of published results given the data and analytical procedures that the authors write to have used.

Replication in a wide sense is to consider the empirical ﬁnding of the original paper by using either new data from other time periods or regions, or by using new methods, e.g., other specifications. Studies with major extensions, new data or new empirical methods are often called reproductions.

But the more important robustness question is more controversial. The original authors can complain they don't like the replicator's choice of instruments, or procedures. So "replication," which sounds straightforward, quickly turns in to controversies.

Michael Clemens writes about the issue in a blog post here, noting

...Again and again, the original authors have protested that the critique of their work got different results by construction, not because anything was objectively incorrect about the original work. (See Berkeley’s Ted Miguel et al. here; Oxford’s Stefan Dercon et al. here and Princeton’s Angus Deaton here among many others. Chris Blattman at Columbia and Berk Özlerat the World Bank have weighed in on some of these controversies.)

In a good paper, published as The meaning of failed replications in the Journal of Economic Surveys he argues for an expanded vocabulary, including "verification," "robustness," "reanalysis" and "extension."

"Failed replication" is a damning criticism. It implies error, malfeasance, deliberately hiding data, and so forth. What most "replication" studies really mean is "robustness," either to method or natural fishing biases, which is a more common problem (in my view). But as Michael points out, you really can't use the emotionally charged language of failed or "discrepant" replication for that situation.

This isn't about people or past work

I did not anticipate, but should have, that the secret data post would be read as criticism of people who do large-data work, proprietary-data work, or work with government agencies that cannot currently be shared. The internet is pretty snarky, so it's worth stating explicitly that is not my intent or my view.

Quite the opposite. I am a huge fan of the pioneering work exploiting new data sets. If these pioneers had not found dramatic results and possibilities with new data, it would not matter whether we can replicate, check or extend those results.

It is only now, that the pioneers have shown the way, that we know how important the work can be, that it becomes vital to rethink how we do this kind of work going forward.

The special problems of confidential government data

The government has a lot of great data -- IRS, and census for microeconomics, SEC, CFTC, Fed, financial product safety commission in finance. And there are obvious reasons why so far it has not been easily shared.

Journal policies allow exceptions for such data. So only a fundamental demand from the rest of us for transparency can bring about changes. And has begun to do so.

In addition to the suggestions in the last post, more and more people are going through the vetting to use the data. That leaves open the possibility that a full replication machine could be stored on site, ready for a replicator with proper access to push a button. Commercial data vendors could allow similar "free" replication, controlling directly how replicators use the data.

Technological solutions are on the way too. "Differential privacy" is an example of a technology that allows results to be replicated without compromising the privacy of the data. Leapyear.io is an example of companies selling this kind of technology. We are not alone, as there is a strong commercial demand for this kind of data. (Medical data for example.)

Other institutions: Journals, replication journals, websites,

There is some debate whether checking "replication" should count as new research, and I argued if we want replication we need to value it. The larger robustness question certainly is "new" research. Xs result does not hold out of sample, is sensitive to the precise choice of instruments and controls, and so forth, is genuine, publishable, follow-on research.

I originally opined that replications should be published by the original journal to give the best incentives. That means an AER replication "counts" as an AER publication.

But with the idea that robustness is the wider issue, I am less inclined to this view. This broader robustness or reexamination is genuine new research, and there is a continuum between replication and the normal business of examining the basic idea of a model with new data and also some new methods. Each paper on the permanent income hypothesis is not a "replication" of Friedman! We don't want to only value as "new" research that which uses novel methods -- then we become dry methodologists, not fact-oriented economists. And once a paper goes beyond pointing out simple mistakes, to questioning specification, a question which itself can be rebutted, it's beyond the responsibility of the original journal.

Ivo Welch argues that a third of each journal should be devoted to replication and critique. The Critical Finance Review, which he edits asks for replication papers. The Journal of Applied Econometrics has a replication section, and now invites replications of papers in many other journals. Where journals fear to tread, other institutions step in. The replication network is one interesting new resource.

Faculties

A correspondent suggests an important additional bullet point for the "what can we do" list

Encourage your faculty to adopt a replicability policy as part of its standards of conduct, and as part of its standards for internal and outside promotions.

The precise wording of such standards should be fairly loose. The important thing is to send a message. Faculty are expected to make their research transparent and replicable, to provide data and programs, even when journals do not require it. Faculty up for promotion should expect that the committee reviewing them will look to see if they are behaving reasonably. Failure will likely lead to a little chat from your department chair or dean. And the policy should state that replication and robustness work is valued.

Another correspondent wrote that he/she advises junior faculty not to post programs and data, so that they do not become a "target" for replicators. To say we disagree on this is an understatement. A clear voice on this issue is an excellent outcome of crafting a written policy.

From Michael Kiley's excellent comment below

Assign replication exercises to your students. Assign robustness checks to your more advanced students. Advanced undergraduate and PhD students are a natural reservoir of replicators. Seeing the nuts and bolts of how good, transparent, replicable work is done will benefit them. Seeing that not everything published is replicable or right might benefit them even more.

Two good surveys of replications (as well as journals)

Maren Duvendack, Richard Palmer-Jones, and Bob Reed have an excellent survey article, "Replications in Economics: A Progress Report"

...a survey of replication policies at all 333 economics journals listed in Web of Science. Further, we analyse a collection of 162 replication studies published in peer-reviewed economics journals.

The latter is especially good, starting at p. 175. You can see here that "replication" goes beyond just can-we-get-the-author's-numbers, and maddeningly often does not even ask that question

a little less than two-thirds of all published replication studies attempt to exactly reproduce the original findings....A frequent reason for not attempting to exactly reproduce an original study’s findings is that a replicator attempts to confirm an original study’s findings by using a different data set

"Robustness" not "replication "

Original Results?, tells whether the replication study re-reports the original results in a way that facilitates comparison with the original study. A large portion of replication studies do not offer easy comparisons, perhaps because of limited journal space. Sometimes the lack of direct comparison is more than a minor inconvenience, as when a replication study refers to results from an original study without identifying the table or regression number from which the results come.

Replicators need to be replicable and transparent too!

Across all categories of journals and studies, 127 of 162 (78%) replication studies disconfirm a major finding from the original study.

But rather than just the usual alarmist headline, they have a good insight. Replication studies can suffer the same significance bias as original work:

Interpretation of this number is difficult. One cannot assume that the studies treated to replication are a random sample. Also, researchers who confirm the results of original studies may face difficulty in getting their results published since they have nothing ‘new’ to report. On the other hand, journal editors are loath to offend influential researchers or editors at other journals. The Journal of Economic & Social Measurement and Econ Journal Watch have sometimes allowed replicating authors to report on their (prior) difficulties in getting disconfirming results published. Such firsthand accounts detail the reticence of some journal editors to publish disconfirming replication studies (see, e.g., Davis 2007; Jong-A-Pin and de Haan 2008, 57).

Summarizing

.. nearly 80 percent of replication studies have found major flaws in the original research

Sven Vlaeminck and Lisa-Kristin Hermmann surveyed journals and report that many journals with data policies are not enforcing them.

The results we obtained suggest that data availability and replicable research are not among the top priorities of many of the journals surveyed. For instance, we found 10 journals (i.e. 20.4% of all journals with such policies) where not a single article was equipped with the underlying research data. But even beyond these journals, many editorial offices do not really enforce data availability: There was only a single journal (American Economic Journal: Applied Economics) which has data and code available for every article in the four issues.

Again, this observation reinforces my point that rules will not substitute for people caring about it. (They also discuss technological aspects of replication, and the impermanence and obscurity of zip files posted on journal websites.)

Numerical Analysis

Ken Judd wrote to me,

"Your advocacy of authors giving away their code is not the rule in numerical analysis. I point to the “market test”: the numerical analysis community has done an excellent job in advancing computational methods despite the lack of any requirement to share the code....

Would you require Tom Doan to give out the code for RATS? If not, then why do you advocate journals forcing me to freely distribute my code?...

The issue is not replication, which just means that my code gives the same answer on your computer as it does on mine. The issue is verification, which is the use of tests to verify the accuracy of the answers. That I am willing to provide."

Ken is I think reading more "rule and censorship" rather than "social norms" in my views. And I think it reinforces my preference for the latter over the former. Among other things, rules designed for one purpose (extensive statistical analysis of large data sets) are poorly adapted to other situations (extensive numerical analysis.)

Rules can be taken to extremes. Nobody is talking about "requiring" package customers to distribute the (proprietary) package source code. We all understand that step is not needed.

For heavy numerical analysis papers, using author-designed software that the author wants to market, the verification suggestion seems a sensible social norm to me. If I'm refereeing a paper with a heavy numerical component, I would be happy to see the extensive verification, and happier still if I could use the program on a few test cases of my own. Seeing the source code would not be necessary or even that useful. Perhaps in extremis, if a verification failed, I would want the right to contact the author and understand why his/her code produces a different result.

Some other examples of "replication" (really robustness) controversies:

Andrew Gelman covers a replication controversy, in which Douglas Campbell and Ju Hyun Pun dissect Enrico Spolaore and Romain Wacziarg's "the Diffusion of Development" in the QJE. There is no charge that the computer programs were wrong, or that one cannot produce the published numbers. The controversy is entirely over specification, that the result is sensitive to specification and controls.

Yakov Amihud and Stoyan Stoyanov Do Staggered Boards Harm Shareholders? reexamine Alma Cohen and Charles Wang's Journal of Financial Economics paper. They come to the opposite conclusion, but could only reexamine the issue because Cohen and Wang shared their data. Again, the issues, as far as I can tell, are not a charge that programs or data are wrong.

Update: Yakov corrects me:

We do not come to "the opposite conclusion". We just cannot reject the null that staggered board is harmless to firm value, using Cohen-Wang's experiment.
Our result is also obtained using the publicly-available ISS database (formerly RiskMetrics).
Why is the difference between the results? We used CRSP data and did not include a few delisted (penny) stocks that are in Cohen-Wang's sample. Our paper states which stocks were omitted and why. We are re-writing the paper now with more detailed analysis.

I think the point that replication slides in to robustness which is more important and more contentious remains clear.

Asset pricing is especially vulnerable to results that do not hold out of sample, in particular the ability to forecast returns. Campbell Harvey has a number of good papers on this topic. Here, the issue is again not that the numbers are wrong, but that many good in-sample return-forecasting tricks stop working out of sample. To know, you have to have the data.

6 comments:

Michael KileyJanuary 6, 2016 at 8:47 AM
An excellent set of posts. A small suggestion that may enhance the state of replicability and robustness in the profession: I suspect there is some room in graduate training to require students to replicate (and, in more advanced courses) examine the robustness of research results. Such an approach would introduce researchers to the value of replication in shaping their own future research projects.
ReplyDelete
Replies
Tom BrownJanuary 7, 2016 at 6:09 PM
John, I just don't see how any of these high-minded goals can be accomplished, when it's well known that learning economics results in antisocial behavior. ;D
ReplyDelete
Replies
Doug Campbell January 11, 2016 at 5:36 PM
This post and the last one are both excellent. I'm surprised they haven't gotten more play, actually. I think changing norms via forums such as this blog actually has a reasonable chance of success. At some level, I think everyone "knows" they should post their code and data. It's up to referees and editors to see to it that this norm gets followed.
ReplyDelete
Replies
AaronJanuary 14, 2016 at 8:15 AM
Re: the differential privacy connection -- it actually goes a bit deeper. As you say, it can be viewed as a technology for sharing sensitive data, which helps reproducibility. But its use also directly implies that the original analysis was robust, and avoids the implicit multiple-hypothesis-testing problem that comes from performing exploratory data analysis and confirmatory data analysis on the same dataset. See here: http://science.sciencemag.org/content/349/6248/636
ReplyDelete
Replies
Youssef EssassaniJanuary 30, 2016 at 10:24 AM
If literature reviews are considered "new research" then robustness and even replication no doubt are.
ReplyDelete
Replies

Add comment

Comments are welcome. Keep it short, polite, and on topic.

Thanks to a few abusers I am now moderating comments. I welcome thoughtful disagreement. I will block comments with insulting or abusive language. I'm also blocking totally inane comments. Try to make some sense. I am much more likely to allow critical comments if you have the honesty and courage to use your real name.