I’ve been following the #EHILive updates today on twitter about the future of digital health. There seems to be a lot of attention on the issue of open data or data sharing, and the potential of this to gain new insights and get the most out of all the data on health that is collected. From a research point of view, this chimes with debates being had around whether trial data sets should be ‘publicly’ available and accessible (Note that this is slightly different to calls by eg. #AllTrials which argue that all findings are made public, but not neccesarily that the original data sets have to be shared – it’s discussed here under point 4). The benefits of using original data are that the analyses we can do are much more robust, and we can interrogate the data sets in greater detail than we can do when using summary results. For an overview, I described here the benefits of what are known as ‘individual patient data meta-analyses’ for answering important questions in health research. But let’s talk about the data itself…
Currently, trial data sets are the property of the Chief Investigators (CIs), or someone else attached to the trial, such as an academic Trials Unit, rather than being stored in an open repository. This means if you want to use the data, you have to contact the individual trial teams and essentially try to pry it out of them, and some can be more helpful than others. It typically requires a merry-go-round of emails with the original authors, trying to work out what was measured where and how. A study by John Ioannidis and team in 2002 estimated that putting together a combined set data of multiple original studies involved 2088 hours of data management and approximately 1000 emails between the original authors and the study team – very costly and very inefficient. With shared data, the idea is that by asking trial teams to put their completed data set into a central system, researchers will be able to more quickly get access to all this data to answer those important questions, rather than data sitting in isolated silos where it can’t be reused or taking up vast amounts of time just to collate in the first place.
This argument definitely has its critics however. There is firstly a pragmatic argument – that organising data sets for open use is an additional drain on time and money. Personally I don’t think this a sufficient reason to abandon the initiative anyway, and also I think although there would be initial costs, it should be possible in time to set up an efficient and easy to use system for submitting, storing and retrieving data.
As someone who has worked on secondary analyses, I can see further benefits in encouraging people to make their data set accessible to outsiders, for example making variable names clear and obvious, as often trying to navigate someone’s data set is akin to rustling through their wardrobe- simultaneously invasive and frustrating (“Why did they put that there? What even IS that?”). I think having clear templates for how data should be stored and labelled would be much more efficient and helpful to anybody (including the original authors) who might want to revisit the data at a later time. On a different note, I can imagine how such templates could actually help standardise analyses and reporting. It would be possible for example for anybody to check the original data, to see how the data was handled (was missing data accounted for? Did they check if assumptions for the statistical analysis were met?), and for reviewers to check if data sets have indeed been organised according to the plan made in the study protocol.
There is another reason some people balk at the idea of data sharing though – it’s another level of scrutiny, with potential for criticism and embarrassment, and I think the idea provokes a certain kind of academic paranoia (” How dare they think I’m wrong! OHMIGOD WHAT IF I’M WRONG”). Dorothy Bishop nails these issues in her excellent post here, which acknowledges that data sharing can be scary, but points out that this in the end can surely benefit researcher and researchers by encouraging us to be more stringent about our analysis and fastidious about how data sets are compiled.
In chats over coffees/pints with colleagues though, it doesn’t seem to just be the extra work or the extra scrutiny that puts people off. Some people refer to issues of ‘ownership’, with the argument being that if they put all the leg work into getting a trial funded, setting it up, collecting the data, etc etc, then it isn’t fair to suggest they hand over that data to a public repository where Johnny Nicks-A-Lot can access it, do some analyses, and publish it for their own glory. It’s a “Bloody secondary analysts, comin’ over ‘ere, stealin’ our data” kind of argument.
I really don’t buy it. For one thing, those original trial researchers will probably have already published their own analyses, and I think it would be perfectly fair to say that data isn’t made open access until the study team has finished with it themselves, so no-one can swoop in and steal credit prematurely. This might include doing multiple analyses themselves, which is fine, and I think this would just emphasise the need to state that they’ll do them in the protocol (to avoid fishing expeditions.) They’re not being robbed then of any opportunity they could have had. I also think it could become a further award in itself – those who do submit their data sets to a repository should be acknowledged and applauded for doing so. 20 years from now, I can imagine on your CV, under the bit about impact factor or amount of grant money won, you might have a comment about how many data sets you have contributed to public repositories, and perhaps even a quantitative figure (like the impact factor) demonstrating how much additional work your data set has supported.
Finally, I have a more personal issue with “nicks-a-lot” argument, which I boil down to: IT’S THE PUBLIC’S DATA, NOT YOURS. These trials are often publicly funded, worked on by publicly funded research staff, and members of the public will be the people providing all that data. The idea that at the end of all that a lone researcher can say “S’mine!” and potentially stop further important studies being conducted makes me cringe.
The last issue for me really is key – the fact that isolating data sets means we’re robbed of opportunities to get the most out of them, to do big, important studies using hundreds and hundreds of patients, which could make a real impact to health care. Hence the title of this post – sharing is caring, if we care about the impact we make on health care rather than our personal ownership rights, and if we care about making sure all data sets contribute everything they have to offer, not just to papers for our own CVs.
I remember reading once that the drive to better structure how we collect, store and access data is about building ‘an information architecture’. As a data geek, I find this a beautiful metaphor. Architects, after all, aim for efficiency and elegance in their design, both things that typically appeal to scientists. I think data sharing could be a fantastic opportunity at developing this architecture, building something that is both accessible and usable. What my conversations with colleagues show however is that the challenge to this may not be in designing the building itself, but the fact that some researchers still want to lock all the doors.