Jump to content

Recommended Posts

Posted

(Please note that I start this discussion a month before I start an undergraduate degree in the sciences. I don't purport to be an expert in anything I'm talking about here.)

 

(people with no attention span should skip to that one-line paragraph down below)

 

Recently I received access, via the University of Texas library system, to hundreds of online journals. I can look up journal articles in pretty much any subject I please, trying to find good information and answers to my questions. I think this is fantastic: online journals make accessing the information far easier and let ordinary undergrads like me learn about research done by pretty much anybody.

 

That's not to say it's optimal, though. There's always the problem of issues that are not yet available online, and if I weren't a university student I'd have to fork over a whole lot of money to get this kind of access.

 

Along those lines, there have been attempts to create open-access online journals that aim to consolidate a lot of research and provide easy search systems to find articles of interest. There's the arXiv, PLoS (all seven of them), indexes like PubMed, and so on. But their openness doesn't solve everything: the conclusions of the papers may now be viewable by the entire world, but what about the rest? The data? Many of the journals I looked at merely provided a few charts, leaving me unable to consider conclusions the authors didn't cover.

 

So, a thought that has been brewing in my mind for a while now:

 

What if there were an open-access online journal that not only provides a medium for authors to publish their work, but which provides them a way to publish their raw data?

 

Think about it: it's very easy to provide online access to heaps of data. Researchers conducting a particular experiment could upload not just their research but a dataset from a particular experiment, giving anybody, anywhere the ability to comb through the data for missed conclusions and new patterns.

 

Combine this with a really clever cross-referencing system and you could make it easy to anyone to find, say, every experiment involving neutrino production, allowing a researcher seeking to answer a question about neutrinos to just look up some datasets and spend a few days analyzing them, rather than spending months setting up another experiment.

 

The journals I have seen generally offer descriptions of the experiments, a few graphs or charts of the data, and some conclusions. (Maybe I'm looking at crappy journals.) But the Internet provides an easy medium to post data as well. What would happen if such an online tool existed? Would it revolutionize the publishing world, or am I too stupid to have noticed that such a tool already exists or would never work in practice?

Posted

I sometimes wonder whether or not the raw data is considered proprietary. It truly is an interesting question you've posed.

 

 

(btw... I did my journal pillaging most successfully in the UGL/SMF library... in the building with the tower right beside the main bldg... PCL was somewhat limited... Have fun with it, whatever you access.)

Posted
I sometimes wonder whether or not the raw data is considered proprietary. It truly is an interesting question you've posed.

Sometimes I see news articles saying that someone re-analyzed data from another experiment to produce a new conclusion, suggesting the authors of the first study shared their data. But I presume that usually takes conversation with the experimenters, not a search on a website.

 

 

(btw... I did my journal pillaging most successfully in the UGL/SMF library... in the building with the tower right beside the main bldg... PCL was somewhat limited... Have fun with it, whatever you access.)

The UT Libraries website lets you get access online to most of the major publishers' online databases. I'll keep your advice in mind when I go searching for paper copies, though :cool:

Posted
The UT Libraries website lets you get access online to most of the major publishers' online databases. I'll keep your advice in mind when I go searching for paper copies, though

 

Thanks for making a guy feel ancient. :-(

 

 

 

Per your topic, though... Isn't this exactly what they do with climate science? There is such a HUGE non-expert interest that the datasets themselves are published online, like on NOAA or GISS? It seems to me that you are suggesting something incredibly similar, just well beyond the climate (instrumental record) sciences.

Posted
Thanks for making a guy feel ancient. :-(

Hey, I did spend some time at PCL during orientation... in the computer lab.

 

 

Per your topic, though... Isn't this exactly what they do with climate science? There is such a HUGE non-expert interest that the datasets themselves are published online, like on NOAA or GISS? It seems to me that you are suggesting something incredibly similar, just well beyond the climate (instrumental record) sciences.

Yeah, basically. A central location for all of that kind of data. Perhaps even provide systems to allow combination of data between experiments or listing of similar experiments to compare datasets. If you want to make it really easy on amateurs, the site could provide basic graphing and statistical features.

Posted

Depending on the field, such mechanisms are already in place. Of course the raw data is usually connected to a given publication, otherwise it would be just a huge data dump with little information. For instance, in most proteomics journals you have to submit the MS/MS spectra of protein analyses. They are then freely available as supplementary material. Likewise, published sequence data has to be deposited in a database (e.g. Genbank) and so on. The problem is that data is often not normalized. They differ from machine to machine, sometimes from run to run. Without a lot of metainformation (as provided in a manuscript) much of the data cannot be analyzed even by an expert (much less by a layman). Of course there are exception as e.g. genome databases. The better ones are manually curated, though, which takes quite a bit of time and effort.

The goal of a manuscript is to put the raw data into a narrative that can be understood by a peer on this field. Interpretation of raw data itself can be terribly complex.

 

To summarize it: in many cases, and if it is of importance, raw data is available either as supplementary information or it is found in one of the specialized databases. An uncurated data dump, however will be in most cases of limited value for both, experts as well as laymen.

Posted

So you're saying that in certain fields, data is indeed freely available and published alongside the articles?

 

You have a good point about data curation; I can see there being a need to reformat and normalize data in my imaginary online database, even if it's just to use consistent units for measurements.

 

What fields may not have freely available data? I know the psychology journals I saw did a pretty poor job in general, although they were the more obscure journals.

Posted

Yes, they are either supplementary material or some kind of depository. However they are usually not totally raw. For instance in a microarray experiment you essentially measure fluorescence in one or more channels for a few thousands spots. Now the raw data is essentially just the fluorescence value for each channel. This list would yield much information to anyone but the investigator. What is published is e.g. normalized values or in case of comparative analyzes, the relative differences.

 

As a rule of thumb most journals want the author to provide, at least upon request, the raw data that lead to the immediate conclusions in one of their manuscripts. The authors may opt not to release data that is not relevant to this (but was generated in the same experiments, for instance) in order to publish it later.

This is quite common in epidemiological studies, for instance. A big epidemiological study yields a vast amount of data and the researchers can harvest it for several years. They are, of course, hesitant to release data before they squeezed all the possible publications out of it. Of course, the data relevant for each manuscript will be included in each respective paper.

Simply put, each paper should include all the information necessary to enable a peer on the given field to come to the same conclusions.

Posted

Have an astronomy data archive:

 

http://archive.stsci.edu/

 

There are a few others, but mostly raw data is by request. Raw data tables show little to no information easily, which is why we use graphs.... It's often tough to convince UGs this is so.

 

It is also impossible to really interperate the data without a good detailed understanding of the experimental setup, looking at my data currently it appears there is something very interesting going on. There isn't there is just an issue with the alignment of the kit.


Merged post follows:

Consecutive posts merged

There is also the issue of the amount of data.

 

My masters project, each data set was 20000 data points long, and each of them was an average of 100. To draw any conclusions from the data you needed 4 sets of data. That's 8000000 numbers that are pretty much meaningless.

Posted

Ha, and that too. On a single run (around two days) I generate around a gig worth of numbers. Making all of them accessible (even those that gave no results) would be quite challenging. Especially if, say, a hundred groups would be doing that.

Posted

In my field (organismal biology), it's actually very uncommon to publish raw data in any form, mostly because it's useless. Nobody would look at it, much less process it and try to use it for something else or to replicate. The sheer amount of manual labor in biological data processing is intense and monotonous (my time ratio of experiment:data processing is easily 1:100, probably closer to 1:200), the inherent organismal variability prevents anyone from discerning anything without statistics anyway, and if you desperately want the data, all it takes is a quick email.

 

Seriously, if someone ever tried to re-create my work from my stored data, they'd hang themselves by the third month. Assuming that carpal tunnel from manual digitizing hadn't rendered them unable to tie a knot.

 

Another point worth making is that, at least in my field, the experiment tends to be very strongly tailored to the hypotheses, and extraneous information is rarely collected, so the chances of finding something new in the data are pretty slim. For instance, there are many studies in animal locomotion using a 3-axis force plate that never even bother to videotape the animal, since that data would just be wasted disk space.

 

In short, I suspect you think raw data may be more useful than it really is in some fields. Mostly it just sits in the file cabinet and rots.

Posted

 

It is also impossible to really interperate the data without a good detailed understanding of the experimental setup, looking at my data currently it appears there is something very interesting going on. There isn't there is just an issue with the alignment of the kit.

 

I second this. The effort one would have to expend to explain how to use the data could be a huge drain on the experimenters' time. And as Klaynos indicates, there are a lot of data that are bad for technical reasons, and shouldn't be in any kind of database.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.