Jump to content

Recommended Posts

Posted

 

</h1>

<h1 class="headline" id="yui_3_4_0_24_1334421282045_416">Secret Source Codes Threaten Modern Science

livesci_logo_73.jpgBy Jeremy Hsu | LiveScience.com – Thu, Apr 12, 2012

 

 

This article claims secret source codes shouldn't be secret, because scientist must be able to share information and check each others work. I think this is nuts! Science has done just fine with no computers at all, and I feel sure the old fashioned way of sharing information is good enough. What do you say?

 

 

 

Posted

In many cases, the entire result of a paper is based on "We did a model on a computer and got result x". Or "I wrote a program which implements this algorithm, which I only sort of vaguely explain, and used it to analyze this data." In those sorts of papers, seeing the source code is the only way to evaluate the science.

 

I recall reading a paper that analyzed the statistical analyses performed in several scientific papers, finding that many scientists made errors in the spreadsheets they used to analyze their results. If the scientists had not disclosed their spreadsheets, these errors could not be discovered. Refusing to disclose source code would be worse.

Posted

This article claims secret source codes shouldn't be secret, because scientist must be able to share information and check each others work. I think this is nuts! Science has done just fine with no computers at all, and I feel sure the old fashioned way of sharing information is good enough. What do you say?

 

One of the pioneers and Fathers of Information science Dr. Vannevar Bush had addressed this problem in the year 1945 itself, his idea of the memex, a machine analogous to the human mind which paved way for hypertext documents and predicted the compression of information. He outlined a series of views on what scientists should concentrate to achieve in future and one of his main tenets was that science should be more used to increase the power of human minds by sharing fast reliable and accessible knowledge rather than using it increase our physical power by producing weapons of mass destruction.

 

Here is his interesting letter - As we may think - The Atlantic

 

I just cannot imagine how much of modern science could be done without the computers, the huge amount of data that is being gathered in experiments of particle accelerators at CERN and accumulation of genomic data accumulated by molecular biologists worldwide needs to be pre-processed in order to test different hypothesis and theories in parallel under a software simulation and to interpret those data. Software simulations are what is being used to evaluate the null hypothesis.

 

Software designers normally follow a set of Software design patterns through which they code programs, its not a necessity that all programmers should follow the same pattern and code things in the same way, different programmers arrive at solutions to problems using different techniques and their codes will be indeed different even though they give the same output. One of the advantages of using a particular pattern through out the project is the re-usability and readability of the code, if a different group of scientists want to repeat your experimental methods and analyze them to verify your works it would be much easier for them if you would give your source codes along with the report so that your code could be reused again rather than wasting our time to create new source codes which would hinder the progress of science and its research. Such a pattern gives good readability of the code so that anyone with enough basic knowledge with programming could easily understand how your model works and how you have got your results, in this way scientific reviewing would be much easier.

Posted

Surely the source code is the equivalent of the experimental apparatus. A knowledge of the experiemental apparatus is central to proper evlauation of any experiment.

Posted

Surely the source code is the equivalent of the experimental apparatus. A knowledge of the experiemental apparatus is central to proper evlauation of any experiment.

 

I can understand the point of view though....

 

You are quite correct the source code is analogous to a detailed methodology and knowledge of the experimental procedure - however it is also the experimental apparatus itself. The creator could have worked for many years - only to put all the fruits of her labour on the open market priced at zero.

 

Or is there some protocol to stop that happening?

 

The source code clearly needs to be published (independent checking of bugs, hidden artifacts, proper implementation of algorithm etc) but I do feel for the creator. Is there a form of opensource licence that could deal with the need for openness but credit the creator in the scientific realm

Posted

It's not good science to say "I have discovered X but I'm not telling anyone how I arrived at X it's a secret"

 

No - you're dead right. But it is very poor career management to place your last few years work at the disposal of any of the worlds scientists - the present method needs you to put results and methodology out there for inspection, but this would entail putting the means to compete out there as well.

 

You will often see papers with the proviso that these are initial results that we hope to refine etc. with old-fashioned physical lab-setups the publisher has the best chance of moving the area forward after that initial publication as they already have the physical and logisitical methodology running (often two or three terms publishing and building on each others work in turn). I cannot help thinking that by publishing the essence of your work you are leaving yourself open to guzumping

Posted
The source code clearly needs to be published (independent checking of bugs, hidden artifacts, proper implementation of algorithm etc) but I do feel for the creator. Is there a form of opensource licence that could deal with the need for openness but credit the creator in the scientific realm

Most of the open source licenses require those adapting or redistributing the code to retain a notice indicating that original copyright belongs to the original author of the code.

 

There's also the CRAPL:

 

http://matt.might.net/articles/crapl/

Posted

I have published using the phrase "We used a custom perl script available from the authors upon request".

 

The reason is twofold - a) I want to know what you want it for - if it's an academic application, go nuts. If it's a commercial application well, we have to deal with the funding agency, my PI and my university to work out the intellectual property details, etc and so on. b) I'm not a programmer. The script did what we wanted it to on our data on our machine. It's probably buggy and not very nice to read. If you want to use it (with above-mentioned caveat) that's cool, but I don't want to get "It's not working" emails from people unaware/unable to troubleshoot the scripts themselves.

 

If I wrote something amazingly useful and people were broadly excited by it, I might involve a non-computard to tidy it up and then host it for download to crank the citation index of our undoubtedly awesome paper, but as it stands my crappy little scripts are available if you read my paper and email me and tell me what you want it for. :)

Posted

I have published using the phrase "We used a custom perl script available from the authors upon request".

 

The reason is twofold - a) I want to know what you want it for - if it's an academic application, go nuts. If it's a commercial application well, we have to deal with the funding agency, my PI and my university to work out the intellectual property details, etc and so on. b) I'm not a programmer. The script did what we wanted it to on our data on our machine. It's probably buggy and not very nice to read. If you want to use it (with above-mentioned caveat) that's cool, but I don't want to get "It's not working" emails from people unaware/unable to troubleshoot the scripts themselves.

Do you think it would be beneficial if there were a license to release academic code under, with provisions prohibiting commercial use without permission but allowing academic use? I don't know of any software licenses designed with this use in mind.

Posted

I've yet to meet anyone even remotely interested in my source codes. Same goes for data analysis: no one wants to spend their time analyzing the data of other people (that's what you hire PhD students for in the first place :P)

Posted (edited)

I've yet to meet anyone even remotely interested in my source codes. Same goes for data analysis: no one wants to spend their time analyzing the data of other people (that's what you hire PhD students for in the first place :P)

 

Good point. Also in most papers the authors would at least describe their algorithm so it is not a complete black box. In fact, I wonder what kind of article would get away with that. Of course, the actual code could have (either by design or by error) actually implement things differently than described. Take BLOSUM62 for instance, which is a standard substitution matrix for sequence alignments and database searches (e.g with the famous BLAST). Interestingly, it is actually based on erroneous calculations. And in this case the source code for the calculation of it was open source. Yet it took more than a decade for people to realize that. One of the reason may be that the erroneous matrix actually performed better than the correct one.

 

It will depend a lot on the respective fields, but especially in bioinformatics I assume that mandating open source or not is not going to be much of an impact, especially considering that many already are, but few will bother to give them closer scrutiny. Also, some are commercializing their respective algorithms, which could make things more tricky. In the end, the performance is evaluated using test runs, and if they do not perform, they vanish.

Edited by CharonY
Posted

I think that as a rule, it is not necessary to share the source code- but it should be considered by the authors. The reasoning behind this is that I feel saying "We fed our data into a program we created and it said this" is insufficient regardless of whether or not you included source code. Instead, it should only be required to explain the actual processes the program performed- not how the program performed them. This way, the experiment is still repeatable under the same test processes, however the person repeating the experiment will need to develop or find their own medium (source code, calculator, etc) to carry out those processes. It also means that errors in the data caused by coding bugs/errors are more likely to be found because the same code (with the same errors) is NOT used.

 

For example. I can perform a t-test using excel- I don't need to show the source code that tells excel how to do this. Likewise, I can repeat the experiment and calculate the t-test with a calculator by hand and still get the same "results". As long as the methodology and mechanics of it (equations/principals and assumptions) are accurately described- then I feel the source code is fairly irrelevant.

Posted (edited)

I think that as a rule, it is not necessary to share the source code- but it should be considered by the authors. The reasoning behind this is that I feel saying "We fed our data into a program we created and it said this" is insufficient regardless of whether or not you included source code. Instead, it should only be required to explain the actual processes the program performed- not how the program performed them. This way, the experiment is still repeatable under the same test processes, however the person repeating the experiment will need to develop or find their own medium (source code, calculator, etc) to carry out those processes. It also means that errors in the data caused by coding bugs/errors are more likely to be found because the same code (with the same errors) is NOT used.

 

For example. I can perform a t-test using excel- I don't need to show the source code that tells excel how to do this. Likewise, I can repeat the experiment and calculate the t-test with a calculator by hand and still get the same "results". As long as the methodology and mechanics of it (equations/principals and assumptions) are accurately described- then I feel the source code is fairly irrelevant.

You would be correct were the computer programs that lurk behind many scientific papers as simple as your t-test example.

 

They aren't.

 

The problems here are software quality and software verification and validation. From years of experience with it, academic software is amongst the lowest quality software on the planet. It's bad stuff. Really. Really. Bad. Cyclomatic complexity in the triple digits, functions that are thousands of lines long, functions that access variables before they are assigned values, memory leaks galore, and the only comments are of three forms:

  • "Well. Tell me something I don't already know!" (i = i + 1; // Increment i),
  • "Well! Tell me why you did that instead of making a silly joke." ( i = i + 42; // Douglas Adams to the rescue!), or
  • "Well!! I don't want to know that!" (// John Smith [1998]: Note to self: The following appears to violate the laws of physics).

 

Even in high quality environments, simply replicating the effects of the code without replicating the code itself can be problematic. For a long time, one of the standard approaches to achieve robustness and correctness in software for safety critical systems was to have two different organizations independently develop the safety critical chunks of the software. This approach sometimes failed. Independent implementers sometimes made the same programming mistakes, sometimes made the same erroneous assumptions. Sometimes they did everything right and the code was still "wrong" because the requirements/algorithms were incorrect. Nowadays, that dual implementation scheme is being discarded in favor of independent verification and validation. Give the requirements, the code, the test procedures, and the test results to some independent organization and let them figure out whether the system is correct.

 

One final point: There's a hidden flaw in your argument. You are implicitly assuming that the people who wrote the paper can describe what the software does. In many cases, good luck with that. Academic software is handed down from one grad student to another, then modified and extended, over and over again. Documentation is verbal. There might well be chunks of code written in Fortran IV that nobody understands and everyone is afraid to touch.

Edited by D H
Posted

Do you think it would be beneficial if there were a license to release academic code under, with provisions prohibiting commercial use without permission but allowing academic use? I don't know of any software licenses designed with this use in mind.

 

There are creative commons licenses that cover this: http://creativecommons.org/licenses/

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.