Wednesday, March 26, 2008

March 26 is Document Freedom Day!

Today marks the first observation of Document Freedom Day, from here on out an annual celebration held on the last Wednesday of March.

From the official website:

Document Freedom Day (DFD) is a global day for document liberation. It will be a day of grassroots effort to educate the public about the importance of Free Document Formats and Open Standards in general.

Complementary to Software Freedom Day, we aim to have local teams all over the world organise events on the last Wednesday of March. 2008 is the first year that Document Freedom Day is being called for, and we are looking for people around the world who are willing to join the effort.

DFD's main goals are:

  • promotion and adoption of free document formats
  • forming a global network
  • coordination of activities that happen on 26th of March, Document Freedom Day

Once a year, we will celebrate Document Freedom Day as a global community. Between those days, DFD will be focused on facilitating community action and building awareness for issues of Document Freedom and Open Standards.

Given the work on open data standards, structured data, and open repositories being done by Cameron, Peter MR, and others in the open science community, this is definitely cause for celebration! Unfortunately, the United States seems to be lagging behind other countries in its observance of this holiday (but maybe we'll give it another year).

Thanks to Alain Laederach for the tip!

Monday, March 24, 2008

PSB Open Science workshop - call for participation

The call for participation for the Open Science workshop at PSB 2009 is now up! We welcome anyone with an interest in open science to submit proposals for talks. Note that although space is limited for talks and demos, anyone who registers for the conference can present a poster, so we also encourage poster submissions!

Tuesday, March 18, 2008

Gregory Petsko on "the right to be wrong"

Gregory Petsko expounds eloquently on the "climate of fear" in science in a recent commentary in Genome Biology, titled "The right to be wrong." Drawing a provocative parallel to US politics, he describes how honest, intelligent people willing to admit their (almost always) understandable mistakes are turned on and burned at the stake for by their opponents, accused of lacking integrity and being "flip-floppers" . In science, the attacks are much less direct, but the attitude is still entrenched, and the vast majority of scientists are opting for "safe", incremental, "data gathering/discovery" based research as opposed to bold, hypothesis-driven science. The sentiment is echoed by funding agencies who do not want to risk funding anything that might "fail."

The commentary, although ominous at first, should inspire us all to behave as true scientists should - boldly but carefully, objectively and rationally.

Saturday, March 15, 2008

AMIA Summit on Translational Bioinformatics

Hundreds of clinical scientists, biologists, bioinformaticians, and policy gurus descended on the swanky Intercontinental Mark Hopkins hotel for the first AMIA-sponsored Summit on Translational Bioinformatics last week. Stanford's Atul Butte rallied impressive troops for this inaugural meeting, including the leaders of all of the National Centers for Biomedical Computation (NCBCs, 7 or so total). Since translational bioinformatics is not simply about research, but about translating research into tangible benefits (clinical diagnostics, therapeutics, and standard of care), this meant a many faceted conversation involving basic researchers, large-scale integrative projects (e.g. caBIG, the NCBCs), clinical scientists, informaticians, and government agencies. This was reflected by the structure of the meeting, which consisted of tutorials; policy, technology, and organization panels; primary paper sessions; and posters covering topics ranging from how to establish collaborative projects to ontologies and phenomics.

Given the breadth of the audience, I'm sure the highlights of the conference vary from person to person. Below are some of mine:

Eitan Rubin from Ben Gurion University, Israel (Talk highlight). "Reverse translational bioinformatics: a bioinformatics assay of age, gender and clinical biomarker." A self-proclaimed biologist, Eitan presented some intriguing work in what he called "reverse translational bioinformatics" - using clinical/medical data to make useful discoveries about biology. As an additional aim, he strove to show that existing bioinformatics tools could be applied to clinical data with little modification. To do this, he took an immense data set - thousands of variables collected for tens of thousands of individuals (part of a nutrition and lifestyle survey that was epidemiological in nature), including laboratory tests, questionnaire answers, and medication data - and essentially turned it into a microarray after binning by age. Note that this was a proxy for clinical data since no such data is currently publicly available. He then subjected this array to the same kinds of analyses one would perform on an array of molecular biological data: normalization, calculation of median values, clustering by age and variable. The results encompassed both the expected and the surprising. For example, when he clustered by age, he found distinct boundaries between somewhat intuitive ages - at 12 yrs and 16 yrs for both sexes, at 40 yrs for women and again around 49, and around 45 for men; these could point to interesting biological changes going on at these age boundaries. He also plotted the median values for variables like serum lead level vs age and found distinct patterns. At this point, he has only begun to analyze the enormous amounts of data, and more interesting patterns are sure to emerge. In the meantime, it helps drive home the potential behind open data and data (and methods!) re-use.

Yael Garten from Stanford University (talk highlight). "Pharmspresso: a text analysis tool for linking pharmacogenomic concepts." [Disclaimer: Yael and I are colleagues in the same lab and I helped to critique her presentation.] Yael's work on a semantic, scoped search engine for pharmacogenomics is worth mentioning because of its immediate and potential utility. Pharmspresso allows a user to query a corpus of documents (currently about a thousand pharmacogenomic-related articles previously curated by the PharmGKB team) for keywords, genes, drugs, and/or polymorphisms occurring in the same sentences. Based on the Textpresso ontology created for mining the C.elegans literature, Pharmspresso includes semantic support for human genes, drugs, and genetic polymorphisms and additionally improves upon more general search engines such as Google and PubMed by limiting the scope of the hits to the sentence-level and returning hits color-coded within each sentence for easy evaluation of search results. Pharmspresso has already helped the PharmGKB curators and in the future will be incorporated into an automatic curation pipeline.

Selected papers to be published in BMC Bioinformatics. At the close of the conference, the surprise announcement was made that 15 of the 27 presented papers had been selected to be published in a summer issue of BMC Bioinformatics as a joint agreement between the Open Access journal and AMIA, who would foot the bill. The papers would need to be expanded and updated for submission but the peer review process had happened for the conference and so they were already considered accepted for the journal. A couple of big conferences already do something similar - ISMB/ECCB and RECOMB - but it would be great if every major conference had some kind of arrangement like this with a journal. It seems like it would be a win-win for everyone - peer-review already taken care of, an increased audience for that issue of the journal, and a nice CV boost for the authors (and no more hard decisions between presenting at a conference vs publishing in a journal). Given the fact that this was the very first meeting for this conference, it was a very nice surprise indeed.

Thoughtful A/V setup. This is simply a logistical highlight. We've all sat through our share of technical difficulties, but this conference (at least in the main room) was astonishingly free of them. A large part of this was due to the presence of dedicated A/V staff who knew just when to dim and raise the lights, cue mood music, and put up the "transition screen" - a screen blank except for the AMIA logo. This screen went up whenever a presenter's slides were NOT up, and prevented those awkward moments when the audience could see the desktop of the presenter's laptop or the view of the Powerpoint application. It was also nice not to have to see the blue or black screens when video input was changed. All in all, it imparted a much-appreciated professional touch to the conference which other meetings would do well to emulate.

In summary, there were some informative panels on various policies and the NCBCs, interesting research, and nice extras that made this first Summit on Translational Bioinformatics a big success!

Thursday, March 13, 2008

Help for protein misfolding in foreign vectors?

A friend of mine is getting ready to do some experiments involving purified human proteins expressed in E. coli, and she asked me if I knew anything about protein misfolding - apparently, proteins sometimes misfold when expressed in foreign vectors such as E. coli. Unfortunately, I didn't, but a Google search hit brought up an explanation that's really not that surprising when you think about it, and has to do with the fact that many proteins fold correctly only with the help of chaperone proteins or cofactors. Obviously, this can be a big problem for an experimentalist who wants to get usable amounts of a specific, correctly folded protein.

Does anyone know where to find good information about this problem or have suggestions for how to get around it (with or without changing vectors - I'm not sure if E.coli is a crucial part of the study or not)? The document I linked has some solutions but I'm wondering if there are any resources or "easy" tips out there I can forward along.

Online collaborative manuscript annotation

While at the inaugural AMIA Summit on Translational Bioinformatics the first half of this week (stay tuned for another post summarizing that), I started thinking about some ideas for tools that could help make discussion of papers easier and more productive.

Currently, it seems that there are a few avenues for discussing a paper: 1) have an informal conversation in person, 2) hold a journal club where one person presents the paper and discussion ensues, or 3) blog about it and hope others comment. (You could argue that another avenue exists through some journals - especially open access ones - allowing comments on published articles, but this hasn't caught on as far as I can tell.) There are several disadvantages of the current systems. In-person conversations or journal clubs can be stimulating as they happen, but are transient and usually go unrecorded, resulting in little tangible benefit to others (or often even the participants); they also usually preclude remote participation without some sort of audio-visual setup. Going the blog route allows anyone to participate, but it's difficult to connect the comments back to the paper and the discussion may be less productive than hoped.

A group of students in my human-computer interaction class a few years ago developed an idea called Collaboread for their final project. In essence, it allowed multiple online users to markup a document, enabling collaborative annotation. I'm sure there are several products out there that allow either online markup of documents (Adobe, for one) or collaborative editing (Google Docs), but I haven't seen anything that resembles exactly what I envision.

Suppose you are viewing a document on the screen - maybe a full-text articles at BioMed Central, or a PDF. Clearly, things like web URLs and references should be hyperlinked already. But suppose you could create additional hyperlinks, such as to wikipedia pages, other papers that were not referenced but are relevant, blog posts, etc. You can also start individual discussion threads attached to a particular results, claims, or points made in the paper, or to tables or figures. Mousing over or clicking on the icon indicating such a thread would bring up a summarized view of the thread overlaid on the screen which you could browse more deeply or hide if you decide you're not interested. The idea is to make a richly annotated document that is easy to read but at the same time make it easy to see what other people thought or were confused by and respond if so inclined without too much disruption. When I envision this, I see a Google Maps-like navigation and manipulation style with lots of linked text, little colored balloons at the POIs - the discussion threads, and liberal use of tags to help with filtering and searching of the document and annotations.

A tool like this would be useful not just for journal club-style discussion of papers, but also as a teaching and editing tool. Authors could collaboratively comment on a paper, or learn from others' comments after it is published and made the focus of such a discussion. Readers and students would benefit from the additional linked resources and learn from the discussions how to critique a paper. And the annotated document would be available to everyone long after discussion has tapered off.

Of course, there are potentially many technical, legal, social, *al issues surrounding this, but I think some kind of tool along this vein would be useful and interesting. Does anyone know of any tools that do these things already? If not, I am already looking into what it would take to develop it, and would appreciate tips, suggestions, warnings...

Tuesday, March 11, 2008

PSB proposal accepted for a workshop

The proposal we submitted for a session on Open Science at the Pacific Symposium on Biocomputing was accepted! They notified us yesterday that it would be included as a 3 hour workshop on the first day of the conference. Many thanks to all those who sent encouragement or letters of support - I'm sure it went a long way towards convincing the conference organizers that we are serious!

A call for participation will come out shortly, but just wanted to get the good news out!

Monday, March 3, 2008

Anatomy of a Ph.D. thesis

Let's face it: life is complicated. But thanks to the ever-flourishing DIY industry (for example, WikiHow), a lot of endeavors that used to seem complicated are made much less so through step by step instructions. In science, experimental protocols already do this, at least in theory, but what about other aspects of science, like writing papers, keeping up with literature, making presentations, or networking at conferences? My advisor has given informal talks for his students on a number of these topics, the latest of which was a set of general guidelines for writing a Ph.D. thesis.

In order of their appearance in the final document...
  • Chapter 1 - Introduction. This is essentially an executive summary. You should briefly describe all contributions your thesis makes to your field, provide at least one "gee-whiz" result, and lay out a roadmap for the rest of the thesis ("In Chapter 2, I present the background... In Chapter 3, I discuss my work on X...."). It is acceptable to make claims without proof, since you will be defending these later on.
  • Chapter 2 - Background. This is essentially a literature review, and demonstrates your understanding of the field and the context surrounding your work. For bioinformatics theses, this covers both the biomedical domain and the area of informatics or computation your work involves. You should present an intellectual framework in which your work fits - what has been done, the advantages and limitations of this previous work, the potential avenues for improvement, and where you come in. Ideally, this chapter could be published as a review article with very little modification.
The next couple chapters are the meat of the thesis, and can take at least two forms depending on what kind of work you did during your Ph.D. If you worked on several somewhat disjoint projects and published 2 or 3 papers on them, you can write one chapter for each paper (but no more than 3). If you worked on just one problem, you are probably better off writing a chapter for the methods and a chapter for the results and discussion (if you developed two approaches for the same problem, you can repeat this for the second approach). So:
  • GENERAL THEME: Several projects
    Chapter 3 - Methods, Results, and Discussion from paper 1
    Chapter 4 - Methods, Results, and Discussion from paper 2
    (Chapter 5 - Methods, Results, and Discussion from paper 3, if applicable)
    FOCUSED THEME: Single project

    Chapter 3 - Methods
    Chapter 4 - Results/Discussion
    (Chapters 5 and 6 - Methods and Results/Discussion for approach 2, if applicable)

    In general, you do not want to reuse text from your published papers verbatim, despite how tempting this can be. Papers are very strict and limit what you can express, so you should see your thesis as an opportunity to pontificate and give voice to your ideas. You should also form your thesis into a detailed guide of everything you tried, even some of the things that didn't work, so that it can be a reference to future generations of grad students who may pursue extensions of your research.
  • Chapter 6 or 7, depending on type of thesis - Summary chapter. Describe overall contributions to the relevant domains. (For biomedical informatics theses, describe the overall contributions to biology or medicine, and the overall contributions to informatics or engineering. If applicable, you may also describe core contributions to computer science.) Here is where you also discuss the limitations of the work, the unsolved problems, and your best ideas for how to solve them.

  • Appendices - supplementary material. Almost anything goes, but you should definitely include all key data and datasets (information needed to recreate the major results from your thesis). Ideally, all data relevant to your thesis (and other related work, if possible) will be stored and/or made available either on the web or as a physical copy, though this is mostly for the advisor as a reference for future students. If you have any proofs or supplementary material, these should be in an appendix. You can also include additional work or papers published unrelated to your thesis.
So that explained what each chapter of the thesis should be about; what about actually writing the thesis? My advisor's recommendation is to start with the meat chapters (Ch. 3 - 6/7) since you should have pretty much all the necessary material to begin with, then write Chapter 2, then write the first and last chapters.

More specific advice on how to actually write each chapter was not covered and probably warrants its own post. Note that this is my advisor's take on the Ph.D. thesis; I'm sure there are some other interpretations, which would be interesting to hear! How much does the thesis vary by field?