Saturday, December 16, 2017

The National Plan Open Science Estafette: my own first Open Science steps

Noot: als je liever Nederlands leest, lees dan het origineel.
Earlier this year Delft hosted a meeting for Dutch scholars aimed at hearing and learning about, and to give feedback on the National Plan Open Science (doi:10.2777/061652). I'm very happy I have been able to contribute to this effort, because more and better access to knowledge is very dear to me. During lunch time everyone could demonstrate their own Open Science. From this the idea evolved to have a relay race ("estafette"). In each part of the relay someone will tell about their Open Science story. This post is te start: every next runner tells their story on what role Open Science has in their research. And it does not matter if the focus is on Open Data, Open Access, or Open Source, because the diversity in the Dutch Open Science community is just very high.

My Open Science story goes back to the time that I was studying chemistry at what is now called the Radboud Universiteit. Chemistry students could get access to the internet in 1994 and this opened a world of Open knowledge to me! Our library was well stocked, but I still had to visit research department to read certain journals. Always uncomfortable as a young student to walk into a coffee room with senior researchers.

I learned HTML and later Java. Java, with their applets, brought the internet to life. It could visualize 3D models of chemical structures. A paper journal cannot do that. Twenty years later journals still don't have this functionality, but that's not the point. In those three, four years I got introduced to three projects, each Open Source, Jmol (now JSmol), JChemPaint, and the file format "Chemical Markup Language" (CML). The first was to visualize 3D structures on the internet and the second was to visualize 2D chemical diagrams. CML was a format that could store 2D and 3D coordinates for me. But the problem was that neither Jmol nor JChemPaint could read CML.

But that's where Open Science comes in. After all, I could download the Jmol and JChemPaint source code, change it, and share that with others. That was brilliant! And I dived in. Of course, I could have just used my changes myself, but because I realized it could benefit others too, I sent my changes ("patches") to the authors of Jmol and JChemPaint. Extremely happy and proud I was when the two researchers from Germany and the U.S.A. included those patches in their version!

And in the end it was not in vain. In the final year of my chemistry study I submitted an abstract to an international conference. It got accepted! But now I had to go to Washington (Georgetown, to be precise), to talk about my work. On top of that, we agreed to meet the authors of Jmol and JChemPaint in South Bend, where we laid the foundation of a new Open Science project, the Chemistry Development Kit (CDK). Expensive trip, but fortunately I got a bursary from a Dutch company. A peculiar trip it was. We used an Amtrak sleeping train and had dinner with a soldier who served during D-Day. In New York I stepped off the sidewalk onto the street to evade a group of scary heavy boys (which turned out to be a popular boys band), and we stood in the WTC (a year before 9/11) to hear two tourists ask at the musical ticket sale desk "what broadway was?".

I am proud that I have been able to contribute to these Open Science projects and that I co-founded the CDK. The Open nature of these projects have had a significant impact and, after twenty years, still do. Sure, it's not the same is discovering a new protein or metabolite, but these projects definitely not only benefited my research. Of course, also with a huge thanks to Hens Borkent, Dick Wife, Dan Gezelter, Christoph Steinbeck, and Peter Murray-Rust.

BTW, thinking about this relay race, Open Science itself is also a relay race: you take the token of the people before you, adopt the token, and pass on the token to the next scientist. And every day the token gets brighter!

This Nationaal Plan Open Science Estafette also continues. I am delighted to pass my token to Rosanne Hertzberger. Read her story here or in Dutch.

Friday, December 15, 2017

Suggestions for ScholarlyHub

Mock Dashboard of ScholarlyHub.
(I'll assume CC-BY for this image.) 
ScholarlyHub is a open scholar profile project. I have yet no idea where this platform is going, but they planned open source nature makes me want to explore it nevertheless. The project is currently running a crowdfunding campaign and developing their plans. They asked for feedback, so here goes:

Feature requests:
  • researchers care about research, more than profiles: make things from their research ("topics") part of their profile; let them tell everyone what they are interested in
  • the website should have an API (good looks is not enough). Have you done a persona analysis? User friendly is only defined if you have defined your users.
  • make the resource FAIR: use or RDFa
  • show innovation into new scholarly activities: provide peer review functionality, etc (similar to Publons, PubPeer, PubMed Commons, etc)
  • support data and software citations
  • use identifiers (DOI, ORCID, project IDs (CORDIS, etc), etc)
  • integration of I4OC
  • freely provide #altmetrics
  • release soon, release often
  • use RSS for any bit of information on the site (one form of API, in fact)
  • integrate my social feeds into my profile (Twitter, blog, LinkedIn, etc)
You can browse my blog for other features I have recommended to websites in the past. You can also check Scholia for ideas.

New paper: "Integration among databases and data sets to support productive nanotechnology: Challenges and recommendations"

Figure 1 from the NanoImpact article. CC-BY.
The U.S.A and European nanosafety communities have a longstanding history of collaboration. On both sides there are working groups, NanoWG and WG-F (previously called WG4) of the NanoSafety Cluster. I have been chair of WG4 for about three years and still active in the group, though in the past half year, without dedicated funding, less active. That is already changing again with the imminent start of the NanoCommons project.

One of these collaborations resulted in a series of papers around data curation (see doi:10.1039/C5NR08944A and doi:10.3762/bjnano.6.189). Part of this effort was also an survey about the state of databases. A good number of databases responded to the call. It turned out non-trivial to analyse the results and write up a report around it with recommendations. The first version was submitted and rejected, and with fresh leadership, the paper underwent a significant restructuring by John Rumble and resubmitted to Elsevier's NanoImpact and now online (doi:10.1016/j.impact.2017.11.002).

The paper outlines an overview of challenges and a recommendation to the community on how to proceed. That is, basically, how should projects like eNanoMapper, caNanoLab, and Nanomaterial Registry evolve to, and what might the European Union Observatory for Nanomaterials (EUON) look like. BTW, a similar paper by Tropsha et al. was recently published the other week with a focus on the USA database ecosystem (doi:10.1038/nnano.2017.233).

Have fun reading it, and if you are working in a related field, please join either of the two aforementioned working groups! And a huge thanks to everyone involved, particular Sandra, John, and Christine.

Saturday, December 09, 2017 every house a library

Two weeks ago someone made me aware of This new social profile page allows you to digitize a list of your books. That in itself is already very useful. In the past I have made some efforts to create overviews of the books I own. If only to get some count, with respect to insurance, etc. But a very neat feature is the ability to find other people in your neighborhood with whom you may exchange books: for each book you can indicate who can see you own that book (no one, by default; if not mistaken, inheriting the ugw approach of UNIX), and if others can borrow or buy this book.

This shows users with public profiles in The Netherlands, showing the uptake has not been massive yet, but I'm hoping this post changes that :)

But the killer feature I am waiting for is a map like this, but then for books. Worldcat has a feature where it lists you the nearest library that has a copy of the book you are looking for:

And such a feature (map or list) would be brilliant for every house would potentially be a library. That idea appeals to me.

Oh, and it's Wikidata-based but that should hardly be a surprise.

Sunday, December 03, 2017

De Nationaal Plan Open Science Estafette: mijn eerste Open Science stappen

Note: For my English-only followers, I translated this to English.
Eerder dit jaar werd er in Delft een meeting voor Nederlandse onderzoekers georganiseerd om te horen over en feedback te geven op het Nationaal Plan Open Science (doi:10.2777/061652). Ik prijs me rijk dat ik hieraan heb kunnen en mogen meedenken, want meer toegankelijkheid tot wetenschappelijke kennis ligt me aan het hart. Tijdens de lunch konden mensen hun Open Science laten zien. Hieruit vloeide het idee voort om een estafette op te zetten om zo veel mogelijk te laten zien hoe veel Open Science er eigenlijk al in Nederland gedaan wordt. Hierbij de start: elke volgende estafetteloper vertelt iets over hun Open Science onderzoek. En of de focus nu op Open Data, Open Access, of op Open Source is maakt niet zo veel uit. Want de diversiteit in het Nederlandse Open Science-gemeenschap is nu eenmaal groot.

Mijn Open Science-verhaal gaat terug naar de tijd dat ik student scheikunde was aan wat nu de Radboud Universiteit heet. Wij kregen daar in 1994 toegang tot het internet, en dit opende voor mij een wereld van Open kennis! Onze bibliotheek was goed gevuld, maar soms moest ik naar afdelingen om daar bepaalde tijdschriften in te kunnen kijken. Altijd ongemakkelijk om een koffiekamer met senior onderzoekers binnen te lopen als student.

Ik leerde HTML en later Java. Java, met hun applets, brachten het internet tot leven. Het kon 3D modellen van chemische structuren laten zien. Dat kan een tijdschrift niet. Twintig jaar later kunnen tijdschriften dat nog steeds niet, maar dat terzijde. In de drie, vier jaren daarna leerde ik drie projecten kennen, elk Open Source: Jmol (nu JSmol), JChemPaint, en het bestandsformaat "Chemical Markup Language" (CML). De eerste was om 3D structuren te laten zien op het internet en het tweede visualiseerde twee-dimensionale (2D) chemische structuren. CML was een formaat waarin ik zowel 2D- en 3D-coördinaten kon opslaan. Maar het ding was dat Jmol en JChemPaint helemaal geen CML konden lezen.

Maar daar kwam Open Science om de hoek kijken. Immers, ik kon van Jmol en JChemPaint de broncode downloaden, aanpassen, en delen met anderen. Ik was overtuigd! En ging aan de slag. Natuurlijk had ik mijn aanpassingen gewoon zelf kunnen gebruiken, maar omdat ik dacht dat het voor andere misschien ook wel bruikbaar zou kunnen zijn, stuurde ik mijn aanpassingen ("patches") naar de auteurs van Jmol en JChemPaint. Dolgelukkig en trots was ik toen de wetenschappers in Duitsland en de Verenigde Staten het in hun versie opnamen!

En het heeft me allemaal geen windeieren gelegd. In mijn laatste jaar van mijn studie heb ik een abstract naar een internationale conferentie ingestuurd. Die werd geaccepteerd en dus moest ik naar de Washington (Georgetown, om precies te zijn), om wat over mijn werk te vertellen. Maar bovendien had ik met de auteurs van Jmol en JChemPaint afgesproken in South Bend, waar we de basis gelegd hebben voor een nieuw Open Science project, de Chemistry Development Kit (CDK). Duur, maar gelukkig kreeg ik een beurs van een Nederlands bedrijf. Een wonderlijke reis was het. In de slaaptrein avondeten met een soldaat die tijdens D-Day in actie is geweest, in New York van de stoep stappen omdat er zware jongens aankomen (die een bekende boys band blijken te zijn), en in het WTC staan (een jaar voor 9/11) en horen hoe toeristen bij de musicalticketverkoop vragen "What is broadway?".

Ik ben trots dat ik aan deze Open Science projecten heb kunnen bijdragen en dat ik medeontwerper ben van de CDK. Door hun Open karakter hebben deze projecten in flinke impact gehad, en na twintig jaar, nog steeds hebben. Natuurlijk, het is niet de functie van een eiwit en metaboliet, maar deze projecten hebben zeker niet alleen mijn onderzoek geholpen. Met dank aan anderen, natuurlijk: Hens Borkent, Dick Wife, Dan Gezelter, Christoph Steinbeck, en Peter Murray-Rust.

Trouwens, over estafettes gesproken, Open Science is op zichzelf ook een estafette: je neemt het stokje over van de mensen voor je, geeft er je eigen draai aan, en geeft daarna het stokje weer door aan de volgende. En het stokje wordt elke dag mooier!

Deze Nationaal Plan Open Science Estafette gaat ook verder. Ik mag mijn stokje doorgeven aan Rosanne Hertzberger. Ik ben superbenieuwd naar haar Open Science verhaal! En natuurlijk van alle lopers die daarna aan de estafette deelnemen!

Sunday, November 26, 2017

Winter solstice challenge: what is your Open Knowledge score?

Source: Wikimedia, CC-BY 2.0.
Hi all, welcome to this winter solstice challenge! Umm, to not give our southern hemisphere colleagues not a disadvantage, as their winter solstice has already passes, you're up for a summer solstice challenge!

So, you know ImpactStory and (if not, browse my blog); these are wonderful tools to see what people are doing with your work. I hope you already know about OpenCitations, a collaboration of publishers, CrossRef, and many others, to make all citation data available. They just passed the 50% milestone, congratulations on that amazing achievement! For the younger scientists it may be worth realizing that for the past 20 years, at least, this data was copyrighted and not to be used unless you paid. Elsevier is, BTW, the major culprit still claiming IP on this, but RT this if you are surprised.

So, the reason I introduce both ImpactStory and OpenCitations is the following. Scientific articles are data and knowledge dense documents. If we did not redirect the reader to other literature. That may give a more complete sketch of the context, describe a measurement protocol, describe how certain knowledge was derived, etc. Therefore, just having your article Open Access is not enough: the articles you cite should be Open Access too. That's the next phase if really making an effort to have all of humanity benefit from the fruits of science.

I know it is hard already to calculate a "Open Access" score, though ImpactStory does a great job at that! So, calculating this for your paper and the papers those papers cite is even harder. You may need to brush up your algorithm and programming skills.

Anyone is allowed to participate. Submission of your entry is done online, e.g. in your blog, in a public write up, or even a open notebook! However, you need at least on citable research object. That is, it needs a DOI. Otherwise, I cannot give you the prize (see below). The score should be based on all your products. Bonus points for those who include software and data citations. Excluding citable object to boost your score (for example, I would have to exclude my book chapters), is seen as cheating the system.

Your article B may cite three articles (C, D, J) but
article D also cited articles (F, I). So, your
Open Knowledge score is recursive.
Source: Wikipedia, CC-BY-SA 4.0
Calculating your Open Knowledge score can be done at multiple levels. After all, your article depends (cites) articles, and your software depends on libraries, but those cited articles and software dependencies recursively also cite articles and/or software. The complexity is non-trivial, making it a perfect solstice challenge indeed!

The prize I have to offer is my continued commitment to Open Science, but that you already get for free and may not be enough boon. So, instead, soon after the winter/summer solstice at the end of this year, I will blog about your research boosting your #altmetrics scores. Yes, I will actually read and try to understand it!

And because there is the results and the method, neither of which exist yet, there are two categories! I just doubled your chance of winning! That's because humanity is worth it! One prize for the best tool to calculated your Open Knowledge score, and one prize for the researcher with the highest score.

Audience Prize
If someone feels a need to organize an audience prize, this is very much encouraged! (Assuming Open approaches, of course :)

Wednesday, November 22, 2017

Monitoring changes to Wikidata pages of your interest

Source: User:Cmglee, Wikipedia, CC-BY-SA 3.0
Wikidata is awesome! In just 5 years they have bootstrapped one of the most promising platforms for the future of science.Whether you like the tools more, or the CCZero, there is coolness for everyone. I'm proud to have been able to contribute my small 1x1 LEGO bricks to this masterpiece and hope to continue this for many years to come. There are many people doing awesome stuff, and many have way more time, have better skills, etc. Yes, I'm thinking here if Finn, Magnus, Andra, the whole Su team, and many, many more.

The point of this post, is to highlight something this matters and something that comes up over and over again and where there just are solutions, like implemented by Wikidata: provenance. We're talking a lot about FAIR data. Most of FAIR data is not technological, it's social. And most of the technical work going on now, is basically to overcome those social barriers.

We teach our students to cite primarily literature and only that. There is a clear reason for that: the primary literature has the arguments (experiments, reasoning, etc) that back a conclusion. Not any citing is good enough: it has to be the exact right shape (think about that Lego brick). This track record of our experiments is a wonderful and essential idea. It removes the need for faith and even trust. Faith is for the religious, trust is for the lazy. Now, without being lazy, it is hard to make progress. But as I have said before (Trust has no place in science #2), every scholar should realize that "trust" is just a social equivalent of saying you are lazy. There is nothing wrong with being lazy: a side effect of it is innovation.

Ideally, we do not have to trust any data source. If we must, we just check where that source got its data from. That works for scholarly literature, and works for other sources too. Sadly, scholarly literature has a horrible track record here: we only cite stuff we find more trustworthy. For example, we prefer to cite articles from journals with high impact factors. Second, we don't cite data. Nor software. As a scholarly community, we don't care much about that (this is where lazy is evil, btw!).

Wikidata made the effort to make a rich provenance model. It has a rich system of referring to information sources. It has version control. And it keeps track of who made the changes.

Of all the awesomeness of Wikidata, Magnus is one of the people that know how to use that awesomeness. He developed many tools that make doing to right thing a lot easier. I'm a big fan of his SourceMD, QuickStatement, and two newer tools, ORCIDator and SPARQL-RC. This latter tool leverages SPARQL (and thus Wikidata RDF) and the version control system. By passing a query, it will list all changes in a given time period. I am still looking for a tool that can show my all changes for items I originally created, but this already is a great tool to monitor the quality of crowdsourcing for data in Wikidata I care about. No trust, but the ability to verify.

Here's a screenshot for the changes of (some of my) output of scientific output I am author of:

Sunday, November 12, 2017

New paper: "WikiPathways: a multifaceted pathway database bridging metabolomics to other omics research"

Focus on metabolic pathways increases the
number of annotated metabolites, further improving
the usability in metabolomics. Image: CC-BY.
TL;DR: the WikiPathways project (many developers in the USA and Europe, contributors from around the world, and many people curating content, etc) has published a new paper (doi:10.1093/nar/gkx1064/4612963), with a slight focus on metabolism. 

Full story
Almost six years ago my family and I moved back to The Netherlands for personal reasons. Workwise, I had a great time in Stockholm and Uppsala (two wonderful universities; thanks to Ola Spjuth, Bengt Fadeel, and Roland Grafström), but being immigrant in another country is not easy, not even for a western immigrant in a western country. ("There is evil among us.")

We had decided to return to our home country, The Netherlands. By sheer coincidence, I spoke with Chris Evelo in the week directly following that weekend. I had visited his group in March that year, while attending a COST-action about NanoQSAR in Maastricht. I had never been to Maastricht University yet, and this group, with their Open Source and Open Data projects, particularly WikiPathways, would give us enough to talk about. Chris had a position on the Open PHACTS project open. I was interested, applied, and ended up in the European WikiPathways group led by Martina Kutmon (the USA node is the group of Alex Pico).

Fast forward to now. It was clear to me that biological text book knowledge was unusable for any kind of computation or machine learning. It was hidden, wrongly represented, and horribly badly annotated. In fact, it still is a total mess. WikiPathways offered machine readable text book knowledge. Just what I needed to link the chemical and biological worlds. The more accurate biological annotation we put in these pathways, or semantically link to these pathways, the more precise our knowledge becomes and the better computational approaches can find and learn patterns not obvious to the human eye (it goes both ways, of course! Just read my PhD thesis.)

Over the past 5-6 years I got more and more involved in the project. Our Open PHACTS tasks did involve WikiPathways RDF (doi:10.1371/journal.pcbi.1004989), but Andra Waagmeester (now Micelio) was the lead on that. I focused on the Identifier Mapping Service, based on BridgeDb (together with great work from Carole Goble's lab, e.g. Alasdair and Christian). I focused on metabolomics.

Indeed, there was plenty to be done in terms of metabolic pathways in WikiPathways. The current database had a strong focus on the genetics and proteins aspects of the pathways. In fact, many metabolites were not datanodes and therefore did not have identifiers. And without identifiers, we cannot map metabolomics data to these pathways. I started working on improving these pathways, and we did some projects using it for metabolomics data (e.g. a DTL Hotel Call project led by Lars Eijssen).

The point of this long introductions is, I am standing on the shoulders of giants. The top right figure shows, besides WikiPathways itself, and the people I just mentioned, more giants. This includes Wikidata, which we previously envisioned as hub of metabolite information (see our Enabling Open Science: Wikidata for Research (Wiki4R) proposal). Wikidata allows me to solve the problem that CAS registry numbers are hard to link to chemical structures (SMILES): it has some 70 thousand CAS numbers.

SPARQL query that lists all CAS registry numbers in Wikidata, along with the matching
SMILES (canonical and isomeric), database entry, and name of the compound. Try it.
A lot more about CAS registry numbers is found in my blog.
Finally, but certainly not least, is Denise Slenter, who started this spring in our group. She picked up things I and others were doing very quickly (for example this great work from Maastricht Science Programme students), gave those her own twist, and is now leading the practical work in taking this to the next level. This new WikiPathways paper shows the fruits of her work.

Of course, there are plenty of other pathways database. KEGG is still the gold standard for many. And there is the great work of Reactome, RECON, and many others (see references in the NAR article). Not to mention the important resources that integrate pathways resources. To me, unique strengths of WikiPathways include the community approach, very liberal licence (CCZero), many collaborations (do we have a slide on that?), and, importantly, its expressiveness. The latter allows our group to do the systems biology work that we do, analyzing microRNA/RNASeq data, studying diseases at a molecular interaction level, see the effects of personal genetics (SNPs, GWAS), and visually integrate and summarize the combination of experimental data and text book knowledge.

OK, this post is now already long enough. And seeing from the length, you can see how much I am impressed with WikiPathways and where it goes. Clearly, there is still a lot left to do. And I am just another person contributing to the project and honored that we could give this WikiPathways paper a metabolomics spin. HT to Alex, Tina, and Chris for that!

Slenter, D. N., Kutmon, M., Hanspers, K., Riutta, A., Windsor, J., Nunes, N., Mélius, J., Cirillo, E., Coort, S. L., Digles, D., Ehrhart, F., Giesbertz, P., Kalafati, M., Martens, M., Miller, R., Nishida, K., Rieswijk, L., Waagmeester, A., Eijssen, L. M. T., Evelo, C. T., Pico, A. R., Willighagen, E. L., Nov. 2017. WikiPathways: a multifaceted pathway database bridging metabolomics to other omics research. Nucleic Acids Research.

Sunday, October 29, 2017

Happy Birthday, Wikidata!

Wikidata celebrates their 5th birthday with a great WikidataCon in Berlin. Sadly, I could not join in person, so I assuming it is a great meeting, following the #WikidataCon hash tag and occasionally the live stream.

Happy Birthday, Wikidata!

My first encounter was soon after they started, and was particularly impressed by the presentation by Lydia Pintscher at the Dutch Wikimedia Conferentie 2012. I had played with DBPedia occasionally but always disappointed by the number of issues with extracting chemistry from the ChemBox infobox, but that's of course the general problem with data that has been mangled into something that looks nice. We know that problem from text mining from PDFs too. Of course, if you start with something machine readable in the first place, your odds for success are much higher.

Yesterday, Lydia shows the State of Wikidata and I think they delivered on their promise.

I did not create my Wikidata account until a year later but did not use the account much in the first two years. But the Wikidata team did a lot of great work in their first three years, and somewhere in 2015 I wrote my first blog post about Wikidata. That year Daniel Mietchen also asked me to join the writing of a project proposal (later published in RIO Journal). The reason for more active adoption of Wikidata and joining Daniel's writing team, was the CCZero license and that chemical identifiers had really picked up. Indeed, free CAS numbers was an important boon. Since then, I have been using Wikidata as data source for our BridgeDb project and for WikiPathways (together with Denise Slenter). I also have to mention the work by Andra Waagmeester and the rest of the Andrew Su team gave me extra support to push Wikidata in our local research agenda around FAIR data.

The Wikidata RDF export and SPARQL end point was an important tipping point. This makes reuse of Wikidata so much easier. Integrating slices of data with curl is trivial and easy to integrate into other projects, as I do for BridgeDb. Someone in the education breakout session mentioned that you can use the interactive SPARQL end point even with people with zero programming experience. I wholeheartedly agree. That is exactly what I did last Thursday at the Surf Verder bouwen aan Open Science seminar. The learning curve with all the example queries is so shallow, it is generally applicable.

And then there is Scholia. What do I need to say? Impressive project by Finn Nielsen to which I am happy to contribute. Check out his WikidataCon talk. Here I am contributing to the biology corner and working on RSS feeds. It makes a marvelous tool to systematically analyze literature, e.g. for the Rett Syndrome as disease or as topic.

Wikidata has evolved to a tremendously useful resource in my biology research and I cannot imagine where we will be next year, at the sixth Wikidata birthday. But it will be huge!

Sunday, October 15, 2017

Two conference proceedings: nanopublications and Scholia

The nanopublication conference article in
It takes effort to move scholarly publishing forward. And the traditional publishers have not all shown to be good at that: we're still basically stuck with machine-broken channels like PDFs and ReadCubes. They seem to all love text mining, but only if they can do it themselves.

Fortunately, there are plenty of people who do like to make a difference and like to innovate. I find this important, because if we do not do it, who will. Two people who make an effort are two researchers who recently published their work as conference proceedings: Tobias Kuhn and Finn Nielsen. And I am happy to have been able to contribute to both efforts.

Tobias works on nanopublications which innovates how we make knowledge machine readable. And I have stressed how important this is in my blog for years. Nanopublications describe how knowledge is captures, makes it FAIR, but importantly, it links the knowledge to the research that led to the knowledge. His recent conference proceedings details how nanopublications can be used to establish incremental knowledge. That is, given two sets of nanopubblications, it determines which have been removed, added, and changed. The paper continues outlining how that can be used to reduce, for example, download sizes and how it can help establish an efficient change history.

And Finn developed Scholia, an interface not unlike Web-of-Science. But then based on Wikidata and therefore fully on CCZero data. And, with a community actively adding the full history of scholarly literature and the citations between papers, courtesy to the Initiative for Open Citations. This is opening up a lot of possibilities: from keeping track of articles citing your work, to get alerts of articles publishing new data on your favorite gene or metabolite.

Kuhn T, Willighagen E, Evelo C, Queralt-Rosinach N, Centeno E, Furlong L. Reliable Granular References to Changing Linked Data. In: d'Amato C, Fernandez M, Tamma V, Lecue F, Cudré-Mauroux P, Sequeda J, et al., editors. The Semantic Web – ISWC 2017. vol. 10587 of Lecture Notes in Computer Science. Springer International Publishing; 2017. p. 436-451. doi:10.1007/978-3-319-68288-4_26

Nielsen FÃ, Mietchen D, Willighagen E. Scholia and scientometrics with Wikidata.; 2017. Available from:

Sunday, October 08, 2017

CDK used in SIRIUS 3: metabolomics tools from Germany

Screenshot from the SIRIUS 3 Documentation.
License: unknown.
It has been ages I blogged about work I heard about and think should receive more attention. So, I'll try to pick up that habit again.

After my PhD research (about machine learning (chemometrics, mostly), crystallography, QSAR) I first went into the field metabolomics. Because is combines core chemistry with the complexity biology. My first position was with Chris Steinbeck, in Cologne, within the bioinformatics institute led by Prof. Schomburg (of the BRENDA database). During that year, I worked in a group that worked on NMR data (NMRShiftDb, dr. Stefan Kuhn), Bioclipse (collaboration with Ola Spjuth), and, of course, the Chemistry Development Kit (see our new paper).

This new paper, actually, introduces functionality that was developed in that year, for example, work started by Miquel Rojas-Cheró. This includes the work on atom types, which we needed to handle radicals, lone pairs, etc, for delocalisation. It also includes work around handling molecular formula and calculating molecular formulas from (accurate) molecular masses. For the latter, more recent work even further improved on earlier work.

So, whenever metabolomics work is published and they use the CDK, I realize that what the CDK does has impact. This week Google Scholar alerted me about a user guidance document for SIRIUS 3 (see the screenshot). Seems really nice (great) work from Sebastian Böcker et al.!

It also makes me happy, as our Faculty of Heath, Medicine, and Life Sciences (FHML) is now part of the Netherlands Metabolomics Center, and that we published the recent article our vision of a stronger, more FAIR European metabolomics community.

Wednesday, October 04, 2017

new paper: "The future of metabolomics in ELIXIR"

CC-BY from F1000 article.
This spring I attended a meeting organized by researchers from the European metabolomics community, including from PhenoMeNal to talk about proposing a use case to ELIXIR. Doing research in metabolomics and being part of ELIXIR, I was happy that meeting happened. During the meeting I presented the work from our BiGCaT group (e.g. WikiPathways, see doi:10.1093/nar/gkv1024).

During the meeting various metabolomics topics were discussed, and I pushed for interoperability of chemical (metabolic) structures, which requires structure normalization, equivalence testing, etc. You know, the kind of work that partners in Open PHACTS did, and that we're now trying to bootstrap with ChemStructMaps. It did not make it, but ideas are included in the selected topic.

All this you can read in this meeting write up, peer-reviewed in F1000Research (doi:10.12688/f1000research.12342.1). I am happy to have been given the opportunity to contribute to this work. The work in our group (e.g. from our PhD student Denise) can surely contribute to this community effort.

 Van Rijswijk M, Beirnaert C, Caron C, Cascante M, Dominguez V, Dunn WB, et al. The future of metabolomics in ELIXIR. F1000Research. 2017 Sep;6:1649+. 10.12688/f1000research.12342.1.

Saturday, September 09, 2017

New paper: "RDFIO: extending Semantic MediaWiki for interoperable biomedical data management"

Figure 10 from the article showing what the DrugMet wiki
with the pKa data looked like. CC-BY.
When I was still doing research at Uppsala University, I had a internship student, Samuel Lampa, who did wonderful work on knowledge representation and logic (check his thesis). In that same period he started RDFIO, a Semantic MediaWiki extension to provide a SPARQL end point and some clever feature to import and export RDF. As I was already using RDF in my research, and wikis are great way to explore how to model domain data, particularly when extracted from diverse literature, I was quite interested. Together we worked on capturing pKa data, and Samuel had put DrugMet online. Extracting pKa values from primary literature is a lot of laborious work and crowdsourcing did not pick up. This data was migrated to Wikidata about a year ago.

I also used the RDFIO extension when I started capturing nanosafety data from literature when I worked at Karolinska Institutet. I will soon write up this work, as the NanoWiki (check out these FigShare data releases) was a seminal data set in eNanoMapper, during which I continued adding data to test new AMBIT features.

Earlier this week Samuel's write up of his RDFIO project was published, to which I contributed the pKa use case (doi:10.1186/s13326-017-0136-y). There are various ways to install the software, as described on the RDFIO project site. The DrugMet data as well as the data for the OrphaNet data from the other example use case can also be downloaded from that site.

Lampa, S., Willighagen, E., Kohonen, P., King, A., Vrandečić, D., Grafström, R., & Spjuth, O. (2017). RDFIO: extending semantic MediaWiki for interoperable biomedical data management. Journal of Biomedical Semantics, 8 (1).

Sunday, August 27, 2017

DataCite: the PubMed for data and software

We have services like PubMed, Europe PMC, and Google Scholar to make a list of literature. Scholia/Wikidata and ORCID are upcoming services, but for data and software there are fewer options. One notable exception is DataCite (two past blogs where I mentioned it). There is plenty of caution in interpreting the results, like versioning, the fact that preprints, posters, etc are also hosted by the supported repositories (e.g. Figshare, Zenodo), but it seems the faceted browsing based on metadata is really improving.

This is what my recent "DataCite" history looks like:

And it get's even more exciting when you realize that DataCite integrates with ORCID so that you can have it all listed on your ORCID profile.

Saturday, August 26, 2017

Updated HMDB identifier scheme

I have not found further details about it yet, but noticed half an hour ago that the Human Metabolome Database (doi:10.1093/nar/gks1065) seems to have changes all their identifiers: the added extra zeros. The screenshot for D-fructose on the right shows how the old identifiers are now secondary identifiers. We will face a period of a few years where one resource uses the old identifiers (archives, supplementary information, other databases, etc).

This change has huge implications, including that mere string matching of identifiers becomes really difficult: we need to know if it uses the old scheme or the new scheme. Of course, we can see this simply from the identifier length, but we likely need a bit of software ("artificial intelligence") in our software.

I ran into the change just now, because I was working on the next BridgeDb metabolite identifier mapping database. The release of this weekend will not have the new identifiers for sure: I first need more info, more detail.

For now, if you use HMDB identifiers in your database, get prepared! Using old identifiers to link to the HMDB website seems to work fine, as they have a redirect working at the server level. Starting to think about internally updating your identifiers (by adding two zero's), is likely something to put on the agenda.

What about postprint servers?

Various article version types, including pre and post.
Now that preprint servers are picking up speed, let's talk about postprint servers. Sure, we have plenty of places to place and find discussions about the content of articles (e.g. PubPeer, PubMed Commons, ...), and sure we have retractions and corrections.

But what if we could just make revisions of articles?

And I'm not only talking about typo-fixes, but also clarifications that show up during post-publication peer-review. Not about full revisions; if a paper is wrong, then this is not the method of choice. They should happen frequently either, but sometimes it is just convenient. Maybe to fix broken website URLs?

One point is, ResearchGate, Academia, Mendeley, and the likes allow you to host versions, but we need to track the fixes and versioned DOIs. That metadata is essential: it is the FAIRness of the post-publication life time of a publication.

Thursday, August 17, 2017

Text mining literature that mention JRC representative nanomaterials

The week before a short holiday in France (nature, cycling, hiking, touristic CERN visit; thanks to Philippe for the ViaRhone tip!), I did some further work on contentmining literature that mention the JRC representative nanomaterials. One important reason was that I could play with the tools developed by Lars in his fellowship with The ContentMine.

I had about one day, as there always is work left over to finish in your first week of holiday, and had several OS upgrades to do too (happily running the latest 64bit Debian!). But, as a good practice, I kept an Open Notebook Science practice, and the initial run of the workflow turned out quite satisfactory:

What we see here is content mined from literature searched with "titanium dioxide" with the getpapers tool. AMI then extracted the nanomaterials and species information. Tools developed by Lars aggregated all information into a single JSON, which I converted into input for cytoscape.js with a simple Groovy script. Yeah, click on the image, and you get the live network.

So, if I find a bit of time before I get back to work, I'll convert this output also to eNanoMapper RDF for loading into Of course, then I will run this on other EuropePMC searches too, for other nanomaterials.

Sunday, July 30, 2017

Wikidata visualizes SMILES strings with John Mayfield's CDK Depict

SVG depiction of D-ribulose.
Wikidata is building up a curated collection of information about chemicals. A lot of data originates from Wikipedia, but active users are augmenting this information. Of particular interest, in this respect, is Sebastian's PubChem ID curation work (he can use a few helping hands!). Followers of my blog know that I am using Wikidata as source of compound ID mapping data for BridgeDb.

Each chemical can have one or two associated SMILES strings. A canonical SMILES, that excludes any chirality, and a isomeric SMILES that does include chirality. Because statement values can be linked to a formatter URL, Wikidata often has values associated with a link. For example, for the EPA CompTox Dashboard identifiers it links to that database. Kopiersperre used this approach to link to John Mayfield's CDK Depict.

Until two weeks ago, the formatter URL for both the canonical and isomeric SMILES was he same. I changed that, so that when a isomeric SMILES is depicted, it shows the perceived R,S (CIP) annotation as well. That should help further curation of Wikidata and Wikipedia content.

Wednesday, July 05, 2017

new paper: "A transcriptomics data-driven gene space accurately predicts liver cytopathology and drug-induced liver injury"

Figure from the article. CC-BY.
One of the projects I worked on at Karolinska Institutet with Prof. Grafström was the idea of combining transcriptomics data with dose-response data. Because we wanted to know if there was a relation between the structures of chemicals (drugs, toxicants, etc) and how biological systems react to that. Basically, testing the whole idea behind quantitative-structure activity relationship (QSAR) modeling.

Using data from the Connectivity Map (Cmap, doi:10.1126/science.1132939) and NCI60, we set out to do just that. My role in this work was to explore the actual structure-activity relationship. The Chemistry Development Kit (doi:10.1186/s13321-017-0220-4) was used to calculate molecular descriptor, and we used various machine learning approaches to explore possible regression models. Bottom line was, it is not possible to correlate the chemical structures with the biological activities. We explored the reason and ascribe this to the high diversity of the chemical structures in the Cmap data set. In fact, they selected the chemicals in that study based on chemical diversity. All the details can be found in this new paper.

It's important to note that these findings does not validate the QSAR concept, but just that they very unfortunately selected their compounds, making exploration of this idea impossible, by design.

However, using the transcriptomics data and a method developed by Juuso Parkkinen it is able to find multivariate patterns. In fact, what we saw is more than is presented in this paper, as we have not been able to support further findings with supporting evidence yet. This paper, however, presents experimental confirmation that predictions based on this component model, coined the Predictive Toxicogenocics Gene Space, actually makes sense. Biological interpretation is presented using a variety of bioinformatics analyses. But a full mechanistic description of the components is yet to be developed. My expectation is that we will be able to link these components to key events in biological responses to exposure to toxicants.

 Kohonen, P., Parkkinen, J. A., Willighagen, E. L., Ceder, R., Wennerberg, K., Kaski, S., Grafström, R. C., Jul. 2017. A transcriptomics data-driven gene space accurately predicts liver cytopathology and drug-induced liver injury. Nature Communications 8.

Saturday, June 24, 2017

The Elsevier-SciHub story

I blogged earlier today why I try to publish all my work gold Open Access. My ImpactStory profile shows I score 93% and note that with that 10% of the scientists in general score in that range. But then again, some publisher do make it hard for us to publish gold Open Access. And then if STM industries spreads FUD for their and only their good ("Sci-Hub does not add any value to the scholarly community.", doi:10.1038/nature.2017.22196), I get annoyed. Particularly, as the system makes young scientists believe that transferring copyright to a publisher (for free, in most cases) is a normal thing to do.

As said, I have no doubt that under current copyright law it was to be expected that Sci-Hub was going to be judged to violate that law. I also blogged previously that I believe copyright is not doing our society a favor (mind you, all my literature is copyrighted, and much of it I license to readers allowing them to read my work, copy it (e.g. share it with colleagues and students), and even modify it, e.g. allowing journals to change their website layout without having to ask me). About copyright, I still highly recommend Free Culture by Prof. Lessig (who unfortunately did not run for presidency).

To get a better understand of Sci-Hub and its popularity (I believe gold Open Access is the real solution), I looked at what literature was in Wikidata, using Scholia (wonderful work by Finn Nielsen, see arXiv). I added a few papers and annotated papers with their main subject's. I guess there must be more literature about Sci-Hub, but this is the "co-occuring topics graph" provided by Scholia at the time of writing:

It's a growing story.

As a PhD student, I was often confronted with Closed Access.

It sounds like a problem not so common in western Europe, but it was when I was a fresh student (around 1994). The Radboud's University Library certainly did not have all journals and for one journal I had to go to a research department and sit in their coffee room. Not a problem at all. Big Package deals improved access, but created a vendor lock-in. And we're paying Big Time for these deals now, with insane year-over-year inflation of the prices.

But even then, I was repeatedly confronted with not having access to literature I wanted to read. Not just me, btw, for PhD students this was very common too. In fact, they regularly visited other universities, just to make some copies there. An article basically costed a PhD a train travel and a euro or two copying cost (besides the package deal cost for the visited university, of course). Nothing much has changed, despite the fact that in this electronic age the cost should have gone down significantly, instead of up.

That Elsevier sues Sci-Hub (about Sci-Hub, see this and this), I can understand. It's good to have a court decide what is more important: Elsevier's profit or the human right of access to literature (doi:10.1038/nature.2017.22196). This is extremely important: how does our society want to continue: do we want a fact-based society, where dissemination of knowledge is essential; or, do we want a society where power and money decides who benefits from knowledge.

But the STM industry claiming that Sci-Hub does not contribute to the scholarly community is plain outright FUD. In fact, it's outright lies. The fact that Nature does not call out those lies in their write up is very disappointing, indeed.

I do not know if it is the ultimate solution, but I strongly believe in a knowledge dissemination system where knowledge can be freely read, modified, and redistributed. Whether Open Science, or gold Open Access.

Therefore, I am proud to be one of the 10 Open Access proponents at Maastricht University. And a huge thank you to our library to keep pushing Open Access in Maastricht.

Sunday, June 11, 2017

You are what you do, or how people got to see me as an engineer

Source, Wikicommons, CC-BY-SA.
Over the past 20 years I have had endless discussions into what the research is that I do. Many see my work as engineer, but I vigorously disagree. But some days it's just too easy to give up and explain things yet again. The question came up on the past few month several times again, and I am suggested to make a choice. That modern academia for you: you have to excel in something tiny, and complex and hard to explain ambition is loosing from the system based on funding, buzz words, "impact", and such. So, again, I am trying to make up my defense as to why my research is not engineering. You know what is ironic? It's all the fault of Open Science! Darn Open Science.

In case you missed it (no worries, many of the people I talk in depth about these things do, IMHO), my research is of theoretical nature (I tried bench chemistry, but my back is not strong enough for that): I am interested in how to digitally represent chemical knowledge. I get excited about Shannon entropy and books from Hofstadter. I do not get excited about "deep learning" (boring! In fact, the only fun I get out of that is pointing you to this). So, arguably, I am in the wrong field of science. One could argue I am not a biologist or chemist, but a computer scientist, or maybe philosophy (mind you, I have a degree in philosophy).

And that's actually where it starts getting annoying. Because I do stuff on a computer, people associate me with software. And software is generally seen as something that Microsoft does... hello, engineering. The fact that I publish papers on software (think CDK, Bioclipse, Jmol) does not help, of course.

That's where that darn Open Science comes in. Because I have a varied set of skills, I actually know how to instruct a computer to do something for me. It's like writing English, just to a different person, um, thingy. Because of Open Science, I can build the machines that I need to do my science.

But a true scientist does not make their own tools; they buy them (of course, that's an exaggeration, but just realize how well we value data and software citations at this time). They get loads of money to do so, just so that they don't have to make machines. And just because I don't ask for loads of money, or ask a bit of money to actually make the tools I need, you are tagged as engineer. And I, I got tricked by Open Science in fixing things, adding things. What was I thinking??

Does this resonate with experience from others? Also upset about it? What can we do about this?

(So, one of my next blog posts will be about the new scientific knowledge I have discovered. I have to say,  not as much as I wanted, mostly because we did not have the right tools yet, which I have to build first, but that's what this post is about...)

Saturday, June 10, 2017

New paper: "The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching"

This paper was long overdue. But software papers are not easy to write, particularly not follow up papers. That actually seems a lot easier for databases. Moreover, we already publish too much. However, the scholarly community does not track software citations (data citations neither, but there seems to be a bit more momentum there; larger user group?). So, we need these kind of papers, and just a version, archived software release (e.g. on Zenodo) is not enough. But, there it is, the third CDK paper (doi:10.1186/s13321-017-0220-4). Fifth, if you include the two papers about ring finding, also describing CDK functionality.

Of course, I could detail what this paper has to offer, but let's not spoil the article. It's CC-BY so everyone can readily access it. You don't even need Sci-Hub (but are allowed for this paper; it's legal!).

A huge thanks to all co-authors, John's work as release manager and great performance improvements as well as code improvement, code clean up, etc, and all developers who are not co-author on this paper but contributed bigger or smaller patches over time (plz check the full AUTHOR list!). That list does not include the companies that have been supporting the project in kind, tho. Also huge thanks to all the users, particularly those who have used the CDK in downstream projects, many of which are listed in the introduction of the paper.

And, make sure to follow John Mayfield's blog with tons of good stuff.

Saturday, May 20, 2017

May 29, Delft, The Netherlands: "Open Science: the National Plan and you"

In less than ten days, a first national meeting is organized in Delft, The Netherlands, where researchers can meet researchers to talk about Open Science. Mind you, researcher is very broad: it is anyone doing research, at home (e.g. citizen science, or as a hobby), at work (company or research institute), or in educational setting (university, HBOs, ...). After all, anyone benefits from Open Science (at least from that by others! "Standing on the shoulders of Open Science, ...")

The meeting is part of the National Plan Open Science (see also Open Science is already a thing in The Netherlands), which is a direct result of the Open Science meeting in Amsterdam during the Dutch presidency which resulted in the Amsterdam Call for action on Open Science.

The program for the #npos2017 meeting is very interactive. It starts with obligatory introductions, explaining how Open Science fits into the national future research landscape, but quickly moves to practical experiences from researchers, a Knowledge Commons session where everyone can show and discuss their Open Science works (with a free lunch: yes, #OpenScience and free lunches are compatible), a number of breakout sessions where the "but how" can be discussed and answered (topics in the image below), and a wrap up panel to wrap up the break out sessions, and a free drink afterwards.

During the Knowledge Commons I will join Andra Waagmeester (Micelio) and Yaroslav Blanter (Delft University) to show Wikidata, and how I have been using this for data interoperability for the WikiPathways metabolism pathways (via BridgeDb).

The meeting is free and you can sign up here. Looking forward to meeting you there!

Sunday, April 16, 2017

GenX spill, national coverage, but where is the data

First (I have never blogged much about risk and hazard), I am not an toxicological expert nor a regulator. I have deepest respect for both, as these studies are one of the most complex ones I am aware off. It makes rocket science look dull. However, I have quite some experience in the relation chemical structure to properties and with knowledge integration, which is a prerequisite for understanding that relation. Anything I do does not say what the right course of action is. Any new piece of knowledge (or technology) has pros and cons. It is science that provides the evidence to support finding the right balance. It is science I focus on.

The case
The AD national newspaper reported spilling of the compound with the name GenX in the environment and reaching drinking water. This was picked up by other newspapers, like de VK. The chemistry news outlet C2W commented on the latter on Twitter:

Translated, the tweet reports that we do not know if the compound is dangerous. Now, to me, there are then two things: first, any spilling should not happen (I know this is controversial, as people are more than happy to repeatedly pollute the environment, just because of self-interest and/or laziness); second, what do we know about the compound? In fact, what is GenX even? It certainly won't be "generation X", though we don't actually know the hazard of that either. (We have IUPAC names, but just like with the ACS disclosures, companies like to make up cryptic names.)

But having working on predictive toxicology and data integration projects around toxicology, and for just having a chemical interest, I started out searching what we know about this compound.

Of course, I need an open notebook for my science, but I tend to be sloppy and mix up blog posts like this, with source code repositories, and public repositories. For new chemicals, as you could read earlier this weekend, Wikidata is one of my favorites (see also doi:10.3897/rio.1.e7573). Using the same approach as for the disclosures, I checked if Wikidata had entries for the ammonium salt and the "active" ingredient FRD-903 (fairly, chemically they are different, and so may their hazard and risk profiles). Neither existed, so I added them using Bioclipse and QuickStatements (a wonderful tool by Magnus Manke): GenX and FRD-903. So, a seed of knowledge was planted.
    A side topic... if you have not looked at yet, please do. It allows you to annotate (yes, there are more tools that allow that, but I like this one), which I have done for the VK article:

I had a look around on the web for information, and there is not a lot. A Wikidata page with further identifiers then helps tracking your steps. Antony Williams, previous of ChemSpider fame, now working on the EPA CompTox Dashboard, added the DTX substance IDs, but the entries in the dashboard will not show up for another bit of time. For FRD-903 I found growth inhibition data in ChEMBL.

But Nina Jeliazkova pointed me to her LRI AMBIT database (poster abstract doi:10.1016/j.toxlet.2016.06.1469, PDF links) that makes (public) data from ECHA available from REACH dossiers in a machine readable way (see this ECHA press release), using their AMBIT software (doi:10.1186/1758-2946-3-18). (BTW, this makes the legal hassle Hartung had last year even more interesting, see doi:10.1038/nature.2016.19365). After creation of a free login, you can find a full (public) dossier with information about the toxicology of the compound (toxicity, ecotoxicity, environmental fate, and more):

I reported this slide, as they worry seems to be about drinking water, so, oral toxicity seems appropriate (note, this is only acute toxicity). The LD50 is the median lethal dose, but is only measured for mouse and rat (these are models for human toxicity, but only models, as humans are just not rats; well, not literally, anyway). Also, >1 gram per kilogram body weight ("kg bw"; assumption) seems pretty high. In my naive understand, the rat may be the canary in the coal mine. But let me refrain from making any conclusions. I leave that to the experts on risk management!

Experts like those from the Dutch RIVM, which wrote up this report. One of the information they say is missing is that of biodistribution: "waar het zich ophoopt", or in English, where the compound accumulates.