The unbearable effectiveness of data

Researchers in artificial intelligence (AI) of the 1980’s, librarians and aficionados of the Semantic Web have a shared faith: The unique value of human-designed knowledge structures whether they be taxonomies, ontologies or metadata. These knowledge representations are seen as providing important leverage in information retrieval, knowledge discovery, and decision-support. In this context, I was recently reminded by Alal Eran of an article by researchers at Google about the value of BIG data. These researchers (one of whom wrote a wonderful book on Common Lisp—Paradigms of Artificial Intelligence Programming: Case Studies in Common Lisp—widely appreciated by the AI community, which includes applications for expert systems) describe how statistical methods applied to trillion-word corpora can automatically support the aforementioned information tasks without requiring human annotation/categorization. It may be that the combination of human-derived annotations (whether crowd-sourced from the web or carefully curated in the monasteries of the ivory tower) can be used synergistically with the purely statistic-learning methods, but that has yet to be convincingly demonstrated. Until then, those of us working on genomic research will see how far we can get just with data, particularly those obtained in the course of healthcare.

For those of us in libraries and those of us who are librarians, there is now an active debate that has yet to achieve resolution on what value there is in human annotations and metadata. If there is value, at what cost? And if it is cost-effective, how do we demonstrate the efficacy? Our Universities' leaders will be interested in the answers and so will our colleagues at Google.