1 Motivation and Goals

To undertake meaningful research, IR researchers need to consider real users (as opposed to simulated users or professional assessors) in their natural environments. Another key ingredient, that allowed search technology to be taken to the next level in many domains, is the availability of large-scale historical usage and interaction data. So far, this type of data has only been available to those working within organizations that operate a search engine used by sufficiently many users.

Open Search is a new evaluation paradigm for IR. The experimentation platform is an existing search engine. Researchers have the opportunity to replace components of this search engine and evaluate these components using interactions with real, unsuspecting users of this search engine.

Unlike most other tracks at TREC, we see our initiative as a proposal for an evaluation paradigm rather than a proposal for a new IR task. Nevertheless it is vital, for this to be a meaningful exercise, that our experimental platform provides an interesting real-world IR task. The first edition of the TREC Open Search track focuses on academic search as the task domain.

2 Task

The track focuses on ad-hoc scientific literature search as the main task: given a keyword query, return a ranked list of documents (scientific articles).

This task is evaluated on three different academic search engines: CiteSeerX, Microsoft Academic Search and SSOAR. From now on, these will simply be referred to as sites. Participants can choose which site(s) they want to experiment with but it is mandatory to submit a run for CiteSeerX. It is possible to use the same retrieval algorithm on all sites or to develop site-specific approaches.

In all cases, sites make available a set of queries and a set of candidate items (documents, authors, conferences, etc.) for each query. Participants need to generate a (re)ranking of a set of candidate items for each query. The generated rankings are then interleaved with the site’s production ranking, then presented to users. This is orchestrated by an API, as detailed in the next section.

3 Procedure

The track operates as follows. A set Q of queries is taken from the search logs of a site. These queries are chosen such that they appear frequently enough, making it likely that they will be issued again in the near future by users of this site. Additionally, for each query q in Q, the site prepares a set of candidate documents Dq and some historical interaction data Id for each document d in Dq.

TREC Open Search operates an infrastructure, called the Living Labs API, that allows the site to share Q, D, and I with the participants. Once the site uploads the data to the API, it can be downloaded by participants. This way participants are provided with very much the typical TREC-style collection, consisting of queries and documents and additionally historical interactions. Queries are strings and documents are represented as JSON documents with all the fields common in literature search (e.g., author, title, abstract, etc); see Section 4 for details.

Participants are expected to produce their runs, as they normally would, and upload these through the API. When an unsuspecting, real user then issues a query q in Q against the site’s search engine, the search engine will ask the API to provide them with a run for that query. The API then selects uniformly randomly from among the runs that have already been upload by participants. This run is then returned to the search engine. The search engine will interleave the run with its production system and show this to the user.

The user may or may not interact with this ranking. When there is an interaction, the search engine will send this back to the API. And the API then makes aggregated feedback information available to the participant. The participant can then choose to update their ranking.

3.1 Train and Test Queries and Evaluation Rounds

The procedure described above holds for train queries. Next to train queries, there is also a set of test queries. Test queries are treated differently. During dedicated evaluation rounds, the runs that are submitted for test queries can not be changed (but they can be changed any number of times before the test period starts). The reason for this freeze is that it makes it possible to compare systems in fair way: during the same period, without the impact of a sudden update by only one of the participants. Outside of evaluation rounds, test queries act just like train queries, except that it is never possible to obtain individual feedback for them, for test queries, only aggregated feedback is available.

We will organize several evaluation rounds (see timeline), the first will simply be a “mock round,” only meant to test the whole procedure.

3.2 The Living Labs API

API endpoint: (port 80)

4 Data and Sites

4.1 Sites

This year’s TREC OpenSearch is the academic search edition. Three academic search engines are participating at TREC Open Search, namely:

CiteSeerx is an evolving scientific literature digital library and search engine that has focused primarily on the literature in computer and information science. CiteSeerx aims to improve the dissemination of scientific literature and to provide improvements in functionality, usability, availability, cost, comprehensiveness, efficiency, and timeliness in the access of scientific and scholarly knowledge.
SSOAR is the Social Science Open Access Repository, a steadily growing collection of full-text social-science documents which you can access freely.
Microsoft Academic Search
Microsoft Academic Search is an experimental research service developed by Microsoft Research to explore how scholars, scientists, students, and practitioners find academic content, researchers, institutions, and activities. Microsoft Academic Search indexes not only millions of academic publications, it also displays the key relationships between and among subjects, content, and authors, highlighting the critical links that help define scientific research. As is true of many research projects at Microsoft, this service is not intended to be a production Web site, and it will likely be taken offline when appropriate given the research goals of the project.

4.2 Queries and Documents

Both queries and documents are assumed to be in English.

4.2.1 Queries

The query set consists of a selection of frequent (head or torso) queries.

For each query the raw query text and aggregated historical click statistics are available. Click statistics are normalized counts, i.e., a distribution over clicked results (possibly after some noise filtering), computed over some (unknown) time period.

The endpoint for obtaining queries is documented here.

An example of queries obtained through our API looks as follows.

  "queries": [
      "creation_time": "Mon, 07 Mar 2016 22:02:26 -0000",
      "qid": "citeseerx-q35",
      "qstr": "asif ekbal rejwanul haque amitava das venkateswarlu poka and sivaji bandyopadhyay language independent named entity recognition in indian languages",
      "type": "train"
      "creation_time": "Mon, 07 Mar 2016 22:02:26 -0000",
      "qid": "citeseerx-q74",
      "qstr": "\"a scalable approach to attack graph generation\"",
      "type": "train"

4.2.2 Documents

One would first obtain a list documents availaible for a query through the doclist endpoint, documented here.
Then, request each document throught the document endpoint, documented here. The content document field is structured differently for each site, as explained below.

There is a single field, text, with the full document text.

  "content": {
    "text": "Computing Semantic Relatedness …  \nNetwork of Excellence.\nReferences\n"
  "creation_time": "Mon, 07 Mar 2016 22:02:41 -0000",
  "docid": "citeseerx-d4",
  "site_id": "citeseerx",
  "title": "Computing semantic relatedness using Wikipedia-based explicit semantic analysis"


  • abstract – abstract text
  • author – name(s) of author(s)
  • description – meta info
  • identifier – uri
  • issued – year of issue
  • language – es, en
  • publisher – publisher
  • subject – topic of the document
  • type – document type (journal, for instance)
  "content": {
    "abstract": "This paper … encountered.",
    "author": "Gimeno, Ana",
    "description": "Published Version",
    "identifier": "urn:nbn:de:0168-ssoar-243622",
    "issued": "2011",
    "language": "es",
    "publisher": "Misc. Country",
    "subject": "Internet",
    "type": "journal article"
  "creation_time": "Tue, 09 Feb 2016 03:41:31 -0000",
  "docid": "ssoar-d3125",
  "site_id": "ssoar",
  "title": "Experimenting the …  Examination"

Microsoft Academic Search

  • abstract – abstract text
  • site_docid – Entity Id in MS Academic Search Knowledge Graph
  • url – Example web location of the published document
  "content": {
    "abstract": "Duodenal duplications in ... advantage.",
    "url": ""
  "creation_time": "Tue, 01 Oct 2016 04:42:12 -0000",
  "docid": "msacademic-d3125",
  "site_id": "msacademic ",
  "title": "PubMed - ..."
  "site_docid": "2202987617"

Additional content can be extracted through the Academic Knowledge API using the site_docid as the entity Id:{site_docid}&correlationId=undefined

4.3 Feedback

Interleaving outcomes are aggregated for each participant and query and comprise the following: #impressions and #wins/#losses/#ties w.r.t. the production system.

  • For the development queries, (aggregated) outcomes are made available as soon as possible, ranging from several times an hour to at most on a daily basis.
  • For the test queries, (aggregated) outcomes are made available only at the end of the test period.

Feedback is obtained through an endpoint documented here.

    "feedback": [
        {"qid": "citeseerx-q35",
         "runid": "baseline",
         "modified_time": "Sun, 13 March 2016 13:46:00 -0000",
         "doclist": [
             {"docid": "citeseerx-d4"
              "clicked": True},
             {"docid": "citeseerx-d5"},

4.4 Evaluation metrics

CiteSeerX and SSOAR use interleaved comparisons on the live sites. Interleaving is a highly sensitive online evaluation method that is often used at large scale commercial (web) search engines. Interleaving combines two rankings that are to be compared into a single ranking. This interleaved ranking is shown to a user and the ranker that contributed more documents that were clicked is preferred.
Specifically, Team Draft Interleaving (TDI) is implemented and performed by the sites themselves. Participants rankings are interleaved with the current production ranking. This is done so that sites, in the case our API responds too slow, can simply ignore the participants rankings and respond quickly to the user’s request.

Rankings for Microsoft Academic Search will be evaluated using both online implicit feedback (clicks) and offline human judgments. Sharing the evaluation details will depend on the number of submissions received and the statistical power of the comparisons. Note that the timeline is also different for this site.

5 Timeline

March 1: release train queries
March 9: finalize guidelines
March 15: clicks start flowing
May 15: release test queries
Jun 1: test period #1 begins
July 15: test period #1 ends
Aug 1: test period #2 begins
Sep 15: test period #2 ends
Oct 1: test period #3 begins
Nov 1: MS Academic test period closes
Nov 15: test period #3 ends


  • How many runs can we submit?
    A single run in test period #1. It will be possible to submit multiple runs from round #2.
  • Can we use external resources?
    Yes. From test period #2 though, once it is possible to submit multiple runs per team, we will ask each team to generate a baseline run that does not use any external resources.