On this page, the general philosophy behind TREC OpenSearch and describe the specific details of the current edition are described. To get an idea what it involves to implement a client, please refer to the participant guide.

1 Motivation and Goals

To undertake meaningful research, IR researchers need to consider real users (as opposed to simulated users or professional assessors) in their natural environments. Another key ingredient, that allowed search technology to be taken to the next level in many domains, is the availability of large-scale historical usage and interaction data. So far, this type of data has only been available to those working within organizations that operate a search engine used by sufficiently many users.

Open Search is a new evaluation paradigm for IR. The experimentation platform is an existing search engine. Researchers have the opportunity to replace components of this search engine and evaluate these components using interactions with real, unsuspecting users of this search engine.

Unlike most other tracks at TREC, we see our initiative as a proposal for an evaluation paradigm rather than a proposal for a new IR task. Nevertheless it is vital, for this to be a meaningful exercise, that our experimental platform provides an interesting real-world IR task. The 2016 and 2017 editions of TREC Open Search track focus on academic search as the task domain.

2 Task

The track focuses on ad-hoc scientific literature search as the main task: given a keyword query, return a ranked list of documents (scientific articles).

TREC OpenSearch supports submitting queries to different sites. Participants can choose which site(s) they want to experiment with. It is possible to use the same retrieval algorithm on all sites or to develop site-specific approaches. In the 2017 edition, there is one site available, the SSOAR academic search engine.

In all cases, sites make available a set of queries and a set of candidate items (documents, authors, conferences, etc.) for each query. Participants need to generate a (re)ranking of a set of candidate items for each query. The generated rankings are then interleaved with the site’s production ranking, then presented to users. This is orchestrated by an API, as detailed in the next section.

3 Procedure

The track operates as follows. A set Q of queries is taken from the search logs of a site. These queries are chosen such that they appear frequently enough, making it likely that they will be issued again in the near future by users of this site. Additionally, for each query q in Q, the site prepares a set of candidate documents Dq and some historical interaction data Id for each document d in Dq.

TREC Open Search operates an infrastructure, called the Living Labs API, that allows the site to share Q, D, and I with the participants. Once the site uploads the data to the API, it can be downloaded by participants. This way participants are provided with very much the typical TREC-style collection, consisting of queries and documents and additionally historical interactions. Queries are strings and documents are represented as JSON documents with all the fields common in literature search (e.g., author, title, abstract, etc); see Section 4 for details.

Participants are expected to produce their runs, as they normally would, and upload these through the API. When an unsuspecting, real user then issues a query q in Q against the site’s search engine, the search engine will ask the API to provide them with a run for that query. The API then selects uniformly randomly from among the runs that have already been upload by participants. This run is then returned to the search engine. The search engine will interleave the run with its production system and show this to the user.

The user may or may not interact with this ranking. When there is an interaction, the search engine will send this back to the API. And the API then makes aggregated feedback information available to the participant. The participant can then choose to update their ranking.

3.1 Train and Test Queries and Evaluation Rounds

The procedure described above holds for train queries. Next to train queries, there is also a set of test queries. Test queries are treated differently. During dedicated evaluation rounds, the runs that are submitted for test queries can not be changed (but they can be changed any number of times before the test period starts). The reason for this freeze is that it makes it possible to compare systems in fair way: during the same period, without the impact of a sudden update by only one of the participants. Outside of evaluation rounds, test queries act just like train queries, except that it is never possible to obtain individual feedback for them, for test queries, only aggregated feedback is available.

We will organize several evaluation rounds (see timeline), the first will simply be a “mock round,” only meant to test the whole procedure.

3.2 The Living Labs API

API endpoint: (port 80)
Participant guide:

4 Data and Sites

4.1 Sites

This year’s TREC OpenSearch is an academic search edition. There is one participating site:

SSOAR is the Social Science Open Access Repository, a steadily growing collection of full-text social-science documents which you can access freely.

4.2 Queries and Documents

Both queries and documents are assumed to be in English.

4.2.1 Queries

The query set consists of a selection of frequent (head or torso) queries.

For each query the raw query text and aggregated historical click statistics are available. Click statistics are normalized counts, i.e., a distribution over clicked results (possibly after some noise filtering), computed over some (unknown) time period.

The endpoint for obtaining queries is documented here.

An example of queries obtained through our API looks as follows.

  "queries": [
      "creation_time": "Mon, 07 Mar 2016 22:02:26 -0000",
      "qid": "citeseerx-q35",
      "qstr": "asif ekbal rejwanul haque amitava das venkateswarlu poka and sivaji bandyopadhyay language independent named entity recognition in indian languages",
      "type": "train"
      "creation_time": "Mon, 07 Mar 2016 22:02:26 -0000",
      "qid": "citeseerx-q74",
      "qstr": "\"a scalable approach to attack graph generation\"",
      "type": "train"

4.2.2 Documents

One would first obtain a list documents availaible for a query through the doclist endpoint, documented here.
Then, request each document throught the document endpoint, documented here. The content document field is structured differently for each site, as explained below.


  • abstract – abstract text
  • author – name(s) of author(s)
  • description – meta info
  • identifier – uri
  • issued – year of issue
  • language – es, en
  • publisher – publisher
  • subject – topic of the document
  • type – document type (journal, for instance)
  "content": {
    "abstract": "This paper … encountered.",
    "author": "Gimeno, Ana",
    "description": "Published Version",
    "identifier": "urn:nbn:de:0168-ssoar-243622",
    "issued": "2011",
    "language": "es",
    "publisher": "Misc. Country",
    "subject": "Internet",
    "type": "journal article"
  "creation_time": "Tue, 09 Feb 2016 03:41:31 -0000",
  "docid": "ssoar-d3125",
  "site_id": "ssoar",
  "title": "Experimenting the …  Examination"

4.3 Feedback

Interleaving outcomes are aggregated for each participant and query and comprise the following: #impressions and #wins/#losses/#ties w.r.t. the production system.

  • For the development queries, (aggregated) outcomes are made available as soon as possible, ranging from several times an hour to at most on a daily basis.
  • For the test queries, (aggregated) outcomes are made available only at the end of the test period.

Feedback is obtained through an endpoint documented here.

    "feedback": [
        {"qid": "citeseerx-q35",
         "runid": "baseline",
         "modified_time": "Sun, 13 March 2016 13:46:00 -0000",
         "doclist": [
             {"docid": "citeseerx-d4"
              "clicked": True},
             {"docid": "citeseerx-d5"},

4.4 Evaluation metrics

SSOAR uses interleaved comparisons on the live site. Interleaving is a highly sensitive online evaluation method that is often used at large scale commercial (web) search engines. Interleaving combines two rankings that are to be compared into a single ranking. This interleaved ranking is shown to a user and the ranker that contributed more documents that were clicked is preferred.
Specifically, Team Draft Interleaving (TDI) is implemented and performed by the sites themselves. Participants rankings are interleaved with the current production ranking. This is done so that sites, in the case our API responds too slow, can simply ignore the participants rankings and respond quickly to the user’s request.

5 Timeline

  • Trial round: July 1st – July 15, 2017
  • Real round: August 1st – August 31st, 2017

6 Get started

Want to get started implementing a client to rank queries? Read the participant guide in the documentation.


  • How many runs can we submit?
    A single run in test period #1. It will be possible to submit multiple runs from round #2.
  • Can we use external resources?
    Yes. From test period #2 though, once it is possible to submit multiple runs per team, we will ask each team to generate a baseline run that does not use any external resources.