Intranet Transactional Search


This page is a distribution site for the intranet data for use in transactional search experiments. Available here are a collection of intranet data sets, search tasks and queries for the evaluation of transactional search. The data sets were introduced in the following paper (joint work of DB group at UMich and Avatar project team at IBM Almanden).

  • Yunyao Li, Rajasekar Krishnamurthy, Shivakumar Vaithyanathan, and H.V. Jagadish. Getting Work Done on the Web: Supporting Transactional Queries. To appear in Proceedings of SIGIR 2006, Seattle, WA, August 2006 (pdf) (Bibtex)

An actively maintained bibliography on transaction search and related topics is also included.

If you have results to report on these corpora, please send email to Yunyao Li (yunyaol a_t umich d_o_t edu). Thanks!

Transactional Search Data Sets:

Dataset introduced in the above paper.

(1) S-DOC (1.11 GB): Pool of 434203 unprocessed html files

Collected from using crawler GNU Wget in November 2005: given a single start point (, the software recursively collected textual documents with a small set of MIME types (e.g., html, php) within the domain of \texttt{} in November 2005.

This is the raw dataset from which the following transactional datasets (2) S-SDC and (3) S-ANN-NE were derived.

(2) S-TDC (93 MB): Transactional page dataset

a subset of S-DOC, comprised of web pages containing transactional features, including form-entry pages and software download pages.

(3) S-ANN-NE (4 MB): Transactional feature dataset

Each file contains all the transactional features from a web page in S-SDC. The identifier of each file corresponds to the original document.

(4) 15 transactional search tasks:

Collected through an informal survey conducted among administrative staff, graduate students and recent graduates in the University of Michigan and IBM Almaden Research Center.

(5) 394 unique user queries:

Collected from 26 subjects from the University of Michigan and IBM Almaden Research Center for the search tasks in (4), with redundant queries removed.

Data Sets not used in the above paper, but potentially useful:

Coming soon...

If you have any questions or comments regarding this site, please send email to Yunyao Li (yunyaol a_t umich d_o_t edu)