RuAmbiEnt (Russian Ambiguous Entries) is a dataset designed for evaluation of semantic clustering of search engine results pages in Russian.
The dataset consists of 96 single-word queries, each with a set of subtopics/senses and a list of 30 top-ranking documents relevant to these queries (URLs, titles and snippets).
The package contains four files:Ambiguous queries for the dataset were taken from Analyzethis homonymous queries analyzer. Analyzethis is an independent search engines evaluation initiative for Russian, offering various search performance analyzers, including one for ambiguous or homonymous queries. We crawled Mail.ru search for these queries, getting titles and snippets (30 for each query, 2880 results total).
SERPs were annotated by two independent human experts manually mapping each result to a corresponding query sense. Sense inventory for the queries was produced beforehand by the third independent human annotator. The Adjusted Rand Index (inter-rater agreement) between the two annotations is 0.945: humans mostly agree on clustering of web search results.
RuAmbiEnt is designed as a Russian-language complement for AMBIENT and MORESQUE datasets, and follows the format established in Roberto Navigli and Giuseppe Crisafulli. 2010. Inducing word senses to improve web search result clustering. In Proceedings of the 2010 conference on empirical methods in natural language processing, pages 116–126. Association for Computational Linguistics.
This dataset accompanies our LREC-2016 paper "Neural Embedding Language Models in Semantic Clustering of Web Search Results".

RuAmbiEnt: Russian Ambiguous Entries by Elizaveta Kuzmenko and Andrey Kutuzov
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.