==========================================================
   WIKI-44 Dataset
  
   http://search.fub.it/wiki-44/
  
   To download the data click here.
==========================================================



SUMMARY
==========================================================
WIKI-44 is a benchmark dataset for topic identification,
in particular for the task of labeling an input text
with Wikipedia articles that are relevant to its topic(s).


It consists of 44 Web page texts (in English),
each with a set of relevant Wikipedia articles,
labeled as either "strongly-relevant" or "weakly-relevant".
The articles were collected from the English Wikipedia dataset in May 2016.


   
STATISTICS
=========================================================

The main statistics about the WIKI-44 dataset are shown in the following table.

  Min Max Avg
Size of input text (Kb) 1.5 58.6 7.8
Number of strongly relevant articles 1 6 2.3
Number of weakly relevant articles 0 15 6.0
CONTENT ========================================================= The WIKI-44 dataset contains: - the "web_page_texts" folder with 44 files, one for each of the 44 input texts; - the URLs.html file containing the URLs of the web pages from which the texts were extracted (in April 2016); - the strongly-relevant.txt file, where each line consists of a text id (from 1 to 44) followed by the title of a strongly relevant article; - the weakly-relevant.txt file, where each line consists of a text id (from 1 to 44) followed by the title of a weakly relevant article. USAGE LICENSE ========================================================== Copyright(c) 2016 Fondazione Ugo Bordoni and Sapienza University of Rome All rights reserved. Authors: Claudio Carpineto, Roberto Navigli, and Giovanni Romano The copyright holder and authors can not guarantee the correctness of the data, its suitability for any particular purpose, or the validity of results based on the use of the data set, and they assume no responsibility for the content, legality, reliability, and accuracy of the data. The data set may be used for any research purposes. The web pages from which the input texts were extracted may be governed by local, national, and/or international laws and regulations, and your use of such content is solely at your own risk. You agree to abide by all applicable laws and regulations, including intellectual property laws, in connection with your use of the dataset. In particular, you certify that your use of any part of the WIKI-44 collection will be limited to noninfringing or fair use under copyright law. If the author or publisher of some web page does not want his or her work in our dataset, please contact us at the address provided below and we will remove the page. Please acknowledge use of this data set when used in your work: Carpineto C., Navigli R., and Romano G. (2016). WIKI-44 dataset, http://search.fub.it/wiki-44/ DOWNLOAD ========================================================== To download the WIKI-44 dataset click here. To download the May 2016 Wikipedia dump click here. MORE INFO ========================================================== If you have any further questions or comments, please contact Claudio Carpineto or Roberto Navigli