Topically Diverse Query Focus Summarization (TD-QFS)

This site contains a new dataset of document clusters, queries and corresponding manual summaries that answer the queries.
The dataset presented in the site is an expansion of the QCFS dataset.
All manual summaries created for the dataset and source code for summarization algorithms are available on this site.
Download Dataset
Download Code

Information

Background information about the TD-QFS dataset can be found in (Baumel, Cohen and Elhadad).

QFS algorithms must combine query relevance assessment, central content identification, and redundancy avoidance. Frustratingly, state of the art algorithms designed for QFS do not significantly improve upon generic summarization methods, which ignore query relevance, when evaluated on traditional QFS datasets. We hypothesize this lack of success stems from the nature of the dataset.

Manual Summaries

The user scenario of this study is Query Focused Summarization (QFS): given an input document cluster and a query, generate an answer to the query which is a brief, well-organized, fluent answer to the query. We prepared a dataset of document clusters in the field of Consumer Health, and asked expert users (Medical Students) to generate summaries of document sets given various queries. All the manual summaries can be found here.

Document Clusters

We reused the document clusters from the QCFS dataset. Those clusters were gathered by medical experts given the general topic (e.g., "asthma") from reliable sources such as NIH, Asthma NZ, U.S. National Library of Medicine, Wikipedia, Mayo Clinic, Alzheimer's Association.

The dataset contains four document clusters: Asthma, Alzheimer's Disease, Lung Cancer and Obesity.

Topic Concentration

Topic concentration is an abstract property of a query-focused multi-document summarization dataset. It measures the extent to which the documents in a document cluster cover the same input query. The TD-QFS dataset was constructed in order to obtain lower topic concentration than is found in existing QFS datasets such as DUC 2005-2007.

We obtain this lower topic concentration by constructing each document cluster around a main query and several secondary queries. For example, one cluster in our dataset includes documents that are all relevant to "Alzheimer's disease" (the main query) and in addition, some of the documents cover specifically cognitive impairment, while others are about Alzheimer's symptoms, or semantic dementia (which are secondary queries within the same cluster).

The full list of queries used for this study can be found here.

Comparison to DUC

We compared the TD-QFS dataset topic concentration with DUC 2005-2007. We used a two-stage QFS scheme in order to quantify topic concentration in the datasets:

Retrieve information relevant to the query using several retrieval methods.
Use the generic KL-SUM summarization method on the subset of the input document cluster found relevant by the retrieval method.

One can observe here that while DUC datasets maintain a flat curve regardless of retrieval size, the TD-QFS curves decrease sharply as less relevant content is kept in the second stage of the QFS scheme.

Relevance-based QFS Models

Several relevance-based QFS models were presented in the paper:

KLThreshold
RelSum
Two-stage Method

Code for all the methods is available here.

This site is a part of my (Tal Baumel) PhD research under the guidance of Prof. Michael Elhadad @ Ben Gurion Natural Language Processing Lab Feb 2016.

	# Documents	# senteces	# Tokens/Unique
Asthma	125	1,924	19,662/2,284
Lung Cancer	135	1,450	17,842/2,228
Obesity	289	1,615	21,561/2,907
Alzheiner's disease	191	1,163	14,813/2,508

	# Documents	# Tokens/Unique
Asthma	9	21 / 14
Lung Cancer	11	47 / 23
Obesity	12	36 / 24
Alzheiner's disease	8	19 / 18

	# Documents	# Tokens/Unique
Asthma	27	3,415 / 643
Lung Cancer	33	3,905 / 660
Obesity	36	3,912 / 899
Alzheiner's disease	24	2,866 / 680