Query Chain Focused Summarization (QCFS)


Background information about QCFS can be found in my thesis, presentation or paper.
A query chain is a sequence of queries submitted to a search engine which articulate the exploratory behavior of a person accessing an unfamiliar document collection. We have extracted query chains by mining the search logs of the PubMed document collection. We have identified sequences of queries that are most likely issued by non-professional users, and that center around a given topic in multiple variations. The Asthma, Lung Cancer, Alzheimer's Disease and Obesity query chains used for the creation of our dataset can be found here.
We asked expert users (Medical Students) to generate summaries of the document sets given the query chains, under slightly different conditions. All the manual summaries can be found here.

We gathered clusters of documents from which answers to the queries in our chains can be extracted. We asked medical experts to gather such documents given the general topic (e.g., "asthma") from reliable sources such as NIH, Asthma NZ, U.S. National Library of Medicine, Wikipedia, Mayo Clinic, Alzheimer's Association, etc.

We then have run experiments where expert users answered query chains by finding answers from within the relevant document sets. The answers generated by the expert users are similar to "query focused summaries" of these document sets. The Asthma, Lung Cancer, Alzheimer's Disease and Obesity Document sets used for the creation of our dataset can be found here.

To help human summarizers produce answers to the queries that are based on the content of the documents in the clusters, we provided a Web application. In this application, summarizers could search documents based on any query (starting from the queries in the chains, but it was possible to type any other query). Matching documents are ranked using standard TF*IDF relevance ranking. The summarizers can then select relevant sentences out of the matching documents. Finally, the summarizers can pick the best sentences from all candidates, order them, and manually edit the resulting answer to make it cohesive, coherent and appropriate to the query, within the maximum length of 250 words.

This screenshot illustrates how this application works

All the source code for the manual summarization aid site can be found here . To run the site you will also need a Solr server and Flask Python library installed.

Source code and an iPython notebook for all the methods described in the thesis can be found here.
Baselines:
  • KLSum
  • Lexrank
QFCS Specialized algorithms
  • KLSum-Update - An adaptation of KLSum that makes it sensistive to queries
  • LexRank-Focused - An adaptation of LexRank that makes it sensitive to queries
  • ChainSum - an LDA variant classifies words as "relevant to the last query in a chain" followed by retrieval filtering and KLSum summarizing
This site is a part of my (Tal Baumel) Msc thesis under the guidance of Prof. Michael Elhadad @ Ben Gurion Natural Language Processing Lab Oct 2013.