This is a uni-project for a course about automatic text summarization. In the end, a text is processed with lexical chains to determine the low- to high-level topics. Sentences which are covered by the highest rated chains will then be extracted.

Currently, I split the text into sentences & tokenize it. Then, for each token a part-of-speech tag is estimated, like, this is a noun, a noun in plural, a verb, etc. Then, the lemma for each token is produced, so that if I have e.g. a noun in plural, I get the correct singular form for easier and more robust processing later on (e.g: trees->tree, geese->goose). If a word is bold, it is equivalent to its lemma, if not, the lemma is attached after the POS-tag in small and italics.



In the screenshot I filtered only nouns, because nouns are my starting points for creating the lexical chains.

Last edited by HeelX; 08/19/11 11:20.