Project Readability Assessment

MORE ABOUT THE DISCOURSE FEATURES OF A TEXT FROM THE THESIS OF LIJUNG FENG:

>>Pitler and Nenkova (2008) attempted to analyze discourse relations when addressing a readability related problem, which is to determine how well is a text is written.
>>In the thesis by feng, they deployed sophisticated NLP techniques to extract four subsets of features automatically from various linguistic levels and study their effectiveness for readability prediction task.

>>In their early work (Feng et al., 2009), they have published results on entity-density features and lexical-chain features for readers with intellectual disabilities (Feng et al., 2009).

>>Pitler and Nenkova (2008) used the entity grid features to evaluate how well a text is written.

>>entity density features implementation by lijung:

1/entities -> union of named entities and the rest of general nouns (nouns and proper nouns) contained in a text.

2/They used open source OpenNLP’s1 name-finding tool to extract named entities, such as names of persons, locations and organizations.

3/extracted nouns by examining the leaf nodes from the output of the Charniak’s Parser, where each leaf node consists of a pair of a word and its part-of-speech tag.

eg of how charniak's parser parses a statement can be seen from this site :

http://www.cse.iitb.ac.in/~jagadish/parser/evaluation.html

feng s method of doing was as follows:
1/ first extract general nouns based on their POS-tags.

2/extract the named entities from the output of openNLP’s name finder.

3/remove those general nouns that appear in the named entities

4/The remaining nouns are then joined with the named entities to form the complete set of our version of entities.

5/Based on the collected set of entities, they implemented 16 features as described below:

Entity density features-----

total number of entities per document
total number of unique entities per document
percentage of entities entities per document
percentage of unique entities per document
average number of entities per sentence
average number of unique entities per sentence
percentage of named entities per document
average number of named entities per sentences
percentage of named entities in total entities
percentage of general nouns in total entities
percentage of general nouns per document
average number of general nouns per sentence
percentage of remaining nouns per document
average number of remaining nouns per sentence
percentage of overlapping nouns per document
average number of of overlapping nouns per sentence

Explanation of the terms:
1//overlapping nouns” refer to general nouns that appear in named entities;
2//“remaining nouns” refer to the set of general nouns with overlapping nouns removed.

>>In her early work (Feng et al., 2009), they implemented only four entity-density features:

1/total number of entities per document
2/total number of unique entities per document
3/percentage of entities entities per document
4/percentage of unique entities per document.
.....................................................................................................................................

Lexical Chain Features:--

1>To better measure the working memory burden of a text for people with ID from the perspective of semantic association of words, particularly nouns, during reading comprehension, we used the output of a lexical chaining tool “LexChainer” (Galley and McKeown, 2003) to build a set of lexical chain features.

2/>LexChainer produces chains of words connected by six semantic relations: synonymy, hypernym, hypony, meronym, holonym and coordinate terms (siblings) (Galley and McKeown, 2003).

3>>Figure::An example of a lexical chain.

34transplant
35operation
46transplant
92medication
98operation
217therapy

There are six semantically related words captured in this chain.

The numbers on the left indicate the token index of the corresponding word in the document.

Based on this example,the length of this chain is 6, the span of the chain is 217 − 34 + 1 = 184.

...................................................................................................................................
Entity Grid Features:

>>Barzilay and Lapata (2008) have reported that distributional properties of local entities generated by their grid models are useful in detecting original texts from their simplified versions when combined with well studied lexical and syntactic features.

>>FENG implemented these entity grid features and study their effectiveness in automatic readability assessment.
>>Barzilay and Lapata’s entity-grid model is based on the assumption that the distribution of entities in locally coherent texts exhibits certain regularities.

>>The entity grid is a two-dimensional array, with one dimension corresponding to the salient entities extracted from the text,and the other corresponding to each sentence of the text.
>>Each grid cell corresponds to the grammatical role of the specified entity in the specified sentence: whether it is a participant (S), object (O), neither of the two (X), or absent from the sentence (-).

>>feng used the Brown Coherence Toolkit (v0.2) (Elsner et al., 2007),which was built based on the work of Lapata and Barzilay (2005), to generate entity grid representation for syntactically parsed texts.

...........................................................................

Project Readability Assessment

Thursday, March 3, 2011

No comments:

Post a Comment