Saturday, March 5, 2011

Predicting the fluency of text with shallow structural features


The paper described below is "Predicting the fluency of text with shallow structural features" by Jieun Chae and Ani Nenkova.


Sentence fluency is an important component of overall text readability but few studies in natural language processing have sought to understand the factors that define it. Numerous natural language applications involve the task of producing fluent text. Consideration of sentence fluency are also key in sentence simplification, sentence compression, text regeneration and headline regeneration. Despite of its importance much more attention has been devoted to discourse-level constraints on adjacent sentences, indicative of coherence and good text flow. Perceived sentence fluency is influenced by many factors. The way the sentence fits in the context of surrounding sentences is one obvious factor. Another well-known factor is vocabulary use: the presence of uncommon difficult words are known to pose problems to readers and to render text less readable. But these discourse- and vocabulary level features measure properties at granularities different from the sentence level.Hence several syntactic surface level features were considered.

The Charniak's parser was used to parse the sentence and calculated the several syntactic surface level features which are given below:

1.sentence length-In general one would expect that shorter sentences are easier to read and thus are perceived as more fluent.

2.Parse tree depth-. Generally, longer sentences are syntactically more complex that can slow processing and lead to lower perceived fluency of the sentence.

3.Number of fragment tags in the sentence parse indicating the presence of ungrammaticality in the sentence Fragments occur in headlines (e.g. “Cheney willing to hold bilateral talks if Arafat observes U.S. cease-fire arrangement”.

4.Phrase type proportion was computed for prepositional phrases (PP), noun phrases (NP) and verb phrases (VP). The length in number of words of each phrase type was counted, then divided by sentence length.

Example:. The longer the noun phrases, the less fluent the sentence is.. Long noun phrases take longer to interpret and reduce sentence fluency/readability.

• [The dog] jumped over the fence and fetched the ball.

• [The big dog in the corner] fetched the ball.

Similarly the length of verb phrases signal potential fluency problems.

- Most of the US allies in Europe publicly [object to invading Iraq]VP .

- But this [is dealing against some recent remarks of Japanese financial minister, Masajuro Shiokawa]VP.

VP distance (the average number of words separating two verb phrases) is also negatively correlated with sentence fluency.

Consider the following two sentences:

• In his state of the Union address, Putin also talked about the national development plan for this fiscal year and the domestic and foreign policies.

• Inside the courtyard of the television station, a reception team of 25 people was formed to attend to those who came to make donations in person.

5.Average phrase length is the number of words comprising a given type of phrase, divided by the number of phrases of this type.

6.Phrase type rate was also computed for PPs, VPs and NPs and is equal to the number of phrases of the given type that appeared in the sentence, divided by the sentence length. Phrase length i.e,The number of words in a PP, NP, VP, without any normalization; it is computed only for the largest phrases. Length of NPs/PPs contained in a VP The average number of words that constitute an NP or PP within a verb phrase, divided by the length of the verb phrase


For all experiments they used four of the classifiers(the classifiers usually emphasizes quantitative evaluation i.e. measuring accuracy) in Weka—decision tree (J48), logistic regression, support vector machines (SMO), and multilayer perceptron.

Overall the best classifier was the multi-layer perceptron. On the task using all available data of machine and human translations, the classification accuracy for the task of distinguishing machine and human translations was 86.99% from multilayer perceptron. Hence the surface structural statistics can distinguish very well between fluent and non-fluent sentences when the examples come from human and machine-produced text respectively.

In pairwise comparison of sentences with different fluency, accuracy of predicting which of the two is better is 90% for the multi-layer perceptron classifier.

But the features correlated with fluency levels in machine-produced text(worst and best machine translations using a set of observation data) are not the same as those that distinguish between human and machine translations. Such results raise the need for caution when using assessments for machine produced text to build a general model of fluency.

The discourse aspects(inferences, references, recall of prior knowledge) and language model features(vocabulory) were proved to be much more important then text fluency in predicting the overall text quality.

For future research it will be beneficial to build a dedicated corpus in which human-produced sentences are assessed for fluency.

Thursday, March 3, 2011

MORE ABOUT THE DISCOURSE FEATURES OF A TEXT FROM THE THESIS OF LIJUNG FENG:

>>Pitler and Nenkova (2008) attempted to analyze discourse relations when addressing a readability related problem, which is to determine how well is a text is written.
>>In the thesis by feng, they deployed sophisticated NLP techniques to extract four subsets of features automatically from various linguistic levels and study their effectiveness for readability prediction task.

>>In their early work (Feng et al., 2009), they have published results on entity-density features and lexical-chain features for readers with intellectual disabilities (Feng et al., 2009).

>>Pitler and Nenkova (2008) used the entity grid features to evaluate how well a text is written.

>>entity density features implementation by lijung:

1/entities -> union of named entities and the rest of general nouns (nouns and proper nouns) contained in a text.

2/They used open source OpenNLP’s1 name-finding tool to extract named entities, such as names of persons, locations and organizations.

3/extracted nouns by examining the leaf nodes from the output of the Charniak’s Parser, where each leaf node consists of a pair of a word and its part-of-speech tag.

eg of how charniak's parser parses a statement can be seen from this site :

http://www.cse.iitb.ac.in/~jagadish/parser/evaluation.html

feng s method of doing was as follows:
1/ first extract general nouns based on their POS-tags.

2/extract the named entities from the output of openNLP’s name finder.

3/remove those general nouns that appear in the named entities

4/The remaining nouns are then joined with the named entities to form the complete set of our version of entities.

5/Based on the collected set of entities, they implemented 16 features as described below:

Entity density features-----
  • total number of entities per document
  • total number of unique entities per document
  • percentage of entities entities per document
  • percentage of unique entities per document
  • average number of entities per sentence
  • average number of unique entities per sentence
  • percentage of named entities per document
  • average number of named entities per sentences
  • percentage of named entities in total entities
  • percentage of general nouns in total entities
  • percentage of general nouns per document
  • average number of general nouns per sentence
  • percentage of remaining nouns per document
  • average number of remaining nouns per sentence
  • percentage of overlapping nouns per document
  • average number of of overlapping nouns per sentence

Explanation of the terms:
1//overlapping nouns” refer to general nouns that appear in named entities;
2//“remaining nouns” refer to the set of general nouns with overlapping nouns removed.

>>In her early work (Feng et al., 2009), they implemented only four entity-density features:

1/total number of entities per document
2/total number of unique entities per document
3/percentage of entities entities per document
4/percentage of unique entities per document.
.....................................................................................................................................

Lexical Chain Features:--

1>To better measure the working memory burden of a text for people with ID from the perspective of semantic association of words, particularly nouns, during reading comprehension, we used the output of a lexical chaining tool “LexChainer” (Galley and McKeown, 2003) to build a set of lexical chain features.

2/>LexChainer produces chains of words connected by six semantic relations: synonymy, hypernym, hypony, meronym, holonym and coordinate terms (siblings) (Galley and McKeown, 2003).


3>>Figure::An example of a lexical chain.

34transplant
35operation
46transplant
92medication
98operation
217therapy

  • There are six semantically related words captured in this chain.
  • The numbers on the left indicate the token index of the corresponding word in the document.

  • Based on this example,the length of this chain is 6, the span of the chain is 217 − 34 + 1 = 184.
...................................................................................................................................
Entity Grid Features:

>>Barzilay and Lapata (2008) have reported that distributional properties of local entities generated by their grid models are useful in detecting original texts from their simplified versions when combined with well studied lexical and syntactic features.

>>FENG implemented these entity grid features and study their effectiveness in automatic readability assessment.
>>Barzilay and Lapata’s entity-grid model is based on the assumption that the distribution of entities in locally coherent texts exhibits certain regularities.

>>The entity grid is a two-dimensional array, with one dimension corresponding to the salient entities extracted from the text,and the other corresponding to each sentence of the text.
>>Each grid cell corresponds to the grammatical role of the specified entity in the specified sentence: whether it is a participant (S), object (O), neither of the two (X), or absent from the sentence (-).

>>feng used the Brown Coherence Toolkit (v0.2) (Elsner et al., 2007),which was built based on the work of Lapata and Barzilay (2005), to generate entity grid representation for syntactically parsed texts.
...........................................................................

Tuesday, March 1, 2011

Towards more understanding of Readability...

The day after the presentation, I realized that we need to take a step forward from the traditional readability metrics. The day started with yummy, "home-made", heavy breakfast from Anusha, Apoorva, Bhuvan's lunch box:) followed by a project-review which was well presented by Akshatha with myself adding the points here and there during the presentation, and moved on again with more reading of thesis but actually the understanding of thesis needs the understanding of the base papers. One of the supporting paper which I found that would be helpful towards understanding is

Paper: Revisiting Readability: A Unified Framework for Predicting Text Quality 2008

Authors:Emily Pitler, Ani Nenkova

The definition of what one might consider to be a well-written and readable text ????????


>>>which heavily depends on the intended audience. Hence obviously, even a superbly written scientific paper will not be perceived as very readable by a lay person. So majority of the previous works speaks about more common words are easier, so some metrics measured text readability by the percentage of words that were not among the N most frequent in the language. It was followed by considering the word length to approximate readability. The work was improvised by the usage of language models which speaks about -For any given text,
it is easy to compute its likelihood under a given language model, i.e. one for text meant for children, or for text meant for adults, or for a given grade level

>>>depend on the text coherence.Text coherence is defined as the ease with which a person understands a text. In many applications such as text generation and summarization, systems need to decide the order in which selected sentences or generated clauses should be presented to the user. Form of reference is also important in well written text and appropriate choices of form of reference lead to improved readability. Use of pronouns for reference is more desirable than the use of definite noun phrases.

And what goes into the readability apart from the surface linguistic features such as average number of words, characters, syllables is explained by the author.

The paper speaks about combining all the 3 features and relative importance of each of the below fetures in determinig the text quality:

1.The use of rare words or technical terminology can make text difficult to read for certain audience types at the lexical level.
2.Syntactic complexity(parse tree height or the number of passive sentences) which is associated with delayed processing time in understanding and is another factor that can decrease readability.
3. Text organization (discourse structure), relating more entities (entity coherence) and the form of referring expressions also determine readability.

In identifying which features of lexical , syntactical and discourse relations has a greater impact in estimating the readability. The following results were found.

>>>They tested among the average number of characters per word, average number of words per sentence, maximum number of words per sentence, and article length and found that longer articles are perceived as less well-written and harder to read than shorter ones. They used unigram language models(Unigram models only calculate the probability of hitting an isolated word, without considering any influence from the words before or after the target word )to predict the probability of an article, when this unigram language model was combined with the length of an article it had an higher predictibility.
>>>They examined the four syntactic features used in (Schwarm and Ostendorf, 2005): average parse tree height (F1), average number of noun phrases per sentence (F2), average number of verb phrases per sentence (F3), and average number of subordinate clauses per sentence(SBAR's).
Having multiple noun phrases (entities) in each sentence requires the reader to remember more
items, but may make the article more interesting-
example of noun phrase:A dream that one day on the red hills of Georgia the sons of former slaves and the sons of former slave owners will be able to sit down together at a table of brotherhood
While including more verb phrases in each sentence increases the sentence complexity.
Example of verb phrases:As she was walking to the mall, she came across some old letters lying on the road.

>>>Discourse relations
A discourse relation is a description of how two segments of discourse(a continuous piece of spoken or written language) are logically connected to one another.
All of the discourse relations were annotated using PDTB(Penn Discourse Tree Bank).
Penn Discourse Treebank is the largest annotated resource.
In short PDTB annotates all the explicit relations ,implicit relations , no relations in a text.The log likelihood of discourse relations and log likelihood of the number of discourse relations in the text under a multinomial model was very highly and significantly correlated with readability ratings, especially after text length was taken into account.


>>>The entity grid was computed using the brown coherence toolkit.Each text is represented
by an entity grid, a two-dimensional array that captures the distribution of discourse entities across text sentences. The rows of the grid correspond to sentences, while the columns
correspond to discourse entities. The discourse entity were the noun phrases.For each occurrence of a discourse entity in the text, the corresponding grid cell contains information about its grammatical role in the given sentence. Each grid column thus corresponds to a string from a set of categories(x,o,s,-) reflecting the entity’s presence or absence in a sequence of sentences. The set of categories consists of four symbols: S (subject), O (object), X (neither subject nor object) and – (gap which signals the entity’s absence from a given sentence). The probability of each of the entity grid is calculated.
But none of the individual entity grid features were significantly correlated with the readability ratings. But the combination of entity grid features had a greater impact on readability.



At last the author combines all the above features with the linear regression(is an approach towards modelling the relationship between the known and unknown variables) to find out the best possible combination of the feaures to predict the readability perfectly and found that-
vocabulory(language models), discourse features, sentence length, a combination of entity grid features combination gives the best result for regression

I hope the explanation of the above paper will take us into the more understanding of readability-assessment.


More ideas and more work to come along!!!!