Project Readability Assessment: Towards more understanding of Readability...

The day after the presentation, I realized that we need to take a step forward from the traditional readability metrics. The day started with yummy, "home-made", heavy breakfast from Anusha, Apoorva, Bhuvan's lunch box:) followed by a project-review which was well presented by Akshatha with myself adding the points here and there during the presentation, and moved on again with more reading of thesis but actually the understanding of thesis needs the understanding of the base papers. One of the supporting paper which I found that would be helpful towards understanding is

Paper: Revisiting Readability: A Uniﬁed Framework for Predicting Text Quality 2008

Authors:Emily Pitler, Ani Nenkova

The deﬁnition of what one might consider to be a well-written and readable text ????????

>>>which heavily depends on the intended audience. Hence obviously, even a superbly written scientiﬁc paper will not be perceived as very readable by a lay person. So majority of the previous works speaks about more common words are easier, so some metrics measured text readability by the percentage of words that were not among the N most frequent in the language. It was followed by considering the word length to approximate readability. The work was improvised by the usage of language models which speaks about -For any given text,

it is easy to compute its likelihood under a given language model, i.e. one for text meant for children, or for text meant for adults, or for a given grade level

>>>depend on the text coherence.Text coherence is deﬁned as the ease with which a person understands a text. In many applications such as text generation and summarization, systems need to decide the order in which selected sentences or generated clauses should be presented to the user. Form of reference is also important in well written text and appropriate choices of form of reference lead to improved readability. Use of pronouns for reference is more desirable than the use of deﬁnite noun phrases.

And what goes into the readability apart from the surface linguistic features such as average number of words, characters, syllables is explained by the author.

The paper speaks about combining all the 3 features and relative importance of each of the below fetures in determinig the text quality:

1.The use of rare words or technical terminology can make text difﬁcult to read for certain audience types at the lexical level.

2.Syntactic complexity(parse tree height or the number of passive sentences) which is associated with delayed processing time in understanding and is another factor that can decrease readability.

3. Text organization (discourse structure), relating more entities (entity coherence) and the form of referring expressions also determine readability.

In identifying which features of lexical , syntactical and discourse relations has a greater impact in estimating the readability. The following results were found.

>>>They tested among the average number of characters per word, average number of words per sentence, maximum number of words per sentence, and article length and found that longer articles are perceived as less well-written and harder to read than shorter ones. They used unigram language models(Unigram models only calculate the probability of hitting an isolated word, without considering any influence from the words before or after the target word )to predict the probability of an article, when this unigram language model was combined with the length of an article it had an higher predictibility.

>>>They examined the four syntactic features used in (Schwarm and Ostendorf, 2005): average parse tree height (F1), average number of noun phrases per sentence (F2), average number of verb phrases per sentence (F3), and average number of subordinate clauses per sentence(SBAR's).

Having multiple noun phrases (entities) in each sentence requires the reader to remember more

items, but may make the article more interesting-

example of noun phrase:A dream that one day on the red hills of Georgia the sons of former slaves and the sons of former slave owners will be able to sit down together at a table of brotherhood

While including more verb phrases in each sentence increases the sentence complexity.

Example of verb phrases:As she was walking to the mall, she came across some old letters lying on the road.

>>>Discourse relations

A discourse relation is a description of how two segments of discourse(a continuous piece of spoken or written language) are logically connected to one another.

All of the discourse relations were annotated using PDTB(Penn Discourse Tree Bank).

Penn Discourse Treebank is the largest annotated resource.

In short PDTB annotates all the explicit relations ,implicit relations , no relations in a text.The log likelihood of discourse relations and log likelihood of the number of discourse relations in the text under a multinomial model was very highly and signiﬁcantly correlated with readability ratings, especially after text length was taken into account.

>>>The entity grid was computed using the brown coherence toolkit.Each text is represented

by an entity grid, a two-dimensional array that captures the distribution of discourse entities across text sentences. The rows of the grid correspond to sentences, while the columns

correspond to discourse entities. The discourse entity were the noun phrases.For each occurrence of a discourse entity in the text, the corresponding grid cell contains information about its grammatical role in the given sentence. Each grid column thus corresponds to a string from a set of categories(x,o,s,-) reﬂecting the entity’s presence or absence in a sequence of sentences. The set of categories consists of four symbols: S (subject), O (object), X (neither subject nor object) and – (gap which signals the entity’s absence from a given sentence). The probability of each of the entity grid is calculated.

But none of the individual entity grid features were signiﬁcantly correlated with the readability ratings. But the combination of entity grid features had a greater impact on readability.

At last the author combines all the above features with the linear regression(is an approach towards modelling the relationship between the known and unknown variables) to find out the best possible combination of the feaures to predict the readability perfectly and found that-

vocabulory(language models), discourse features, sentence length, a combination of entity grid features combination gives the best result for regression

I hope the explanation of the above paper will take us into the more understanding of readability-assessment.

More ideas and more work to come along!!!!

Project Readability Assessment

Tuesday, March 1, 2011

Towards more understanding of Readability...

No comments:

Post a Comment