Paper: Revisiting Readability: A Unified Framework for Predicting Text Quality 2008
Authors:Emily Pitler, Ani Nenkova
The definition of what one might consider to be a well-written and readable text ????????
>>>which heavily depends on the intended audience. Hence obviously, even a superbly written scientific paper will not be perceived as very readable by a lay person. So majority of the previous works speaks about more common words are easier, so some metrics measured text readability by the percentage of words that were not among the N most frequent in the language. It was followed by considering the word length to approximate readability. The work was improvised by the usage of language models which speaks about -For any given text,
it is easy to compute its likelihood under a given language model, i.e. one for text meant for children, or for text meant for adults, or for a given grade level
>>>depend on the text coherence.Text coherence is defined as the ease with which a person understands a text. In many applications such as text generation and summarization, systems need to decide the order in which selected sentences or generated clauses should be presented to the user. Form of reference is also important in well written text and appropriate choices of form of reference lead to improved readability. Use of pronouns for reference is more desirable than the use of definite noun phrases.
And what goes into the readability apart from the surface linguistic features such as average number of words, characters, syllables is explained by the author.
The paper speaks about combining all the 3 features and relative importance of each of the below fetures in determinig the text quality:
1.The use of rare words or technical terminology can make text difficult to read for certain audience types at the lexical level.
2.Syntactic complexity(parse tree height or the number of passive sentences) which is associated with delayed processing time in understanding and is another factor that can decrease readability.
3. Text organization (discourse structure), relating more entities (entity coherence) and the form of referring expressions also determine readability.
In identifying which features of lexical , syntactical and discourse relations has a greater impact in estimating the readability. The following results were found.
>>>They tested among the average number of characters per word, average number of words per sentence, maximum number of words per sentence, and article length and found that longer articles are perceived as less well-written and harder to read than shorter ones. They used unigram language models(Unigram models only calculate the probability of hitting an isolated word, without considering any influence from the words before or after the target word )to predict the probability of an article, when this unigram language model was combined with the length of an article it had an higher predictibility.
>>>They examined the four syntactic features used in (Schwarm and Ostendorf, 2005): average parse tree height (F1), average number of noun phrases per sentence (F2), average number of verb phrases per sentence (F3), and average number of subordinate clauses per sentence(SBAR's).
Having multiple noun phrases (entities) in each sentence requires the reader to remember more
items, but may make the article more interesting-
example of noun phrase:A dream that one day on the red hills of Georgia the sons of former slaves and the sons of former slave owners will be able to sit down together at a table of brotherhood
While including more verb phrases in each sentence increases the sentence complexity.
Example of verb phrases:As she was walking to the mall, she came across some old letters lying on the road.
>>>Discourse relations
A discourse relation is a description of how two segments of discourse(a continuous piece of spoken or written language) are logically connected to one another.
All of the discourse relations were annotated using PDTB(Penn Discourse Tree Bank).
Penn Discourse Treebank is the largest annotated resource.
In short PDTB annotates all the explicit relations ,implicit relations , no relations in a text.The log likelihood of discourse relations and log likelihood of the number of discourse relations in the text under a multinomial model was very highly and significantly correlated with readability ratings, especially after text length was taken into account.
>>>The entity grid was computed using the brown coherence toolkit.Each text is represented
by an entity grid, a two-dimensional array that captures the distribution of discourse entities across text sentences. The rows of the grid correspond to sentences, while the columns
correspond to discourse entities. The discourse entity were the noun phrases.For each occurrence of a discourse entity in the text, the corresponding grid cell contains information about its grammatical role in the given sentence. Each grid column thus corresponds to a string from a set of categories(x,o,s,-) reflecting the entity’s presence or absence in a sequence of sentences. The set of categories consists of four symbols: S (subject), O (object), X (neither subject nor object) and – (gap which signals the entity’s absence from a given sentence). The probability of each of the entity grid is calculated.
But none of the individual entity grid features were significantly correlated with the readability ratings. But the combination of entity grid features had a greater impact on readability.
At last the author combines all the above features with the linear regression(is an approach towards modelling the relationship between the known and unknown variables) to find out the best possible combination of the feaures to predict the readability perfectly and found that-
vocabulory(language models), discourse features, sentence length, a combination of entity grid features combination gives the best result for regression
I hope the explanation of the above paper will take us into the more understanding of readability-assessment.
More ideas and more work to come along!!!!
No comments:
Post a Comment