Saturday, March 5, 2011

Predicting the fluency of text with shallow structural features


The paper described below is "Predicting the fluency of text with shallow structural features" by Jieun Chae and Ani Nenkova.


Sentence fluency is an important component of overall text readability but few studies in natural language processing have sought to understand the factors that define it. Numerous natural language applications involve the task of producing fluent text. Consideration of sentence fluency are also key in sentence simplification, sentence compression, text regeneration and headline regeneration. Despite of its importance much more attention has been devoted to discourse-level constraints on adjacent sentences, indicative of coherence and good text flow. Perceived sentence fluency is influenced by many factors. The way the sentence fits in the context of surrounding sentences is one obvious factor. Another well-known factor is vocabulary use: the presence of uncommon difficult words are known to pose problems to readers and to render text less readable. But these discourse- and vocabulary level features measure properties at granularities different from the sentence level.Hence several syntactic surface level features were considered.

The Charniak's parser was used to parse the sentence and calculated the several syntactic surface level features which are given below:

1.sentence length-In general one would expect that shorter sentences are easier to read and thus are perceived as more fluent.

2.Parse tree depth-. Generally, longer sentences are syntactically more complex that can slow processing and lead to lower perceived fluency of the sentence.

3.Number of fragment tags in the sentence parse indicating the presence of ungrammaticality in the sentence Fragments occur in headlines (e.g. “Cheney willing to hold bilateral talks if Arafat observes U.S. cease-fire arrangement”.

4.Phrase type proportion was computed for prepositional phrases (PP), noun phrases (NP) and verb phrases (VP). The length in number of words of each phrase type was counted, then divided by sentence length.

Example:. The longer the noun phrases, the less fluent the sentence is.. Long noun phrases take longer to interpret and reduce sentence fluency/readability.

• [The dog] jumped over the fence and fetched the ball.

• [The big dog in the corner] fetched the ball.

Similarly the length of verb phrases signal potential fluency problems.

- Most of the US allies in Europe publicly [object to invading Iraq]VP .

- But this [is dealing against some recent remarks of Japanese financial minister, Masajuro Shiokawa]VP.

VP distance (the average number of words separating two verb phrases) is also negatively correlated with sentence fluency.

Consider the following two sentences:

• In his state of the Union address, Putin also talked about the national development plan for this fiscal year and the domestic and foreign policies.

• Inside the courtyard of the television station, a reception team of 25 people was formed to attend to those who came to make donations in person.

5.Average phrase length is the number of words comprising a given type of phrase, divided by the number of phrases of this type.

6.Phrase type rate was also computed for PPs, VPs and NPs and is equal to the number of phrases of the given type that appeared in the sentence, divided by the sentence length. Phrase length i.e,The number of words in a PP, NP, VP, without any normalization; it is computed only for the largest phrases. Length of NPs/PPs contained in a VP The average number of words that constitute an NP or PP within a verb phrase, divided by the length of the verb phrase


For all experiments they used four of the classifiers(the classifiers usually emphasizes quantitative evaluation i.e. measuring accuracy) in Weka—decision tree (J48), logistic regression, support vector machines (SMO), and multilayer perceptron.

Overall the best classifier was the multi-layer perceptron. On the task using all available data of machine and human translations, the classification accuracy for the task of distinguishing machine and human translations was 86.99% from multilayer perceptron. Hence the surface structural statistics can distinguish very well between fluent and non-fluent sentences when the examples come from human and machine-produced text respectively.

In pairwise comparison of sentences with different fluency, accuracy of predicting which of the two is better is 90% for the multi-layer perceptron classifier.

But the features correlated with fluency levels in machine-produced text(worst and best machine translations using a set of observation data) are not the same as those that distinguish between human and machine translations. Such results raise the need for caution when using assessments for machine produced text to build a general model of fluency.

The discourse aspects(inferences, references, recall of prior knowledge) and language model features(vocabulory) were proved to be much more important then text fluency in predicting the overall text quality.

For future research it will be beneficial to build a dedicated corpus in which human-produced sentences are assessed for fluency.

Thursday, March 3, 2011

MORE ABOUT THE DISCOURSE FEATURES OF A TEXT FROM THE THESIS OF LIJUNG FENG:

>>Pitler and Nenkova (2008) attempted to analyze discourse relations when addressing a readability related problem, which is to determine how well is a text is written.
>>In the thesis by feng, they deployed sophisticated NLP techniques to extract four subsets of features automatically from various linguistic levels and study their effectiveness for readability prediction task.

>>In their early work (Feng et al., 2009), they have published results on entity-density features and lexical-chain features for readers with intellectual disabilities (Feng et al., 2009).

>>Pitler and Nenkova (2008) used the entity grid features to evaluate how well a text is written.

>>entity density features implementation by lijung:

1/entities -> union of named entities and the rest of general nouns (nouns and proper nouns) contained in a text.

2/They used open source OpenNLP’s1 name-finding tool to extract named entities, such as names of persons, locations and organizations.

3/extracted nouns by examining the leaf nodes from the output of the Charniak’s Parser, where each leaf node consists of a pair of a word and its part-of-speech tag.

eg of how charniak's parser parses a statement can be seen from this site :

http://www.cse.iitb.ac.in/~jagadish/parser/evaluation.html

feng s method of doing was as follows:
1/ first extract general nouns based on their POS-tags.

2/extract the named entities from the output of openNLP’s name finder.

3/remove those general nouns that appear in the named entities

4/The remaining nouns are then joined with the named entities to form the complete set of our version of entities.

5/Based on the collected set of entities, they implemented 16 features as described below:

Entity density features-----
  • total number of entities per document
  • total number of unique entities per document
  • percentage of entities entities per document
  • percentage of unique entities per document
  • average number of entities per sentence
  • average number of unique entities per sentence
  • percentage of named entities per document
  • average number of named entities per sentences
  • percentage of named entities in total entities
  • percentage of general nouns in total entities
  • percentage of general nouns per document
  • average number of general nouns per sentence
  • percentage of remaining nouns per document
  • average number of remaining nouns per sentence
  • percentage of overlapping nouns per document
  • average number of of overlapping nouns per sentence

Explanation of the terms:
1//overlapping nouns” refer to general nouns that appear in named entities;
2//“remaining nouns” refer to the set of general nouns with overlapping nouns removed.

>>In her early work (Feng et al., 2009), they implemented only four entity-density features:

1/total number of entities per document
2/total number of unique entities per document
3/percentage of entities entities per document
4/percentage of unique entities per document.
.....................................................................................................................................

Lexical Chain Features:--

1>To better measure the working memory burden of a text for people with ID from the perspective of semantic association of words, particularly nouns, during reading comprehension, we used the output of a lexical chaining tool “LexChainer” (Galley and McKeown, 2003) to build a set of lexical chain features.

2/>LexChainer produces chains of words connected by six semantic relations: synonymy, hypernym, hypony, meronym, holonym and coordinate terms (siblings) (Galley and McKeown, 2003).


3>>Figure::An example of a lexical chain.

34transplant
35operation
46transplant
92medication
98operation
217therapy

  • There are six semantically related words captured in this chain.
  • The numbers on the left indicate the token index of the corresponding word in the document.

  • Based on this example,the length of this chain is 6, the span of the chain is 217 − 34 + 1 = 184.
...................................................................................................................................
Entity Grid Features:

>>Barzilay and Lapata (2008) have reported that distributional properties of local entities generated by their grid models are useful in detecting original texts from their simplified versions when combined with well studied lexical and syntactic features.

>>FENG implemented these entity grid features and study their effectiveness in automatic readability assessment.
>>Barzilay and Lapata’s entity-grid model is based on the assumption that the distribution of entities in locally coherent texts exhibits certain regularities.

>>The entity grid is a two-dimensional array, with one dimension corresponding to the salient entities extracted from the text,and the other corresponding to each sentence of the text.
>>Each grid cell corresponds to the grammatical role of the specified entity in the specified sentence: whether it is a participant (S), object (O), neither of the two (X), or absent from the sentence (-).

>>feng used the Brown Coherence Toolkit (v0.2) (Elsner et al., 2007),which was built based on the work of Lapata and Barzilay (2005), to generate entity grid representation for syntactically parsed texts.
...........................................................................

Tuesday, March 1, 2011

Towards more understanding of Readability...

The day after the presentation, I realized that we need to take a step forward from the traditional readability metrics. The day started with yummy, "home-made", heavy breakfast from Anusha, Apoorva, Bhuvan's lunch box:) followed by a project-review which was well presented by Akshatha with myself adding the points here and there during the presentation, and moved on again with more reading of thesis but actually the understanding of thesis needs the understanding of the base papers. One of the supporting paper which I found that would be helpful towards understanding is

Paper: Revisiting Readability: A Unified Framework for Predicting Text Quality 2008

Authors:Emily Pitler, Ani Nenkova

The definition of what one might consider to be a well-written and readable text ????????


>>>which heavily depends on the intended audience. Hence obviously, even a superbly written scientific paper will not be perceived as very readable by a lay person. So majority of the previous works speaks about more common words are easier, so some metrics measured text readability by the percentage of words that were not among the N most frequent in the language. It was followed by considering the word length to approximate readability. The work was improvised by the usage of language models which speaks about -For any given text,
it is easy to compute its likelihood under a given language model, i.e. one for text meant for children, or for text meant for adults, or for a given grade level

>>>depend on the text coherence.Text coherence is defined as the ease with which a person understands a text. In many applications such as text generation and summarization, systems need to decide the order in which selected sentences or generated clauses should be presented to the user. Form of reference is also important in well written text and appropriate choices of form of reference lead to improved readability. Use of pronouns for reference is more desirable than the use of definite noun phrases.

And what goes into the readability apart from the surface linguistic features such as average number of words, characters, syllables is explained by the author.

The paper speaks about combining all the 3 features and relative importance of each of the below fetures in determinig the text quality:

1.The use of rare words or technical terminology can make text difficult to read for certain audience types at the lexical level.
2.Syntactic complexity(parse tree height or the number of passive sentences) which is associated with delayed processing time in understanding and is another factor that can decrease readability.
3. Text organization (discourse structure), relating more entities (entity coherence) and the form of referring expressions also determine readability.

In identifying which features of lexical , syntactical and discourse relations has a greater impact in estimating the readability. The following results were found.

>>>They tested among the average number of characters per word, average number of words per sentence, maximum number of words per sentence, and article length and found that longer articles are perceived as less well-written and harder to read than shorter ones. They used unigram language models(Unigram models only calculate the probability of hitting an isolated word, without considering any influence from the words before or after the target word )to predict the probability of an article, when this unigram language model was combined with the length of an article it had an higher predictibility.
>>>They examined the four syntactic features used in (Schwarm and Ostendorf, 2005): average parse tree height (F1), average number of noun phrases per sentence (F2), average number of verb phrases per sentence (F3), and average number of subordinate clauses per sentence(SBAR's).
Having multiple noun phrases (entities) in each sentence requires the reader to remember more
items, but may make the article more interesting-
example of noun phrase:A dream that one day on the red hills of Georgia the sons of former slaves and the sons of former slave owners will be able to sit down together at a table of brotherhood
While including more verb phrases in each sentence increases the sentence complexity.
Example of verb phrases:As she was walking to the mall, she came across some old letters lying on the road.

>>>Discourse relations
A discourse relation is a description of how two segments of discourse(a continuous piece of spoken or written language) are logically connected to one another.
All of the discourse relations were annotated using PDTB(Penn Discourse Tree Bank).
Penn Discourse Treebank is the largest annotated resource.
In short PDTB annotates all the explicit relations ,implicit relations , no relations in a text.The log likelihood of discourse relations and log likelihood of the number of discourse relations in the text under a multinomial model was very highly and significantly correlated with readability ratings, especially after text length was taken into account.


>>>The entity grid was computed using the brown coherence toolkit.Each text is represented
by an entity grid, a two-dimensional array that captures the distribution of discourse entities across text sentences. The rows of the grid correspond to sentences, while the columns
correspond to discourse entities. The discourse entity were the noun phrases.For each occurrence of a discourse entity in the text, the corresponding grid cell contains information about its grammatical role in the given sentence. Each grid column thus corresponds to a string from a set of categories(x,o,s,-) reflecting the entity’s presence or absence in a sequence of sentences. The set of categories consists of four symbols: S (subject), O (object), X (neither subject nor object) and – (gap which signals the entity’s absence from a given sentence). The probability of each of the entity grid is calculated.
But none of the individual entity grid features were significantly correlated with the readability ratings. But the combination of entity grid features had a greater impact on readability.



At last the author combines all the above features with the linear regression(is an approach towards modelling the relationship between the known and unknown variables) to find out the best possible combination of the feaures to predict the readability perfectly and found that-
vocabulory(language models), discourse features, sentence length, a combination of entity grid features combination gives the best result for regression

I hope the explanation of the above paper will take us into the more understanding of readability-assessment.


More ideas and more work to come along!!!!

Monday, February 28, 2011

Readability in a nutshell...

A Monday morning and I was feeling the Monday morning blues. To top it, I had to give a project presentation. Well, now I’m glad I had the presentation as I realized I need to gather my scattered thoughts on my project topic and yes, market the product in a better and more impressive way, even when bombarded with questions!! :P

Working in this direction, I decided to blog about the ideas and concepts about Readability Assessment by putting it in a nutshell…

So... Why are we looking at readability?

The use of readability test has been a controversial topic. There is a lot of apprehension associated and quite a few questions are raised about this topic. Firstly, let’s be clear with “What readability actually is”. Readability describes the ease with which a document can be read and make sense of it. To help us assess how “readable” a text is, we use readability tests.

These tests were developed with the intention of helping librarians and educators select their choice of books, who were otherwise relying on recommendations to make decisions.

Though Webster’s defines “readable” as something that is fit to read, interesting, attractive in style and enjoyable; obviously, the readability formulas cannot measure the latter three factors. Also the comprehensibility or how well the user understands the text cannot be measured using these formulas.

Historical Overview

Readability formulas were first developed in the 1920s in the United States. Right from the time of conception till today, readability tests have been designed as mathematical equations which take into account elements of writing such as- the number of personal pronouns in the text, the average number of syllables in words or number of words in sentences in the text.

Factors like these are usually described as "semantic" if they concern the word used and "syntactic" if they concern the length or structure of sentences. Both semantic and syntactic elements are surface-level features of the text, and do not take into account any the nature of the topic or the characteristics of the readers.

The earliest investigations of readability were conducted by asking students, librarians, and teachers what seemed to make texts readable.

How Do They Work?

Readability formulas measure certain features of text which can be subjected to mathematical equations and calculations. These mathematical equations cannot measure comprehension directly and not all features can be measured mathematically. Readers can be questioned or tested on the material they have read and the material itself can be tested with formulas. The readers success in understanding the material can be correlated to the readability score of the text itself. This is one method to validate the formulae.

The most important features that contribute to determining reading ease are word and sentence length.

So readability formulas are considered to be predictions of reading ease but they do not help us evaluate how well the reader will understand the ideas in the text.

Today’s readability formulas are usually based on one semantic factor ( difficulty of words) and one syntactic factor (difficulty of sentences ). Inclusion of other factors just complicates the process and doesn’t make the formula anymore predictive! Words are either measured against a frequency list or are measured according to their length in characters or syllables. Sentences are measured for the average length in characters or words.

The best readability methods and tests are elaborated on below :

· Fog Index

This is computed as follows:

1. The total number of words is divided by the total number of sentences which gives the average number of words per sentence.

2. The number of words with more than 3 syllables id divided by total number of words to give the percentage of difficult words

3. The sum of these two figures( 1 and 2) multiplied by 0.4. this is the Fog Index in years of education.

· The Flesch Reading Ease Scale is the most widely used formula outside of educational circles. It measures reading from 100 (for easy to read) to 0 (for very difficult to read). A zero score indicates text has more than 37 words on the average in each sentence and the average word is more than 2 syllables. In response to demand, Flesch also provided an interpretation table to convert the scale to estimated reading grade and estimated school grade completed.

· Fry published a readability graph which was easier than manual computations. A hand-held calculator was developed to do the Fry test, and now it is incorporated in computer programs.

· The “cloze” procedure

The cloze procedure for testing the writing is often treated as a readability test because a formula exists for translating the data from "cloze tests" into numerical results. The name "Cloze" comes from the word "closure". In this procedure, words are deleted from the text and readers are asked to fill in the blanks. By constructing the meaning from the available words and completing the text, the reader achieves “closure”.It became a popular method for measuring the suitability of text for a particular audience. It was popular because its scoring was objective; it was easy to use and analyze; it used the text itself for analysis; and it yields high correlations to other formulas.

It tells you whether a particular audience group can comprehend the writing well enough to complete the cloze test and asks the reader to fill in the appropriate or a similar word in the blanks. Usually every fifth word is deleted. Cloze is thought to offer a better index of comprehensibility than the statistical formulas. The ability to identify the missing word or to insert a satisfactory substitute for the original word indicates that the reader comprehends the content of the text.

In recent years, researchers have emphasized that readability tests can only measure the surface characteristics of text. Qualitative factors like vocabulary difficulty, composition, sentence structure, concreteness and abstractness, obscurity and incoherence cannot be measured mathematically. They have pointed out that material which receives a low-grade level score may be incomprehensible to the target audience. As an example, they suggest that you consider what happens if you scramble the words in a sentence, or on a larger scale, randomly rearranged the sentences in a whole text. The readability score could be low, but comprehension would be lacking.

example: Fall Humpty had Dumpty great a.

Things readability formulas can do

1. Their primary advantage is they can serve as an early warning system to let the writer know that the writing is too dense. They can give a quick, on-the-spot assessment. They have been described as "screening devices" to eliminate dense drafts and give rise to revisions or substitutions.

2. In some organizational settings, readability tests are considered useful to show measurable improvement in written documents.

Things they cannot tell you

1. How complex the ideas are

2. Whether or not the content is in logical order

3. Whether the vocabulary is appropriate for the audience

4. Whether there is a gender, class or cultural bias

5. Whether the design is attractive and helps or hinders the reader

6. Whether the material appears in a form and type style that is easy or hard to read

Readability tests cannot tell you whether the information in the text is written in a way to interest the reader, nor can they tell you whether reader has sufficient background information to appreciate the new information provided in the text.

Hope this blog helped in providing a broad overview about readability :) And assessors, hope this provides a clearer picture and we can think and work further with new ideas :)


Saturday, February 19, 2011

INTRODUCTION TO TEXT READABILITY:
Reading as a means of education has helped individuals learn more about the outside world. If the materials are easy to read and contain clear ideas, they will increase the enthusiasm for reading. It is considered a medium of language acquisition and communication, and leads to the sharing of ideas and
information. It depends on three main factors: the reader, the text and the situation . Our job is to focus on the readability of text.

There has been a lot of research in the field of text readability since the 1920s. This research has led to many popular readability formulas and effective readability tools with useful applications in English, Spanish, and French.

Text readability is defined as “the ease of understanding or comprehension due to the style of writing”.Readability is concerned with matching the reader and the text; it helps us to measure the appropriateness of texts to particular readers.


Every author should transmit his/her messages to the intended readers and motivate the reader by avoiding the use of long sentences and unnecessarily complex words, because poor readers will soon discouraged and overwhelmed with the huge number of new words and complex structures.


Text readability measurement has many potential benefits in the following fields: education, medicine , web applications , and information retrieval systems.



Readability Factors
:
Readability factors are those that affect the level of proper reading and understanding of a text . These factors can be divided into two types: reader factors and text factors.
a) Reader Factors
These are the factors that are related to reader age and his/her reading ability. A lot of research found that, in addition to vocabulary and sentence structure, the prior knowledge and experience, interest, and motivation of the reader affect in one way or another text's readability . Also, tendencies of the reader encourage him/her to read and comprehend the text.
b) Text Factors
There are many factors related to the text itself that affect text readability. Among these factors are the following:
• Certain aspects of words have a huge impact on text readability, such as word length, word frequency,vocabulary load, and using unusual or abstract words, because short words and well-known words are easy to comprehend and most readers recognize frequent words faster than infrequent ones.
• Average sentence length is an important feature that affects readability of a text .
• The clarity of an idea mentioned in the text affects its readability, as does the number of parenthetical clauses.
• Topology, metaphor, and simile usually affect the readability.
• A lot of research has found that aspects of grammatical structure complexity affect text readability. These aspects include deletion of one of the main sentence parts, spacing between the main sentence parts(such as between the subject and the verb), separating the pronouns and the words that they refer to, and using the passive voice more than the active voice.


Most research has focused on combinations of these factors to estimate text readability. For example, Larsson proposed a so-called "nominal quotient NQ" feature to model the different readability levels. It is calculated by
counting the number of nouns, prepositions, and participles, then dividing them by the number of pronouns, adverbs, and verbs per document.
NQ = (nouns + prepositions + participle)/ (pronouns + adverbs + verbs)

The Equation estimates the volume of information of a given text. The amount of information in a text affects its readability, i.e., a less informative text is more readable than a highly informative one.

PREVIOUS WORK:
Many traditional readability metrics are linear models with a few (often two or three) predictor variables based on superficial properties of words, sentences, and documents. These shallow features include the average number of syllables per word,the number of words per sentence, or binned word
frequency.
--->> the Flesch-Kincaid Grade Level formula uses the average number of words
per sentence and the average number of syllables per word to predict the grade level (Flesch, 1979).


--->>The Gunning FOG index (Gunning, 1952) uses average sentence length and the percentage of words with at least three syllables.

---->>Automated Readability Index (Senter and Smith, 1967) counts the number of characters per word instead to determine word difficulty.

---->>Dale-Chall formula uses the percentage of difficult words (words that do not appear in the list) and average sentence length to predict the grade level of a text.

--->>Stenner et al. (1983) had analyzed more than 50 lexical variables and did extensive correlation tests to find out that word frequency and sentence length have the most predictive power in ranking the reading difficulty of texts contained in their experiment data.

DISADVANTAGES OF THE ABOVE METRICS:
These traditional metrics are easy to compute and use, but they are not reliable, as demonstrated by several recent stud-
ies in the field (Si and Callan, 2001; Petersen and Ostendorf, 2006; Feng et al., 2009).

RECENT WORK:
With the advancement of natural language processing tools(NLP), a wide range of more complex text properties have been explored at various linguistic levels. ----->>Si and Callan (2001) used unigram language models to capture content information from scientific web pages.

------>>Collins-Thompson and Callan (2004) adopted a similar approach and used a smoothed unigram model to predict the grade levels of short passages and web documents.

------>>Heilman et al. (2007) continued using language modeling to predict readability for first and second language texts. Furthermore, they experimented with various statistical models to test their effectiveness at predicting reading difficulty (Heilman et al., 2008)

------>>Schwarm/Petersen and Ostendorf (Schwarm and Ostendorf, 2005; Petersen and Ostendorf , 2006) used support vector machines to combine features from traditional reading level measures, statistical language models and automatic parsers to assess reading levels.

In addition to lexical and syntactic
features, several researchers started to explore DISCOURSE LEVEL features and examine their usefulness in predicting text readability.

Discourse Features has four subsets of discourse features: entity density features,lexical-chain features, coreference inference features and entity grid features.

---->>The coreference inference features are novel and have not been studied before.
----->>Entity-density features and lexical chain features have been studied for readers with intellectual disabilities (Feng et al., 2009).
------>>Entity-grid features have been studied by Barzilay and Lapata (2008) in a stylistic classification task.

Pitler and Nenkova(2008) used the Penn Discourse Treebank (Prasad
et al., 2008) to examine discourse relations.

Entity density features include:
  • percentage of named entities per document
  • percentage of named entities per sentences
  • percentage of overlapping nouns removed
  • average number of remaining nouns per sentence
  • percentage of named entities in total entities
  • percentage of remaining nouns in total entities

Lexical Chain Features include:
  • total number of lexical chains per document
  • avg. lexical chain lengthavg. lexical chain span
  • num. of lex. chains with span ≥ half doc. length
  • num. of active chains per word
  • num. of active chains per entity

Coreference Inference Features:
  • total number of coreference chains per document
  • avg. num. of coreferences per chain
  • avg. chain span
  • num. of coref. chains with span ≥ half doc. length
  • avg. inference distance per chain
  • num. of active coreference chains per word
  • num. of active coreference chains per entity


In our project"AUTOMATIC READABILITY ASSESSMENT FOR TEXT SIMPLIFICATION" we are mainly focussed about the discourse features of a text and how they can used to assess or grade the readability of the text.

The rest of the discourse features will be addressed in my next blog:)so long till den:)

Tuesday, February 15, 2011

Previous work on readability assessment, applications of readability assessment and research work done by Lejun Feng towards readability assessment

Previous work on readability assessment, applications of readability assessment and research work done by Lejun Feng towards readability assessment


Are you fed up reading the text which is not of your choice or level????



For this we require the readability assessment tool to select the text of our choice :) :)


Let me start with-what is Readability and what all we need to consider while comprehending or understanding a given text.Let me explain:

Readability is defined as a measure of ease with which a written text can be understood.

Now the second question arises What makes a text easy or difficult to understand? For this let us go through the previous work done on readability-assessment.

Relevant Literature & Previous Work on Readability Assessment

Let us first see the characteristics and limitations of traditional readability metrics and recent statistical development in the field of readability.

Traditional readability metrics are given below:


1.Flesch Reading Ease and the FleschKincaid grade level formulas (Flesch, 1979) use average sentence length and average syllables per word to calculate the grade level of a text.

2.Gunning FOG (Gunning, 1952) and the SMOG (McLaughlin, 1969) index use average sentence length and the percentage of words with at least three syllable as parameters

3.Automated Readability Index (Senter and Smith, 1967) counts the number of characters per word instead to determine word difficulty.

4.Dale-Chall formula uses the percentage of difficult words (words that do not appear in the list) and average sentence length to predict the grade level of a text.

5.Stenner et al. (1983) had analyzed more than 50 lexical variables and did extensive correlation tests to find out that word frequency and sentence length have the most predictive power in ranking the reading difficulty of texts contained in their experiment data.

Advantages of traditional readability metric is explained as follows:

>>These traditional metrics are widely used, especially in educational settings, because they are simple and easy to calculate.

>>Grade levels that are calculated by the above methods indicate the number of years of education generally required to understand the text. It is generally understood that reading difficulty increases with grade level. They are a commonly accepted index for reading difficulty of a text, especially in educational settings, because the scale of grade levels make it easier for teachers, parents, librarians, and others to judge the readability level of various books and texts. Another reason to look at grade levels is that they have been widely used in previous research.

Drawbacks of traditional readability metrics:

>>>They ignored syntactic constituents, the structure of the text, local and global discourse coherence across the text(using the coherent basis for discourse i.e., familiarity of the discourse topic to the reader, readers’ prior knowledge and motivation to read.

>>>The traditional metrics cannot capture content information and often misjudge the reading difficulty of scientific web documents.

Statistical approaches towards readability metrics

Si and Callan (2001) used unigram language models to capture content information from scientific web pages. A linear model was built combining language models with sentence length.

CollinsThompson and Callan (2004) adopted Smoothed Unigram model to capture vocabulary variation across all grade levels contained in the corpus,their Smoothed Unigram model is purely vocabulary-based and does not contain any syntactic features.Although vocabulary-based unigram language models help capture important content information and variation of word usage, they do not capture syntactic information.

Schwarm and Ostendorf (2005)

used Charniak’s parser (Charniak, 2000) and higher order n-gram (n = 3) models over a combination of word and part-of-speech (POS) sequences to capture syntactic and semantic features.But it was limited to the study of lexical and syntactic features with regard to text comprehensibility

Heilman et al. (2007)

The readability measurement was motivated by pedagogical differences in first language (L1) and second language (L2) learning. They argue that grammatical features play a more important role in L2 texts than in L1 texts because, unlike L1 learners who learn grammar through natural interaction, L2 learners learn grammatical patterns explicitly from L2 textbooks.

But it was limited to the study of lexical and syntactic features with regard to text comprehensibility.

Barzilay and Lapata (2008)The first work on discourse relation was done by Barzilay and Lapata, designed and implemented an entity-grid model to capture the distribution of entity transition patterns at sentence to sentence level.

The cognitive science reveals that the most important process during reading comprehension lie in discourse comprehension, which entails making appropriate inferences from concepts and propositions, connecting and/or integrating related information to construct a coherent memory representation.

Their work was not motivated by text readability, but rather by other NLP tasks related to text generation, such as text ordering and summary coherence rating.

Pitler and Nenkova (2008) for the first time looked at readability factors at all three linguistic levels: lexical, syntactic and discourse.In the PDTB(Penn Discourse tree bank), all discourse connectives and the relations between two adjacent sentences of a text were manually annotated.Among all individual factors analyzed at all three linguistic levels, the likelihood of discourse relations with text length taken into account shows the strongest correlation with human readability ratings (r = .4835).Their work is novel and inspiring, because it touched the core of text comprehension and showed a new direction in readability study that has been long overdue

Limitation of Petler and Nenkova work

1.It cannot be adopted for any corpus other than the PDTB.

2.they mainly focussed on text style rather on text readability i.e. how well a text is written rather than3. how difficult or easy a text is to read.

3.they experiment conducted was only on 30 articles and because they relied only on limited subjective human ratings,their study lacks any objective measure..

After reading all the previous work done on readability.Let me conclude in a simple way that the readability cannot solely judged by

1. l >>lexical tokenisation( which looks at three factors: the number of syllables a word contains, the number of characters a word contains, and word frequency)

2. >>syntactic representation(the complexity of sentences is solely judge by their average length in words).

3.3>>. Sentence processing

But also on .....

>>> discourse relation to buid coherent memory representation of text by the reader. The discourse relation in totally means the amount of prior knowledge the reader needs to apply, the inferences the reader need to make, relate the text with the accumulated knowledge, the references required to understand the text, searching and retrieving the relevant information for comprehending.

1. >>> It also depends on the working memory capacity. If the text is not related to the main topic of discussion that means the text is not present under current working memory then the reader has to search the long term memory for understanding .

Now coming to the major contribution towards readability assessment done by Lejun Feng :

1.The readability from a text comprehension point of view; in particular, paid special attention to discourse processes that are crucial for constructing and maintaining local and global memory coherence of a text(we can say it as short term and long term memory), which is key to successful text comprehension.

2.The processes that occur in discourse comprehension, which contains the activities such as resolving entities, inferring meaning from words and phrases, assessing and evaluating semantic relations among concepts and propositions and making connections among them, using background knowledge to generate appropriate inferences to fill in gaps, and integrating new information into existing semantic structure to achieve and maintain coherent memory representation of a text.

3.The thesis propose to apply advanced NLP techniques to implement three classes of novel discourse features that have not been studied by any of the previous research.i.e.density of entities, lexical chains, coreferential inference features .

4.It focused not only on intrinsic text properties, but viewed text comprehensibility as the result of the interaction between the text and the reader’s prose processing ability, the characteristics of a given reader was taken into the readability study by addressing constraints of working memory capacity placed on the reader’s comprehension effort.The constraints on working memory is highlighted here because the individuals with ID(intellectual disability) do not have the same memory capacity as the one without ID.

5.Working memory while extracting the discourse features was taken under consideration. Working memory has great impact on various language comprehension activities, because it provides temporary storage and simultaneous manipulation of information and coordinates resources that are necessary for comprehension processes during reading. Since individuals with ID(intellectual disabilities) donot have the same working capacity as the individuals without ID(intellectual disabilities) which accounts in variation of comprehension performance.

6.The thesis proposes the development of the automatic readability assessment tool which consists of four major parts: data collection, feature extraction and implementation, building and evaluating the tool on labeled corpora, and test and evaluating the tool on unlabeled texts from different domain.

7.The study on readability combines various proxies, such as paired original/simplified corpora, grade levels, subjective ratings by experts and users, and objective observations in our user studies, to get all those underlying text properties that are associated with reading difficulties.

Let me conclude by telling the applications of buiding the automatic-readability-assessment tool.

1. >>> in educational settings, school children, second language learners, adults with low literacy can use the tool to select reading material that is of their interest and tailored to their varying reading proficiency.

2. >>> language instructors can use this tool to select teaching material effectively that is at appropriate level of reading difficulty for target readers.

3. >>>It can be used to rank the documents by reading difficulty for automated systems such as text simplification, text summarization, machine translation and other text generation systems for example tool can be used to select documents that are at appropriate level of reading difficulty among those on similar topic for the target system to begin with.

4. >>> A reliable tool that can accurately assess the change of reduction in reading difficulty before and after simplification process can be provided by this tool.

5. >>> we can use the tool to check the quality of text generated by systems such as text summarization, machine translation and text ordering system. Comparing the reading difficulty before and after the change of simplification process is required to check the coherence(as coherent text are easier to read).

I hope the above information will help us before getting delved into the more understanding of thesis .Finally I could say that after reading the thesis for three times made me to grade myself to the same level of understanding J

Summary on "Automatic Readability Assessment" thesis

When I was told that I needed to read a 200 page thesis: “Automatic Readability Assessment by Lijun Feng, the first thought that occurred to me was “I will definitely doze off 20 pages into the paper!!” But this thesis is pretty extensive in its coverage on the topic that forms the backbone of our project: what exactly goes into understanding-the-understandability of the text.

I provide a summary of the paper in this blog.

Readability is commonly defined as a measure of what makes a text easy or hard to read and has been the central topic of readability research for the past 80 years. Many traditional metrics exist for text readability which bank upon a limited set of textual features, such as sentence length, number of syllables per word, word frequency, etc. Though these metrics are easy to compute, they have been proven to be highly unreliable.

Language models and parsers that use the NLP technology have been used to explore complex lexical features and syntactic constructs in aiding readability study. But readability research has not made much progress beyond lexical and syntactic analysis as these features are easier to define and measure with existing techniques, while factors such as discourse topic and discourse coherence require much more complex semantic analysis, and hence remain as challenging problems.

The thesis focuses on developing an automatic text readability assessment tool at various discourse levels while taking user characteristics into account. The primary goal of the thesis is to quantify and understand what makes a text easy or difficult to read, particularly for readers with mild intellectual disabilities (MID).

In order to assess how well the readability assessment tool corpora were created consisting of original and simplified texts. The tool’s ability to differentiate between original and simplified text was evaluated and a comparison between the correlations of predictions by the tool with independent measure of text difficulty rated by experts and by adult participants with mild intellectual disabilities, was done.

A reader processes the sentences as he reads it and organizes the memory units extracted from the word and sentence processing and these units are placed in his/her memory in an organized and structured manner. The coherent memory representation is constructed and maintained by his ability to process the text and resolve references by making suitable inferences. Low working memory capacity has been shown to be related to a reduction in the speed and accuracy with which sentences can be processed.

The following elements are used towards the Feng’s approach to readability:

  • Text readability is not determined by intrinsic text properties alone. Rather, reading ease or difficulty results from the interaction of the reader and the text.
  • The goal of reading is to construct a coherent memory representation of a text. Word identification and sentence parsing are part of basic comprehension processes that occur at the low level of text comprehension. Much of reading difficulties arise from higher level of discourse comprehension, which involves mostly evaluating and identifying relations among conceptual information, solving references to establish entities in a text and making various types of inferences to fill in missing information.
  • Working memory has great impact on various language comprehension activities, because it provides temporary storage and simultaneous manipulation of information and coordinates resources that are necessary for comprehension processes during reading.
  • Working memory capacity constantly places constraints on readers’ attempt to understand a text. Individual differences in working memory capacity account for some of the variation in comprehension performance.
  • Text comprehensibility can be well predicted by an analysis of the demands it makes of readers’ working memory

The thesis is especially targeted to help people with MID ( Mild Intellectual Disability).

So a situation like this:

indicates that the person has a problem reading, comprehending, analyzing and joining the dots to make inferences while reading each line. The limitation in their cognitive functioning is due to various degree of impairments which affects their reading comprehension directly. The ability to actively and strategically apply one’s semantic knowledge to facilitate comprehension activities is considered crucial in understanding differences in individual comprehension performance. In many empirical studies, individuals with ID were observed to show deficits in various aspects of semantic processing.

It is difficult to find reading materials for individuals with MID that are

(1) of interest to them and

(2) at the right reading level.

Reading materials at lower reading levels are typically written for children, and texts written for adults without disabilities often require a high level of linguistic skills and sufficient real world knowledge, which these individuals often lack. The lack of appropriate reading materials may also discourage adults with ID from practicing reading, thus diminishing their already low literacy skills.

Transformation rules are applied that change constructs into shorter or plainer sentences and as a result they are thought to be easier to understand to help people with MID. However, synonym-replacement and syntax-tree simplification alone is not enough as, in addition to challenges that come from lexical and syntactic factors, they have other difficulties with processing written information. Moreover, text simplification results in increased length of the simplified document, because long and complex sentences are often split into multiple shorter sentences. The resulting increased length of the whole document can pose another challenge to the already limited working capacity of readers with MID because it requires processing and storing more information. Therefore a system needs to be designed wherein the most relevant information is retained and less relevant information simplified or completely left out

There are two major research questions that are at the center of the design and implementation of such a text simplification system

(1) How do we identify which portions of a text will pose difficulty for our users?

(2) When there are several possible simplification choices, how do we decide which is the optimal one to choose for our users?

Ideally, a reliable automatic readability assessment tool would help solve both questions and aid automatic text simplification in many ways. It can be used to rank documents by reading difficulty for automated systems such as text simplification, text summarization, machine translation and other text generation systems. For example, as a reprocessing step, such a tool can be used to select documents that are at appropriate level of reading difficulty among those on similar topic for the target system to begin with. More importantly, such a tool can be used to provide efficient evaluation measure for systems’ performance.

One of many important aspects to look at when evaluating the quality of text generated by automated systems is coherence. Coherent texts are easier to read. One of many ways to check the coherence of resultant texts is compare their reading difficulty before and after change. Feng’s automatic readability assessment tool is well suited for this task.

To make it easier for people to judge the reading difficulty of a text, grade levels or number of years of education required to completely understand a text are commonly used as index for reading difficulty.

Many traditional readability metrics metrics use simple linear functions with two or three shallow language features to model the readability of a given text. For example, the widely used Flesch Reading Ease and the Flesch- Kincaid grade level formulas use average sentence length and average syllables per word to calculate the grade level of a text. Similarly, the Gunning FOG and the SMOG index use average sentence length and the percentage of words with at least three syllable as parameters. Automated Readability Index counts the number of characters per word instead to determine word difficulty. Different from the syllabic approach, the Dale-Chall formula made an advance in measuring lexical difficulty by introducing a list of common words familiar for 4th-grade students. It uses the percentage of difficult words (words that do not appear in the list) and average sentence length to predict the grade level of a text.

Flesch Kincaid Reading Ease: is based on a 0-100 scale. A high score indicates that that a text is easier to read.

206.835 - 1.015 (words/sentences) - 84.6 * (syllables/words)

Flesch Kincaid Grade Level:

0.39 * (words/sentences) + 11.8 * (syllables/words) - 15.59

Gunning FOG score:

0.4 * (( words/sentences) + 100 * (complex words/word))

SMOG Index:

1.0430 * sqrt ( 30 * complex words/sentences) + 3.1291

ARI :

4.71 * ( characters/ words) - 0.5 ( words / sentences ) - 21.43

These traditional metrics are widely used, especially in educational settings, partly also because they are simple and easy to calculate. However, the limitations of these metrics are obvious. They overweighted the impact of word frequency and sentence length on text comprehensibility and systematically ignored many other important factors that are crucial to reading, such as syntactic constituents, the structure of the text, local and global discourse coherence across the text, familiarity of the discourse topic to the reader, readers’ prior knowledge and motivation to read, etc.

Moreover, the number of syllables per word, which acts as a reliable proxy for word frequency, and sentence length do not always capture the reading complexity of a text accurately. Hence traditional metrics have been proved to be unreliable.

Most recently the work done in this direction was:

  • Detailed analysis of syntactic complexity based on parse trees has been combined with language models and traditional measures in readability research (Heilman et al., 2007; Pitler and Nenkova, 2008; Schwarm and Ostendorf, 2005).
  • Besides three traditional measures (average sentence length, average number of syllables per word and Flesch-Kincaid score), Schwarm and Ostendorf (2005) used Charniak’s parser (Charniak, 2000) and higher order n-gram (n = 3) models over a combination of word and part-of-speech (POS) sequences to capture syntactic and semantic features. The four parse features include average parse tree height, average number of noun phrases, average number of verbphrases, and average number of “SBAR”s (relative clauses).
  • Pitler and Nenkova (2008) for the first time looked at readability factors at all three linguistic levels: lexical, syntactic and discourse. They analyzed six classes of features: traditional readability factors such as average number of characters per word, average sentence length, maximum number of words per sentence, document length, vocabulary-based unigram features, four parse syntax given by Schwarm and Ostendorf and mentioned above, elements of text cohesion and discourse relations.

The thesis approaches readability from a text comprehension point of view, with special attention to discourse processes that are crucial for constructing and maintaining local and global memory coherence of a text, which is key to successful text comprehension. These discourse processes reflect the reader’s comprehension task and can be useful in predicting the complexity of a text. Advanced NLP techniques have been applied to implement three classes of novel discourse features that have not been studied by anyof the previous research.

The study does not rely on a single measure of readability and combines various proxies, such as paired original/simplified corpora, grade levels, subjective ratings by experts and users, and objective observations in our user studies, to get at those underlying text properties that are associated with reading difficulties.

The methods employed in the paper consist of four major parts:

  • Data collection
  • Feature extraction and implementation
  • Building and evaluating the tool on labeled corpora
  • Testing and evaluating the tool on unlabeled texts from different domain.

The main corpus for the study consists of texts with reading difficulty annotated by elementary grades level ranging from Grade 2 to 5. The corpus is used to build and evaluate our automatic text readability assessment tool.

Two ways are given to assess how well the readability assessment tool generalizes texts from different domains:

  • First, two corpora are manually created consisting of original and simplified texts adapted specifically for adults with mild intellectual disabilities. The automated readability assessment tool gives the grade levels to predict the reading difficulty of original and simplified texts contained in these two corpora.
  • Second, the correlations between grade level predictions by our tool, expert ratings, and inferred text difficulty for adult participants with mild intellectual disabilities have been compared.

Hence the general methodology relies on the following five proxies:

  • Grade levels: Grade levels indicate the number of years of education generally required to understand the text. It is generally understood that reading difficulty increases with grade level.
  • Paired original/simplified texts: A common assumption is that simplified texts should be easier to read.
  • Subjective ratings by experts: Experts who have linguistic expertise or specialize in working with adults with ID were asked to rate text difficulty.
  • Objective observations in user studies: Target users with texts at a variety of difficulty levels are taken and their reading times are recorded. Subjects will answer simple comprehension questions afterwards, and the accuracy of their answers are analyzed. This will give the most direct clues about the difficulties faced by the target user group, even though we will need to account for per subject and other effects·
  • Subjective (introspective) ratings by users: This will probably be especially problematic in the study, as the users’ subjective judgment may not be fully reliable because of their cognitive impairments.

Research Hypothesis

The thesis proposes to design and implement four classes of novel discourse features that will best reflect working memory burden posed on the reader’s attempt to understand a text: density of entities, lexical chains, coreferential inference features and local entity coherence features.

  • Density of Entities: Conceptual information is often introduced in a text by entities, which consist of general nouns and named entities, such as people’s names, locations, organizations, etc. More the entities introduced into a text, the more demands they make of the reader’s working memory capacity; for individuals with ID who suffer from impoverished working memory, the increasing demands of entity processing would become especially overwhelming.
  • Lexical Chains :Using existing NLP technology, various semantic relations among entities – such as synonym, hypernym, hyponym, coordinate terms (siblings), etc.– can be automatically annotated. Based on these annotations, entities that are connected by certain semantic relations can be chained up through the text and form a lexical chain.
  • Coreferential Inferences: Readers are required to actively apply acquired prior background knowledge to disambiguate and make appropriate inferences. The inference processes involve searching and retrieving relevant information from various long- and short-term memory systems.

Corpora

Six corpora collected for their readability study were:

  • Labeled Corpus: from WeeklyReader, LocalNews2007, LocalNews2008 and NewYorkTimes100
  • Unlabeled Paired Corpora: from Britannica and LiteracyNet

Feature Extraction

Various features were used for their automatic text readability assessment tool and the techniques deployed to extract and implement them.

The following 5 feature subsets were proposed, many of which result from refinement and improvement of previously studied features.

  • Discourse Features
  • Language-Modeling-based Perplexity Features
  • Parsed Syntactic Features
  • Part-Of-Speech-based (POS) Features
  • Shallow Features

Discourse Features

Four subsets of discourse features were given: entity-density features, lexical-chain features, coreference inference features and entity grid features.

The first three subsets of features are novel and have not been studied by other researchers before.

  • Entity-Density Features: The entities are defined as a union of named entities and the rest of general nouns (nouns and proper nouns) contained in a text.
  • Lexical Chain Features: LexChainer produces chains of words connected by six semantic relations: synonymy, hypernym, hypony, meronym, holonym and coordinate terms (siblings) (Galley and McKeown, 2003). The hypothesis is that important conceptual and topical information recurring throughout a text is likely to be captured by these lexical chains. In order to construct a coherent semantic representation of a text, it is necessary that a reader keeps semantic related discourse units in his/her working memory throughout the whole reading comprehension process.

  • Coreferential Inference Features: Relations among concepts and propositions are often not stated explicitly in a text. The constructive nature of building a coherent semantic representation of a text requires a reader to actively retrieve and assess previously processed information to generate appropriate inferences when conceptual information is not stated explicitly.

  • Entity Grid Features: Features extracted from entity grid models are study for their effectiveness in automatic readability assessment.

Parsed Syntactic Features

Recent approaches to readability have utilized natural language processing techniques such as probabilistic parsers to analyze syntactic features of texts and reported their positive contributions. Schwarm and Ostendorf studied four parse tree features (average parse tree height, average number of SBARs, noun phrases, and verb phrases per sentences). The paper implemented these and additional features, using the Charniak parser (Charniak, 2000). Our parsed syntactic features focus on clauses (SBAR), noun phrases (NP), verb phrases (VP) and prepositional phrases (PP). For each phrase, four features are implemented: total number of the phrases per document, average number of phrases per sentence, and average phrase length measured by number of words and characters respectively

POS Features

The paper focuses on five classes of words (nouns, verbs, adjectives, adverbs, and prepositions) and two broad categories (content words, function words).

Nouns include general nouns and proper nouns. Verbs include past tenses, present participles, past participles and modals in addition to infinitives, present 3rd person singular forms and all forms of auxiliary verbs. Content words include nouns, verbs, numerals, adjectives, and adverbs; the remaining types are function words.

Shallow Features

Shallow features refer to those used by traditional readability metrics, such as Flesch-Kincaid Grade Level (Flesch, 1979), SMOG (McLaughlin, 1969), Gunning FOG (Gunning, 1952), etc. Although recent readability studies have strived to take advantage of NLP techniques, little has been revealed about the predictive power of shallow features.

Some of the shallow features are:

1) average number of syllables per word

2 )percentage of poly-syllables words per doc.

3 )average number of poly-syllables words per sentence.

4 )average number of characters per word

5 )Chall-Dale difficult words rate per document

6 )average number of words per sentence

7 )average number of characters per sentence

8 )Flesch-Kincaid score

9 )total number of words per document


Automatic Readability Assessment

The effectiveness of features in terms of their impact on predicting reading difficulty indexed by grade levels is studied.

To summarize, within the four subsets of discourse features, the following key observations were made:

  • Among all four subsets of features, entity-density features exhibit the most significant discriminative power in modeling text reading difficulty.
  • Combining all discourse features together leads to overall improvement. However, the best performance is achieved by combining entity density features and entity grid features together.
  • Analysis at grade level reveals that entity-density features generate the highest accuracy for Grade 2 (57.41%) and 4 (50.09%); combining all features produces the best performance for Grade 3 (57.09%); and entity grid features generate the highest accuracy for Grade 5 (80.96%).

This is the farthest I could read and comprehend in the thesis. The actual implementation followed in the paper details required text simplification for me ( :P ) I’ll definitely have to bury my head deeper into the thesis!


So long till then!