Project Readability Assessment: February 2011

Monday, February 28, 2011

Readability in a nutshell...

A Monday morning and I was feeling the Monday morning blues. To top it, I had to give a project presentation. Well, now I’m glad I had the presentation as I realized I need to gather my scattered thoughts on my project topic and yes, market the product in a better and more impressive way, even when bombarded with questions!! :P

Working in this direction, I decided to blog about the ideas and concepts about Readability Assessment by putting it in a nutshell…

So... Why are we looking at readability?

The use of readability test has been a controversial topic. There is a lot of apprehension associated and quite a few questions are raised about this topic. Firstly, let’s be clear with “What readability actually is”. Readability describes the ease with which a document can be read and make sense of it. To help us assess how “readable” a text is, we use readability tests.

These tests were developed with the intention of helping librarians and educators select their choice of books, who were otherwise relying on recommendations to make decisions.

Though Webster’s defines “readable” as something that is fit to read, interesting, attractive in style and enjoyable; obviously, the readability formulas cannot measure the latter three factors. Also the comprehensibility or how well the user understands the text cannot be measured using these formulas.

Historical Overview

Readability formulas were first developed in the 1920s in the United States. Right from the time of conception till today, readability tests have been designed as mathematical equations which take into account elements of writing such as- the number of personal pronouns in the text, the average number of syllables in words or number of words in sentences in the text.

Factors like these are usually described as "semantic" if they concern the word used and "syntactic" if they concern the length or structure of sentences. Both semantic and syntactic elements are surface-level features of the text, and do not take into account any the nature of the topic or the characteristics of the readers.

The earliest investigations of readability were conducted by asking students, librarians, and teachers what seemed to make texts readable.

How Do They Work?

Readability formulas measure certain features of text which can be subjected to mathematical equations and calculations. These mathematical equations cannot measure comprehension directly and not all features can be measured mathematically. Readers can be questioned or tested on the material they have read and the material itself can be tested with formulas. The readers success in understanding the material can be correlated to the readability score of the text itself. This is one method to validate the formulae.

The most important features that contribute to determining reading ease are word and sentence length.

So readability formulas are considered to be predictions of reading ease but they do not help us evaluate how well the reader will understand the ideas in the text.

Today’s readability formulas are usually based on one semantic factor ( difficulty of words) and one syntactic factor (difficulty of sentences ). Inclusion of other factors just complicates the process and doesn’t make the formula anymore predictive! Words are either measured against a frequency list or are measured according to their length in characters or syllables. Sentences are measured for the average length in characters or words.

The best readability methods and tests are elaborated on below :

· Fog Index

This is computed as follows:

1. The total number of words is divided by the total number of sentences which gives the average number of words per sentence.

2. The number of words with more than 3 syllables id divided by total number of words to give the percentage of difficult words

3. The sum of these two figures( 1 and 2) multiplied by 0.4. this is the Fog Index in years of education.

· The Flesch Reading Ease Scale is the most widely used formula outside of educational circles. It measures reading from 100 (for easy to read) to 0 (for very difficult to read). A zero score indicates text has more than 37 words on the average in each sentence and the average word is more than 2 syllables. In response to demand, Flesch also provided an interpretation table to convert the scale to estimated reading grade and estimated school grade completed.

· Fry published a readability graph which was easier than manual computations. A hand-held calculator was developed to do the Fry test, and now it is incorporated in computer programs.

· The “cloze” procedure

The cloze procedure for testing the writing is often treated as a readability test because a formula exists for translating the data from "cloze tests" into numerical results. The name "Cloze" comes from the word "closure". In this procedure, words are deleted from the text and readers are asked to fill in the blanks. By constructing the meaning from the available words and completing the text, the reader achieves “closure”.It became a popular method for measuring the suitability of text for a particular audience. It was popular because its scoring was objective; it was easy to use and analyze; it used the text itself for analysis; and it yields high correlations to other formulas.

It tells you whether a particular audience group can comprehend the writing well enough to complete the cloze test and asks the reader to fill in the appropriate or a similar word in the blanks. Usually every fifth word is deleted. Cloze is thought to offer a better index of comprehensibility than the statistical formulas. The ability to identify the missing word or to insert a satisfactory substitute for the original word indicates that the reader comprehends the content of the text.

In recent years, researchers have emphasized that readability tests can only measure the surface characteristics of text. Qualitative factors like vocabulary difficulty, composition, sentence structure, concreteness and abstractness, obscurity and incoherence cannot be measured mathematically. They have pointed out that material which receives a low-grade level score may be incomprehensible to the target audience. As an example, they suggest that you consider what happens if you scramble the words in a sentence, or on a larger scale, randomly rearranged the sentences in a whole text. The readability score could be low, but comprehension would be lacking.

example: Fall Humpty had Dumpty great a.

Things readability formulas can do

1. Their primary advantage is they can serve as an early warning system to let the writer know that the writing is too dense. They can give a quick, on-the-spot assessment. They have been described as "screening devices" to eliminate dense drafts and give rise to revisions or substitutions.

2. In some organizational settings, readability tests are considered useful to show measurable improvement in written documents.

Things they cannot tell you

1. How complex the ideas are

2. Whether or not the content is in logical order

3. Whether the vocabulary is appropriate for the audience

4. Whether there is a gender, class or cultural bias

5. Whether the design is attractive and helps or hinders the reader

6. Whether the material appears in a form and type style that is easy or hard to read

Readability tests cannot tell you whether the information in the text is written in a way to interest the reader, nor can they tell you whether reader has sufficient background information to appreciate the new information provided in the text.

Hope this blog helped in providing a broad overview about readability :) And assessors, hope this provides a clearer picture and we can think and work further with new ideas :)

Saturday, February 19, 2011

INTRODUCTION TO TEXT READABILITY:
Reading as a means of education has helped individuals learn more about the outside world. If the materials are easy to read and contain clear ideas, they will increase the enthusiasm for reading. It is considered a medium of language acquisition and communication, and leads to the sharing of ideas and
information. It depends on three main factors: the reader, the text and the situation . Our job is to focus on the readability of text.

There has been a lot of research in the field of text readability since the 1920s. This research has led to many popular readability formulas and effective readability tools with useful applications in English, Spanish, and French.

Text readability is defined as “the ease of understanding or comprehension due to the style of writing”.Readability is concerned with matching the reader and the text; it helps us to measure the appropriateness of texts to particular readers.

Every author should transmit his/her messages to the intended readers and motivate the reader by avoiding the use of long sentences and unnecessarily complex words, because poor readers will soon discouraged and overwhelmed with the huge number of new words and complex structures.

Text readability measurement has many potential benefits in the following fields: education, medicine , web applications , and information retrieval systems.

Readability Factors:
Readability factors are those that affect the level of proper reading and understanding of a text . These factors can be divided into two types: reader factors and text factors.
a) Reader Factors
These are the factors that are related to reader age and his/her reading ability. A lot of research found that, in addition to vocabulary and sentence structure, the prior knowledge and experience, interest, and motivation of the reader affect in one way or another text's readability . Also, tendencies of the reader encourage him/her to read and comprehend the text.
b) Text Factors
There are many factors related to the text itself that affect text readability. Among these factors are the following:
• Certain aspects of words have a huge impact on text readability, such as word length, word frequency,vocabulary load, and using unusual or abstract words, because short words and well-known words are easy to comprehend and most readers recognize frequent words faster than infrequent ones.
• Average sentence length is an important feature that affects readability of a text .
• The clarity of an idea mentioned in the text affects its readability, as does the number of parenthetical clauses.
• Topology, metaphor, and simile usually affect the readability.
• A lot of research has found that aspects of grammatical structure complexity affect text readability. These aspects include deletion of one of the main sentence parts, spacing between the main sentence parts(such as between the subject and the verb), separating the pronouns and the words that they refer to, and using the passive voice more than the active voice.

Most research has focused on combinations of these factors to estimate text readability. For example, Larsson proposed a so-called "nominal quotient NQ" feature to model the different readability levels. It is calculated by
counting the number of nouns, prepositions, and participles, then dividing them by the number of pronouns, adverbs, and verbs per document.
NQ = (nouns + prepositions + participle)/ (pronouns + adverbs + verbs)

The Equation estimates the volume of information of a given text. The amount of information in a text affects its readability, i.e., a less informative text is more readable than a highly informative one.

PREVIOUS WORK:
Many traditional readability metrics are linear models with a few (often two or three) predictor variables based on superficial properties of words, sentences, and documents. These shallow features include the average number of syllables per word,the number of words per sentence, or binned word
frequency.
--->> the Flesch-Kincaid Grade Level formula uses the average number of words
per sentence and the average number of syllables per word to predict the grade level (Flesch, 1979).

--->>The Gunning FOG index (Gunning, 1952) uses average sentence length and the percentage of words with at least three syllables.

---->>Automated Readability Index (Senter and Smith, 1967) counts the number of characters per word instead to determine word difﬁculty.

---->>Dale-Chall formula uses the percentage of difﬁcult words (words that do not appear in the list) and average sentence length to predict the grade level of a text.

--->>Stenner et al. (1983) had analyzed more than 50 lexical variables and did extensive correlation tests to ﬁnd out that word frequency and sentence length have the most predictive power in ranking the reading difﬁculty of texts contained in their experiment data.

DISADVANTAGES OF THE ABOVE METRICS:
These traditional metrics are easy to compute and use, but they are not reliable, as demonstrated by several recent stud-
ies in the field (Si and Callan, 2001; Petersen and Ostendorf, 2006; Feng et al., 2009).

RECENT WORK:
With the advancement of natural language processing tools(NLP), a wide range of more complex text properties have been explored at various linguistic levels. ----->>Si and Callan (2001) used unigram language models to capture content information from scientific web pages.

------>>Collins-Thompson and Callan (2004) adopted a similar approach and used a smoothed unigram model to predict the grade levels of short passages and web documents.

------>>Heilman et al. (2007) continued using language modeling to predict readability for first and second language texts. Furthermore, they experimented with various statistical models to test their effectiveness at predicting reading difficulty (Heilman et al., 2008)

------>>Schwarm/Petersen and Ostendorf (Schwarm and Ostendorf, 2005; Petersen and Ostendorf , 2006) used support vector machines to combine features from traditional reading level measures, statistical language models and automatic parsers to assess reading levels.

In addition to lexical and syntactic features, several researchers started to explore DISCOURSE LEVEL features and examine their usefulness in predicting text readability.

Discourse Features has four subsets of discourse features: entity density features,lexical-chain features, coreference inference features and entity grid features.

---->>The coreference inference features are novel and have not been studied before.
----->>Entity-density features and lexical chain features have been studied for readers with intellectual disabilities (Feng et al., 2009).
------>>Entity-grid features have been studied by Barzilay and Lapata (2008) in a stylistic classification task.

Pitler and Nenkova(2008) used the Penn Discourse Treebank (Prasad
et al., 2008) to examine discourse relations.

Entity density features include:

percentage of named entities per document

percentage of named entities per sentences

percentage of overlapping nouns removed

average number of remaining nouns per sentence

percentage of named entities in total entities

percentage of remaining nouns in total entities

Lexical Chain Features include:

total number of lexical chains per document
avg. lexical chain lengthavg. lexical chain span
num. of lex. chains with span ≥ half doc. length
num. of active chains per word
num. of active chains per entity

Coreference Inference Features:

total number of coreference chains per document

avg. num. of coreferences per chain

avg. chain span

num. of coref. chains with span ≥ half doc. length

avg. inference distance per chain

num. of active coreference chains per word

num. of active coreference chains per entity

In our project"AUTOMATIC READABILITY ASSESSMENT FOR TEXT SIMPLIFICATION" we are mainly focussed about the discourse features of a text and how they can used to assess or grade the readability of the text.

The rest of the discourse features will be addressed in my next blog:)so long till den:)

Tuesday, February 15, 2011

Previous work on readability assessment, applications of readability assessment and research work done by Lejun Feng towards readability assessment

Previous work on readability assessment, applications of readability assessment and research work done by Lejun Feng towards readability assessment

Are you fed up reading the text which is not of your choice or level????

For this we require the readability assessment tool to select the text of our choice :) :)

Let me start with-what is Readability and what all we need to consider while comprehending or understanding a given text.Let me explain:

Readability is deﬁned as a measure of ease with which a written text can be understood.

Now the second question arises What makes a text easy or difficult to understand? For this let us go through the previous work done on readability-assessment.

Relevant Literature & Previous Work on Readability Assessment

Let us first see the characteristics and limitations of traditional readability metrics and recent statistical development in the ﬁeld of readability.

Traditional readability metrics are given below:

1.Flesch Reading Ease and the FleschKincaid grade level formulas (Flesch, 1979) use average sentence length and average syllables per word to calculate the grade level of a text.

2.Gunning FOG (Gunning, 1952) and the SMOG (McLaughlin, 1969) index use average sentence length and the percentage of words with at least three syllable as parameters

3.Automated Readability Index (Senter and Smith, 1967) counts the number of characters per word instead to determine word difﬁculty.

4.Dale-Chall formula uses the percentage of difﬁcult words (words that do not appear in the list) and average sentence length to predict the grade level of a text.

5.Stenner et al. (1983) had analyzed more than 50 lexical variables and did extensive correlation tests to ﬁnd out that word frequency and sentence length have the most predictive power in ranking the reading difﬁculty of texts contained in their experiment data.

Advantages of traditional readability metric is explained as follows:

>>These traditional metrics are widely used, especially in educational settings, because they are simple and easy to calculate.

>>Grade levels that are calculated by the above methods indicate the number of years of education generally required to understand the text. It is generally understood that reading difﬁculty increases with grade level. They are a commonly accepted index for reading difﬁculty of a text, especially in educational settings, because the scale of grade levels make it easier for teachers, parents, librarians, and others to judge the readability level of various books and texts. Another reason to look at grade levels is that they have been widely used in previous research.

Drawbacks of traditional readability metrics:

>>>They ignored syntactic constituents, the structure of the text, local and global discourse coherence across the text(using the coherent basis for discourse i.e., familiarity of the discourse topic to the reader, readers’ prior knowledge and motivation to read.

>>>The traditional metrics cannot capture content information and often misjudge the reading difﬁculty of scientiﬁc web documents.

Statistical approaches towards readability metrics

Si and Callan (2001) used unigram language models to capture content information from scientiﬁc web pages. A linear model was built combining language models with sentence length.

CollinsThompson and Callan (2004) adopted Smoothed Unigram model to capture vocabulary variation across all grade levels contained in the corpus,their Smoothed Unigram model is purely vocabulary-based and does not contain any syntactic features.Although vocabulary-based unigram language models help capture important content information and variation of word usage, they do not capture syntactic information.

Schwarm and Ostendorf (2005)

used Charniak’s parser (Charniak, 2000) and higher order n-gram (n = 3) models over a combination of word and part-of-speech (POS) sequences to capture syntactic and semantic features.But it was limited to the study of lexical and syntactic features with regard to text comprehensibility

Heilman et al. (2007)

The readability measurement was motivated by pedagogical differences in ﬁrst language (L1) and second language (L2) learning. They argue that grammatical features play a more important role in L2 texts than in L1 texts because, unlike L1 learners who learn grammar through natural interaction, L2 learners learn grammatical patterns explicitly from L2 textbooks.

But it was limited to the study of lexical and syntactic features with regard to text comprehensibility.

Barzilay and Lapata (2008)The first work on discourse relation was done by Barzilay and Lapata, designed and implemented an entity-grid model to capture the distribution of entity transition patterns at sentence to sentence level.

The cognitive science reveals that the most important process during reading comprehension lie in discourse comprehension, which entails making appropriate inferences from concepts and propositions, connecting and/or integrating related information to construct a coherent memory representation.

Their work was not motivated by text readability, but rather by other NLP tasks related to text generation, such as text ordering and summary coherence rating.

Pitler and Nenkova (2008) for the ﬁrst time looked at readability factors at all three linguistic levels: lexical, syntactic and discourse.In the PDTB(Penn Discourse tree bank), all discourse connectives and the relations between two adjacent sentences of a text were manually annotated.Among all individual factors analyzed at all three linguistic levels, the likelihood of discourse relations with text length taken into account shows the strongest correlation with human readability ratings (r = .4835).Their work is novel and inspiring, because it touched the core of text comprehension and showed a new direction in readability study that has been long overdue

Limitation of Petler and Nenkova work

1.It cannot be adopted for any corpus other than the PDTB.

2.they mainly focussed on text style rather on text readability i.e. how well a text is written rather than3. how difﬁcult or easy a text is to read.

3.they experiment conducted was only on 30 articles and because they relied only on limited subjective human ratings,their study lacks any objective measure..

After reading all the previous work done on readability.Let me conclude in a simple way that the readability cannot solely judged by

1. l >>lexical tokenisation( which looks at three factors: the number of syllables a word contains, the number of characters a word contains, and word frequency)

2. >>syntactic representation(the complexity of sentences is solely judge by their average length in words).

3.3>>. Sentence processing

But also on .....

>>> discourse relation to buid coherent memory representation of text by the reader. The discourse relation in totally means the amount of prior knowledge the reader needs to apply, the inferences the reader need to make, relate the text with the accumulated knowledge, the references required to understand the text, searching and retrieving the relevant information for comprehending.

1. >>> It also depends on the working memory capacity. If the text is not related to the main topic of discussion that means the text is not present under current working memory then the reader has to search the long term memory for understanding .

Now coming to the major contribution towards readability assessment done by Lejun Feng :

1.The readability from a text comprehension point of view; in particular, paid special attention to discourse processes that are crucial for constructing and maintaining local and global memory coherence of a text(we can say it as short term and long term memory), which is key to successful text comprehension.

2.The processes that occur in discourse comprehension, which contains the activities such as resolving entities, inferring meaning from words and phrases, assessing and evaluating semantic relations among concepts and propositions and making connections among them, using background knowledge to generate appropriate inferences to ﬁll in gaps, and integrating new information into existing semantic structure to achieve and maintain coherent memory representation of a text.

3.The thesis propose to apply advanced NLP techniques to implement three classes of novel discourse features that have not been studied by any of the previous research.i.e.density of entities, lexical chains, coreferential inference features .

4.It focused not only on intrinsic text properties, but viewed text comprehensibility as the result of the interaction between the text and the reader’s prose processing ability, the characteristics of a given reader was taken into the readability study by addressing constraints of working memory capacity placed on the reader’s comprehension effort.The constraints on working memory is highlighted here because the individuals with ID(intellectual disability) do not have the same memory capacity as the one without ID.

5.Working memory while extracting the discourse features was taken under consideration. Working memory has great impact on various language comprehension activities, because it provides temporary storage and simultaneous manipulation of information and coordinates resources that are necessary for comprehension processes during reading. Since individuals with ID(intellectual disabilities) donot have the same working capacity as the individuals without ID(intellectual disabilities) which accounts in variation of comprehension performance.

6.The thesis proposes the development of the automatic readability assessment tool which consists of four major parts: data collection, feature extraction and implementation, building and evaluating the tool on labeled corpora, and test and evaluating the tool on unlabeled texts from different domain.

7.The study on readability combines various proxies, such as paired original/simpliﬁed corpora, grade levels, subjective ratings by experts and users, and objective observations in our user studies, to get all those underlying text properties that are associated with reading difﬁculties.

Let me conclude by telling the applications of buiding the automatic-readability-assessment tool.

1. >>> in educational settings, school children, second language learners, adults with low literacy can use the tool to select reading material that is of their interest and tailored to their varying reading proﬁciency.

2. >>> language instructors can use this tool to select teaching material effectively that is at appropriate level of reading difﬁculty for target readers.

3. >>>It can be used to rank the documents by reading difﬁculty for automated systems such as text simpliﬁcation, text summarization, machine translation and other text generation systems for example tool can be used to select documents that are at appropriate level of reading difﬁculty among those on similar topic for the target system to begin with.

4. >>> A reliable tool that can accurately assess the change of reduction in reading difﬁculty before and after simpliﬁcation process can be provided by this tool.

5. >>> we can use the tool to check the quality of text generated by systems such as text summarization, machine translation and text ordering system. Comparing the reading difficulty before and after the change of simplification process is required to check the coherence(as coherent text are easier to read).

I hope the above information will help us before getting delved into the more understanding of thesis .Finally I could say that after reading the thesis for three times made me to grade myself to the same level of understanding J

Summary on "Automatic Readability Assessment" thesis

When I was told that I needed to read a 200 page thesis: “Automatic Readability Assessment” by Lijun Feng, the first thought that occurred to me was “I will definitely doze off 20 pages into the paper!!” But this thesis is pretty extensive in its coverage on the topic that forms the backbone of our project: what exactly goes into understanding-the-understandability of the text.

I provide a summary of the paper in this blog.

Readability is commonly defined as a measure of what makes a text easy or hard to read and has been the central topic of readability research for the past 80 years. Many traditional metrics exist for text readability which bank upon a limited set of textual features, such as sentence length, number of syllables per word, word frequency, etc. Though these metrics are easy to compute, they have been proven to be highly unreliable.

Language models and parsers that use the NLP technology have been used to explore complex lexical features and syntactic constructs in aiding readability study. But readability research has not made much progress beyond lexical and syntactic analysis as these features are easier to define and measure with existing techniques, while factors such as discourse topic and discourse coherence require much more complex semantic analysis, and hence remain as challenging problems.

The thesis focuses on developing an automatic text readability assessment tool at various discourse levels while taking user characteristics into account. The primary goal of the thesis is to quantify and understand what makes a text easy or difficult to read, particularly for readers with mild intellectual disabilities (MID).

In order to assess how well the readability assessment tool corpora were created consisting of original and simplified texts. The tool’s ability to differentiate between original and simplified text was evaluated and a comparison between the correlations of predictions by the tool with independent measure of text difficulty rated by experts and by adult participants with mild intellectual disabilities, was done.

A reader processes the sentences as he reads it and organizes the memory units extracted from the word and sentence processing and these units are placed in his/her memory in an organized and structured manner. The coherent memory representation is constructed and maintained by his ability to process the text and resolve references by making suitable inferences. Low working memory capacity has been shown to be related to a reduction in the speed and accuracy with which sentences can be processed.

The following elements are used towards the Feng’s approach to readability:

Text readability is not determined by intrinsic text properties alone. Rather, reading ease or difficulty results from the interaction of the reader and the text.

The goal of reading is to construct a coherent memory representation of a text. Word identification and sentence parsing are part of basic comprehension processes that occur at the low level of text comprehension. Much of reading difficulties arise from higher level of discourse comprehension, which involves mostly evaluating and identifying relations among conceptual information, solving references to establish entities in a text and making various types of inferences to fill in missing information.

Working memory has great impact on various language comprehension activities, because it provides temporary storage and simultaneous manipulation of information and coordinates resources that are necessary for comprehension processes during reading.

Working memory capacity constantly places constraints on readers’ attempt to understand a text. Individual differences in working memory capacity account for some of the variation in comprehension performance.

Text comprehensibility can be well predicted by an analysis of the demands it makes of readers’ working memory

The thesis is especially targeted to help people with MID ( Mild Intellectual Disability).

So a situation like this:

indicates that the person has a problem reading, comprehending, analyzing and joining the dots to make inferences while reading each line. The limitation in their cognitive functioning is due to various degree of impairments which affects their reading comprehension directly. The ability to actively and strategically apply one’s semantic knowledge to facilitate comprehension activities is considered crucial in understanding differences in individual comprehension performance. In many empirical studies, individuals with ID were observed to show deficits in various aspects of semantic processing.

It is difficult to find reading materials for individuals with MID that are

(1) of interest to them and

(2) at the right reading level.

Reading materials at lower reading levels are typically written for children, and texts written for adults without disabilities often require a high level of linguistic skills and sufficient real world knowledge, which these individuals often lack. The lack of appropriate reading materials may also discourage adults with ID from practicing reading, thus diminishing their already low literacy skills.

Transformation rules are applied that change constructs into shorter or plainer sentences and as a result they are thought to be easier to understand to help people with MID. However, synonym-replacement and syntax-tree simplification alone is not enough as, in addition to challenges that come from lexical and syntactic factors, they have other difficulties with processing written information. Moreover, text simplification results in increased length of the simplified document, because long and complex sentences are often split into multiple shorter sentences. The resulting increased length of the whole document can pose another challenge to the already limited working capacity of readers with MID because it requires processing and storing more information. Therefore a system needs to be designed wherein the most relevant information is retained and less relevant information simplified or completely left out

There are two major research questions that are at the center of the design and implementation of such a text simplification system

(1) How do we identify which portions of a text will pose difficulty for our users?

(2) When there are several possible simplification choices, how do we decide which is the optimal one to choose for our users?

Ideally, a reliable automatic readability assessment tool would help solve both questions and aid automatic text simplification in many ways. It can be used to rank documents by reading difficulty for automated systems such as text simplification, text summarization, machine translation and other text generation systems. For example, as a reprocessing step, such a tool can be used to select documents that are at appropriate level of reading difficulty among those on similar topic for the target system to begin with. More importantly, such a tool can be used to provide efficient evaluation measure for systems’ performance.

One of many important aspects to look at when evaluating the quality of text generated by automated systems is coherence. Coherent texts are easier to read. One of many ways to check the coherence of resultant texts is compare their reading difficulty before and after change. Feng’s automatic readability assessment tool is well suited for this task.

To make it easier for people to judge the reading difficulty of a text, grade levels or number of years of education required to completely understand a text are commonly used as index for reading difficulty.

Many traditional readability metrics metrics use simple linear functions with two or three shallow language features to model the readability of a given text. For example, the widely used Flesch Reading Ease and the Flesch- Kincaid grade level formulas use average sentence length and average syllables per word to calculate the grade level of a text. Similarly, the Gunning FOG and the SMOG index use average sentence length and the percentage of words with at least three syllable as parameters. Automated Readability Index counts the number of characters per word instead to determine word difficulty. Different from the syllabic approach, the Dale-Chall formula made an advance in measuring lexical difficulty by introducing a list of common words familiar for 4th-grade students. It uses the percentage of difficult words (words that do not appear in the list) and average sentence length to predict the grade level of a text.

Flesch Kincaid Reading Ease: is based on a 0-100 scale. A high score indicates that that a text is easier to read.

206.835 - 1.015 (words/sentences) - 84.6 * (syllables/words)

Flesch Kincaid Grade Level:

0.39 * (words/sentences) + 11.8 * (syllables/words) - 15.59

Gunning FOG score:

0.4 * (( words/sentences) + 100 * (complex words/word))

SMOG Index:

1.0430 * sqrt ( 30 * complex words/sentences) + 3.1291

ARI :

4.71 * ( characters/ words) - 0.5 ( words / sentences ) - 21.43

These traditional metrics are widely used, especially in educational settings, partly also because they are simple and easy to calculate. However, the limitations of these metrics are obvious. They overweighted the impact of word frequency and sentence length on text comprehensibility and systematically ignored many other important factors that are crucial to reading, such as syntactic constituents, the structure of the text, local and global discourse coherence across the text, familiarity of the discourse topic to the reader, readers’ prior knowledge and motivation to read, etc.

Moreover, the number of syllables per word, which acts as a reliable proxy for word frequency, and sentence length do not always capture the reading complexity of a text accurately. Hence traditional metrics have been proved to be unreliable.

Most recently the work done in this direction was:

Detailed analysis of syntactic complexity based on parse trees has been combined with language models and traditional measures in readability research (Heilman et al., 2007; Pitler and Nenkova, 2008; Schwarm and Ostendorf, 2005).

Besides three traditional measures (average sentence length, average number of syllables per word and Flesch-Kincaid score), Schwarm and Ostendorf (2005) used Charniak’s parser (Charniak, 2000) and higher order n-gram (n = 3) models over a combination of word and part-of-speech (POS) sequences to capture syntactic and semantic features. The four parse features include average parse tree height, average number of noun phrases, average number of verbphrases, and average number of “SBAR”s (relative clauses).

Pitler and Nenkova (2008) for the first time looked at readability factors at all three linguistic levels: lexical, syntactic and discourse. They analyzed six classes of features: traditional readability factors such as average number of characters per word, average sentence length, maximum number of words per sentence, document length, vocabulary-based unigram features, four parse syntax given by Schwarm and Ostendorf and mentioned above, elements of text cohesion and discourse relations.

The thesis approaches readability from a text comprehension point of view, with special attention to discourse processes that are crucial for constructing and maintaining local and global memory coherence of a text, which is key to successful text comprehension. These discourse processes reflect the reader’s comprehension task and can be useful in predicting the complexity of a text. Advanced NLP techniques have been applied to implement three classes of novel discourse features that have not been studied by anyof the previous research.

The study does not rely on a single measure of readability and combines various proxies, such as paired original/simplified corpora, grade levels, subjective ratings by experts and users, and objective observations in our user studies, to get at those underlying text properties that are associated with reading difficulties.

The methods employed in the paper consist of four major parts:

Data collection
Feature extraction and implementation
Building and evaluating the tool on labeled corpora
Testing and evaluating the tool on unlabeled texts from different domain.

The main corpus for the study consists of texts with reading difficulty annotated by elementary grades level ranging from Grade 2 to 5. The corpus is used to build and evaluate our automatic text readability assessment tool.

Two ways are given to assess how well the readability assessment tool generalizes texts from different domains:

First, two corpora are manually created consisting of original and simplified texts adapted specifically for adults with mild intellectual disabilities. The automated readability assessment tool gives the grade levels to predict the reading difficulty of original and simplified texts contained in these two corpora.

Second, the correlations between grade level predictions by our tool, expert ratings, and inferred text difficulty for adult participants with mild intellectual disabilities have been compared.

Hence the general methodology relies on the following five proxies:

Grade levels: Grade levels indicate the number of years of education generally required to understand the text. It is generally understood that reading difficulty increases with grade level.
Paired original/simplified texts: A common assumption is that simplified texts should be easier to read.
Subjective ratings by experts: Experts who have linguistic expertise or specialize in working with adults with ID were asked to rate text difficulty.
Objective observations in user studies: Target users with texts at a variety of difficulty levels are taken and their reading times are recorded. Subjects will answer simple comprehension questions afterwards, and the accuracy of their answers are analyzed. This will give the most direct clues about the difficulties faced by the target user group, even though we will need to account for per subject and other effects·
Subjective (introspective) ratings by users: This will probably be especially problematic in the study, as the users’ subjective judgment may not be fully reliable because of their cognitive impairments.

Research Hypothesis

The thesis proposes to design and implement four classes of novel discourse features that will best reflect working memory burden posed on the reader’s attempt to understand a text: density of entities, lexical chains, coreferential inference features and local entity coherence features.

Density of Entities: Conceptual information is often introduced in a text by entities, which consist of general nouns and named entities, such as people’s names, locations, organizations, etc. More the entities introduced into a text, the more demands they make of the reader’s working memory capacity; for individuals with ID who suffer from impoverished working memory, the increasing demands of entity processing would become especially overwhelming.

Lexical Chains :Using existing NLP technology, various semantic relations among entities – such as synonym, hypernym, hyponym, coordinate terms (siblings), etc.– can be automatically annotated. Based on these annotations, entities that are connected by certain semantic relations can be chained up through the text and form a lexical chain.

Coreferential Inferences: Readers are required to actively apply acquired prior background knowledge to disambiguate and make appropriate inferences. The inference processes involve searching and retrieving relevant information from various long- and short-term memory systems.

Corpora

Six corpora collected for their readability study were:

Labeled Corpus: from WeeklyReader, LocalNews2007, LocalNews2008 and NewYorkTimes100

Unlabeled Paired Corpora: from Britannica and LiteracyNet

Feature Extraction

Various features were used for their automatic text readability assessment tool and the techniques deployed to extract and implement them.

The following 5 feature subsets were proposed, many of which result from refinement and improvement of previously studied features.

Discourse Features
Language-Modeling-based Perplexity Features
Parsed Syntactic Features
Part-Of-Speech-based (POS) Features
Shallow Features

Discourse Features

Four subsets of discourse features were given: entity-density features, lexical-chain features, coreference inference features and entity grid features.

The first three subsets of features are novel and have not been studied by other researchers before.

Entity-Density Features: The entities are defined as a union of named entities and the rest of general nouns (nouns and proper nouns) contained in a text.

Lexical Chain Features: LexChainer produces chains of words connected by six semantic relations: synonymy, hypernym, hypony, meronym, holonym and coordinate terms (siblings) (Galley and McKeown, 2003). The hypothesis is that important conceptual and topical information recurring throughout a text is likely to be captured by these lexical chains. In order to construct a coherent semantic representation of a text, it is necessary that a reader keeps semantic related discourse units in his/her working memory throughout the whole reading comprehension process.

Coreferential Inference Features: Relations among concepts and propositions are often not stated explicitly in a text. The constructive nature of building a coherent semantic representation of a text requires a reader to actively retrieve and assess previously processed information to generate appropriate inferences when conceptual information is not stated explicitly.

Entity Grid Features: Features extracted from entity grid models are study for their effectiveness in automatic readability assessment.

Parsed Syntactic Features

Recent approaches to readability have utilized natural language processing techniques such as probabilistic parsers to analyze syntactic features of texts and reported their positive contributions. Schwarm and Ostendorf studied four parse tree features (average parse tree height, average number of SBARs, noun phrases, and verb phrases per sentences). The paper implemented these and additional features, using the Charniak parser (Charniak, 2000). Our parsed syntactic features focus on clauses (SBAR), noun phrases (NP), verb phrases (VP) and prepositional phrases (PP). For each phrase, four features are implemented: total number of the phrases per document, average number of phrases per sentence, and average phrase length measured by number of words and characters respectively

POS Features

The paper focuses on five classes of words (nouns, verbs, adjectives, adverbs, and prepositions) and two broad categories (content words, function words).

Nouns include general nouns and proper nouns. Verbs include past tenses, present participles, past participles and modals in addition to infinitives, present 3rd person singular forms and all forms of auxiliary verbs. Content words include nouns, verbs, numerals, adjectives, and adverbs; the remaining types are function words.

Shallow Features

Shallow features refer to those used by traditional readability metrics, such as Flesch-Kincaid Grade Level (Flesch, 1979), SMOG (McLaughlin, 1969), Gunning FOG (Gunning, 1952), etc. Although recent readability studies have strived to take advantage of NLP techniques, little has been revealed about the predictive power of shallow features.

Some of the shallow features are:

1) average number of syllables per word

2 )percentage of poly-syllables words per doc.

3 )average number of poly-syllables words per sentence.

4 )average number of characters per word

5 )Chall-Dale difficult words rate per document

6 )average number of words per sentence

7 )average number of characters per sentence

8 )Flesch-Kincaid score

9 )total number of words per document

Automatic Readability Assessment

The effectiveness of features in terms of their impact on predicting reading difficulty indexed by grade levels is studied.

To summarize, within the four subsets of discourse features, the following key observations were made:

Among all four subsets of features, entity-density features exhibit the most significant discriminative power in modeling text reading difficulty.
Combining all discourse features together leads to overall improvement. However, the best performance is achieved by combining entity density features and entity grid features together.
Analysis at grade level reveals that entity-density features generate the highest accuracy for Grade 2 (57.41%) and 4 (50.09%); combining all features produces the best performance for Grade 3 (57.09%); and entity grid features generate the highest accuracy for Grade 5 (80.96%).

This is the farthest I could read and comprehend in the thesis. The actual implementation followed in the paper details required text simplification for me ( :P ) I’ll definitely have to bury my head deeper into the thesis!

So long till then!