When I was told that I needed to read a 200 page thesis: “Automatic Readability Assessment” by Lijun Feng, the first thought that occurred to me was “I will definitely doze off 20 pages into the paper!!” But this thesis is pretty extensive in its coverage on the topic that forms the backbone of our project: what exactly goes into understanding-the-understandability of the text.
I provide a summary of the paper in this blog.
Readability is commonly defined as a measure of what makes a text easy or hard to read and has been the central topic of readability research for the past 80 years. Many traditional metrics exist for text readability which bank upon a limited set of textual features, such as sentence length, number of syllables per word, word frequency, etc. Though these metrics are easy to compute, they have been proven to be highly unreliable.
Language models and parsers that use the NLP technology have been used to explore complex lexical features and syntactic constructs in aiding readability study. But readability research has not made much progress beyond lexical and syntactic analysis as these features are easier to define and measure with existing techniques, while factors such as discourse topic and discourse coherence require much more complex semantic analysis, and hence remain as challenging problems.
The thesis focuses on developing an automatic text readability assessment tool at various discourse levels while taking user characteristics into account. The primary goal of the thesis is to quantify and understand what makes a text easy or difficult to read, particularly for readers with mild intellectual disabilities (MID).
In order to assess how well the readability assessment tool corpora were created consisting of original and simplified texts. The tool’s ability to differentiate between original and simplified text was evaluated and a comparison between the correlations of predictions by the tool with independent measure of text difficulty rated by experts and by adult participants with mild intellectual disabilities, was done.
A reader processes the sentences as he reads it and organizes the memory units extracted from the word and sentence processing and these units are placed in his/her memory in an organized and structured manner. The coherent memory representation is constructed and maintained by his ability to process the text and resolve references by making suitable inferences. Low working memory capacity has been shown to be related to a reduction in the speed and accuracy with which sentences can be processed.
The following elements are used towards the Feng’s approach to readability:
- Text readability is not determined by intrinsic text properties alone. Rather, reading ease or difficulty results from the interaction of the reader and the text.
- The goal of reading is to construct a coherent memory representation of a text. Word identification and sentence parsing are part of basic comprehension processes that occur at the low level of text comprehension. Much of reading difficulties arise from higher level of discourse comprehension, which involves mostly evaluating and identifying relations among conceptual information, solving references to establish entities in a text and making various types of inferences to fill in missing information.
- Working memory has great impact on various language comprehension activities, because it provides temporary storage and simultaneous manipulation of information and coordinates resources that are necessary for comprehension processes during reading.
- Working memory capacity constantly places constraints on readers’ attempt to understand a text. Individual differences in working memory capacity account for some of the variation in comprehension performance.
- Text comprehensibility can be well predicted by an analysis of the demands it makes of readers’ working memory
The thesis is especially targeted to help people with MID ( Mild Intellectual Disability).
So a situation like this:

indicates that the person has a problem reading, comprehending, analyzing and joining the dots to make inferences while reading each line. The limitation in their cognitive functioning is due to various degree of impairments which affects their reading comprehension directly. The ability to actively and strategically apply one’s semantic knowledge to facilitate comprehension activities is considered crucial in understanding differences in individual comprehension performance. In many empirical studies, individuals with ID were observed to show deficits in various aspects of semantic processing.
It is difficult to find reading materials for individuals with MID that are
(1) of interest to them and
(2) at the right reading level.
Reading materials at lower reading levels are typically written for children, and texts written for adults without disabilities often require a high level of linguistic skills and sufficient real world knowledge, which these individuals often lack. The lack of appropriate reading materials may also discourage adults with ID from practicing reading, thus diminishing their already low literacy skills.
Transformation rules are applied that change constructs into shorter or plainer sentences and as a result they are thought to be easier to understand to help people with MID. However, synonym-replacement and syntax-tree simplification alone is not enough as, in addition to challenges that come from lexical and syntactic factors, they have other difficulties with processing written information. Moreover, text simplification results in increased length of the simplified document, because long and complex sentences are often split into multiple shorter sentences. The resulting increased length of the whole document can pose another challenge to the already limited working capacity of readers with MID because it requires processing and storing more information. Therefore a system needs to be designed wherein the most relevant information is retained and less relevant information simplified or completely left out
There are two major research questions that are at the center of the design and implementation of such a text simplification system
(1) How do we identify which portions of a text will pose difficulty for our users?
(2) When there are several possible simplification choices, how do we decide which is the optimal one to choose for our users?
Ideally, a reliable automatic readability assessment tool would help solve both questions and aid automatic text simplification in many ways. It can be used to rank documents by reading difficulty for automated systems such as text simplification, text summarization, machine translation and other text generation systems. For example, as a reprocessing step, such a tool can be used to select documents that are at appropriate level of reading difficulty among those on similar topic for the target system to begin with. More importantly, such a tool can be used to provide efficient evaluation measure for systems’ performance.
One of many important aspects to look at when evaluating the quality of text generated by automated systems is coherence. Coherent texts are easier to read. One of many ways to check the coherence of resultant texts is compare their reading difficulty before and after change. Feng’s automatic readability assessment tool is well suited for this task.
To make it easier for people to judge the reading difficulty of a text, grade levels or number of years of education required to completely understand a text are commonly used as index for reading difficulty.
Many traditional readability metrics metrics use simple linear functions with two or three shallow language features to model the readability of a given text. For example, the widely used Flesch Reading Ease and the Flesch- Kincaid grade level formulas use average sentence length and average syllables per word to calculate the grade level of a text. Similarly, the Gunning FOG and the SMOG index use average sentence length and the percentage of words with at least three syllable as parameters. Automated Readability Index counts the number of characters per word instead to determine word difficulty. Different from the syllabic approach, the Dale-Chall formula made an advance in measuring lexical difficulty by introducing a list of common words familiar for 4th-grade students. It uses the percentage of difficult words (words that do not appear in the list) and average sentence length to predict the grade level of a text.
Flesch Kincaid Reading Ease: is based on a 0-100 scale. A high score indicates that that a text is easier to read.
206.835 - 1.015 (words/sentences) - 84.6 * (syllables/words)
Flesch Kincaid Grade Level:
0.39 * (words/sentences) + 11.8 * (syllables/words) - 15.59
Gunning FOG score:
0.4 * (( words/sentences) + 100 * (complex words/word))
SMOG Index:
1.0430 * sqrt ( 30 * complex words/sentences) + 3.1291
ARI :
4.71 * ( characters/ words) - 0.5 ( words / sentences ) - 21.43
These traditional metrics are widely used, especially in educational settings, partly also because they are simple and easy to calculate. However, the limitations of these metrics are obvious. They overweighted the impact of word frequency and sentence length on text comprehensibility and systematically ignored many other important factors that are crucial to reading, such as syntactic constituents, the structure of the text, local and global discourse coherence across the text, familiarity of the discourse topic to the reader, readers’ prior knowledge and motivation to read, etc.
Moreover, the number of syllables per word, which acts as a reliable proxy for word frequency, and sentence length do not always capture the reading complexity of a text accurately. Hence traditional metrics have been proved to be unreliable.
Most recently the work done in this direction was:
- Detailed analysis of syntactic complexity based on parse trees has been combined with language models and traditional measures in readability research (Heilman et al., 2007; Pitler and Nenkova, 2008; Schwarm and Ostendorf, 2005).
- Besides three traditional measures (average sentence length, average number of syllables per word and Flesch-Kincaid score), Schwarm and Ostendorf (2005) used Charniak’s parser (Charniak, 2000) and higher order n-gram (n = 3) models over a combination of word and part-of-speech (POS) sequences to capture syntactic and semantic features. The four parse features include average parse tree height, average number of noun phrases, average number of verbphrases, and average number of “SBAR”s (relative clauses).
- Pitler and Nenkova (2008) for the first time looked at readability factors at all three linguistic levels: lexical, syntactic and discourse. They analyzed six classes of features: traditional readability factors such as average number of characters per word, average sentence length, maximum number of words per sentence, document length, vocabulary-based unigram features, four parse syntax given by Schwarm and Ostendorf and mentioned above, elements of text cohesion and discourse relations.
The thesis approaches readability from a text comprehension point of view, with special attention to discourse processes that are crucial for constructing and maintaining local and global memory coherence of a text, which is key to successful text comprehension. These discourse processes reflect the reader’s comprehension task and can be useful in predicting the complexity of a text. Advanced NLP techniques have been applied to implement three classes of novel discourse features that have not been studied by anyof the previous research.
The study does not rely on a single measure of readability and combines various proxies, such as paired original/simplified corpora, grade levels, subjective ratings by experts and users, and objective observations in our user studies, to get at those underlying text properties that are associated with reading difficulties.
The methods employed in the paper consist of four major parts:
- Data collection
- Feature extraction and implementation
- Building and evaluating the tool on labeled corpora
- Testing and evaluating the tool on unlabeled texts from different domain.
The main corpus for the study consists of texts with reading difficulty annotated by elementary grades level ranging from Grade 2 to 5. The corpus is used to build and evaluate our automatic text readability assessment tool.
Two ways are given to assess how well the readability assessment tool generalizes texts from different domains:
- First, two corpora are manually created consisting of original and simplified texts adapted specifically for adults with mild intellectual disabilities. The automated readability assessment tool gives the grade levels to predict the reading difficulty of original and simplified texts contained in these two corpora.
- Second, the correlations between grade level predictions by our tool, expert ratings, and inferred text difficulty for adult participants with mild intellectual disabilities have been compared.
Hence the general methodology relies on the following five proxies:
- Grade levels: Grade levels indicate the number of years of education generally required to understand the text. It is generally understood that reading difficulty increases with grade level.
- Paired original/simplified texts: A common assumption is that simplified texts should be easier to read.
- Subjective ratings by experts: Experts who have linguistic expertise or specialize in working with adults with ID were asked to rate text difficulty.
- Objective observations in user studies: Target users with texts at a variety of difficulty levels are taken and their reading times are recorded. Subjects will answer simple comprehension questions afterwards, and the accuracy of their answers are analyzed. This will give the most direct clues about the difficulties faced by the target user group, even though we will need to account for per subject and other effects·
- Subjective (introspective) ratings by users: This will probably be especially problematic in the study, as the users’ subjective judgment may not be fully reliable because of their cognitive impairments.
Research Hypothesis
The thesis proposes to design and implement four classes of novel discourse features that will best reflect working memory burden posed on the reader’s attempt to understand a text: density of entities, lexical chains, coreferential inference features and local entity coherence features.
- Density of Entities: Conceptual information is often introduced in a text by entities, which consist of general nouns and named entities, such as people’s names, locations, organizations, etc. More the entities introduced into a text, the more demands they make of the reader’s working memory capacity; for individuals with ID who suffer from impoverished working memory, the increasing demands of entity processing would become especially overwhelming.
- Lexical Chains :Using existing NLP technology, various semantic relations among entities – such as synonym, hypernym, hyponym, coordinate terms (siblings), etc.– can be automatically annotated. Based on these annotations, entities that are connected by certain semantic relations can be chained up through the text and form a lexical chain.
- Coreferential Inferences: Readers are required to actively apply acquired prior background knowledge to disambiguate and make appropriate inferences. The inference processes involve searching and retrieving relevant information from various long- and short-term memory systems.
Corpora
Six corpora collected for their readability study were:
- Labeled Corpus: from WeeklyReader, LocalNews2007, LocalNews2008 and NewYorkTimes100
- Unlabeled Paired Corpora: from Britannica and LiteracyNet
Feature Extraction
Various features were used for their automatic text readability assessment tool and the techniques deployed to extract and implement them.
The following 5 feature subsets were proposed, many of which result from refinement and improvement of previously studied features.
- Discourse Features
- Language-Modeling-based Perplexity Features
- Parsed Syntactic Features
- Part-Of-Speech-based (POS) Features
- Shallow Features
Discourse Features
Four subsets of discourse features were given: entity-density features, lexical-chain features, coreference inference features and entity grid features.
The first three subsets of features are novel and have not been studied by other researchers before.
- Entity-Density Features: The entities are defined as a union of named entities and the rest of general nouns (nouns and proper nouns) contained in a text.
- Lexical Chain Features: LexChainer produces chains of words connected by six semantic relations: synonymy, hypernym, hypony, meronym, holonym and coordinate terms (siblings) (Galley and McKeown, 2003). The hypothesis is that important conceptual and topical information recurring throughout a text is likely to be captured by these lexical chains. In order to construct a coherent semantic representation of a text, it is necessary that a reader keeps semantic related discourse units in his/her working memory throughout the whole reading comprehension process.
- Coreferential Inference Features: Relations among concepts and propositions are often not stated explicitly in a text. The constructive nature of building a coherent semantic representation of a text requires a reader to actively retrieve and assess previously processed information to generate appropriate inferences when conceptual information is not stated explicitly.
- Entity Grid Features: Features extracted from entity grid models are study for their effectiveness in automatic readability assessment.
Parsed Syntactic Features
Recent approaches to readability have utilized natural language processing techniques such as probabilistic parsers to analyze syntactic features of texts and reported their positive contributions. Schwarm and Ostendorf studied four parse tree features (average parse tree height, average number of SBARs, noun phrases, and verb phrases per sentences). The paper implemented these and additional features, using the Charniak parser (Charniak, 2000). Our parsed syntactic features focus on clauses (SBAR), noun phrases (NP), verb phrases (VP) and prepositional phrases (PP). For each phrase, four features are implemented: total number of the phrases per document, average number of phrases per sentence, and average phrase length measured by number of words and characters respectively
POS Features
The paper focuses on five classes of words (nouns, verbs, adjectives, adverbs, and prepositions) and two broad categories (content words, function words).
Nouns include general nouns and proper nouns. Verbs include past tenses, present participles, past participles and modals in addition to infinitives, present 3rd person singular forms and all forms of auxiliary verbs. Content words include nouns, verbs, numerals, adjectives, and adverbs; the remaining types are function words.
Shallow Features
Shallow features refer to those used by traditional readability metrics, such as Flesch-Kincaid Grade Level (Flesch, 1979), SMOG (McLaughlin, 1969), Gunning FOG (Gunning, 1952), etc. Although recent readability studies have strived to take advantage of NLP techniques, little has been revealed about the predictive power of shallow features.
Some of the shallow features are:
1) average number of syllables per word
2 )percentage of poly-syllables words per doc.
3 )average number of poly-syllables words per sentence.
4 )average number of characters per word
5 )Chall-Dale difficult words rate per document
6 )average number of words per sentence
7 )average number of characters per sentence
8 )Flesch-Kincaid score
9 )total number of words per document
Automatic Readability Assessment
The effectiveness of features in terms of their impact on predicting reading difficulty indexed by grade levels is studied.
To summarize, within the four subsets of discourse features, the following key observations were made:
- Among all four subsets of features, entity-density features exhibit the most significant discriminative power in modeling text reading difficulty.
- Combining all discourse features together leads to overall improvement. However, the best performance is achieved by combining entity density features and entity grid features together.
- Analysis at grade level reveals that entity-density features generate the highest accuracy for Grade 2 (57.41%) and 4 (50.09%); combining all features produces the best performance for Grade 3 (57.09%); and entity grid features generate the highest accuracy for Grade 5 (80.96%).
This is the farthest I could read and comprehend in the thesis. The actual implementation followed in the paper details required text simplification for me ( :P ) I’ll definitely have to bury my head deeper into the thesis!

So long till then!