Saturday, February 19, 2011

INTRODUCTION TO TEXT READABILITY:
Reading as a means of education has helped individuals learn more about the outside world. If the materials are easy to read and contain clear ideas, they will increase the enthusiasm for reading. It is considered a medium of language acquisition and communication, and leads to the sharing of ideas and
information. It depends on three main factors: the reader, the text and the situation . Our job is to focus on the readability of text.

There has been a lot of research in the field of text readability since the 1920s. This research has led to many popular readability formulas and effective readability tools with useful applications in English, Spanish, and French.

Text readability is defined as “the ease of understanding or comprehension due to the style of writing”.Readability is concerned with matching the reader and the text; it helps us to measure the appropriateness of texts to particular readers.


Every author should transmit his/her messages to the intended readers and motivate the reader by avoiding the use of long sentences and unnecessarily complex words, because poor readers will soon discouraged and overwhelmed with the huge number of new words and complex structures.


Text readability measurement has many potential benefits in the following fields: education, medicine , web applications , and information retrieval systems.



Readability Factors
:
Readability factors are those that affect the level of proper reading and understanding of a text . These factors can be divided into two types: reader factors and text factors.
a) Reader Factors
These are the factors that are related to reader age and his/her reading ability. A lot of research found that, in addition to vocabulary and sentence structure, the prior knowledge and experience, interest, and motivation of the reader affect in one way or another text's readability . Also, tendencies of the reader encourage him/her to read and comprehend the text.
b) Text Factors
There are many factors related to the text itself that affect text readability. Among these factors are the following:
• Certain aspects of words have a huge impact on text readability, such as word length, word frequency,vocabulary load, and using unusual or abstract words, because short words and well-known words are easy to comprehend and most readers recognize frequent words faster than infrequent ones.
• Average sentence length is an important feature that affects readability of a text .
• The clarity of an idea mentioned in the text affects its readability, as does the number of parenthetical clauses.
• Topology, metaphor, and simile usually affect the readability.
• A lot of research has found that aspects of grammatical structure complexity affect text readability. These aspects include deletion of one of the main sentence parts, spacing between the main sentence parts(such as between the subject and the verb), separating the pronouns and the words that they refer to, and using the passive voice more than the active voice.


Most research has focused on combinations of these factors to estimate text readability. For example, Larsson proposed a so-called "nominal quotient NQ" feature to model the different readability levels. It is calculated by
counting the number of nouns, prepositions, and participles, then dividing them by the number of pronouns, adverbs, and verbs per document.
NQ = (nouns + prepositions + participle)/ (pronouns + adverbs + verbs)

The Equation estimates the volume of information of a given text. The amount of information in a text affects its readability, i.e., a less informative text is more readable than a highly informative one.

PREVIOUS WORK:
Many traditional readability metrics are linear models with a few (often two or three) predictor variables based on superficial properties of words, sentences, and documents. These shallow features include the average number of syllables per word,the number of words per sentence, or binned word
frequency.
--->> the Flesch-Kincaid Grade Level formula uses the average number of words
per sentence and the average number of syllables per word to predict the grade level (Flesch, 1979).


--->>The Gunning FOG index (Gunning, 1952) uses average sentence length and the percentage of words with at least three syllables.

---->>Automated Readability Index (Senter and Smith, 1967) counts the number of characters per word instead to determine word difficulty.

---->>Dale-Chall formula uses the percentage of difficult words (words that do not appear in the list) and average sentence length to predict the grade level of a text.

--->>Stenner et al. (1983) had analyzed more than 50 lexical variables and did extensive correlation tests to find out that word frequency and sentence length have the most predictive power in ranking the reading difficulty of texts contained in their experiment data.

DISADVANTAGES OF THE ABOVE METRICS:
These traditional metrics are easy to compute and use, but they are not reliable, as demonstrated by several recent stud-
ies in the field (Si and Callan, 2001; Petersen and Ostendorf, 2006; Feng et al., 2009).

RECENT WORK:
With the advancement of natural language processing tools(NLP), a wide range of more complex text properties have been explored at various linguistic levels. ----->>Si and Callan (2001) used unigram language models to capture content information from scientific web pages.

------>>Collins-Thompson and Callan (2004) adopted a similar approach and used a smoothed unigram model to predict the grade levels of short passages and web documents.

------>>Heilman et al. (2007) continued using language modeling to predict readability for first and second language texts. Furthermore, they experimented with various statistical models to test their effectiveness at predicting reading difficulty (Heilman et al., 2008)

------>>Schwarm/Petersen and Ostendorf (Schwarm and Ostendorf, 2005; Petersen and Ostendorf , 2006) used support vector machines to combine features from traditional reading level measures, statistical language models and automatic parsers to assess reading levels.

In addition to lexical and syntactic
features, several researchers started to explore DISCOURSE LEVEL features and examine their usefulness in predicting text readability.

Discourse Features has four subsets of discourse features: entity density features,lexical-chain features, coreference inference features and entity grid features.

---->>The coreference inference features are novel and have not been studied before.
----->>Entity-density features and lexical chain features have been studied for readers with intellectual disabilities (Feng et al., 2009).
------>>Entity-grid features have been studied by Barzilay and Lapata (2008) in a stylistic classification task.

Pitler and Nenkova(2008) used the Penn Discourse Treebank (Prasad
et al., 2008) to examine discourse relations.

Entity density features include:
  • percentage of named entities per document
  • percentage of named entities per sentences
  • percentage of overlapping nouns removed
  • average number of remaining nouns per sentence
  • percentage of named entities in total entities
  • percentage of remaining nouns in total entities

Lexical Chain Features include:
  • total number of lexical chains per document
  • avg. lexical chain lengthavg. lexical chain span
  • num. of lex. chains with span ≥ half doc. length
  • num. of active chains per word
  • num. of active chains per entity

Coreference Inference Features:
  • total number of coreference chains per document
  • avg. num. of coreferences per chain
  • avg. chain span
  • num. of coref. chains with span ≥ half doc. length
  • avg. inference distance per chain
  • num. of active coreference chains per word
  • num. of active coreference chains per entity


In our project"AUTOMATIC READABILITY ASSESSMENT FOR TEXT SIMPLIFICATION" we are mainly focussed about the discourse features of a text and how they can used to assess or grade the readability of the text.

The rest of the discourse features will be addressed in my next blog:)so long till den:)

No comments:

Post a Comment