The Middlebury Campus

Science Spotlight: Natural Language Processing

By Will Henriques

“I disdain green eggs and green ham.” Easy enough to simplify this sentence down to the classic Dr. Seuss: “I do not like green eggs and ham.” The leap from complexity to simplicity is easy for the human brain. But it’s an entirely different story for a computer, which has no “first language.”

To grasp the scope of the computer’s problem, imagine this scenario: you only speak English, you’re handed a complex sentence in Chinese and you’re told to simplify it to the level of the average Chinese kindergarten student. Not so easy.

This is the problem that Assistant Professor of Computer Science David Kauchak has been working on for the past two years. He works in the field of natural language processing on a problem called “text simplification.” Patrick Adelstein ’14, who worked with Kauchack on his research this past summer, explained that natural language processing is basically “equipping computers to understand language spoken by humans.”

“It’s computer science with linguistics really,” said Kauchak. “The basic premise is this: you give me a document and I’ll try to create a program that automatically simplifies it, in the sense of reducing the grammatical complexity and reducing the complexity of the vocabulary.”

Written material is generally composed at the reading level of the author and not the reader. Communication breakdown occurs when a document is simply to complex for the reader to understand. This can create real communication challenges in the realm of politics, medicine and education.

“There’s a lot of applications for this,” Kauchak said. “One of the most interesting is in the medical domain. There’s a lot of information out there on diseases, treatments, diagnoses. But there have been a number of studies that show that a lot of people don’t have the reading ability to comprehend most of the literature that’s available. There’s a study that said 89 million people in the US — that’s a little over a quarter of the population — don’t have sufficient reading skills to be able to read the documentation that is given to them.”

One solution to this problem would be to simply have the authors of the literature write the material at a lower reading level. But that would require a massive systemic overhaul of medical writing standards.

“That’s a non-trivial overhaul, and that’s just not going to happen in the near future,” said Kauchak. “The goal of this project is to be able to do that sort of simplification automatically. For example, one could take a medical pamphlet about cancer or diabetes, and be able to produce similar content but in a way that’s easier to understand for somebody who doesn’t have that higher reading level.”

According to the 2010 United States Census, a little under 40 million residents in the United States (over 12% of the total population) are foreign-born, and of those 30.1 percent speak English “Not well” or “Not at all” Text simplification technology has incredible potential to facilitate communication for U.S. residents with English as a second language by creating simplified documents. To automatically generate content — official paperwork, online news articles, information pamphlets — for these audiences is undeniably beneficial.

But going back to the earlier scenario: how does one train a computer to simplify a document in a language of which it has no fundamental understanding? It all comes down to statistics, probability and creative programming.

Kauchak used a language translation analogy — easier to understand than the text simplification explanation — to explain the process. If there is a list of sentence pairs — one sentence in Chinese and its English equivalent — you would tell the computer to use the list to establish a probability that one word will translate to another.

“I see this word in Chinese in let’s say  100 sentences, and in 70 of the English sentences I see exactly this other word,” said Kauchak. “And in the other 30 English cases, there’s a different word. Based on these ratios, I can establish the probabilities of how a word will be translated. Then you try and “teach” your model these different types of probabilities. So there’s two steps: the training, where you try and learn these probabilities from your data that’s aligned, and the translation step, where you take a new sentence, and based on what you’ve seen before, ask: what are all the possible ways I could put the words and phrases in this sentence together, and which of those is the most likely? It’s all about probability, establishing numerical relations. You can’t do this manually. Because from a decent-sized data set you’re going to end up with a few hundred thousand, maybe a million words. And so for each of those words, you’re going to end up with maybe 10 or 20 — it depends on the word — possible translations. It’s not something you’re doing manually. It’s a lot of data.”

That’s how to train a computer to translate from one language to another. Training it to simplify a complex sentence is a little trickier, and it involves analyzing the grammatical structure of the sentence, and determining what is superfluous and what isn’t.

Kauchak was attracted to natural language processing because of its intuitive nature.

“It’s easy to get excited about,” he said. “People do it on a day-to-day basis. It’s very tangible, very human. And being able to look at a problem and understand the input and the output. It’s very satisfying.”

Listen to Professor David Kauchak discuss his work with the Campus’ Will Henriques.

Leave a Comment