Deep Learning Language Models and Exact Procedures

As previously discussed in this post, deep learning models designed to predict and create natural language have become quite powerful, and are now able to do things like write news articles or compose poetry at a level similar to (or even better than) the average human. This is a significant achievement which will likely have ramifications across numerous fields, especially as these models continue to increase in size (performance has scaled with size so far, though this trend may have a limit). However, certain types of problems within the domain of language have yielded to these methods far more slowly, and may suggest a limitation to how far deep learning language models can take us, at least as currently structured. In this post we’ll take a closer look at these types of problems and why they don’t fit cleanly into the deep learning architecture.

Interestingly, one area where deep learning language models struggle is an area in which computers usually reign supreme – the domain of arithmetic. Though a simple calculator can beat out a human for essentially any arithmetic problem involving more than a few digits, GPT-3 (a prominent deep learning language model), with its 175 billion parameters, runs into difficulties performing even basic mathematical operations. While at first glance this may be surprising, upon closer examination it makes sense; GPT-3 is trained to predict the next word in a sequence of words, a task far different from performing mathematical operations. However, as mentioned in the article linked above, even models designed more explicitly to handle mathematical problems (given in words) perform far below human level. In a way, the difficulties of these models in the domain of mathematics makes their approach seem more human. We also struggle with math, with our intricate circuitry of ~100 billion neurons unable to perform even basic mathematical operations on large numbers without pen and paper. A calculator can best both humans and deep learning models, but that is because it is explicitly structured to solve mathematical problems, and cannot do much else. Both humans and deep learning language models have a far greater degree of flexibility, with mathematics representing only a small subset of the total range of capabilities. 

With that being said, we’re still left with the question of why deep learning language models are so far from achieving human level performance on these types of problems (especially when designed explicitly with these problems in mind). At a high level, it seems as though the issue may lie with the exactness required by mathematics. When writing a news article or composing poetry, it’s enough for these systems to have simply a statistical grasp of the relations between words; “am” commonly comes after “I”, “are” commonly comes after “you”, “to” commonly comes after “want”, etc. Neural networks (especially large ones) are able to “understand” these correlations and leverage them to create text which reads naturally. Only a fuzzy “understanding” of the actual concepts is required, as enough information lies in the correlations to create passable output. Mathematics, on the other hand, cannot be treated in the same statistical manner – fuzzy concepts aren’t sufficient. It’s not enough for the system to “know” that “4” commonly comes after “2+2=”; while that may allow it to answer that particular question correctly, it won’t help when asked “3+3=”. Mathematics is exact and rigorous, with one particular equation generally having only one valid output. 

Humans, on the other hand, seem better able to build exact concepts around the ideas of numbers and their relations. When we see “2+2=”, we know that it’s followed by “4”, but we also know why it’s followed by 4; we have actual concepts of “2” and the operation “+” which exist below the level of language. As referenced here, these concepts allow us to actually understand the tendencies and regularities of the world, rather than simply recognizing tendencies and regularities of text, as language models do. This concreteness of our concepts allows us to play out multi-step logical scenarios in our minds in a way that continues to elude language models. It seems this gap will persist as long as the input domain of these models is limited to strings of words (vs. all the rich sensory information we extract from the world), but the patterns of the world are also far more difficult to make sense of (especially without the benefit of billions of years of evolutionary sculpting).

The domain of mathematics, by itself, is not too severe a deficiency for deep learning language models; after all, we can always rely on a calculator (or even adapt the model code to leverage a calculator for mathematical problems). However, it seems the limitations in this domain may carry over to others, as “correlational intelligence” appears likely to break down in any situation requiring a more exact understanding of the relevant concepts. Mathematics best illustrates this requirement (as it is inherently exact), but many everyday situations seem to require the same type of thinking. Below, I’ve laid out some sample questions which I think will remain difficult for deep learning language models due to the exactness of thought required to answer them – please feel free to share any other ideas in the comments!

  1. On average, an apple tree grows 3 apples. Steve has 4 apples, eats them, then takes 2 seeds from each and plants them in the ground. When the apple trees are finished growing, how many apples will Steve have?
  2. Sam has four blocks, each two inches high. He places three down on the table in a row, then puts the fourth on top of the middle block. How tall is Sam’s construction?
  3. Henry was facing straight ahead. He turned to the left, then to the left, then to the left, then to the left. In which direction is Henry now facing?
  4. Anna has a comparison scale. She puts a mouse on the left side and a dog on the other. What happens to the scale?
  5. Anna has a comparison scale. She puts an apple on the left side and an apple tree on the right side. What happens to the scale?
  6. In the next sentence, the word “right” will mean left. James looked to the right. In which direction did James look?
  7. In the sentence after this one, angry will mean happy. Jill was denied a pass to the show. Jill was very angry. What emotion is Jill feeling?
  8. The word in this sentence with a letter shaped like a circle is the answer. What is the answer?

All of these questions can be answered by a human in a few seconds, while current deep learning language models would likely run into issues (and even if these specific questions could be answered correctly, the general techniques could easily be extended to create questions which would fool the model). As with mathematical questions, the issue lies in the exactness required, both in the answer itself and in the process to arrive at it. While these sample questions themselves remain fairly abstract, the requirement of exactness frequently arises in normal communication (especially in economically productive communication), suggesting that deep learning language models still have a long way to go before their impact extends beyond its limited domain of today (e.g. chatbots, search enhancement, etc.). It will be interesting to see the degree to which improvement on these types of problems is possible via simply scaling up model size; my hypothesis is that architectural changes will be required to make significant gains.

4.5 2 votes
Article Rating
Subscribe
Notify of
2 Comments
Inline Feedbacks
View all comments
Karan Arora
2 years ago

To what extent do you think deep learning models are being built in a way that would allow them to merge together later on to improve efficacy?

A generally accepted tenet is that more (relevant) data is going to make a model better. Do you think there is an opportunity for AI-based models to actually learn from other AI-based models to be able to tackle more complex problems in an open source world? Or are they all being built too disparately and tackling too narrow of a scope?