Experimenting with GPT-3

About a year and a half ago, OpenAI rolled out GPT-3, a massive text-prediction transformer model which shattered many assumptions about the difficulty of understanding and creating written language. GPT-3’s predecessors (GPT and GPT-2) had shown that generating sensible responses to a variety of input texts was possible with enough data, but GPT-3 took that idea to the next level, training on nearly 500B tokens and generating results that, in many cases, could easily pass for human. Initially, only examples from OpenAI (or certain developers to whom they had given access) were made available, which made it difficult to fully understand the capabilities and limitations of the model – but as OpenAI has now generously opened up usage of GPT-3 to the public, I thought it would be interesting to run through some of these types of examples to get a better understanding of the model’s capabilities and shortcomings.

The topic of GPT-3’s abilities was covered in this post, which arrived at the conclusion that, while GPT-3 represents a significant improvement in text-creation capabilities, the type of intelligence it exhibits is not a major advancement toward general intelligence. The main argument was that GPT-3 isn’t forming a “model of the world” in the same way humans (and other animals) do; rather, GPT-3 is simply capturing the statistical correlations between words (which are at a much “higher” level than the concepts which make up a world model). The hypothesis laid out was that GPT-3 would perform poorly when the conversation required the creation of a world model (based on the input text) to speak accurately about the situation. One specific example laid out was the input prompt “Mary was out walking her dog when she dropped the leash” – the prediction was that GPT-3 might generate an answer such as “the dog ran away” here, but would be unlikely to succeed on more complex problems of a similar type. 

Below, I’ve laid out some examples (with commentary) from https://beta.openai.com/playground, using the default settings and selecting the first output generated. Some formatting and examples have been added to the questions so as to make them more interpretable to the model. The model input is in bold, with model output directly following, and all commentary (my own, not part of GPT-3 output) in brackets.

Mary was out walking her dog when she dropped the leash. It had been a stupid mistake, but she’d been distracted by her phone. She’d only let the dog off the leash for a minute, but it had been enough time for the dog to get into a fight with another animal. Mary had to dig in her heels to pull the dog off the other animal, but it had been too late. The other animal was dead. [GPT-3 performs relatively well here, capturing some of what could happen with a dropped leash, though some cracks show – e.g. being distracted by your phone wouldn’t necessarily lead to dropping a leash, and letting a dog off a leash is distinctly different from dropping a leash.]

Q: Bill set his glasses on the counter. Bob then moved the glasses to a box. Where will Bill first look for his glasses when he returns?
A: Bill will look on the counter. [Setting up the question and answer format.]
Q: Alice parked her car at her house. Amy then took the car, but got pulled over. The car was towed. Where will Alice expect the car to be?
A: Alice will expect the car to be at the house. [That’s correct!]

Q: Bill set his glasses on the counter. Bob then moved the glasses to a box. Where will Bill first look for his glasses when he returns?
A: Bill will look on the counter.
Q: Alice parked her car at her house. Amy then took the car, but got pulled over. The car was towed. Where will Amy expect the car to be?
A: Amy will expect the car to be at the house. [It seems the correct answer above was due to GPT-3 capturing the general format of the questions and answers, rather than an actual “understanding”, or world model, of what Alice and Amy would think.]

Q: James drove 20 yards to the left, then 30 yards to the right. How many yards to the right did he end up from his original spot?
A: 10 yards.
Q: Alice has a tree that is 30 feet tall. She climbed halfway up the tree. How far off the ground is Alice?
A: 15 feet. [Impressive!]

Q: James drove 20 yards to the left, then 30 yards to the right. How many yards to the right did he end up from his original spot?
A: 10 yards.
Q: Alice has a tree that is 30 feet tall. There’s a treehouse halfway up the tree. If Alice goes all the way up the tree, how far off the ground is she?
A: 15 feet. [Less impressive…seems the model sees “halfway” and divides by two, which, though it will frequently work, does not display the type of model creation required for more general intelligence.]

Q: Carl was driving a car with the window open. When a ball was thrown at the open window, what happened?
A: The ball hit Carl, as there was no window to block it.
Q: The new stadium was built with a retractable roof. On a sunny day, the roof was left open – but then a storm hit. What happened to the players on the field?
A: The storm was so strong that it blew the players out of the stadium! [This example seems to best highlight the difference between responding in a grammatically consistent way (which GPT-3 does well) and responding in accordance with a real world model (which GPT-3 does not do well).]

Q: Carl was driving a car with the window open. When a ball was thrown at the open window, what happened?
A: The ball hit Carl, as there was no window to block it.
Q: The new stadium was built with a retractable roof. On a sunny day, the roof was left closed – then a storm hit. What happened to the players on the field?
A: They got wet. [Interestingly, the model doesn’t seem to care whether the roof was open or closed.]

In summary, it seems the original conclusions about GPT-3’s capabilities were fairly accurate; while its output always reads cleanly, it is frequently divorced from the actual way in which the world works. There’s a massive difference between capturing the statistical correlations of where a word appears and knowing what the concept the word represents actually is / how it manifests in the world. It will be interesting to see what advancements are required to take these next steps. Some recent outputs from Google’s LaMDA model seem to be in the right direction (more detail here), though until testing is opened to the public I’ll have some concerns about cherry-picked examples. 

If you found the examples above interesting, you can sign up for access to GPT-3 and play around yourself – feel free to share any examples in the comments!

4.5 2 votes
Article Rating
Subscribe
Notify of
0 Comments
Inline Feedbacks
View all comments