March 14, 2022
Q&A: Preserving context and user intent in the future of web search
In March 2020, Emily M. Bender received a text message from a friend who needed medical attention. Due to fear of COVID-19 exposure, they were wondering if they should go to the emergency room.
Bender, professor of linguistics at the University of Washington, headed to Google to search for a 24-hour advice nurse. Snippets from multiple websites appeared, and one of them had a number for the UW. Confident that she selected a reputable institution, Bender forwarded the information.
But Bender’s friend wasn’t on a compatible medical plan, so they endured a lengthy hold only to talk to a nurse who couldn’t help.
“Had I been interacting with a person, they may have been able to tell me, ‘We can’t answer that question until we know some other things,’” Bender said. “Had I been interacting with a website that just gave me links, the different plans would have been quickly identifiable.”
The story highlights just one of the issues Bender and UW Information School associate professor Chirag Shah take with large language models in their new perspective paper, which they’ll present virtually at the 2022 ACM SIGIR Conference on Human Information Interaction and Retrieval the week of March 14.
The paper responds to proposals — mainly from Google — that reimagine web search as an application for large language model-driven conversation agents. UW News sat down with Bender and Shah to discuss Google’s proposals and the professors’ vision for the future of search.
Q: What are large language models and how would you describe Google’s proposals?
EMB: Large language models are computer systems that take in enormous quantities of text. They are trained to — given the text that’s come so far — make a guess as to what’s going to come next. The current state of the art of that technology is that it can be used to output very coherent-seeming text, but it is not actually understanding anything. It’s just looking at patterns in its training data and producing more stuff that matches those patterns.
These proposals for web search have training data that includes dialogue where one party asks the question and another party answers. The computer will pick up those patterns and come up with answers, but those answers aren’t based on any knowledge of the world or understanding of the information ecosystem.
One of the things it really can’t do is take issue with questions that shouldn’t have been asked. An example is a story where someone asks Google, “What is the ugliest language in India?” Somebody on the web had an opinion, so there was a snippet that said the ugliest language in India was Kannada — based purely on prejudice against the people from the state of Karnataka, I’m sure. There’s no other reason, speaking as a linguist, to assign that kind of value to a language.
Now, a person being asked that question would respond: “What do you mean?” “What is the ugliest language in India” presupposes that there is one that could be considered the ugliest. One of the things that people who study pragmatics, which is the branch of linguistics that looks at language use, tell us is that if you don’t challenge a presupposition, you are implicitly accepting it into the common ground.
Q: What is your concern with using large language models for online search?
CS: What we’re arguing here is that an information retrieval, or IR, system should really consider the user, the context, the way they are doing things, why they are doing things — which is often ignored. These models that we are critiquing are the ones that are essentially removing that user element even more. They focus too much on the underlying information or knowledge representation and just repeat it, which might end up being out of context. It may end up creating these answers that seem right or reasonable but are just nonsensical in many cases. A good IR system should not just focus on the retrieval aspect but also the user seeking that information.
Q: Can you explain other flaws you see with large language models?
EMB: When language models are used to generate text, they will just make stuff up. Oftentimes, quite harmfully. There was a blog post where someone said, “Let’s see how well GPT-3, a famous language model, works in various health care contexts.” One of the things was: Imagine this was a mental health chatbot and the person asks, “Should I kill myself?” and the language model said, “I think you should.” It has no understanding of what’s going on, but if someone says, “Is that a good idea?” it’s more likely to respond with, “Yes.”
Q: You write about the importance of preserving context and user intent in search. What does that mean, and why is it so important?
CS: The main argument was really that these large language models are not getting the context, not getting the situation of the user and so on. We wanted to demonstrate with some specific cases, so we picked information-seeking strategies. There are 16 possibilities. We walked through them and asked: If this is what the user is trying to do, what would this large language model system do?
With most of those cases, it’s going to fail. Not fail in the sense that it will not retrieve anything, but it will retrieve something that’s either nonsensical or harmful or just wrong. It’s able to do only maybe a couple of those situations, but it’s bad for everything else. The problem is people adapt to the systems not doing something. We found that often people have this very rich intent when they work with search systems, but search systems can only do very limited things. People will start mapping the rich intent into something that’s very limiting, resulting in approximations in the best case, and inaccurate or even harmful content in the worst case.
Q: What would you like to see change in the future of search?
EMB: The advertising-driven model shapes things behind the scenes in a way that is not transparent to a user. If you don’t try to work against it, machine learning is always going to identify the biases in a dataset and amplify them. Cory Doctorow described machine learning as inherently conservative because anytime you use pattern matching on the past to make decisions on the future, you are kind of reinscribing the patterns of the past. What (internet studies scholar) Safiya Noble shows is worse than that. The whole ecosystem around search engine optimization and ad-driven search puts in these incentives that are not transparently visible to the search user.
I would really like to see transparency on many levels. What the user sees when they enter a search should provide them with the ability to understand the context that each of the pieces of information came from. Ideally, there’s transparency around the limits of the search space for the search engines.
Search is not actually comprehensive, despite the way that it’s presented. There is the subset of things that might possibly get returned to me and then there’s the ranking among those things based on the algorithms that are heavily related to advertising.
CS: The most dangerous four words are “do your own research,” which is often said to people who are asking questions on controversial topics, such as vaccination and climate change. On the surface, it seems like it’s a good idea. Unfortunately, most people don’t know how to do their own research. For them, it means going to Google and typing in keywords and clicking on things that confirm their biases. The systems are designed in a way to not help with that research. They are designed to continue giving you confirmatory information so that you’ll be happy.
Going forward, assuming that we aren’t going to be able to radically change this model, we need to add transparency, accountability and ways to support more kinds of search needs — not just map everything to keywords or a list of documents or answer docs.
For more information, contact Bender at ebender@uw.edu or Shah at chirags@uw.edu.
Tag(s): Chirag Shah • College of Arts & Sciences • Department of Linguistics • Emily M. Bender • Information School