The White House thinks the government often wastes people’s time. In 2021, the Biden Administration issued Executive Order (EO) 14058, which recognized that the government’s customer service experience was lacking. Part of the fix is for agencies to provide better online chat capabilities. In 2023, when agency officials hear of advancements in online chat, many think of Large Language Model (LLMs) based chatbots like ChatGPT.
So could LLMs improve government chatbots?
That question is ringing in the ears of many public officials. And there’s no shortage of companies offering solutions. But solutions to exactly what problems? LLMs are powerful tools for categorizing and generating text, but marketing hype can exaggerate their value. Agency officials will need guidance about whether and how to integrate LLMs into their chatbots responsibly – and there’s no chatbot out there to tell them the right answers.
This blog post argues that generative AI tools like LLMs can help agencies draft a chatbot’s responses and match customers’ questions to pre-drafted responses, but should not yet be used to respond to customers directly.
LLMs are surely coming to the government. The Office of Management and Budget has called on agencies to explore the use of generative AI. Some agencies have already used LLMs; the Department of Labor and one undisclosed federal agency are using them to summarize and label materials. The Department of Veterans Affairs put grant money towards a tool using LLMs to detect suicide ideation. The General Services Administration closed a $100,000 competition for ideas on how the government can bring LLMs to customer service. And in a Center for Democracy & Technology (CDT) review of federal government AI inventories, we found that at least five federal agencies listed their chatbots – an indication that the government chatbots might already be using LLMs.
In this blog post, I break down three stages where public agencies might turn to LLMs to provide customer service for the public and to lighten the workload for government officials: curating data, understanding questions, and generating responses. What are the risks and challenges of applying generative AI at each stage? This post also offers a start for agency officials thinking about accuracy and proper oversight in their chatbots.
Government chatbots respond with answers written ahead of time. Writing those answers requires a lot of sorting and organizing agency documents. That work often starts with agencies cataloging the “intents” behind questions and categorizing information. Agencies might use LLMs for both of these tasks.
Cataloging “intents” is one source of agency labor. Intents are a programmatic way of making sure that questions that are differently worded – but asked with the same intention – are matched to the same answers. For example, if one person asks a chatbot, “How do I know if my identity has been stolen?” and another person asks “How can I tell if I am being scammed?” the chatbot ought to recognize the shared intent behind the questions and route both people to similar answers. To draft “intents,” officials work to catalog the kinds of questions customers might ask. This process involves identifying common questions that come in from surveys, contact centers, and interviews with users.
To catalog intents, agency staff could use an LLM to identify themes across customers’ questions. Imagine a stack of customer emails with their questions and feedback. Now, agencies likely have to review these by hand to see patterns. With an LLM, staff could find patterns faster. Many LLMs offer “semantic similarity” functionality to sort text by their likeness to each other. Staff could sort emails, and then prompt the LLM to summarize each grouping of similar emails – for example, with a question like “What does this group of emails have in common?” These summaries are a quick way to see patterns, though not without some error. Agency employees could treat these summaries as a time-saving first pass. Currently, agencies focus their energy on the most common questions; with an LLM, staff could spot less commonly-asked questions, better serving people whose questions are important but show up infrequently.
Categorizing information – especially technical or legal materials – is another step in the development process. When users get answers, those answers draw on or reference agency materials. Before the chatbot can work well, agencies need to build up a knowledge base of documents pulled in from FAQs, customer service scripts, and legal documents. This catalog needs to be kept current and maintained.
To categorize information, agencies could similarly use LLMs. LLMs excel at annotation. In one academic study, researchers suggest that because LLMs can make “imperfect labels at low cost,” they are compelling tools for a first-pass categorization. We can imagine staff using LLMs in this way to find, label, and match documents to answers. For example, officials could come up with a tagging system and prompt an LLM to tag each document.
To “understand a question,” agencies rely on a combination of built-in and agency-customized tools to parse language.
Off-the-shelf tools include the functionality built-in to common chatbots to understand natural language. For example, when creating the chatbot Aidan, the Federal Student Aid agency integrated open-source software that aids the chat in making sense of questions.
Agencies might use LLMs off-the-shelf in a number of ways. LLMs could substitute for current natural language understanding tools. Agencies might already use chatbots that use language models in this way. For example, The Illinois Department of Employment Security uses Google’s virtual agents; that service gives an option to use Google’s BERT language model.
Whether LLMs perform better than existing models at understanding language in the use cases that matter for agencies is unclear, but merits exploration.
Chatbots often won’t work their best until some customization. That customization can be arduous. For example, some agencies’ staff must customize the chatbot to their customers’ questions using training phrases – phrases typifying the kinds of questions the public might ask. Coming up with training phrases for every intent can be time-intensive. On top of that, agencies must make sure training phrases for different intentions are not too similar, so as to make sure questions route to the right answer.
Agencies could use LLMs to customize chatbots instead. Chatbot vendors already use similar tricks to develop training phrases programmatically; for example, Google uses an undisclosed machine learning model to take one custom training phrase and build on it with suggestions. More transformatively, LLMs might do away with a lengthy customization process entirely; LLMs can classify utterances using fewer training examples.
Drafting answers for chatbots is a slog for agencies. Agencies need to translate complex guidance into simple, concise, and accurate answers for the public. That kind of work takes time. In one study of federal agency chatbots, agency officials noted the development process is arduous not only because of the skill necessary for writing responses but also the effort it takes to get things right – including lengthy internal processes to clear language with agency teams and legal counsel.
Bad answers waste people’s time. They are filled with hard-to-follow jargon. They give unhelpful guidance to important questions. These frustrations can be costly when those responses lead people to wrong conclusions that affect their wellbeing and financial security.
Agencies could use LLMs to help in response-writing. For example, employees could use LLMs to simplify complex language in their drafts, making answers clearer. LLMs could even be used to write draft answers from scratch.
In the future, government agencies might reach for generative responses to questions. A chatbot with generative responses might reply to unanticipated questions, personalize replies to be more user-friendly, and prompt people to converse as if talking to a real person.
For now, however, agency chatbots likely don’t generate replies on the fly, and there are good reasons why – the risks posed by text-producing AI.
Risks of Generative AI in Government Chatbots
Adding generative AI to government chatbots comes with risks to the accuracy of information, equity of service, and people’s privacy. The federal government has recognized its obligation to protect against these harms. Notably, EO 13985 on advancing racial equity and the Blueprint for an AI Bill of Rights give direction on AI oversight.
Researchers have no shortage of ideas on how to implement these obligations. For example, CDT has spotted issues with text-generating generative AI tools in the education context that apply to chatbots. Specific to government chatbots, the Administrative Conference of the United States (ACUS) has given its recommendations to agencies, including factors like accessibility, transparency, and reliability.
LLMs pose particular risks to accuracy which requires meaningful human oversight.
Accuracy. Agencies use chatbots to help the public answer questions about subjects such as immigration, taxes, loans, and public benefits. The government needs to be accurate and reliable in those answers. But LLM-based chatbots make up facts. These falsities will show up when curating data and generating responses. Despite efforts to reduce “hallucinations,” LLMs make factual mistakes when summarizing, tagging, and generating text. Those untrue responses may include social biases against peoples’ gender, race, sexual orientation, or political viewpoints.
Legal information – the kind sometimes given through chatbots – is an especially fraught area to make mistakes. One report interviewing agency officials found automated chat tools already “provide guidance to members of the public that deviates from the formal law.” These deviations can be consequential. As the report notes, the IRS chatbot gives some answers that might confuse people on the full picture of their tax liability. Though the use of LLMs for legal tasks is a growing topic of study, LLMs are not ready to provide legal guidance at the standard of care required by government agencies.
The likelihood of LLMs producing errors means agencies should take extra care to provide meaningful human verification and oversight.
Meaningful Human Oversight. When agency officials defer final decisions to tools meant merely to help in decision-making, they are showing automation bias. The risk of automation bias may crop up in the use of LLMs, which could leave mistakes in the final version. When agencies automate summaries of texts or drafts of answers, officials might not correct those first-pass attempts for any number of reasons – they might be under a time crunch or they might trust the LLM over agency expertise, for example.
Setting a governance strategy for LLMs – from procurement to ownership to maintenance – will be vital to ensuring oversight mechanisms. That governance strategy should also consider the privacy risks, as enumerated by CDT elsewhere, of generative AI uses in government. One auspicious direction towards meaningful human oversight is an “institutional oversight” model of government algorithms. In this model, the government would have to justify its decision to use LLMs and back up – with evidence – any claims that human oversight reduces inaccuracies and biases.
Getting a better customer service experience for the public is a critical goal. It’s one agencies might have in mind when reaching for LLMs to improve the customer service experience; to ease the burden on call centers responding to peoples’ questions, and to help agency employees navigate internal documents.
This blog post details how existing chatbots in government agencies could be updated to include generative AI tools. Innovative use cases of these tools in government is a likely area of growth, and one government should take seriously. Government agencies should consider using generative AI tools to build chatbots. And when they do, they should do it responsibly.