OpenAI’s attempts to watermark AI text hit limits

It's proving tough to rein in systems like ChatGPT

Did a human write that, or ChatGPT? It can be hard to tell — perhaps too hard, its creator OpenAI thinks, which is why it is working on a way to “watermark” AI-generated content.

In a lecture at the University of Texas at Austin, computer science professor Scott Aaronson, currently a guest researcher at OpenAI, revealed that OpenAI is developing a tool for “statistically watermarking the outputs of a text [AI system].” Whenever a system — say, ChatGPT — generates text, the tool would embed an “unnoticeable secret signal” indicating where the text came from.

OpenAI engineer Hendrik Kirchner built a working prototype, Aaronson says, and the hope is to build it into future OpenAI-developed systems.

“We want it to be much harder to take [an AI system’s] output and pass it off as if it came from a human,” Aaronson said in his remarks. “This could be helpful for preventing academic plagiarism, obviously, but also, for example, mass generation of propaganda — you know, spamming every blog with seemingly on-topic comments supporting Russia’s invasion of Ukraine without even a building full of trolls in Moscow. Or impersonating someone’s writing style in order to incriminate them.”

Exploiting randomness

Why the need for a watermark? ChatGPT is a strong example. The chatbot developed by OpenAI has taken the internet by storm, showing an aptitude not only for answering challenging questions but writing poetry, solving programming puzzles and waxing poetic on any number of philosophical topics.

While ChatGPT is highly amusing — and genuinely useful — the system raises obvious ethical concerns. Like many of the text-generating systems before it, ChatGPT could be used to write high-quality phishing emails and harmful malware, or cheat at school assignments. And as a question-answering tool, it’s factually inconsistent — a shortcoming that led programming Q&A site Stack Overflow to ban answers originating from ChatGPT until further notice.

To grasp the technical underpinnings of OpenAI’s watermarking tool, it’s helpful to know why systems like ChatGPT work as well as they do. These systems understand input and output text as strings of “tokens,” which can be words but also punctuation marks and parts of words. At their cores, the systems are constantly generating a mathematical function called a probability distribution to decide the next token (e.g. word) to output, taking into account all previously output tokens.

In the case of OpenAI-hosted systems like ChatGPT, after the distribution is generated, OpenAI’s server does the job of sampling tokens according to the distribution. There’s some randomness in this selection; that’s why the same text prompt can yield a different response.

OpenAI’s watermarking tool acts like a “wrapper” over existing text-generating systems, Aaronson said during the lecture, leveraging a cryptographic function running at the server level to “pseudorandomly” select the next token. In theory, text generated by the system would still look random to you or I, but anyone possessing the “key” to the cryptographic function would be able to uncover a watermark.

“Empirically, a few hundred tokens seem to be enough to get a reasonable signal that yes, this text came from [an AI system]. In principle, you could even take a long text and isolate which parts probably came from [the system] and which parts probably didn’t.” Aaronson said. “[The tool] can do the watermarking using a secret key and it can check for the watermark using the same key.”

 

Key limitations

Watermarking AI-generated text isn’t a new idea. Previous attempts, most rules-based, have relied on techniques like synonym substitutions and syntax-specific word changes. But outside of theoretical research published by the German institute CISPA last March, OpenAI’s appears to be one of the first cryptography-based approaches to the problem.

When contacted for comment, Aaronson declined to reveal more about the watermarking prototype, save that he expects to co-author a research paper in the coming months. OpenAI also declined, saying only that watermarking is among several “provenance techniques” it’s exploring to detect outputs generated by AI.

Unaffiliated academics and industry experts, however, shared mixed opinions. They note that the tool is server-side, meaning it wouldn’t necessarily work with all text-generating systems. And they argue that it’d be trivial for adversaries to work around.

“I think it would be fairly easy to get around it by rewording, using synonyms, etc.,” Srini Devadas, a computer science professor at MIT, told TechCrunch via email. “This is a bit of a tug of war.”

Jack Hessel, a research scientist at the Allen Organization for AI, brought up that it'd be challenging to subtly fingerprint AI-created text in light of the fact that every token is a discrete decision. Too clear a fingerprint could bring about odd words being picked that corrupt readability, while too unpretentious would leave space for uncertainty when the fingerprint is searched out.

As stated by Yoav Shoham, the co-founder and co-CEO of AI21 Labs, an OpenAI competitor, statistical watermarking won't be sufficient to assist with recognizing the origin of AI-created text. He requires a "more comprehensive" move that incorporates differential watermarking, in which various fragments of text are watermarked differently, and AI frameworks that all the more precisely quote the origin of the text.

This particular watermarking procedure also requires putting a great deal of trust — and power — in OpenAI, specialists noted.

Hessel commented via email that an ideal procedure wouldn't be discernible by human readers and enable extremely assured detection. “Depending on how it’s set up, it could be that OpenAI themselves might be the only party able to confidently provide that detection because of how the ‘signing’ process works.”

In his talk, Aaronson acknowledged the procedure would only work in a world with corporations like OpenAI being ahead in scaling up progressive systems and committing to be accountable. If OpenAI had to share the watermarking tool with different text-generating system providers, like Cohere and AI21Labs, this wouldn’t stop others from ignoring it.

“If [it] becomes a free-for-all, then a lot of the safety measures do become harder, and might even be impossible, at least without government regulation. In a world where anyone could build their own text model that was just as good as [ChatGPT] … what would you do there?”

That is how it's worked out for text-to-image algorithms. While OpenAI's DALL-E 2 image-creating framework is only accessible through API, the text-to-image Stable Diffusion by Stability AI is available as open source. While DALL-E 2 has various filters at the API level to keep problematic pictures from being created (in addition to watermarks on pictures it produces), the open source Stable Diffusion doesn't. Troublemakers have utilized it to make deepfaked pornography, among other toxicities.

However, Aaronson remains hopeful. In the talk, he communicated the conviction that, if OpenAI can prove that watermarking works and doesn't affect the quality of the produced text, it can possibly turn into an industry standard.

Not everybody concurs. As marked by Devadas, the instrument needs a key, meaning it can't be totally open source, which possibly restricts its use by companies ready to join forces with OpenAI. Meanwhile, assuming the key was to be disclosed, anybody could figure out the pattern behind the watermarks, eliminating their efficiency in the first place.

Yet, it probably won't be so implausible. A spokesperson for Quora said the organization would be keen on utilizing such a framework, and it probably wouldn't be the only one.

According to Aaronson, "You could worry that all this stuff about trying to be safe and responsible when scaling AI … as soon as it seriously hurts the bottom lines of Google and Meta and Alibaba and the other major players, a lot of it will go out the window. On the other hand, we’ve seen over the past 30 years that the big internet companies can agree on certain minimal standards, whether because of fear of getting sued, desire to be seen as a responsible player, or whatever else."