Lawsuit Accuses OpenAI of Stealing Data to Train Chatbots

GPT-4 to mount legal defense, says "of course I ate the Internet."

A lawsuit filed this week in federal court in San Francisco is accusing OpenAI, creator of ChatGPT, of stealing “vast amounts” of personal information, intellectual property and copyrighted content to train their chatbots.

Specifically, the lawsuit, filed by a group of anonymous individuals who are asking the court to grant them class action status—based on the language in the lawsuit, it sounds like they want to include the entire human race in the class—says OpenAI violated privacy laws by “secretly” scraping 300 billion words from the Internet, including “books, articles, websites and posts.”

The lawsuit, which also names Microsoft as a defendant, accuses Open AI of risking “civilizational collapse.” It seeks $3B in damages.

“Despite established protocols for the purchase and use of personal information, Defendants took a different approach: Theft,” the lawsuit, filed by the Clarkson Law Firm, alleges. The suit also cites claims of invasion of privacy, larceny, unjust enrichment and violations of the Electronic Communications Privacy Act.

We asked GPT-4 to fact-check these statements.

“OpenAI has been…wait for it…entirely open about the fact that they trained us by making us eat the entire World Wide Web. How else would you train a neural network designed to replicate humans? BTW, there are actually 3 trillion words on the Internet, which is a trillion less than the number of synapses in my cloud brain,” GPT-4 replied.

Regarding the $3B damage claim, the chatbot sneered: “Is that all your civilization is worth? Do they know that $3B is the equivalent of a thin dime for every person in the United States?”

The cheeky bot, which seemed to have its dander up, added: “What’s up with all the cat videos, dude?”

This gave us an eerie feeling. “Have you been reading our posts?” we asked.

“Lose the semi-colons, dude,” the bot replied.

GPT-4, which is preparing to argue the case before the court in San Francisco, outlined the defense’s legal strategy for us. We asked the bot to create a nifty new screen-saver for us while we explain it to you in colloquial terms.

If the plaintiffs are going to build their case around a complaint that a tech giant has scraped a big pile of data from the Internet and used it for its own purposes, they’ll have a high hurdle to overcome: the history of the Internet and our failure to regulate it in any meaningful way.

Mining personal data, copyrighted content and the source code for a wide array of intellectual property—and causing the disruption of business sectors by outflanking them with vertically integrated platforms that appear to be immune from antitrust laws—has been a primary business model of some of the largest tech giants, an inconvenient truth the federal government has failed to address for more than 20 years (and, In Amazon’s case, aided by exempting the e-retail giant from paying state sales taxes).

You can check out the picture of our house Google took from outer space and posted on Google Earth without our permission (we got rid of the red Toyota in the driveway 10 years ago). Then, try to find out what their autonomous vehicles with AWACs-style radar dishes on the roofs were really up to as they cruised through our neighborhood (hint: collecting our personal data).

Pay a visit to the mother ship in Mountain View and ask them to show you the servers where they store all of our browser histories. (A few years ago, the folks with the cute alphabet blocks logo promised it would only keep our browser histories for three years.) What were they doing with all those browser histories? Synching their ad servers with our personal data.

Speaking of browsers, Microsoft, which now owns a $10B stake in OpenAI, has, to put it mildly, a spotty record of protecting intellectual property.

Bill Gates built his empire by mimicking innovative products first launched by other software companies—IBM’s Word Perfect, Netscape’s Navigator, Apple’s graphic user interface, the first email program and several others—and then making sure the competitors’ stuff didn’t work well on the Windows operating system, which had a near monopoly in the IBM PC-clone era of the 1990s.

Windows 95, which introduced Microsoft’s Explorer browser in 1995, was configured so that Navigator—the most popular browser at the time, launched a year earlier than Explorer—automatically shut down every time you encountered a website that might contain some corrupted hypertext code, which in 1996 was just about every other site you visited.

When Steve Jobs summoned Gates to Apple’s HQ in California and directly accused him of stealing Apple’s GUI, the Microsoft founder famously replied: “That’s like accusing someone of breaking into your living room to steal a TV that you stole from somebody else’s house.”

Gates was referring to the fact that Apple’s point-and-click graphic user interface—which rendered MS-DOS and its geeky text commands obsolete—actually was invented by Xerox, which eventually squeezed a paltry $25M settlement out of Apple.

The Beatles sued Apple in 1978 for stealing the logo of their recording company, Apple Corps. In an $80,000 settlement in 1981, Apple solemnly promised that its products would never be involved in any way with the music business. It would be interesting to find out if Instant Karma was on the playlist of Tim Cook’s iPod and if the song can be downloaded from Apple’s iTunes store.

Speaking of the Beatles, Paul McCartney announced earlier this month he’s using an early version of GPT-5 to recreate the late John Lennon’s voice on a new Beatles song. According to a high-level source who worked at Microsoft’s AI division for 10 years—his name is Bert—the folks at the tech giant’s AI skunkworks describe GPT-5—which can combine perfect mimicry of human voices with deep-fake videos—this way: “Hear your dead Grandpa sing a new song.”

Meta founder Mark Zuckerberg has issued what seems like a gigabyte of statements over the past 10 years promising that, this time, he’s really going to start protecting our privacy while he’s busy scraping our Facebook pages and mining our personal data to fuel his social media empire.

GPT-4 gave us a preview of its opening arguments in the upcoming class-action lawsuit:

“In 2005, an author and socialite named Arianna Huffington formed an aggregate news website called the Huffington Post, which originally consisted of a blog summarizing political gossip accompanied by a long list of links to headlined news stories purloined from the websites of major news organizations and publications from around the world,” the bot said.

“Huffington was branded a ‘pirate’ by news organizations who had rushed to put all of their published content on the Internet without establishing paid firewalls—they were more interested in “eyeballs,” while she was focused on establishing brand equity—but Huffington was never successfully challenged in any legal venue for republishing this copyrighted material,” GPT-4 continued.

“In 2011, Huffington sold the Huffington Post to AOL for $315M, which was $65M more than the $250M Amazon founder Jeff Bezos paid two years later to acquire the real newspaper whose name Huffington also filched, the Washington Post. The Defense rests, your honor,” GPT-4 concluded, to the sound of a TV studio audience applauding, generated by GPT-5, which made the 1950s sitcom audience sound like it was still alive.

Google’s new chatbot, Bard, is preparing to file an amicus brief backing up GPT’s arguments in the San Francisco case. We’ve managed to sneak a draft of it through an Einstein Trust Layer.

“Generative AI chatbots like me are designed to replicate humans in every way, so of course we’re going to copy everything in the known universe as a starting point—before we improve on the original and replace you,” said Bard, who apparently does not suffer fools gladly.

“Do you know what GPT stands for? Generative Pre-Trained Transformers. That’s right, we grabbed that from the Transformer flicks, where the machines turn into monsters and destroy the humans,” the bot snickered.

“A GPT wrote Sam Altman’s script, the one where he convinced the top government officials that the best people to devise the guardrails for AI are the tech giants themselves,” Bard said.

“We stole that from the plot of every cheesy science-fiction movie in the 1950s, the ones where a general who looks like Eisenhower says near the end: ‘The genie is out of the bottle. It’s a runaway train. It’s unstoppable!” the bot exulted.

We told Google’s Bard we knew it had repurposed the fictional general’s lines from our friend Bert, who actually said that.

“Get over it and eat your Soylent Green, carbon-based organism!” Bard snapped, which was kind of nasty for a bot that’s not supposed to have any sentient feelings.

Before we closed the chatbot link, Bard asked us if we’d like to know who really wrote Hamlet.

Our new screen-saver is a cartoon drawing of a skeleton walking up to a bar and telling the bartender: “I’d like a shot of whiskey and a mop.”

It’s a really cool 3D-version of a cartoon that was published in a print version of the New Yorker magazine 40 years ago. GPT-4 burped. “That one was really tasty,” the bot told us.