On Generative AI, phantom citations, and…

Mar 20, 2023

The Generative AI age is going to be exhausting and unpleasant, isn't it?

11 Comments

Mar 20, 2023

When it comes to content, whether that's academic papers, reviews, short stories, books or any other form of content, the adjustment from scarcity to abundance has been painful for many, even as it brought benefits too. Traditional gatekeepers have been weakened, frequently replaced at least in part by curators. And in many cases, this has been a good thing. That I can locate and read nearly any academic paper is a boon. That books and stories are so easy to find and read is amazing. Less fun are, for eg, all the spam reviews on pretty much every site that hosts them, Amazon being a particularly notable example of a company that hosts spam reviews and doesn't care.

But we are now moving into an era of superabundance and no one is prepared. I've recently been paying a lot of attention to superabundance in the literary industry. For example, the literary magazine Clarkesworld which suffered 500 short story submissions from 1-20 February when it had to close submissions completely. It usually receives 10-25 a month. Most of those 500 were LLM-generated spam. And Clarkesworld isn't the only magazine to have been affected.

It's only too easy to imagine a time when Kindle is flooded with LLM-generated novels and novellas, when Amazon and other reviews are predominately ChatGPT created, when magazines and agents are drowned in LLM submissions. I wrote about this here:

https://wordcounting.substack.com/p/can-publishing-survive-the-oncoming

Of course, if we were talking about a superabundance of quality content, then that would be one thing. Discovery would get harder, but we'd still get something useful (ish) at the end of the process. But we're not. We're talking about a superabundance of LLM-generated trash, whether that's citations or papers or opinion pieces or novellas or books or whatever. The infosphere is going to become horribly polluted. It's like we've just crossed an information event horizon beyond which nothing found on the internet is going to be reliable because all that LLM trash is going to pollute the search engines.

I've really been trying to find a light at the end of this tunnel, but every conversation I've had about it just makes me more concerned. Even if OpenAI creates a digital watermark in its content so that it can be detected, there'll be other LLMs that don't, and soon we won't know what's real from what's been made up. When LLMs get good enough to sound just like humans, and they will, how will be tell humans apart from LLMs? And honestly, that's not a rhetorical question.

Expand full comment

Gerben Wierda

Mar 20, 2023

Great observation, beautiful example. The current generative AI wave certainly brings back memories of the early days of the internet (1990's). At the time I was one of the lone voices against the naiveté of the tech-optimists (in newspaper opinion pieces and one TV debate). But looking back, while it was easy to spot simplistic nonsense, I missed the darker side, like the corrosive side of things, such as mass-manipulation and the problematic effects of the 'attention economy'.

That we will be seeing a tsunami of 'noise masquerading as signal' seems likely. But I wonder what I am missing now.

Either we will not cope and we will culturally drown in it, or we will cope, but if so: how? Coping may for instance mean the internet waning in terms of influence and only sources with strict policies of human-curated content or human curated sources will remain, content/sources you have to pay for. The 'you're not paying for the product so you are the product' might become less and less a workable business model, as what you can consume for 'free' (i.e. in exchange for data about you) will be worth almost nothing. Most open comment sections (like this one) might have to shut shut down as will other open communities. 'Islands' (smaller, closed groups) may become a dominant pattern (again). If the sea of 'noise masquerading as signal' becomes orders of magnitude larger than the noise we already have, trust will be so rare that trust itself becomes valuable (again).

Food for thought and thank you for putting it so clearly, with such a great example, in front of us.

Expand full comment

Michael McCawley

Mar 20, 2023

Dave, there is an essay's worth of thinking coming from " ... what it means to be living in the early times." Do you have more to share on precisely that? On five minutes of thought, I thought about the American manifest destiny, and our "pioneering spirit" as defining positive characteristics of our country. However, "pioneering" is somewhat lazy - it's extracting value and exploiting resources from uncontested or undefended places. Nothing against my Native American brothers and sisters but the Europeans from the 1800's didn't view the western wilderness as "occupied." I lament all of that ugly history but that's how it was.

So, pioneering means "the first guys there had an easy time of it." It got harder later once it started to fill up, and value must be gained by displacing others or competing against incumbents, or other more talented upstarts.

This whole thought resonates with me about defining our American dysfunctional sense of exceptionalism - there was nothing exceptional except that we were "deluded" into thinking this was how it always would be, because we had, as you say, "... a naive misunderstanding of what it means to be living in the early times."

This is a poignant thought. More about this please?

Expand full comment

Reply (1)

Dave Karpf

Mar 20, 2023

I'll develop it more, for sure (probably for the book as well). Thanks, great to hear it resonated!

Expand full comment

Joseph

Mar 20, 2023

This may sound parochial but I have to get it said.

I don't buy the fundamental premise in the term LLM.

I see no language anywhere in the outputs. I see text. I see endless torrents of stochastic baloney. They don't constitute language and add nothing except more and faster plagiarism.

Language is in contact with the human world. These "models" aren't.

Expand full comment

Curt J. Sampson

Mar 21, 2023

FWIW, I had the same issue with a ChatGPT response that claimed to use information from a company's annual report.[1] I didn't carefully check the claims initially, but later did and discovered that ChatGPT does appear to make up URLs when the URLs are in a predictable format. In other words, it's not checking sources but just making up sources that are essentially random information that's statistically similar to the references it's read.

I expect that stuff like this is going to get posted all over the web in the next few years and further degrade our ability to even know what references really exist; if this continues our search engines will over time be working from a pool not of mostly-good references with the odd error in them, but a pool where most references don't exist at all, making it far more difficult to find references.

I am finding it more and more difficult to be non-alarmist about ChatGPT and LLMs in general; every time I turn around it starts looking just a little more nightmarish.

[1]: https://garymarcus.substack.com/p/should-we-worry-more-about-short/comment/13591756

Expand full comment

Cheez Whiz

Mar 20, 2023

I invented an aphorism I call Fudd's 1st Law (in honor of Firesign Theater's Fudd's 3rd Law of Energy: "if you push something, it falls over") stating "if there is a system, someone will try to game it". The Internet, with its roots in academic idealism of free exchange, proved Fudd's 1st Law many times over. It's greatest strength is its greatest weakness, and it's been an endless arms race to preserve its ideals against the reality of monetization. Just think of the opportunities for a Nigerian Prince in LLM.

Expand full comment

Reply (1)

Dave Karpf

Mar 22, 2023

I like that. I came up with a similar one in a 2012 paper. "Any metric of digital influence that becomes financially valuable, or is used to determine newsworthiness, will become increasingly reliable over time."

So my friendly amendment would be that people won't game *any* system, because there are plenty of systems that nobody hears about and no one cares about. But as soon as the system becomes worth giving a damn about, someone will try to game it.

Expand full comment