Mechanical mockingbirds
Earlier this week, The Bookseller, a UK-based trade publication for the book industry, published an op-ed by tech journalist Mark Piesing that garnered a lot of attention: "AI narration is inevitable."
In it, he points to the strides made in the field of computer science and artificial intelligence when it comes to machine-generated audio. You've probably heard some of it: news sites that have widgets that provide a passable narration of their articles, text-to-voice apps, and others. He explained that he's listened to the technology steadily improving, and while this technology has made waves around social media and tech journalism, the publishing industry seems reluctant to adopt it. He feels that there's an inevitability coming: that at some point, AI will be able to replace narrators by streaminlining the publishing process and bringing down costs.
There's been a vocal contingent of authors and narrators who have pushed back on this: Hummingbird Salamander and Annihilation author Jeff VanderMeer was unequivocal on Twitter: "No audiobook of my work will be created by AI. Period," while The Mountain in the Sea author Ray Nayler concurred, noting that "this is acting, not just 'reading'."
Eunice Wong, who narrated Nayler's The Mountain in the Sea and my own book, Cosplay: A History, put in blunter terms to me: "I find it terrifying. When AI begins to encroach on human creativity and art, it's very hard for me not to feel a sense of doom."
And yet.
Earlier this week, Apple launched a suite of production tools for authors and a catalog of books featuring machine-generated narration. In its explanation, Apple explains that many books don't have an audio narration, and that audiobooks are largely out of reach for independent authors and small publishers because of the complexities involved in the production. (I reached out to Apple for this, but didn't hear back). Indeed, in his op-ed, Piesing outlines that complexity: it involves a voice actor (sometimes many), editors and producers, with dozens of hours involved in the production and thousands of dollars.
Apple touts its advances in machine learning to produce high-quality artificial voices, and the technology required to generate such audio from a digital text. The company notes on its page that this particular product offering is designed to "complement" the human narrators, and seems to be actively positioning this product in equitable terms: it allows authors and publishers who don't have the resources to bring their books to audio for the first time.
The rationale behind this is pretty clear: the demand for audiobooks continues to grow. In June 2022, Publishers Weekly reported that a double-digit increase in sales, and that the volume of audiobooks has been steadily growing, thanks in part to devices like smartphones and platforms like Apple's bookstore, and audiobook retailers like Audible and Libro.fm.
That industrial growth means increased revenue for creators, but also pipeline problems on the production end. One author I interviewed for a separate piece a while back explained to me that because of the popularity of the audiobook adaptations of their own work, they had to essentially work around their long-time narrator's already busy-schedule. In anothse case, Frontlines author Marko Kloos noted in 2020 that the audiobook editions of his novels were delayed because of his narrator's schedule and he had to jump to a new narrator (who then unexpectedly died and had to be replaced again.)
On one hand, this seems like a place where artificial intelligence can be put to good use: tackling the backlog and allowing authors who don't have the resources to take advantage of this surge in demand to catch up with their peers.
But on the other hand, artificial intelligence's push into the arts is an existential threat to the livelihoods of narrators and artists who've made their living voicing books for readers. An audiobook is not a recitation of an author's printed text: it's a performance in and of itself, and while you can replicate a voice, it's difficult – if not impossible – to replicate the performance.
"I feel," Wong told me in an email, when I asked her how she saw the differences between her work as a human narrator and that of a machine-generated one. "I have organic, unpredictable, unexpected human emotions that bypass the intellect. When I tell a story or share an exciting piece of nonfiction with my listener, my entire life experience informs those words. I have intuition, [which is] vital to a great performance."
Wong went on to explain how intuition is a part of how she approaches her work. "You don't plan out your performance. You don't decide in advance what to emphasize, what line to get angry or sad or silly on."
There's a reason for that approach: it helps breath life into the performance, bringing spontaneity and nuance and real emotion to the words. "That planned, canned performance will shrivel next to the one in which the actor is simply open and alive in the moment, living that situation, allowing their emotions to react to what's happening," she explained. "In the greatest performances, audiobooks included, emotional nuances appear that are not planned by the actor or writer, and cannot be directed or even replicated."
That translates into a more visceral performance for the reader. "I've had many moments while recording a book that I am surprised – even shocked – and overcome by an emotional reaction that I didn't have while I was reading the script for preparation," Wong said. "I couldn't have planned those moments, and I couldn't have made them happen. When they do happen, they're little miracles, a confluence of the story, my life, that particular day and the events of my life. I've also seen them happen in others' performances, and they are magical. They're transcendent."
I experienced an example of this last year, when Amazon allowed Alexa users to ask the smart assistant to read them J.R.R. Tolkien's Fellowship o fthe Rings. I ended up listening as the smart assistant did just that: recite the author's words.
It was a strange experience. As I noted at the time, "the words are there, but it's delivered in that flat, halting Alexa voice ... It's particularly noticeable when you come across a section of poetry or a song; what Alexa isn't able to replicate is the proper cadence of Tolkien's written words." It was easy to cue up while doing chores, but it wasn't a performance that was keeping me engaged with the text.
When I switched over to Andy Serkis's narration, the book came alive: it was a far more engaging and enjoyable experience.
This isn't a perfect example, but it's an illustrative one: Alexa is an all-purpose voice, while Apple appears to have put a lot more effort into refining their product, releasing voices that are better suited for various genres.
On its site, Apple provides some sample voices. They're good, much better than Alexa, but even listening to those (admittedly brief) examples, it's clear these aren't performances but recitations. Undoubtable, the technology will continue to improve. Wong notes that she felt that companies like Apple should label these types of products as such: "narrated by Artificial Intelligence," she suggests. "These AI narrators should not be given a name, as though they were actually human."
Apple does seem to be taking a measured approach here: as Bayern Agenda, Aleph Extraction, and Nova Incident author and journalist Dan Moren points out in Six Colors, the company has limited its program to specific genres, appears to be focusing this toolset specifically for indie authors and publishers, and its rights don't preclude authors from commissioning other audio editions. This it doesn't appear that this is a program where you can pop in a text file and have it spit out into an audiobook instantly: Apple says that it'll take one to two months to perform quality checks.
The cork is off the bottle, and one company's restraint lasts only as long as it deems it useful: this technology (as well as a whole host of other apps) is out in the world and people are finding uses for it.
In his op-ed, Piesing spoke with DeepZen co-founder and CEO Taylan Kamis, who broke the backlog into hard numbers: there's an enormous gulf between the volumes of print and audiobooks. It's a compelling argument up to the point where you realize that the reason that readers go to books and stories is because of the humanity wrapped up in the words and paper and glue, and that those economic arguments won't remain constant. This isn't a technology that is designed to fill a backlog of books: it's one designed to cut the human experience out of the arts.
The fight over the role that artificial intelligence plays in the arts highlights a longstanding argument exacerbated by the development of these tools: artistic expression against the need to fill an online shopping cart with a product. These AI products are designed to replicate that human expression, creating an endless feedback loop of empty words. The arguments made by the most fervant tech advocates ignore the complexities, the training, and the experience of artists and creators, not recognizing that it's that journey that helps bring us to the stories that stay with us the longest. Audiobook narrators are additive to that art: they provide the performance, the nuance, and direction for those characters in ways that have even surprised their authors.
Those nuances, charms, experiences, and depths are all things that artificial intelligence will have a difficult time replicating, if it can be done at all. If we lose the humanity at the core of the arts, what is it that we are consuming, if it's just content?
Publishing is a low-margin business that's spent decades working to make writing and publishing a book into an efficient process, and major publishers have sought to merge into even larger entities to try and bring more efficiencies to it. The Department of Justice brought Penguin Random House and Simon & Schuster's proposed merger to a stop because it didn't buy its arguments that such efficiencies would be good for authors, all the while workers at Harper Collins have been on strike for weeks over the low pay within the industry.
A technical process that threatens to cut out the role of a trained and talented performer and the technical support that comes with an audiobook production is an efficiency that publishers will not ignore. The drive for efficiency makes me think that in the future, we'll have a publishing industry where human narrators will be an expense reserved for only the top tier of talent, the biggest and surest sellers, while the rest of the midlist will be relegated to a robotic narration.
If the cork is out of the bottle, where do we go from here? Wong notes that AI won't go away, and that there are valid points when it comes to AI being used as a tool to help with equity and accessibility. "But I think it might be the death of the human spirit when we have robots telling us stories," she said.
AI should remain a tool, she explains: "The number of times I've been prepping a script to narrate and thought, "Why can't I just listen to the audiobook of this!" Duh. But seriously, if I could multitask while listening to a script I need to perform, that could be useful. I could see that also being useful for audiobook producers."
Publishers wield an enormous amount of strength in the direction that this industry will take, she explains. Audible has a "humans-only" policy for narrators, something that Wong says she hopes it will maintain. "Readers and publishers can insist on human narrators. Readers, contact the publishers of audiobooks that you listen to, and let them know that you support human narrators. It's always good to get reinforcement. This society revolves around money, and it's always good to know that there are customers out there for a quality product, even if it does cost more."
Humanity is at the heart of our culture of storytelling and performance, and centering artificial intelligence into that tradition feels antithetical to its entire purpose. We're drawn to stories to be entertained, to learn, and to feel. We don't come to audiobooks to hear a cheap copy of a performance: we want to hear the stories come to life.