With demand for spoken-word audio on the up, 80% of media leaders are investing more in digital audio this year.¹ If you're making the move into spoken-word audio, one of the main things you'll need to consider is human vs AI audio.
While publishers like Zetland, The Economist, and Harvard Business Review have seen success with human voice over², publishers including Berlingske, The Japan Times, and Media24 are engaging audiences with synthetic speech. Some, like The Washington Post, use a mixture of both.³
Human voices can be more engaging, but this comes at a huge cost — one that's often unviable. With the quality of synthetic speech catching up to, and in some ways surpassing, human voice over, many find that the balance has tipped into AI's favor. Especially when it comes to audio articles and newsletters.
Human-read audio is generally considered more personable and engaging than AI audio, because people can add more emphasis and emotion into their speech. They can also manually check pronunciations and make thoughtful decisions on delivery.
“With audio, it’s even more personal than text, and we see more opportunities there because it’s a more intimate way of consuming journalism.” — Ernst-Jan Pfauth, The Correspondent CEO⁴
However, if you haven't had training in narration or voice acting, delivering clear and engaging speech yourself can be difficult. The sound can also be compromised by your recording environment and equipment.
Hiring a voice actor and professional recording studio will give the highest-quality results, but this can be time-consuming and expensive. You may also have issues with achieving a consistent brand voice, because you will be relying on the availability of the voice actor(s).
Another drawback with human-read audio is a lack of flexibility. Switching between multiple languages or voices means hiring and managing multiple speakers. This compromises your ability to choose the best voice for each piece of content you're producing. It's also impractical to edit human-read audio after publishing.
“You can’t have somebody producing a new audio version of one article every time it’s updated. But with [...] synthetic language, there’s hardly any additional cost to production at all.” — Andy Webb, head of product for the voice and artificial intelligence team at the BBC⁵
AI audio offers more consistency and reliability, as well as flexibility. With BeyondWords, you can easily update what's being said and switch between 500+ voices across 130+ language locales.
There's even the option to create custom voices. This means you can clone your own voice, or the voice of a person on your team, to give audio a more personal touch. Or, you can work with a voice actor to create a unique and engaging brand voice.
“The synthetic voice we developed with BeyondWords handles local names better than anything we’ve heard before. It’s much more engaging to listen to a voice that sounds like our brand.” — Kelly Anderson, Deputy Site Editor at News24
While AI audio is not as personable and emotive as human speech, progress is being made. And in certain cases, it's hard to tell the difference. Just listen to this example, which compares voice actor Joe Coen's AI voice against his voice clone, 'Joe':
The quality of your text-to-speech audio will depend largely on the AI voice itself. As well as having the option to create a custom voice, our users get access to an AI voice library featuring voices from Amazon Polly, Yandex, Microsoft Azure, and Google Cloud. Subscribers can also use exclusive voices like 'Joe', which are created in collaboration with voice actors.
“[AI voices] are of such good quality that it’s kind of hard to distinguish [them] from human voices. Particularly for news articles, they are a really good solution to the audio problem.” — Paddy Logue, digital editor at The Irish Times.⁶
But the voice isn't the only thing that matters. AI voices sound better on BeyondWords because we use natural language processing algorithms to convert your text into speech synthesis markup language (SSML). This reduces the risk of pronunciation errors and allows for custom text-to-speech rules.
Human-read audio is traditionally expensive to produce. Of course, the cost will vary significantly depending on scale, how much work you do yourself, and how much you want to invest in quality.
Technically, you can use your own voice, your phone or computer, and free software to create audio for nothing. You'll just need to account for your time and perhaps pick up some new skills. However, podcast producer Jeff Large explains that individual podcasters can spend $2,000+ on equipment, $799+ on software, and $99+ on hosting alone.
Businesses that want to hire a podcast production team are looking at $1,000 to $15,000 per episode.⁷ To give some further context, if you are hiring a voice actor to read a 750-word article, Voices.com recommends budgeting $749.⁸
AI audio is relatively inexpensive. It requires significantly less time and expertise to produce, meaning staffing costs are far lower. And there's no need for specialist equipment — you just need to budget for the text-to-speech software.
At BeyondWords, we have a pricing plan for every publisher. Free users get access to all our core features and can convert up to 30,000 characters (approximately 6,000 words) into audio every month. If you use our Pro plan to the max, it works out at around $0.23 per 750 words.
Human-read audio is typically time-consuming to produce. To give you an idea, every hour of audiobook audio requires around 3.5 hours of recording and editing.⁹ Even if you're not going down the DIY route, you're likely to spend a significant amount of time hiring, briefing, and managing your team.
As well as increasing costs, this slows down your production turnaround times. So, it's extremely difficult to produce time-sensitive audio, such as news article narration — especially at scale.
"If all of a sudden the audio appears in some articles, but not all of them it can cause frustration. We want to give our readers the ability to play every article." — Karl Oskar Teien, director of product at Aftenposten¹¹
For example, The New York Times publishes a 20-minute news podcast five times a week, called The Daily. It takes a team of fifteen to produce.¹⁰ While this includes time for researching and writing the stories, it demonstrates the huge amount of resources required for timely human-read audio.
AI audio production, on the other hand, can be fully automated. If you connect your CMS to BeyondWords, audio versions of your articles are created automatically and almost instantly. This means you can make your content listenable without adding a single step to your or your team's publishing workflow.
Otherwise, it's just a case of pasting your script into our Text-to-Speech Editor and hitting 'Publish'. You can spend a little extra time reviewing and editing your text-to-speech audio to achieve the best possible quality, but this should still take a fraction of the time required for human-read audio.
Many believe that the human vs AI audio debate comes down to a trade-off between quality and cost, but this may not be the case anymore.
Advancements in synthetic speech mean it's more engaging than ever — particularly when you create a custom voice or use a voice actor's AI voice. Virtual voice over also gives you more consistency, reliability, and flexibility.
And AI voices are getting better all the time. This, combined with sophisticated and affordable tooling, is making audio publishing not just viable, but unignorable. Publishers who don't cater to listening needs and preferences will soon get left behind.