Human vs AI audio: Quality, cost, and time comparison

Rachel Handley 29.Mar.2022

With demand for spoken-word audio on the up, 80% of media leaders are investing more in digital audio this year. If you're making the move into spoken-word audio, one of the main things you'll need to consider is human vs AI audio.

While publishers like Zetland, The Economist, and Harvard Business Review have seen success with human voice-over, publishers including Berlingske, The Japan Times, and Media24 are engaging audiences with synthetic speech. Some, like The Washington Post, use a mixture of both.

Human voices can be more engaging, but this comes at a huge cost—one that's often unviable. With the quality of synthetic speech catching up to, and in some ways surpassing, human voice over, many find that the balance has tipped into AI's favor. Especially when it comes to audio articles.

In this article, I'm going to compare human vs text-to-speech audio production in terms of quality, cost, and time, to help you make an informed decision.

Quality

Human-read audio is generally considered more personable and engaging than AI audio because people can add more emphasis and emotion into their speech. They can also manually check pronunciations and make thoughtful decisions on delivery.

“With audio, it’s even more personal than text, and we see more opportunities there because it’s a more intimate way of consuming journalism.”—Ernst-Jan Pfauth, The Correspondent CEO

However, if you haven't had training in narration or voice acting, delivering clear and engaging speech yourself can be difficult. The sound can also be compromised by your recording environment and equipment.

Hiring a voice actor and professional recording studio will give the highest-quality results, but this can be time-consuming and expensive. You may also have issues with achieving a consistent brand voice, because you will be relying on the availability of the voice actor.

Another drawback with human-read audio is a lack of flexibility. Switching between multiple languages or voices means hiring and managing multiple speakers. This compromises your ability to choose the best voice for each piece of content you're producing. It's also impractical to edit human-read audio after publishing.

“You can’t have somebody producing a new audio version of one article every time it’s updated. But with [...] synthetic language, there’s hardly any additional cost to production at all.”—Andy Webb, head of product for the voice and artificial intelligence team at the BBC

AI audio offers more consistency and reliability, as well as flexibility. With BeyondWords, you can easily update what's being said and switch between 500+ voices across 130+ language locales.

There's even the option to create custom AI voices. This means you can clone your own voice, or the voice of a person on your team, to give audio a more personal touch. Or, you can work with a voice actor to create a unique and engaging brand voice.

“The synthetic voice we developed with BeyondWords handles local names better than anything we’ve heard before. It’s much more engaging to listen to a voice that sounds like our brand.” — Kelly Anderson, Deputy Site Editor at News24

The quality of your text-to-speech audio will depend largely on the AI voice itself. As well as having the option to create a custom voice, our users get access to hyper-realistic premade voices.

“[AI voices] are of such good quality that it’s kind of hard to distinguish [them] from human voices. Particularly for news articles, they are a really good solution to the audio problem.”—Paddy Logue, digital editor at The Irish Times

But the voice isn't the only thing that matters. AI voices sound better on BeyondWords because we use natural language processing algorithms to convert your text into speech synthesis markup language (SSML). This reduces the risk of pronunciation errors and allows for custom text-to-speech rules.

Cost

Human-read audio is traditionally expensive to produce. Of course, the cost will vary significantly depending on scale, how much work you do yourself, and how much you want to invest in quality.

Technically, you can use your own voice, your phone or computer, and free software to create audio for nothing. You'll just need to account for your time and perhaps pick up some new skills. However, podcast producer Jeff Large explains that individual podcasters can spend $2,000+ on equipment, $799+ on software, and $99+ on hosting alone.

Businesses that want to hire a podcast production team are looking at $1,000 to $15,000 per episode.⁷ To give some further context, if you are hiring a voice actor to read a 750-word article, Voices.com recommends budgeting $749.

AI audio is relatively inexpensive. It requires significantly less time and expertise to produce, meaning staffing costs are far lower. And there's no need for specialist equipment—you just need to budget for the text-to-speech software.

Time

Human-read audio is typically time-consuming to produce. To give you an idea, every hour of audiobook audio requires around 3.5 hours of recording and editing. Even if you're not going down the DIY route, you're likely to spend a significant amount of time hiring, briefing, and managing your team.

As well as increasing costs, this slows down your production turnaround times. So, it's extremely difficult to produce time-sensitive audio, such as news article narration — especially at scale.

"If all of a sudden the audio appears in some articles, but not all of them it can cause frustration. We want to give our readers the ability to play every article."—Karl Oskar Teien, director of product at Aftenposten

For example, The New York Times publishes a 20-minute news podcast called The Daily five times a week. It takes a team of fifteen to produce. While this includes time for researching and writing the stories, it demonstrates the huge amount of resources required for timely human-read audio.

AI audio production, on the other hand, can be fully automated. If you connect your CMS to BeyondWords, audio versions of your articles are created automatically and almost instantly. This means you can make your content listenable without adding a single step to your or your team's publishing workflow.

Otherwise, it's just a case of pasting your script into our Text-to-Speech Editor and hitting "Publish". You can spend a little extra time reviewing and editing your text-to-speech audio to achieve the best possible quality, but this should still take a fraction of the time required for human-read audio.

Summary

	Human-read audio	AI audio
Pros	More personable and emotive Less chance of mispronunciations	Typically cheaper to produce Easy to edit Huge choice of voices Consistent and reliable Fast production speeds Less time-consuming
Cons	Can be expensive Difficult to edit Lack of flexibility with voices Reliance on speakers' availability Slow production speeds More time-consuming	Less personable and emotive More chance of mispronunciations

Many believe that the human vs AI audio debate comes down to a trade-off between quality and cost, but this may not be the case anymore.

Advancements in synthetic speech mean it's more engaging than ever—particularly when you create a custom voice or use a voice actor's AI voice. Virtual voice over also gives you more consistency, reliability, and flexibility.

And AI voices are getting better all the time. This, combined with sophisticated and affordable tooling, is making audio publishing not just viable, but unignorable. Publishers who don't cater to listening needs and preferences will soon get left behind.

Ready to make the move into AI audio? Contact our team to arrange a demo.

Guides

Quality

Cost

Time

Summary

You might also like

Grow registrations and subscriptions with tiered audio experiences

5 ways to fuel long sessions with audio articles