Kokoro 82M – A Free Alternative to Elevenlabs’ Text-to-Speech (TTS)
Introduction
Looking for a powerful, open-source alternative to Elevenlabs to generate human-like audio from text? Meet Kokoro 82M, a speech generator model based on StyleTTS-2 which is quickly becoming the next big thing in TTS (text-to-speech) world.
Unlike Elevenlabs—a paid, API-based service—Kokoro is a free, open-source model that you can run locally on a computer or self-host in the cloud. It has made waves in the TTS arena since its release, showing potential to replace Elevenlabs for many users who need cost-effective, locally hosted solutions.
In this article we will compare Kokoro and Elevenlabs in depth so you can decide if this open source solution is for you. Let’s get started!
*Update: 1st Feb 2025
Before this article got index, hexgrad has already released a newer version of kokoro 82M with version v1.0.
v1.0 comes with 8 language support and 54 voices.
You can now kickstart kokoro v1.0 locally with their package kokoro
which you can install with pip install kokoro
. Under the hood, kokoro uses misaki, a G2P library. If G2P is unable to convert grammar to phonetics than it fallbacks to eg-speak.
Language Supported by v1.0
🇺🇸 American English: 11F 9M
🇬🇧 British English: 4F 4M
🇯🇵 Japanese: 4F 1M
🇨🇳 Mandarin Chinese: 4F 4M
🇪🇸 Spanish: 1F 2M
🇫🇷 French: 1F
🇮🇳 Hindi: 2F 2M
🇮🇹 Italian: 1F 1M
🇧🇷 Brazilian Portuguese: 1F 2M
What is Kokoro? The Basics
Kokoro is an 82-million parameter TTS model (~350 MB in size) designed to convert text into audio. Its architecture is powered by StyleTTS-2, and it was trained for roughly 500 GPU hours on an A100 80GB machine at a cost of about $400 using just <100 hours of audio data.
Despite its relatively low training time and size, Kokoro has quickly risen to become one of the best open-source TTS models, taking the #1 spot in open-source TTS rankings during its first week of release. In the proprietary TTS rankings, Kokoro is #4 while Elevenlabs is #1.
Elevenlabs vs. Kokoro: Key Differences
Let’s break down how Kokoro compares to Elevenlabs in major areas:
- Pricing
- Elevenlabs: Subscription based model that cost $11/month for Creator Pack (generates ~100 minutes of audio) and $99 for Pro Plan (supports ~600,000 characters for audiobooks). It also comes with voice cloning.
- Kokoro: Free and open-source. You can run it locally on your MacBook M2 Pro or other hardware and convert entire books into audio without any extra cost. For example, converting a 600,000 character audiobook takes ~2 hours locally.
- Language Support
- Elevenlabs: Supports 32+ languages, one of the most versatile TTS services out there.
- Kokoro: Currently only 6 languages—American English, British English, French, Korean, Japanese and Mandarin. While this is a clear limitation, future updates may add more languages.
- Voice Customization
- Elevenlabs: Allows users to create new, unique voice clones with high realism.
- Kokoro: Cloning isn’t supported yet, but due to its hackable nature and the community’s involvement, voice cloning could be added in later updates.
- Ease of Use
- Elevenlabs’ API makes it easy for non-technical users to integrate TTS into projects.
- You need to set up Kokoro locally or on a cloud platform which might be more suitable for developers who want flexibility or low cost solutions.
Running Kokoro Locally (Code Example)
Curious about how Kokoro works in practice? Below is an example of how you can infer Kokoro locally using Python.
import torch # noqa: E402
import sounddevice as sd # noqa: E402
device = "cuda" if torch.cuda.is_available() else "cpu"
MODEL = build_model(
"/Users/snehmehta/work/miraiminds/content/src/audio/kokoro/kokoro-v0_19.pth", device
)
VOICE_NAME = [
"af", # Default voice is a 50-50 mix of Bella & Sarah
"af_bella",
"af_sarah",
"am_adam",
"am_michael",
"bf_emma",
"bf_isabella",
"bm_george",
"bm_lewis",
"af_nicole",
"af_sky",
][0]
VOICEPACK = torch.load(
f"/Users/snehmehta/work/miraiminds/content/src/audio/kokoro/voices/{VOICE_NAME}.pt",
weights_only=True,
).to(device)
print(f"Loaded voice: {VOICE_NAME}")
text = """what is meaning of life"""
audio, out_ps = generate(MODEL, text, VOICEPACK, lang=VOICE_NAME[0])
sd.play(audio, samplerate=24000)
sd.wait() # Wait until the audio is finished playing
Run this script on your local machine (or a free Google Colab GPU) to try Kokoro out.
Real-World Use Cases for Kokoro
Personalized Audiobooks
- Kokoro’s small size and hackable nature allow the community to create amazing applications. One such tool, Audiblez, lets users generate audiobooks from e-books locally.
Interactive Chatbots
- With the right pipeline, Kokoro can be part of real-time conversational systems. For example:
User audio is transcribed into text.
- A large language model (LLM) generates a response.
- Kokoro converts the generated response into audio.
- Tools like Weebo demonstrate this by combining Whisper, Llama, and Kokoro into a real-time speech-to-speech chatbot.
Strengths and Limitations
Why Kokoro Stands Out
- Open-source freedom: Completely free, self-hostable, and customizable to your needs.
- Cost effective: For users who want to generate a lot of audio without subscription fees.
- Active community: Users are already building tools like Audiblez and Weebo, extending its capabilities.
Current Limitations
- Smaller language pool (6 languages vs 32+)
- No voice cloning yet, which could be a dealbreaker for those relying on custom voices.
- Requires moderate technical setup, which may be harder for beginners than a plug-and-play API like Elevenlabs.
The Future of Kokoro
The Kokoro community is already preparing for the next wave of updates. Currently hexgrad has only shared the weights for kokoro v0.19 another version v0.23 is in the training pipeline whose weights are not open but we can access that model through hexgrad huggingface space.
Inspired by its sci-fi namesake (from Terminator Zero's story), Kokoro is an AI designed to help humans.It’s just getting started and with community help it could be the foundation of many TTS projects to come.
Conclusion – Should You Use Kokoro?
If you want a cheap, hackable text-to-speech solution you can run locally, Kokoro is a great alternative to Elevenlabs. It’s missing some features (like voice cloning and broader language support) but it’s open source so you can experiment and be flexible right out of the box.
For hobbyists, developers, or even audiobook creators, Kokoro offers a powerful introduction to the TTS world—without breaking the bank.
Want to try Kokoro right now?
- Run it on a free Google Colab GPU [Link Here].
- Visit our AI company’s site for more AI-related posts and updates!