I've been using this to try to make audiobooks out of various philosophy books I've been wanting to read, for accessibility reasons, and I ran into a critical problem: if the input text fed to Kokoro is too long, it'll start skipping words at the end or in the middle, or fade out at the end; and abogen chunks the text it feeds to Kokoro by sentence, so sentences of arbitrary length are fed to Kokoro without any guarding. This produces unusable audiobooks for me. I'm working on "vibe coding" my own Kokoro based tkinter personal gui app for the same purpise that uses nltk and some regex magic for better splitting.
I use "kokoro-tts" CLI, which has better chunking/splitting.
https://github.com/nazdridoy/kokoro-tts
It generates a directory of audio files, along with a metadata file for ebook chapters
You have to use m4b-tool to stitch the audio files together into an audiobook and include the chapter metadata, but it works great:
https://github.com/sandreas/m4b-tool
I've been meaning to write a post on this workflow because it's incredibly useful
I'll look into this! But I have to say I'm a bit attacked to the little app I've ended up habing AI make for myself lol. It's so cute, and its mine!
Hey, can you share an example book or text so I can test it?
Regarding "abogen chunks the text it feeds to Kokoro by sentence", that's not quite correct, it actually splits subtitles by sentence, not the chunks sent to Kokoro.
This might be happening because the "Replace single newlines with spaces" option isn’t enabled. Some books require that setting to work correctly. Could you try enabling it and see if it fixes the issue?
> Hey, can you share an example book or text so I can test it?
I was running into issues with this one: https://theanarchistlibrary.org/library/kevin-carson-studies..., this one: https://files.libcom.org/files/Accelerate%20-%20Robin%20Mack... (converted to plain text using MinerU, double checked to make sure the text was clean).
> Regarding "abogen chunks the text it feeds to Kokoro by sentence", that's not quite correct, it actually splits subtitles by sentence, not the chunks sent to Kokoro.
Ah, that's odd. So I don't know why abogen'd be doing the weird fading out and skipping words thing then when my tool (https://github.com/alexispurslane/kokoro-audiobook-reliable/) isn't.
> This might be happening because the "Replace single newlines with spaces" option isn’t enabled. Some books require that setting to work correctly. Could you try enabling it and see if it fixes the issue?
I tried that, as well as doing it myself, and it didn't seem to help.
I just can't stand how non-deterministic many deep learning TTSes are. At least the classical ones have predictable pronunciation which can be worked around if needed.
You could try implementing a character count limit per chunk instead of sentence-based splitting. A hybrid approach that breaks at sentence boundaries but enforces a maximum chunk size of ~150-200 characters would likely solve the word-skipping issue while maintaining natural speech flow.
That's precisely what I'm doing. I'm splitting by sentences, and then for each sentence that's still too long, I split them by natural breakpoints like colons, semicolons, commas, dashes, and conjunctions, and if any of /those/ are still too long, I then break by greedy-filling words. Then I do some fun manipulation on the raw audio tensors to maintain flow.