I’ve spent the last seven months building a tool I wish I’d had in my previous roles. MimicScribe is a macOS menu bar app that fits the "AI notetaker" category. It has accurate on-device speaker identification (a first possibly?), real-time meeting talking points for discovery calls, and a fully keyboard- and voice-driven interface.
I believe the accuracy of the speaker ID system is its biggest strength. I used fluid audio’s port of (https://github.com/fluidInference/FluidAudio) Pyannote's community-1 as a base. To improve accuracy, the system uses grammar structure cues from the Parakeet STT to mask by sentence. By taking a second set of samples within that mask for cluster assignment, it leverages the fact that most people don’t finish each other's… sandwiches in business meetings. It tends to slightly oversegment, as I’ve found it much easier to merge segments or reassign a speaker than it is to untangle an incorrect merge. https://github.com/MimicScribe/benchmarks/blob/main/diarizat...
The app provides in-meeting talking points using a prompt tuned for discovery type calls. It can suggest probing questions to help you extract more detail or helps you refocus on the big picture with “magic wand” type questions (e.g. “how would your ideal system work”). Getting low latency models to provide novel, relevant, and totally not hallucinated information is a bit of a reach and it tends to restate the transcript frequently but little gems do come from it sometimes so it’s best to think of it as a source of inspiration and be a vigilant gatekeeper.
It’s set up so recording can be started and ended via holding a keyboard shortcut instead of connecting to your calendar service. I prefer this for privacy and to keep transcript history from getting cluttered. Tapping the shortcut shows and hides an always-on-top overlay on your active screen regardless of whether you have other apps full-screen or not. Beyond simple navigation, you can also use voice commands to make post-meeting corrections or additions, for instance, you can simply say "merge this speaker with that speaker" to clean up the transcript.
It also has push-to-talk/dictate functionality with LLM cleanup - what the app started as but that tool was developer catnip, soo many of them.
A developer friend who’s worked in finance reviewed the site and said he’d bounce because the privacy story wasn’t strong enough so I added a completely on-device mode and a bring-your-own-key option. Using cloud models does add a lot to the experience, including context aware speaker merging and fragment cleanup, summary items during meetings, action items attributed, etc. On-device mode is completely free and the speaker identification is still very useful.
The privacy story is my biggest worry with the app, particularly since its target audience is more technical people. I’d love to get people's thoughts on it and any feedback would be super helpful.
I think the privacy story is super important, I like the use of local models as the fix here and understand that the output wouldn't be as good as frontier models. Perhaps look into venice for more private inference.
Hey Marshall! Cool to see this coming together, kudos for buildimg the tool you wish you had, thats the right reason to do things!
it seems like these “realtime meeting assistant / transcriber” services have taken a huge leap closer to being what I too have have often found myself wishing for. (Recently I gave Hedy AI a shot, very much in the same neighborhood functionally feels like)
out of curiosity, for Mimic’s Local Mode, whatre the tech specs required for a reasonable level of performance?
Matt, Thanks!
I just tried Hedy, same concept, also a great tool. It's todos are nice.
MimicScribe works well with any Apple silicon Mac so it'll feel snappy on an M1 with 8GB of RAM even. It uses Apple's on-device ML accelerator, the ANE. https://mimicscribe.app/docs/performance
Looks great! Feature suggestion: would be great to plug in ollama for the AI parts. Not as great as a BYOK, but worth it to those who want to keep everything local
Hey thanks! Good call. I had limited success getting Qwen 3.5 9B working for some of the longer prompts that require lots of json output. I feel like completely on-device is so close to being usable for this stuff though. I should revisit this, actually.
running locally is great, but I would wonder about the system requirements (I love my mac air but ollama on 8GB with an Intel i5 probably would be a little too much for the little thing). But having a toggle option would be great
Right? I feel like local model support is something that has strong ideological appeal and might influence someones feeling towards the app, but when it comes down to it most people will probably just use a cloud model for larger tasks unless they have beastly hardware. It's like how I have an Android in part because I might one day flash the ROM.
Ollama just exposes models via an OpenAI compatible endpoint though (I'm pretty sure), so adding that standard is probably a good idea. The prompts are a bit tuned for Gemini. I'd have to test how much that matters.
Somehow I expected from the headline it identifies loudspeakers by their sound signature and got really curious :)
lol that'd be a trick. I'd have it purposely misidentify to cheaper brands to mess with people.