Very cool!
Are you willing to share more technical details?
- Which data sources do you ingest?
- How do you transform and enrich the data? How does your pipeline look?
- What are your key challenges?
- Which tools do you use? What is your 'stack'? (Stanze, wordfreq, Whisper, wn, ...)
Background: I am currently building a multi-lang vocabulary hub for language learning. The goal is to match core words/lemmas to their senses/concepts, and then be able to generate multi-language flash cards.
I am still stuck on the sense alignment and fingerprinting (example: should 'to shop', 'einkaufen', ' alışveriş yapmak' and 'go shopping' point to the same concept of 'shop'?), but in a later stage I want to allow user-submission and data enrichment for IPA, pictograms [1] and audio.
[1: https://arasaac.org/pictograms/search]
Use-case (the dream): I come back from language class, I input new vocab and I output new Anki cards that work across all my fluent languages.
Currently, I mostly find myself knee-deep in problems of linguistics, NLP, Python and getting an LLM to do exactly what I want. At the same time it is a super fun project, and really makes me feel the joy of programming again. LLMs are magic, time just flies by, and all the random projects I always wanted to do suddenly materialize.
For coding, I mostly use free Gemini and some deepseek-v4-flash via openrouter to keep a tight oversight and understand the problem space. Maybe this slows me down, but agentic code jsut does not align with me. Overall, I haven't spent more than 2 € in total.
So far, surprisingly, the biggest problem is the lack of high-quality, free input data (example: English has the Oxford 5000 words as core vocabulary, but it is difficult to find the same for e.g. Turkish).
2nd place is the lack of high-quality synsets/wordnets (cross-language is mostly incomplete), and the 3rd place is getting LLMs to reliable play to their strength (on paper, a LLM is the perfect tool to provide multi-lang sense equivalents)
I plan to do a full writeup sometimes, but first I need it to work :)
Thanks! As far as I understand your idea is to starts from the word and pulls examples from some huge data source. My approach is the other way round: I start from a source (the audio that you want to learn), and the tool extracts only the words that appear in it, with their meaning in that context. I think that hugely simplifies the implementation, and it is more useful for learners. They learn the meaning in a particular context.
As for the stack: STT with Soniox (word level timestamps), then spaCy for segmentation, POS and lemmas, then AI enrichment, correcting the lemma when spaCy is wrong. Some languages have no spaCy model at all and others are unreliable. I am trying to do spaCy thing in LLM then. Plus some extra magic for Japanese and Chinese.
Awesome, and yes, totally makes sense -- you are more learner-centric that way.
Having the full sentence context is actually one of the things I have been thinking about a lot -- this helps both the learner as well as the POS detection in Stanza. I always decided against, because I wanted to build agnostic flash-cards.
However, as your approach allows on-the-fly generation of flash cards, you always stay close to the learner progress. I could (e.g.) pick some Gutenberg fairy tales, allow the learner to read them in their target language and provide bi- and omni-directional translations across all languages. Creating flash cards from the source material keeps the learner in progress (context), allows to learn new words step-by-step (discovery), as well as providing a fun learning experience and measurable progress. Similarly, instead of fairy tales, we could use some series in combination with its subtitles. This allows video-progress. Awesome x2!
Sidenote: The awesome part about HN is that I get to chat with like-minded people and directly grasp some new inspiration. Probably I ought to visit some in-person hacker spaces :)