I think maybe you're misunderstanding the issue here. I have loads of data, but I'm unwilling to send it to 3rd parties, so that leaves me with gathering/generating the training data locally, but none of the models are good/strong enough for that today.
I'd love to "send them to go looking for stuff for you", but local models aren't great at this today, even with beefy hardware, and since that's about my only option, that leaves me unable to get sessions to use for the fine-tuning in the first place.
Right, that's exactly the situation I'm in too and "send them to go looking for stuff for you" without it going off the rails is the problem we've been working on.
Basically you need a squad of specialized models to do this in a mostly-structured way that ends up looking kind of like a crawling or scraping/search operation. I can share a stack of about 5-6 that are working for us directly if you want, I want to keep the exact stack on the DL for now but you can check my company's recent github activity to get an idea of it. It's basically a "browser agent" where gemma or qwen guide the general navigation/summarization but mostly focus on information extraction and normalization.
The other thing I've done, which obviously not everybody is going to want to do, is create emails and browser profiles for the browser agent (since they basically work when I'm not on the computer, but need identity to navigate the web) and run them on devices that don't have the keys to the kingdom. I also give them my phone number and their own (via an endpoint they can only call me from). That way if they run into something they have a way to escalate it, and I can do limited steering out of the loop. Obviously this is way more work than is reasonable for most people right now though so I'm hoping to show people a proper batteries-included setup for it soon.
Edit: Based on your other comment, I think maybe what you're really looking for most are "personal traces". Right now that's something we're working on with https://github.com/accretional/chromerpc (which uses the lower level Chrome Devtools Protocol rather than Puppetteer to basically fully automate web navigation, either through an LLM or prescriptive workflows). It would be very simple to set up automation to take a screenshot and save it locally every Xm or in response to certain events and generate traces for yourself that way, if you want. That alone provides a pretty strong base for a personal dataset.
> that ends up looking kind of like a crawling or scraping/search operation
Sure, but what I'm talking about is that the current SOTA models are terrible even for specialized small use cases like what you describe, so you can't just throw a local modal on that task and get useful sessions out of them that you can use for fine-tuning. If you want distilled data or similar, you (obviously) need to use a better model, but currently there is none that provides the privacy-guarantees I need, as described earlier.
All of those things come once you have something suitable for the individual pieces, but I'm trying to say that none of the current local models come close to solving the individual pieces, so all that other stuff is just distraction before you have that in place.
Understood. I guess I'm saying "soon" but definitely agreed its not "now" yet. I will say though, with 96GB, in a couple months you're going to be able to hold tons of Gemma 4 LoRa "specialists" in-memory at the same time and I really think it will feel like a whole new world once these are all getting trained and shared and adapted en-masse. And also, you could set up personal traces now if you want. Nobody can make you, but in its laziest form it can be literally just taking screenshots of your screen periodically as you work, and that'll have applications soon
> And also, you could set up personal traces now if you want. Nobody can make you, but in its laziest form it can be literally just
But again, you're missing my point :) I cannot, since the models I could generate useful traces from are run by platforms I'm not willing to hand over very private data to, and local models that I could use I cannot get useful traces from.
And I'm not holding out hope for agent orchestration, people haven't even figured out how to reliably get high quality results from agent yet, even less so with a fleet of them. Better to realistically tamper your expectations a bit :)