Does their training corpus respect copyrights or do you have to follow their opt out procedure to keep them from consuming your data? Assuming it’s the latter, it’s open-er but still not quite there.
Does their training corpus respect copyrights or do you have to follow their opt out procedure to keep them from consuming your data? Assuming it’s the latter, it’s open-er but still not quite there.
Your question is addressed in opening abstract: https://github.com/swiss-ai/apertus-tech-report/raw/refs/hea...
> Unlike many prior models that release weights without reproducible data pipelines or regard for content-owner rights, Apertus models are pretrained exclusively on openly available data, retroactively respecting robots.txt exclusions and filtering for copyrighted, non-permissive, toxic, and personally identifiable content.
Afaik they respect robots.txt on crawl and later when using the data they re-check the robots.txt and will exclude the data if the new robots.txt was updated to deny access. They have further data filtering bit for that you better check the technical report.