One of my first jobs was as the programmer/IT/graphics guy at a newspaper. Everybody there was required to use em-dashes properly and regularly, and followed other esoteric rules from the Associated Press Stylebook that also regularly appear in LLM output.
This highlights just how much unlicensed copyrighted material is in LLM training sets (whether you consider that fair use or not).
> This highlights just how much unlicensed copyrighted material is in LLM training sets (whether you consider that fair use or not).
Is there any license copyrighted material in their original training sets? AFAIK, they just scrapped it all regardless of the license