When is it okay to train a LLM on scraped data? Robin Sloan:

If an AI appli­ca­tion delivers some pro­found public good, or even if it might, it’s prob­ably okay that its value is rooted in this unprece­dented oper­a­tional­iza­tion of the com­mons.

If an AI appli­ca­tion simply repli­cates Every­thing, it’s prob­ably not okay. […]

I think the case of code is espe­cially clear, and, for me, basi­cally settled. That’s both (1) because of where code sits in the cre­ative process, as an inter­me­diate product, the thing that makes the thing, and (2) because the com­mons of open-source code has car­ried the expec­ta­tion of rich and sur­prising reuse for decades. I think this appli­ca­tion has, in fact, already passed the threshold of “pro­found public good”: opening up pro­gramming to whole new groups of people.

Robin sets aside common stances on this issue: the comparison with human learning (unreasonable, as it happens at a completely different speed and scale), and those who invoke even more copyright protections (doubling down on a flawed and ineffective tool for promoting the production of culture).