When is it okay to train a LLM on scraped data? Robin Sloan:
If an AI application delivers some profound public good, or even if it might, it’s probably okay that its value is rooted in this unprecedented operationalization of the commons.
If an AI application simply replicates Everything, it’s probably not okay. […]
I think the case of code is especially clear, and, for me, basically settled. That’s both (1) because of where code sits in the creative process, as an intermediate product, the thing that makes the thing, and (2) because the commons of open-source code has carried the expectation of rich and surprising reuse for decades. I think this application has, in fact, already passed the threshold of “profound public good”: opening up programming to whole new groups of people.
Robin sets aside common stances on this issue: the comparison with human learning (unreasonable, as it happens at a completely different speed and scale), and those who invoke even more copyright protections (doubling down on a flawed and ineffective tool for promoting the production of culture).