GPT-3 — let’s define it as the autocomplete tool by OpenAI trained on a large amount of uncategorized text from the internet — is quite impressive, comparable to what happened to AI image processing from 2012 onward. We can safely ignore the hype — it’s probably a dead end in terms of reaching artificial general intelligence (see its performance on the Turing Test). And I doubt it’s going to replace developers. But as an autocomplete, at guessing common sense or trivia questions, it’s a leap forward. Here I am asking Alexa for the third time to lower the volume, and this thing can almost handle a conversation.

Anyway because of how it works (you give it some text, a prompt, and it guesses what comes next) in recent weeks the internet got inundated with things they made GPT-3 do. There’s even a course in creative writing taught by GPT-3 (which is probably as valid as most creative writing courses).

Like all good AI GPT-3 never admits of not knowing an answer, it’d rather make up stuff, weird stuff sometimes, nonsense but nicely written nonsense. It might not make sense but at least it’s syntactically correct. It’s an idea machine, and a quite funny one. Here’s one of its replies, when prompted by Arram Sabeti to write an essay on human intelligence:

I propose that intelligence is the ability to do things humans do. [..] The brain is a very bad computer, consciousness is a very bad idea.

Gli algoritmi per il riconoscimento facciale richiedono una potenza di calcolo che gli smartphone odierni, per quanto potenti, non hanno. Molti — Google, fra questi — caricano le foto online per poi analizzarle in remoto, con degli algoritmi di deep learning che girano su cloud.

Sul Machine Learning Journal, il team di “Computer Vision” di Apple racconta gli ostacoli che ha dovuto affrontare per riuscire a effettuare l’analisi delle facce sul dispositivo. iCloud cripta le foto in locale prima di caricarle sui suoi server; non è quindi possibile analizzarle altrove se non in locale:

We faced several challenges. The deep-learning models need to be shipped as part of the operating system, taking up valuable NAND storage space. They also need to be loaded into RAM and require significant computational time on the GPU and/or CPU. Unlike cloud-based services, whose resources can be dedicated solely to a vision problem, on-device computation must take place while sharing these system resources with other running applications. Finally, the computation must be efficient enough to process a large Photos library in a reasonably short amount of time, but without significant power usage or thermal increase.

Google racconta di come stia provando a insegnare al computer a descrivere le immagini: a fare in modo che il suo motore di ricerca riconosca il soggetto e l’ambiente dell’immagine, e riesca a riassumerli a parole:

We’ve developed a machine-learning system that can automatically produce captions to accurately describe images the first time it sees them. […] A picture may be worth a thousand words, but sometimes it’s the words that are most useful — so it’s important we figure out ways to translate from images to words automatically and accurately. As the datasets suited to learning image descriptions grow and mature, so will the performance of end-to-end approaches like this.