Today’s language models are trained on the entirety of the Internet.
Future models are going to have to train on exponentially more content (both higher quality and more diverse), to be exponentially better.
There are going to have to be groundbreaking new innovations in terms of how models are architected and trained to get an order of magnitude improvement. They will also need to incorporate causal reasoning and logic. Moving beyond statistical correlation to understanding information’s meaning.
For the more basic transformer models synthetic data is going to play a part, whoever can do this well stands to be in a lucrative position, albeit temporarily due to commoditisation.
In the next few months, if it doesn’t already exist, a data marketplace will emerge — building a moat because of its economy of scale. Publishers are attracted because it’s essentially free money, and something that was never factored into any of their models. They now understand the value of their users’ content. Youtube stands in a unique place with multimodal video, audio, and text.