Is bigger better?

Recent work in Natural Language Processing (NLP) indicates that the bigger your model is, the better performance you will get. In a paper by Kaplan, Jared, et al., they show that loss scales as a power-law with model size, dataset size, and the amount of compute used for training.

Kaplan, Jared, et al. “Scaling laws for neural language models.” arXiv preprint arXiv:2001.08361 (2020).

Other parameters, like model architecture or hyperparameter choice, appear to only offset the the lines shown above by a small amount without changing the gradient. Even the revolutionary change from LSTM to Transformers, is shown to only offset the power-law trend by a constant term.

A more recent paper shows that these trends hold for large a variety of other deep learning tasks such as image and video generation.

Henighan, Tom, et al. “Scaling laws for autoregressive generative modeling.” arXiv preprint arXiv:2010.14701 (2020).

For fields with virtually infinite amounts of data (like text or images), these results mean that the key to success will be computing power. This does not bode well for the future of small deep learning groups trying to achieve state of the art performance in these fields.

But what does it mean for deep learning applied to bioscience? I think that for most cases in bioscience the main bottleneck is the availability of data. This translates into two things in my opinion: It reiterates the importance of getting lots of high quality data. And, if the data is fixed, it indicates that improving the methods may be the best we can do for now. If you have got this far, I hope this was not a complete waste of your time.

Author