There are thousands of deep learning papers; where to start? Here is a curated list of the greatest hits.
Year | Author/Title | Notes |
---|---|---|
1943 | Warren S. McCulloch and Walter Pitts, A Logical Calculus of the Ideas Immanent in Nervous Activity | Is a neural network a computing machine? McCulloch and Pitts are the first to model neural networks as an abstract computational system. They find that under various assumptions, networks of neurons are as powerful as propositional logic, sparking widespread interest in neural models of computation. |
1958 | Frank Rosenblatt, The Perceptron: A Probablistic Model for Information Storage and Organization in the Brain | Can an artificial neural network learn? Rosenblatt proposes the Perception Algorithm, a method for iteratively adjusting variable weight connections between neurons to learn to solve a problem. He raises funds from the U.S. Navy to build a physical Perceptron machine. In press coverage, Rosenblatt anticipates walking, talking, self-conscious machines. |
1959 | Jerome Lettvin, Humberto Maturana, Warren McCulloch and Walter Pitts, What the Frog's Eye Tells the Frog's Brain | Do nerves transmit ideas? Lettvin provocatively proposes that the frog optic nerve signals the presence of meaningful patterns rather than just brightness, demonstrating that the eye is doing part of the computational work of vision. Lettvin is also known for his famous thought experiment that your brain might contain a Grandmother Neuron that you use to conceptualize your grandmother. |
1959 | David H. Hubel and Torsten N Wiesel, Receptive Fields of Single Neurones in the Cat's Striate Cortex | How does biological vision work? This paper and its 1962 extension kick off a 25-year collaboration in which Hubel and Wiesel methodically analyze the processing of signals through mammalian visual systems, developing many specific insights about the operation of the Visual Cortex that later inspire and inform the design of convolutional neural networks. They win the Nobel Prize in 1981. |
1969 | Marvin Minsky and Seymour Papert, Perceptrons: An Introduction to Computational Geometry | What cannot be learned by a perceptrons? During the early 1960's, while Rosenblatt argues that his neural networks could do almost anything, Minsky counters that they could do very little. This influential book lays out the negative argument, showing that many simple problems such as maze-solving or even XOR cannot be solved by a single-layer perceptron network. The sharp critique leads to one of the first AI Winter periods during which many researchers abandon neural networks. |
1972 | Teuvo Kohonen, Correlation Matrix Memories | Can a neural network store memories? Kohonen (and simultaneously Anderson) observed that a single-layer network can act as a matrix Associative Memory if keys and data are seen as vectors of neural activations, and if keys are linearly independent. Associative memory would become a major focus of neural network research in coming decades. |
1981 | Geoffrey E. Hinton, Implementing Semantic Networks in Parallel Hardware | How are concepts represented? Writing in book on associative memory with Anderson, Hinton proposes that concepts should not be represented as single units, but as vectors of activations, and he demonstrates a scheme that encodes complex relationships in a distributed fashion. Distributed representation becomes a core tenet of the Parallel Distributed Processing (PDP) framework, advanced in a subsequent book by Rumelhart, McCleland, and Hinton (1986), and a central dogma in the understanding of large neural networks. |
1986 | David E. Rumelhart, Geoffrey E. Hinton and Ronald J. Williams, Learning Representations by Back-Propagating Errors | How can a deep network learn? Learning in multilayer networks was not widely understood until this paper's explanation of the Backpropagation method, which updates weights by efficiently computing gradients. While Griewank (2012) notes that reverse-mode auto-differentiation was discovered independently several times, notably by Seppo Linnainmaa (1970) and by Paul Werbos (1981), Rumelhart's letter to Nature demonstrating its power to learn nontrivial representations gains widespread attention and unleashes a new wave of innovation in neural networks. |
1988 | Sarah Solla, Esther Levin and Michael Fleisher, Accelerated Learning in Layered Neural Networks | What should deep networks learn? In three concurrent papers, Solla et al, John Hopfield (1987) and Eric Baum and Frank Wilczek (1988) describe the insight that neural networks should often compute log probabilities rather than just arbitrary scales of numbers and that the Cross Entropy Objective is frequently more natural and more effective than squared error minimization. (How effective remains an open area of research: see Hui 2021 and Golik 2013.) |
1989 | Yann Le Cun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard and L. D. Jackel, Handwritten Digit Recognition with a Back-Propagation Network | Can a deep network learn to see? In a technical tour-de-force, Le Cun devises the Convolutional Neural Network (CNN) (inspired and informed by Hubel and Weisel's biological studies), and demonstrates that backpropagation can train a CNN to accurately read handwritten digits on U.S. Postal Mail addresses. The work demonstrates the value of a good network architecture, and proves that deep networks can solve real-world problems. Also see Fukushima 1980 for an early variant of this idea. |
1990 | Jeffrey L Elman, Finding Structure in Time | Can a deep network learn language? Adopting a three-layer Recurrent Neural Network (RNN) architecture devised by Michael Jordan (1986), Elman trains an RNN to model natural language text, starting from letters. Strikingly, he finds that the network learns to represent the structure of words, grammar, and elements of semantics. |
1990 | Léon Bottou and Patrick Gallinari A Framework for the Cooperation of Learning Algorithms | What is the right notation for neural network architecture? Bottou observes that the backpropagation algorithm allows an elegant graphical notation where instead of a graph of neurons, the network is written as a graph of computation modules that encapsulate vectorized forward and backward gradient computations. Bottou's modular idea is the basis for deep learning libraries such as Torch (Collobert 2002), Theano (Bergstra 2010), Caffe, (Jia 2014), Tensorflow. (Abadi 2016) and PyTorch (Paszke 2019). |
1989 | George Cybenko, Approximation by Superpositions of a Sigmoidal Function* | What functions can a deep network compute? This paper proves that any continuous function can be closely approximated by a neural network to arbitrary small error on a finite domain. Cybenko's reasoning is specific to the sigmoid nonlinearities popular at the time, but Hornik (1991) shows that the result can be generalized to essentially any ordinary nonlinearity, and that two layers is enough. Cybenko and Hornik's results show that networks with multiple layers are Universal Approximators, far more expressive than the single-layer perceptrons proposed in the 50s and 60s. |
1991 | Léon Bottou, Stochastic Gradient Learning in Neural Networks | What optimization algorithm should be used? In his PhD thesis, Bottou proposes that previously proposed learning algorithms such as perceptrons correspond to Stochastic Gradient Descent (SGD), and he argues that SGD scales better than more complex higher-order optimization methods. Over the decades, Bottou is proved right, and variants of the simple SGD algorithm become the standard workhorse learning algorithm for neural networks. See Bottou (1998) and Bottou (2010) for newer discussions about SGD from Bottou, and also see Zinkevich (2003) for an elegant generalizable proof of convergence. |
1991 | Anders Krogh and John A. Hertz, A Simple Weight Decay Can Improve Generalization | How can overfitting be avoided? This paper analyzes and advocates Weight Decay, a simple regularizer originally proposed as Ridge Regression (Hoerl, 1970) that imposes a penalty on the square of the weights of a model. Krogh analyzes this trick in neural networks, demonstrating generalization gains in single-layer and mulilayer networks. |
1997 | Sepp Hochreiter and Jürgen Schmidhuber, Long Short-Term Memory | How can long recurrences be stabilized? Iterating an RNN many times will invariably to lead to an explosion of gradients without special measures. This paper proposes the Long Short-Term Memory (LSTM) architecture, a gated but differentiable neural memory structure that can retain state over very long sequences while preventing the gradient from exploding. The LTSM architecture also inspires Gated Recurrent Units (GRU), a simpler alternative devised by Cho 2014). |
2003 | Yoshua Bengio, Réjean Ducharme, Pascal Vincent and Christian Jauvin, A Neural Probabilistic Language Model | Can a neural network model language at scale? This paper scales a nonrecurrent neural language model to a 15-million word training set, beating the state-of-the-art traditional language modeling methods by a large margin. Rather than using a fully recurrent network, Bengio processes a fixed window of n words and devotes a network layer to learn a position-indepenent Word Embedding. |
2005 | Rodrigo Quian Quiroga, Leila Reddy, Gabriel Kreiman, Christof Koch and Itzhak Fried, Invariant Visual Representation by Single Neurons in the Human Brain | What do individual biological neurons do? In a series of remarkable experiments probing single neurons of human epilepsy patients, several Multimodal Neurons are found: individual neurons that are selectively responsive to very different stimuli that evoke the same concept, for example a neuron reponsive to a written name, sketch, photo, or costumed figure of Halle Berry, while not responding to other people, suggesting a simple physical encoding for high-level concepts in the brain. |
2005 | Geoffrey Hinton, What Kind of Graphical Model is the Brain? | Can networks be deepened like a spin glass? In the early 2000s, neural network research is focused on the problem of scaling networks deeper than three layers. A breakthrough comes from bidirectional-link models of neural networks inspired by spin-glass physics, like Hopfield Networks (Hopfield, 1982), and Restricted Boltzmann Machines (RBM) (Hinton, 1983). In 2005, Hinton shows that a RBM called a Deep Belief Network can train a stack of many layers efficiently, and in 2006, Hinton and Salakhutdinov show that layers of autoencoders can be stacked if initialized by RBMs. |
2010 | Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio and Pierre-Antoine Manzagol, Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion | Can networks be deepend with unsupservised training? The search for simpler deep network initialization methods continues, and in 2010, Vincent finds an alternative to initialization by Boltzmann machines: train each layer as a Denoising Autoencoder that must learn to remove noise added to training data. That group also devises the Contractive Autoencoder (Rifai, 2011), in which a gradient penalty is incorporated into the loss. |
2010 | Xavier Glorot and Yoshua Bengio, Understanding the Difficulty of Training Deep Feedforward Neural Networks | Can networks be deepend with simple changes? Glorot analyzes the problems with ordinary feed-forward training and proposes Xavier Initialization, a simple random initialization that is scaled to avoid vanishing or exploding gradients. In a second important development, Nair (2010) and Glorot (2011) experimentally find that Rectified Linear Units (ReLU) work much better than the sigmoid nonlinearities that have previously been ubiquitous. These simple-to-apply innovations eliminate the need for complex pretraining, so that deep feedforward networks can be trained directly, end-to-end, from scratch, using backpropagation. |
2011 | Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu and Pavel Kuksa, Natural Language Processing (Almost) from Scratch | Can a neural network solve language problems? Previous work in natural language processing treats the problems of chunking, part-of-speech tagging, named entity recognition, and semantic role labeling separately. Collobert claims that a single neural network can do it all at once, using a Multi-Task Objective to learn a unified representation of language for all the tasks. They find that their network learns a satisfying word embedding that groups together meaningfully related words, but the performance claims are initially met with skepticism. |
2012 | Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton ImageNet Classification with Deep Convolutional Neural Networks | Can a neural network do state-of-the-art computer vision? Krizhevsy shocks the computer vision community with a deep convolutonal network that wins the annual ImageNet classification challenge (Deng, 2009) by a large margin. Krizhevsky's AlexNet is a deep eight-layer 60-million parameter convolutional network that combines the latest tricks such as ReLU and Dropout (Srivatsava, 2014 and Hinton, 2012), and it is run on a pair of consumer Graphical Processing Units (GPU). The superior performance on the high-profile large-scale benchmark sparks a sudden change in perspective towards neural networks in the ML community and an explosive resurgence of interest in deep network applications. |
2012 | Tomas Mikolov, Illya Sutskever, Kai Chen, Greg Corrado and Jeffrey Dean, Distributed Representations of Words and Phrases and their Compositionality | Does massive data beat a complex network? While excitement grows over the power of neural networks, Google researcher Mikolov finds that his simple (non-deep) skip-gram model (Mikolov, 2012a) can learn a good word embedding that outperforms other (deep) embeddings by a large margin if trained on a massive 30-billion word data set. This Word2Vec model exhibits Semantic Vector Composition for the first time. Google also trains an unsupervised model on Youtube image data (Le, 2011) using an Topographic Independent Component Analysis loss (Hyvärinen 2009), and observes the emergence of individual neurons for human faces and cats. |
2013 | Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra and Martin Riedmiller, Playing Atari with Deep Reinforcement Learning | Can a network learn to play a game from raw input? DeepMind proposes Deep Reinforcement Learning (DRL), applying neural networks directly to the Q-learning algorithm, and demonstrates that their Deep Q-Network (DQN) architecture directly predicts actions from state observations and can learn to control joystick controls well enough to learn to play several Atari games better than humans. The work inspires many other DRL methods such as Deep Deterministic Policy Gradient (DDPG) (Lillicrap 2016) and Proximal Policy Optimization (PPO) (Shulman 2017), and touches off development of Atari-capable RL testing environments like OpenAI Gym. |
2013 | Diederik P. Kingma and Max Welling, Auto-Encoding Variational Bayes | What should an autoencoder reconstruct? The Variational Autoencoder (VAE) casts the autoencoder as variational inference problem, matching distributions rather than instances, by maximizing the Evidence Lower Bound (ELBO) of the likelihood of the data by minimizing information in the stochastic latent, and using a Reparameterization Trick to train a sampling process at the bottleneck (see the Doersch tutorial). VAEs take their inspiration from Hinton's 1995 Wake-sleep algorithm which attacks the same problem of learning a continuous latent variable model. Descendants such as Beta-VAE (Higgins 2017) can learn disentangled representations, and VQ-VAE (van der Oord 2017) can do state-of-the-art image generation. |
2013 | Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow and Rob Fergus, Intriguing Properties of Neural Networks | Do artificial neural networks have bugs? Using a simple optimization, Szegedy finds that it is easy to construct Adversarial Examples: inputs that are imperceptibly different from a natural input that fool a deep network into misclassifying an image. The observation touches off discoveries of further attacks (e.g., Papernot 2017), defenses (Madry 2018) and evaluations (Carlini 2017). |
2014 | Ross Girshick, Jeff Donahue, Trevor Darrell and Jitendra Malik, Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation | Can a CNN locate an object in a scene? Computer vision is concerned with not just classifying, but locating and understanding the arrangments of objects in a scene. By exploiting the spatial arrangement of CNN features, Girshick's R-CNN (and Faster R-CNN, Ren 2015) can identify not only the class of an object, but the location of an object in a scene via both bounding-box estimation and semantic segmentation. |
2014 | Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio and Jitendra Malik, Generative Adversarial Nets | Can an adversarial objective be learned? A Generative Adversarial Network (GAN) is trained to imitate a data set by learning to synthesize examples that fool a second adversarial model simultaneously trained to distinguish real from generated data. The elegant method sparks a wave of new theoretical work as well as a new category of highly-realistic image generation methods such as DCGAN (Radford 2016), Wasserstein GAN (Arjovsky 2017), BigGAN (Brock 2019), and StyleGAN (Karras 2019). |
2014 | Jason Yosinksi, Jeff Clune, Yoshua Bengio and Hod Lipson How Transferable are Features in Deep Neural Networks? | Can network parameters be reused in another network? Transfer Learning takes layers of a pretrained network to initialize a network that is trained to solve a different problem. Yosinksi shows that such Fine-Tuning will outperform training a new network from scratch, and practioners quickly recognize that initialization with a large Pretrained Model (PTM) is a way to get a high-performance network using only a small amount of training data. |
2014 | Matthew D. Zeiler and Rob Fergus Visualizing and Understanding Convolutional Networks | Can people understand deep networks? One of the critiques of deep learning is that its huge models are opaque to humans. Zeiler tackles this problem by reviewing and introducing several methods for Deep Feature Visualization, which depict individual signals within a network, and Salience Mapping, which summarize the parts of the input that most influence the outcome of the complex computation. Zeiler's goal of Explainable AI (XAI) is futher developed in feature optimization methods (Olah 2017), feature dissection (Bau 2017), and salience methods such as Grad-CAM (Selvaraju 2016) and Integrated Gradients (Sundararajan 2017). |
2014 | Ilya Sutskever, Oriol Vinyals and Quoc V. Le, Sequence to Sequence Learning with Neural Networks | Can a neural network translate human languages? Sutskever applies the LSTM architecture to English-to-French translation, combining an encoder phase with an autoregressive decoder phase. This demonstration of Neural Machine Translation; does not beat state-of-the art machine translation methods at the time, but its competitive performance establishes the feasibility of the neural approach to translation, one of the classical grand challenges of AI. |
2015 | Dzmitry Bahdanau, KyngHyun Cho and Yoshua Bengio Neural Machine Translation by Jointly Learning to Align and Translate | Can a network learn its own attention? While CNNs compare adjacent pixels and RNNs examine adjacent words, sometimes the most important data dependencies are not adjacencies. Bahandau proposes a learned Attention model that can estimate which parts of the input are relevant to each part of the output. This innovation dramatically improves performance of neural machine translation, and the idea of using learnable attention proves effective for many kinds of data including graphs (Veličković 2018), and images (Zhang 2019). |
2015 | Diederik P. Kingma and Jimmy Lei Ba, Adam: A Method for Stocastic Optimization | What learning rate should be used? The Adam Optimizer adaptively chooses the step size by using smaller steps for parameters in regions with more gradient variation. Combining ideas from Momentum (Polyak 1964), Second-order optimization (Becker 1989), Adagrad (Duchi 2011), Adadelta (Zeiler 2012), and RMSProp (Tieleman 2012), the Adam optimizer proves very effective in practice, enabling optimization of huge models with little or no manual tuning. |
2015 | Sergei Ioffe and Christian Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift | How can training gradients be stabilized? Even with clever initalization, in very deep ReLU networks eventually signals will get very large or very small. Batch Normalization solves this problem by normalizing each neuron to have zero mean and unit variance within every training batch. This practical step yields huge benefits, improving training speed, network performance and stability, and enabling very large models to be trained. |
2015 | Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun, Deep Residual Learning for Image Recognition | Can backpropagation succeed if there are a huge number of network layers? Analyzing the propagation of gradients, He proposes the Residual Network (ResNet) architecture in which layers compute a vector to add to the signal, rather than transforming the signal at each layer. He also proposes Kaiming Initialization, a variant of Xavier initialization that takes into account nonlinearities. Together with batchnorm, these methods solve the depth problem, allowing networks to achieve state-of-the-art results with more than 100 layers. |
2015 | Oriol Vinyals, Alexander Toshev, Samy Bengio and Dumitru Erhan, Show and Tell: A Neural Image Caption Generator | Can language and vision be related? This paper demonstrates that, despite the apprent disparities between modalities, neural representations for images and text can be directly connected. By simply attaching a vision network (a CNN) to a language network (an RNN), Vinyals demonstrates a system that can perform Image Captioning, generating accurate captions for a wide range of subjects after training on the MSCOCO dataset. |
2015 | Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan and Surya Ganguli, Deep Unsupervised Learning using Nonequilibrium Thermodynamics | Can a network learn by reversing the physics of diffusion? Inspired by Kingma's VAEs and Hinton's wake-sleep method as well as the dynamics of diffusion, Jascha Sohl-Dicksten proposes Diffusion Models, a latent variable framework which transforms Gaussian noise into a meaningful distribution iteratively by learning to reverse a diffusion process in many small steps. Later this method is extended by Jonathan Ho (2020) to synthesize remarkably high quality images, superior to GANs, and that demonstration kicks off a wave of interest in using diffusion for image synthesis. See the tutorial paper from Calvin Luo (2022) for detailed discussion. |
2016 | David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel and Demis Hassabis, Mastering the Game of Go with Deep Neural Networks and Tree Search | Game-playing is one of the original domains used to demonstrate artificial intelligence capabilities. Yet while chess is conquered using traditional search methods in 1997, the game of Go is considered a far more subtle game, intuititive and impenetrable to brute-force computation. In this work by DeepMind, the AlphaGo system combines a CNN with traditional search methods to add the needed intuition, through a powerful learned board evaluation function, trained through self-play. The system achieves master-level play and bests the top-ranked human Go player Lee Sedol in a five-game match. |
2017 | Ashsh Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin Attention is All You Need | While applying attention ideas of Bahdanau to acheive state-of-the art machine translation results, Vaswani discovers that the various mechanisms for supporting recurrent networks are unnecessary and can be replaced by attention. The resulting architecture, the Transformer Network, proves to be a scalable and versatile way of dealing with sequence data, leading to popular archtiectures such as BERT, GPT, and T5. |
2017 | Phillip Isola, Jun-Yan Zhu, Tinghui Zhou and Alexei A. Efros, Image-to-Image Translation with Conditional Adversarial Networks | A wide class of image processing methods can be framed as a transformation from one image to another. In this work, Isola demonstrates that a single Pix2Pix architecture can be used across the problems of segmentation, image restyling, and sketch-guided image generation, by applying a GAN adversarial network to train the generative network to create realistic images that match the target domain. While Pix2Pix relies on a paired dataset of before- and after- images, it inspires CycleGAN (Zhu, 2017), which is able to learn to transform images based on data that is not explicitly paired. |
2018 | Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever, Language Models are Unsupervised Multitask Learners | Can a network learn to write simply by reading? While the original transformer architecture required paired language translation text in order to train, Radford discards the encoder portion of the network to obtain a simple autoregressive language model that can be trained on the simple task of predicting the next word in text. The resulting model, GPT, can be scaled to be trained on massive amounts of text, and the model and its scaled-up successors GPT-2 and GPT-3 exhibit emergent behavior such as the ability to solve a variety of tasks simply by Prompting the model with a natural-language request to answer a particular kind of question. |
2019 | Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding | Is there a universal encoding for language? While the traditional approach is to design a custom network for solving particular langauge problems, this paper proposes a BERT architecture that learns to encode text in universal way. BERT is trained on a denoising task, learning to fill in missing words in text, and also learning to distinguish adjacent sentence pairs from unrelated sentence pairs. This unsupervised training scheme allows BERT to be scaled up and trained on a huge amount of text. BERT makes it straightfoward to create high-quality language processing models for specialized tasks with only a small amount of data, by starting with a pretrained BERT and fine-tuning it for the task. |
2020 | Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik1, Jonathan T. Barron, Ravi Ramamoorthi and Ren Ng, NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis | Can a neural network model the physics of light transport? While most neural models are inspired by functions of the human brain, neural networks can be applied to learn functions in other domains. In this work, Mildenhall demonstrates Neural Radiance Fields (NeRF), a use of neural networks to learn to compute the full light transport within a 3d scene, by following physical rules while learning to match the light observed in a handful of photographs. By modeling the amount of light at every location and direction in a volume, a NeRF model is able to solve difficult rendering problems such as depicting a photographed scene from a new viewpoint, or showing a scene with a new object added. |
2020 | Andrew W. Senior, Richard Evans, John Jumper, James Kirkpatrick, Laurent Sifre, Tim Green, Chongli Qin, Augustin Žídek, Alexander W. R. Nelson, Alex Bridgland, Hugo Penedones, Stig Petersen, Karen Simonyan, Steve Crossan, Pushmeet Kohli, David T. Jones, David Silver, Koray Kavukcuoglu and Demis Hassabis, Improved protein structure prediction using potentials from deep learning | Can a neural network model the physics of protein structure? One of the grand challenges of computational chemistry is to predict the 3d stucture of a protein from its amino acid sequence, because that structure is critical for understanding a protein's function. By training a convolutional neural network to predict residue distances on the 150,000 known protein conformations in the public Protein Data Bank, this team from DeepMind dramatically improves upon the state-of-the-art in protein structure prediction; the neural approach is combined with other chemistry algorithms to create full 3d predictions. The team applies their methods on all 200 million proteins in the UniProt database, contributing high-confidence predictions for essentially every protein known to biologists across a range of organisms, transforming the field of molecular biology. |
2021 | Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger and Ilya Sutskever, Learning Transferable Visual Models From Natural Language Supervision | Will the best image training data always be manually labeled? While huge text models such as BERT and GPT are trained without manual labels, the best training data in vision is still laborious manually labeled. This work changes the situation, demonstrating an image representation supervised by automatically scraped open-text image caption data from the internet. CLIP applies Contrastive Learning on a massive 400 million captioned-image data set to jointly learn aligned image and text encodings, approaching state-of-the-art classification on a zero-shot test without any fine-tuning. CLIP establishes a new state-of-the-art image representation and is also an essential part of OpenAI's DALL-E text-to-image synthesis system. |
2022 | Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu and Mark Chen, Hierarchical Text-Conditional Image Generation with CLIP Latents | Can a neural net learn to draw? By applying diffusion models together with text supervision using CLIP representations, the DALL-E 2 method demonstrates a remarkable new state-of-the-art in Text-to-Image synthesis. The model demonstrates an uncanny ability to create images from a text description that are obviously novel yet also reasonable compositions of real ideas. The dramatic success of this approach inspires rapid development of several commercial as well as open-source projects for text-guided image synthesis that are practical for widespread deployment, such as Latent Diffusion Models (Rombach 2022), and these systems' uncanny capabilities begin to raise societal questions about the relationship and role of humans and AI in creative problem-solving. |