Machine Learning Timeline

The history of machine learning (ML) is not merely a record of increasing parameter counts but a fundamental reorganization of how computational systems represent, process, and generate information. From the early days of hand-engineered features to the contemporary era of self-evolving reasoning agents, the field has undergone several paradigm shifts. Each shift was precipitated by a seminal research paper that addressed a specific bottleneck—be it computational efficiency, gradient stability, data scarcity, or alignment with human intent. This report provides an exhaustive analysis of twenty-four critical papers that define the current landscape of artificial intelligence, tracing the lineage of thought from the revival of convolutional neural networks to the emergence of conditional memory and autonomous reasoning systems.

The Architectures of Perception: The Renaissance of Neural Networks

The period between 2012 and 2015 marked the transition from classical statistical learning to deep learning. During this era, researchers demonstrated that the constraints of the "AI Winter"—namely the vanishing gradient problem and lack of sufficient compute—could be overcome by combining architectural innovations with high-performance hardware.

1. ImageNet Classification with Deep Convolutional Neural Networks (2012)

The publication of what is commonly referred to as the AlexNet paper serves as the foundational milestone of the modern era. Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton demonstrated that a deep convolutional neural network (CNN) could achieve an unprecedented reduction in error rates on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC).1 The model's significance lay in its ability to destroy the previous state-of-the-art results, which relied on hand-crafted feature extractors like SIFT or HOG.2

The key insights of the AlexNet architecture were three-fold. First, it popularized the use of the Rectified Linear Unit (ReLU) activation function, , which allowed for non-saturating neurons and significantly faster training compared to traditional tanh or sigmoid units.2 Second, it utilized a novel regularization technique called "dropout," where neurons were randomly deactivated during training to prevent complex co-adaptations and reduce overfitting.1 Third, the implementation was specifically designed for efficient GPU training, utilizing two GTX 580 GPUs to handle the 60 million parameters across five convolutional and three fully connected layers.1 This established the "compute-first" paradigm that remains dominant today.

ArXiv Link: https://arxiv.org/abs/1206.5533 (Note: Originally published at NIPS 2012)

2. Efficient Estimation of Word Representations in Vector Space (2013)

While vision research was being revolutionized by AlexNet, natural language processing (NLP) underwent a similar transformation with the introduction of Word2Vec. This work, led by Tomas Mikolov at Google, introduced efficient methods for learning distributed representations of words, where semantic relationships were captured as geometric distances in a high-dimensional vector space.4

The paper introduced two primary architectures: the Continuous Bag-of-Words (CBOW) model, which predicts a target word from its context, and the Skip-gram model, which predicts the context words given a target word.4 The profound insight was the observation that these vectors capture linguistic regularities; for instance, the vector relationship between "King" and "Man" is nearly identical to that between "Queen" and "Woman".4 This work demonstrated that semantic meaning could be extracted unsupervised from massive corpora, providing a foundational tool for all subsequent NLP research.6

ArXiv Link: https://arxiv.org/abs/1301.3781

3. Very Deep Convolutional Networks for Large-Scale Image Recognition (2014)

Following the success of AlexNet, the VGGNet paper investigated the impact of network depth on classification performance. Simonyan and Zisserman proposed that instead of using large convolutional filters (like the 11x11 filters in AlexNet), it was more effective to stack multiple layers of very small 3x3 filters.8

This approach provided several benefits. A stack of three 3x3 convolutional layers has a receptive field of 7x7 but incorporates three non-linear rectification layers instead of one, making the decision function more discriminative.9 Furthermore, it reduces the number of parameters compared to a single larger filter. The resulting VGG-16 and VGG-19 models demonstrated that pushing depth to 16–19 layers led to significant improvements, winning both the localization and classification tracks of ILSVRC 2014.8

ArXiv Link: https://arxiv.org/abs/1409.1556

4. Generative Adversarial Nets (2014)

Generative modeling took a radical turn with the introduction of Generative Adversarial Networks (GANs). Ian Goodfellow and his colleagues proposed a framework where two models are trained simultaneously: a generator G that captures the data distribution, and a discriminator D that estimates the probability that a sample came from the training data rather than G

The significance of GANs was the transition from likelihood-based modeling to a game-theoretic approach. The training objective is a minimax game where the discriminator is trained to maximize the probability of assigning the correct label to both real and generated samples, while the generator is trained to minimize log(1-G(D(Z))) This adversarial process forced the generator to produce highly realistic samples, leading to a decade of breakthroughs in synthetic image generation, style transfer, and super-resolution.13

ArXiv Link: https://arxiv.org/abs/1406.2661

5. Adam: A Method for Stochastic Optimization (2014)

As models grew in complexity, the methods used to train them became a critical bottleneck. Diederik Kingma and Jimmy Ba introduced Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions.16

Adam’s key insight was the combination of the advantages of two prior methods: AdaGrad, which works well with sparse gradients, and RMSProp, which works well in non-stationary settings.16 By maintaining adaptive estimates of lower-order moments (the mean and uncentered variance of the gradients), Adam allowed for efficient, parameter-wise learning rate adjustment.16 Its robustness to hyperparameter settings and computational efficiency quickly made it the standard optimizer for deep learning research, a position it largely maintains a decade later.17

ArXiv Link: https://arxiv.org/abs/1412.6980

Optimizer Feature Description Reference
Adaptive Learning Rates Computes individual adaptive learning rates for different parameters from estimates of first and second moments. 16
Momentum Incorporates a running average of the gradient to accelerate training in relevant directions. 18
Bias Correction Includes mechanisms to correct the initialization bias of the moment estimates near the start of training. 17

The Era of Normalization and Residuals: Stability at Scale

By 2015, the community faced a "degradation problem": simply adding more layers to a deep network led to higher training error, not because of overfitting, but because the networks became increasingly difficult to optimize.

6. Batch Normalization: Accelerating Deep Network Training (2015)

Sergey Ioffe and Christian Szegedy identified "internal covariate shift"—the change in the distribution of layer inputs during training—as a major impediment to training deep networks.21 They introduced Batch Normalization (BN) to stabilize these distributions by normalizing the inputs to each layer for every mini-batch.21

The key insight was that making normalization a part of the model architecture would allow the use of much higher learning rates and make the model less sensitive to initialization.21 BN also acted as a regularizer, often eliminating the need for dropout.21 By ensuring that activations throughout the network maintained a stable mean and variance, BN allowed for the training of models that were previously thought to be impossible to converge.22

ArXiv Link: https://arxiv.org/abs/1502.03167

7. Deep Residual Learning for Image Recognition (2015)

The ResNet paper by Kaiming He and his team at Microsoft Research is perhaps the most cited work in modern computer vision. It addressed the optimization collapse in very deep networks by introducing "residual learning".26

The researchers hypothesized that it is easier for a stack of layers to learn a residual mapping than to learn the original unreferenced mapping .26 By adding "shortcut connections" that skip one or more layers, the network could effectively pass the identity through the stack.26 This allowed for the training of networks with 152 layers—eight times deeper than VGG—while maintaining lower complexity.26 ResNet won first place in all five major tracks of the ILSVRC and COCO 2015 competitions and remains a standard backbone for vision tasks.26

ArXiv Link: https://arxiv.org/abs/1512.03385

8. You Only Look Once: Unified, Real-Time Object Detection (2015)

Object detection had traditionally been a multi-stage process involving region proposals and classification. The YOLO paper by Joseph Redmon et al. reframed this as a single regression problem, where a single neural network predicts bounding boxes and class probabilities directly from full images.32

The core insight was the division of the image into an grid, with each cell responsible for predicting bounding boxes and confidence scores.32 Because the entire detection pipeline is a single network, it can be optimized end-to-end. This resulted in extreme speed—YOLO could process images at 45 frames per second in real-time, making it suitable for video and autonomous systems.32 While it made more localization errors than state-of-the-art systems like Faster R-CNN, it was significantly less likely to predict false positives in the background.34

ArXiv Link: https://arxiv.org/abs/1506.02640

9. WaveNet: A Generative Model for Raw Audio (2016)

WaveNet, developed by researchers at DeepMind, introduced a deep generative model for raw audio waveforms. Unlike previous speech synthesis systems that used concatenative or parametric methods, WaveNet generated audio sample-by-sample.35

The key technical breakthrough was the use of "dilated causal convolutions," which allowed the model to have a very large receptive field without requiring an impractical number of layers.35 This was crucial for handling the high temporal resolution of audio (e.g., 16,000 samples per second). WaveNet was fully probabilistic and autoregressive, and when conditioned on text, it produced speech that human listeners rated as significantly more natural than any previous system.35 It also demonstrated an ability to capture the nuance of musical instruments and speaker identities.35

ArXiv Link: https://arxiv.org/abs/1609.03499

The Transformer Revolution: Attention and Scaling

The second major shift in ML history occurred in 2017 with the abandonment of recurrence in favor of attention. This enabled massive parallelization and the era of the "Large Language Model."

10. Attention Is All You Need (2017)

The Transformer paper is arguably the most influential work in the history of machine learning. Vaswani et al. proposed a model architecture that eschewed recurrent and convolutional layers entirely, relying solely on a mechanism called "self-attention".39

The fundamental insight was that the sequential nature of RNNs (Recurrent Neural Networks) was a bottleneck for both training speed and long-range dependency modeling. Self-attention allowed the model to process all tokens in a sequence simultaneously, calculating the "importance" of every other token relative to the current one.40 By using "Multi-Head Attention," the model could attend to different representation subspaces at different positions.40 This architecture allowed for significantly more parallelization during training, enabling the creation of models with billions—and eventually trillions—of parameters.39

ArXiv Link: https://arxiv.org/abs/1706.03762

11. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018)

BERT (Bidirectional Encoder Representations from Transformers) demonstrated the power of the "pre-train and fine-tune" paradigm. Unlike previous models that were unidirectional, BERT was designed to pre-train deep bidirectional representations from unlabeled text.41

The key insight was the "Masked Language Model" (MLM) objective, where a random 15% of the input tokens are masked, and the model is trained to predict them based on both their left and right context.41 This resulted in a model that deeply understood linguistic context. BERT achieved new state-of-the-art results on eleven NLP tasks, including the SQuAD question-answering benchmark and the GLUE suite, proving that a single pre-trained model could be adapted to almost any language task with minimal task-specific modifications.41

ArXiv Link: https://arxiv.org/abs/1810.04805

12. Language Models are Unsupervised Multitask Learners (2019)

OpenAI's GPT-2 paper introduced the concept of "zero-shot learning" at scale. The researchers demonstrated that language models begin to learn tasks like translation, question answering, and summarization without any explicit supervision when trained on a sufficiently large and diverse dataset called WebText.46

The key insight was that the capacity of the model is essential to the success of zero-shot task transfer. By increasing the parameter count to 1.5 billion, GPT-2 achieved state-of-the-art results on 7 out of 8 tested language modeling datasets in a zero-shot setting.46 This shifted the research focus from fine-tuning toward the "prompting" paradigm, where the model's behavior is guided by the input text rather than parameter updates.47

ArXiv Link: https://openai.com/blog/better-language-models/ (Note: GPT-2 code and paper description)

13. Scaling Laws for Neural Language Models (2020)

As models grew, the need for a principled understanding of resource allocation became paramount. Jared Kaplan et al. at OpenAI conducted a systematic study of how language model performance scales with model size, dataset size, and compute budget.50

The researchers found that the loss scales as a power-law with all three factors, with trends spanning over seven orders of magnitude.50 A critical insight was that architectural details (like network width or depth) had minimal effect within a wide range; the most important factors were simply the scale of (parameters), (data), and (compute).50 They also observed that larger models are significantly more "sample-efficient," meaning they reach a given level of performance with fewer training steps than smaller models.50 This paper provided the mathematical foundation for the massive investments in AI compute that followed.52

ArXiv Link: https://arxiv.org/abs/2001.08361

Scaling Factor Relationship to Loss (L) Insight
Model Size ( ) Performance improves predictably with parameter count. 51
Dataset Size ( ) Data must scale in tandem with parameters to avoid overfitting. 50
Compute ( ) Optimal training requires balancing and for a given . 53

The Diffusion and Multimodal Pivot

The period around 2020-2021 saw the expansion of the Transformer architecture into vision and the emergence of diffusion models as the new standard for generative AI.

14. Denoising Diffusion Probabilistic Models (2020)

Denoising Diffusion Probabilistic Models (DDPM) introduced a class of generative models inspired by non-equilibrium thermodynamics. Jonathan Ho and his colleagues showed that high-quality images could be synthesized by learning to reverse a process that gradually adds noise to an image.55

The insight was the connection between diffusion models and "denoising score matching." The model learns a reverse diffusion process where it predicts the noise added at each step, effectively "sculpting" an image out of random Gaussian noise.56 This provided a more stable and scalable training objective than GANs, eventually leading to the creation of systems like DALL-E and Midjourney.57

ArXiv Link: https://arxiv.org/abs/2006.11239

15. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020)

The Vision Transformer (ViT) paper by Dosovitskiy et al. demonstrated that the Transformer architecture could be applied to images with minimal modifications. By splitting an image into patches and treating them as "words," the researchers achieved results comparable to state-of-the-art CNNs.59

The key insight was that while CNNs have an inherent "inductive bias" (like translation invariance and locality), a Transformer trained on massive datasets (like Google's JFT-300M) can learn these properties from scratch and eventually outperform CNNs.59 ViT proved that the Transformer was a truly general-purpose architecture capable of handling diverse modalities.59

ArXiv Link: https://arxiv.org/abs/2010.11929

16. Language Models are Few-Shot Learners (2020)

Commonly known as the GPT-3 paper, this work marked the definitive arrival of the "In-Context Learning" era. With 175 billion parameters, GPT-3 demonstrated that a model could perform a new task at inference time simply by being given a few demonstrations in its prompt.63

The insight was that "scaling up" drastically improves task-agnostic, few-shot performance, sometimes reaching competitiveness with models that had been specifically fine-tuned for those tasks.63 GPT-3 achieved strong performance on translation, question-answering, and arithmetic, while also revealing the emergent ability to generate coherent, long-form human-like text.64

ArXiv Link: https://arxiv.org/abs/2005.14165

17. Learning Transferable Visual Models From Natural Language Supervision (2021)

CLIP (Contrastive Language-Image Pre-training) bridged the gap between vision and language. Alec Radford and his team showed that training an image encoder and a text encoder to recognize which caption goes with which image resulted in highly robust and transferable visual representations.67

The insight was the use of a "contrastive" objective: for a batch of image-text pairs, the model is trained to maximize the cosine similarity of the correct pairs while minimizing it for the incorrect ones.68 This allowed CLIP to perform "zero-shot" classification on any visual task simply by being given the names of the categories in text.67 CLIP became a fundamental component for almost all subsequent multimodal systems.68

ArXiv Link: https://arxiv.org/abs/2103.00020

The Alignment and Efficiency Frontier

As models reached human-level fluency, the focus shifted from pure performance to "alignment"—ensuring models follow instructions and are helpful—and "efficiency"—training smaller, smarter models.

18. Training Language Models to Follow Instructions with Human Feedback (2022)

OpenAI's InstructGPT paper introduced Reinforcement Learning from Human Feedback (RLHF) as a way to align language models with user intent. They showed that fine-tuning a model on human-written demonstrations and then further refining it using a reward model based on human preferences resulted in a much more useful system.71

The key insight was that "bigger is not inherently better" at following instructions. The 1.3B parameter InstructGPT model was preferred by human evaluators over the 175B parameter GPT-3, despite being 100 times smaller.71 This work provided the blueprint for ChatGPT and the modern era of conversational AI.71

ArXiv Link: https://arxiv.org/abs/2203.02155

19. Training Compute-Optimal Large Language Models (2022)

The Chinchilla paper from DeepMind revisited the scaling laws from 2020 and found a major flaw: most large models were significantly undertrained.72 They found that for compute-optimal training, the model size and the number of training tokens should be scaled equally.72

The researchers tested this by training "Chinchilla," a 70B parameter model, on 1.4 trillion tokens (compared to GPT-3's 300 billion). Chinchilla significantly outperformed larger models like Gopher (280B) and GPT-3 (175B) on almost all benchmarks.72 This insight led to a industry-wide shift toward training smaller, denser models on much larger datasets.72

ArXiv Link: https://arxiv.org/abs/2203.15556

20. DeepSeek-V3 (2024)

DeepSeek-V3 represents the pinnacle of modern "hardware-aware" model design. This Mixture-of-Experts (MoE) model features 671 billion parameters, but only 37 billion are activated per token, allowing for high performance with manageable compute costs.73

The technical significance lies in its adoption of Multi-head Latent Attention (MLA), which reduces the KV cache size, and its pioneer auxiliary-loss-free strategy for load balancing.73 It also introduced a "Multi-Token Prediction" (MTP) training objective to improve the model's forward-looking planning capabilities.75 DeepSeek-V3 demonstrated that open-source models could achieve performance comparable to leading closed-source models while being remarkably stable during training.74

ArXiv Link: https://arxiv.org/abs/2412.19437

Model Total Parameters Activated Parameters Training Tokens Ref
DeepSeek-V3 671B 37B 14.8 Trillion 74
Gopher 280B 280B 300 Billion 72
GPT-3 175B 175B 300 Billion 63

The Reasoning Era: Chain of Thought and Memory

The most recent research has moved beyond static prediction to "slow thinking," where models reflect, plan, and verify their answers.

21. Show Your Working: Eliciting Reasoning in LLMs (2024)

This paper addresses the performance gap between frontier models (like OpenAI's O1) and others by hypothesizing that it stems from the limited availability of high-quality "reasoning process" data.77

The researchers propose the "Reasoning Enhancement Loop" (REL), a critic-generator pipeline that autonomously produces high-quality "worked solutions"—including brainstorming, hypothesis testing, and solution refinement.77 By first fine-tuning on expert demonstrations that explicitly showcase problem exploration and then iteratively enhancing these with the REL, they elicited fundamental reasoning behaviors from existing models.77 This "show your working" approach mirrors human cognition and achieves the benefits of test-time compute scaling without the computational overhead of generating multiple solutions.77

ArXiv Link: https://arxiv.org/abs/2412.04645

22. DeepSeek-R1: Reasoning via Pure RL (2025)

DeepSeek-R1 achieved a breakthrough by demonstrating that reasoning capabilities can emerge through "pure" reinforcement learning (RL) without the need for human-annotated reasoning trajectories.79

The paper introduced DeepSeek-R1-Zero, which used rule-based rewards (for correctness and formatting) rather than human feedback.80 Notably, after several thousand RL steps, the model exhibited an "Aha Moment," where it spontaneously learned to self-correct, reflect, and verify its own reasoning steps.80 While R1-Zero had readability issues, the finalized R1 model used a multi-stage pipeline (Cold-start SFT → Reasoning RL → Rejection Sampling → General RL) to achieve performance comparable to OpenAI's O1-1217 on verifiable tasks like mathematics and coding.79

ArXiv Link: https://arxiv.org/abs/2501.12948

23. DeepSeek-OCR 2: Visual Causal Flow (2026)

DeepSeek-OCR 2 addresses the fundamental mismatch between the 2D nature of images and the 1D sequential training of LLMs. It introduces "DeepEncoder V2," an encoder designed to intelligently reorder visual tokens based on image semantics before they are interpreted.84

The insight is that human vision follows semantically coherent scanning patterns driven by logical structures.84 DeepEncoder V2 replaces the standard CLIP component with a compact LLM architecture that models a "visual causal flow," enabling parallelized processing through learnable queries.84 This framework allows for "genuine 2D reasoning" through two-cascaded 1D causal structures, yielding significant performance gains on images with complex layouts like documents and diagrams.84

ArXiv Link: https://arxiv.org/abs/2601.20552

24. Conditional Memory via Scalable Lookup (2026)

The Engram paper, authored by Wenfeng Liang and his team at DeepSeek, introduces a novel memory-compute architecture designed for DeepSeek V4.86 It addresses the efficiency bottleneck of processing very long contexts in LLMs.

The key innovation is the "Engram" module, which provides a "second axis of sparsity" complementary to Mixture-of-Experts (MoE).86 While MoE provides dynamic reasoning capacity, Engram serves as a conditional memory system for static knowledge lookup.86 By reallocating 20-25% of sparse parameters to this scalable lookup system, the model achieves massive gains in general reasoning (BBH +5.0) and long-context retrieval, where Multi-Query Needles-in-a-Haystack (NIAH) performance jumped from 84.2% to 97.0%.86 This architecture represents a move toward models that can "remember" and "reason" across millions of tokens with hardware-aware efficiency.86

ArXiv Link: https://arxiv.org/abs/2601.07372

Synthesis: The Trajectory of Machine Intelligence

Tracing these twenty-four papers reveals a coherent narrative of technical evolution. We have moved from the "Static Perception" of AlexNet and VGG, where models were fixed classifiers, to "Dynamic Understanding," where Transformers like BERT and GPT-3 could adapt to new contexts. We are now entering the "Cognitive Reasoning" era, defined by DeepSeek-R1 and the "Show Your Working" methodologies, where models no longer just guess the next token but engage in iterative refinement and self-correction.

A recurring theme is the "Efficiency Paradox": as models get larger, we find increasingly clever ways to use only a fraction of their capacity at any given moment (MoE, Engram). Similarly, the "Supervision Paradox" shows that the most powerful reasoning behaviors do not come from human instruction (SFT) but from internal self-evolution through reinforcement learning (R1).

Synopsis of Prerequisites for Technical Mastery

The papers listed above vary significantly in their difficulty. For those seeking to master this literature, the following hierarchical prerequisites are recommended:

Level 1: Foundational Literacy (AlexNet, Word2Vec, YOLO, Adam)

Requires a strong grasp of undergraduate mathematics:

Level 2: Architectural Proficiency (ResNet, Transformer, BERT, WaveNet)

Requires knowledge of system-level AI design:

Level 3: Frontier Research (Scaling Laws, Diffusion, DeepSeek-R1, Engram)

Requires graduate-level specialization and cross-disciplinary intuition:

Works cited

  1. ImageNet Classification with Deep Convolutional Neural Networks - NIPS - NeurIPS, accessed on February 2, 2026, https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks
  2. (PDF) ImageNet Classification with Deep Convolutional Neural Networks - ResearchGate, accessed on February 2, 2026, https://www.researchgate.net/publication/319770183_Imagenet_classification_with_deep_convolutional_neural_networks
  3. Ravoxsg/A_chronology_of_deep_learning: Tracing back and exposing in chronological order the main ideas in the field of deep learning, to help everyone better understand the current intense research in AI. - GitHub, accessed on February 2, 2026, https://github.com/Ravoxsg/A_chronology_of_deep_learning
  4. Efficient Estimation of Word Representations in Vector Space | BibSonomy, accessed on February 2, 2026, https://www.bibsonomy.org/bibtex/28b132b4b7e82cfb538fd462887ba98b8/ven7u
  5. [1301.3781] Efficient Estimation of Word Representations in Vector Space - arXiv, accessed on February 2, 2026, https://arxiv.org/abs/1301.3781
  6. [PDF] Efficient Estimation of Word Representations in Vector Space - Semantic Scholar, accessed on February 2, 2026, https://www.semanticscholar.org/paper/Efficient-Estimation-of-Word-Representations-in-Mikolov-Chen/f6b51c8753a871dc94ff32152c00c01e94f90f09
  7. Natural Language Understanding - Google Research, accessed on February 2, 2026, https://research.google.com/teams/brain/natural-language/
  8. Very Deep Convolutional Networks for Large-Scale Image Recognition - ResearchGate, accessed on February 2, 2026, https://www.researchgate.net/publication/265385906_Very_Deep_Convolutional_Networks_for_Large-Scale_Image_Recognition
  9. [1409.1556] Very Deep Convolutional Networks for Large-Scale Image Recognition - arXiv, accessed on February 2, 2026, https://arxiv.org/abs/1409.1556
  10. vgg16 - (Not recommended) VGG-16 convolutional neural network - MATLAB - MathWorks, accessed on February 2, 2026, https://ch.mathworks.com/help/deeplearning/ref/vgg16.html
  11. Conditional Generative Adversarial Nets - BibSonomy, accessed on February 2, 2026, https://www.bibsonomy.org/bibtex/2a4426d639ebb30270839ad347bcfb999/achakraborty
  12. [1411.1784] Conditional Generative Adversarial Nets - arXiv, accessed on February 2, 2026, https://arxiv.org/abs/1411.1784
  13. [1708.02556] Multi-Generator Generative Adversarial Nets - arXiv, accessed on February 2, 2026, https://arxiv.org/abs/1708.02556
  14. [1803.04469] An Introduction to Image Synthesis with Generative Adversarial Nets - arXiv, accessed on February 2, 2026, https://arxiv.org/abs/1803.04469
  15. [1710.10772] Tensorizing Generative Adversarial Nets - arXiv, accessed on February 2, 2026, https://arxiv.org/abs/1710.10772
  16. Adam: A Method for Stochastic Optimization - R Discovery, accessed on February 2, 2026, https://discovery.researcher.life/article/adam-a-method-for-stochastic-optimization/b14df890f8a03f6baa06f300a3ed9c86
  17. [1412.6980] Adam: A Method for Stochastic Optimization - arXiv, accessed on February 2, 2026, https://arxiv.org/abs/1412.6980
  18. [1412.6980v8] Adam: A Method for Stochastic Optimization - arXiv, accessed on February 2, 2026, https://arxiv.org/abs/1412.6980v8?hl:ja
  19. [PDF] Adam: A Method for Stochastic Optimization - Semantic Scholar, accessed on February 2, 2026, https://www.semanticscholar.org/paper/Adam%3A-A-Method-for-Stochastic-Optimization-Kingma-Ba/a6cb366736791bcccc5c8639de5a8f9636bf87e8
  20. Adam: A Method for Stochastic Optimization - BibBase, accessed on February 2, 2026, https://bibbase.org/network/publication/kingma-ba-adamamethodforstochasticoptimization-2015
  21. [1502.03167] Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift - arXiv, accessed on February 2, 2026, https://arxiv.org/abs/1502.03167
  22. Batch Normalization - Convolutional Neural Networks for Image and Video Processing, accessed on February 2, 2026, https://collab.dvb.bayern/spaces/TUMlfdv/pages/69119920/Batch+Normalization
  23. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. | BibSonomy, accessed on February 2, 2026, https://www.bibsonomy.org/bibtex/bf2b461f54850dbae02a295b9f5e799b
  24. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift - ResearchGate, accessed on February 2, 2026, https://www.researchgate.net/publication/272194743_Batch_Normalization_Accelerating_Deep_Network_Training_by_Reducing_Internal_Covariate_Shift
  25. Normalization Layers - TFLearn, accessed on February 2, 2026, http://tflearn.org/layers/normalization/
  26. Deep Residual Learning for Image Recognition - ResearchGate, accessed on February 2, 2026, https://www.researchgate.net/publication/286512696_Deep_Residual_Learning_for_Image_Recognition
  27. [1512.03385] Deep Residual Learning for Image Recognition - arXiv, accessed on February 2, 2026, https://arxiv.org/abs/1512.03385
  28. Paper page - Deep Residual Learning for Image Recognition - Hugging Face, accessed on February 2, 2026, https://huggingface.co/papers/1512.03385
  29. KaimingHe/deep-residual-networks: Deep Residual Learning for Image Recognition, accessed on February 2, 2026, https://github.com/KaimingHe/deep-residual-networks
  30. Top 10 Must Read Machine Learning Research Papers - Analytics Vidhya, accessed on February 2, 2026, https://www.analyticsvidhya.com/blog/2024/07/machine-learning-research-papers/
  31. [PDF] Deep Residual Learning for Image Recognition - Semantic Scholar, accessed on February 2, 2026, https://www.semanticscholar.org/paper/Deep-Residual-Learning-for-Image-Recognition-He-Zhang/2c03df8b48bf3fa39054345bafabfeff15bfd11d
  32. You Only Look Once - Wikipedia, accessed on February 2, 2026, https://en.wikipedia.org/wiki/You_Only_Look_Once
  33. You Only Look Once: Unified, Real-Time Object Detection | Request PDF - ResearchGate, accessed on February 2, 2026, https://www.researchgate.net/publication/278049038_You_Only_Look_Once_Unified_Real-Time_Object_Detection
  34. [1506.02640] You Only Look Once: Unified, Real-Time Object Detection - arXiv, accessed on February 2, 2026, https://arxiv.org/abs/1506.02640
  35. WaveNet: A Generative Model for Raw Audio - BibSonomy, accessed on February 2, 2026, https://www.bibsonomy.org/bibtex/2481a258916504a335fb1dcdb76b475d4/kirk86
  36. [1609.03499] WaveNet: A Generative Model for Raw Audio - arXiv, accessed on February 2, 2026, https://arxiv.org/abs/1609.03499
  37. WaveNet: A Generative Model for Raw Audio - ISCA Archive, accessed on February 2, 2026, https://www.isca-archive.org/ssw_2016/vandenoord16_ssw.html
  38. [1609.03499] WaveNet: A Generative Model for Raw Audio - ar5iv - arXiv, accessed on February 2, 2026, https://ar5iv.labs.arxiv.org/html/1609.03499
  39. [2501.09166] Attention is All You Need Until You Need Retention - arXiv, accessed on February 2, 2026, https://arxiv.org/abs/2501.09166
  40. Attention Is All You Need - BibSonomy, accessed on February 2, 2026, https://www.bibsonomy.org/bibtex/1b23d83da70543e00f9240cc009f1fcfa/annakrause
  41. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, accessed on February 2, 2026, https://www.bibsonomy.org/bibtex/210c860e3f390c6fbfd78a3b91ab9b0af/albinzehe
  42. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, accessed on February 2, 2026, https://huggingface.co/papers/1810.04805
  43. [1810.04805] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding - arXiv, accessed on February 2, 2026, https://arxiv.org/abs/1810.04805
  44. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, accessed on February 2, 2026, https://blog.paperspace.com/bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding/
  45. BERT (Language Model) - Devopedia, accessed on February 2, 2026, https://devopedia.org/bert-language-model
  46. Language Models are Unsupervised Multitask Learners | OpenAI, accessed on February 2, 2026, https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
  47. Annotated Bibliography Item: Language Models are Unsupervised Multitask Learners, accessed on February 2, 2026, https://service-science.info/archives/5137
  48. Understanding GPT-2 | Paper Summary: Language Models are Unsupervised Multitask Learners - BioErrorLog Tech Blog, accessed on February 2, 2026, https://en.bioerrorlog.work/entry/gpt-2-paper
  49. openai-community/gpt2 - Hugging Face, accessed on February 2, 2026, https://huggingface.co/openai-community/gpt2
  50. [2001.08361] Scaling Laws for Neural Language Models - arXiv, accessed on February 2, 2026, https://arxiv.org/abs/2001.08361
  51. Scaling Laws for Neural Language Models - arXiv, accessed on February 2, 2026, https://arxiv.org/pdf/2001.08361
  52. RELATIVE-BASED SCALING LAW FOR NEURAL LAN- GUAGE MODELS - OpenReview, accessed on February 2, 2026, https://openreview.net/pdf?id=anAHXnrTVW
  53. Paper page - Scaling Laws for Neural Language Models - Hugging Face, accessed on February 2, 2026, https://huggingface.co/papers/2001.08361
  54. Scaling Instruction-Finetuned Language Models (Flan-PaLM) - Samuel Albanie, accessed on February 2, 2026, https://samuelalbanie.com/digests/2022-10-scaling-instruction-finetuned-language-models/
  55. [2410.18784] Denoising diffusion probabilistic models are optimally adaptive to unknown low dimensionality - arXiv, accessed on February 2, 2026, https://arxiv.org/abs/2410.18784
  56. [2006.11239] Denoising Diffusion Probabilistic Models - arXiv, accessed on February 2, 2026, https://arxiv.org/abs/2006.11239
  57. [2102.09672] Improved Denoising Diffusion Probabilistic Models - arXiv, accessed on February 2, 2026, https://arxiv.org/abs/2102.09672
  58. Denoising Diffusion Probabilistic Model: practical experiments for landscape generation from segmentation masks using EDEN — Multimodal Synthetic Dataset of Enclosed Garden Scenes | by Olga Mindlina | Medium, accessed on February 2, 2026, https://medium.com/@olga.mindlina/denoising-diffusion-probabilistic-model-practical-experiments-for-landscape-generation-from-688ae87f3b73
  59. [PDF] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale | Semantic Scholar, accessed on February 2, 2026, https://www.semanticscholar.org/paper/An-Image-is-Worth-16x16-Words%3A-Transformers-for-at-Dosovitskiy-Beyer/268d347e8a55b5eb82fb5e7d2f800e33c75ab18a
  60. Dosovitskiy et al (2021) An Image is Worth 16x16 Words - ntegrabℓε ∂ifferentiαℓs, accessed on February 2, 2026, https://www.adrian.idv.hk/2025-03-26-dbkwzudmhguh21-vit/
  61. Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More - arXiv, accessed on February 2, 2026, https://arxiv.org/html/2502.03738v1
  62. [PDF] An Image is Worth 16x16 Words, What is a Video Worth? - Semantic Scholar, accessed on February 2, 2026, https://www.semanticscholar.org/paper/An-Image-is-Worth-16x16-Words%2C-What-is-a-Video-Sharir-Noy/63e838bb935f5ebe3498107e753f07f08a8b5689
  63. Language Models are Few-Shot Learners - NIPS, accessed on February 2, 2026, https://papers.nips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
  64. Language Models are Few-Shot Learners - NIPS, accessed on February 2, 2026, https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
  65. Language Models are Few-Shot Learners - BibSonomy, accessed on February 2, 2026, https://www.bibsonomy.org/bibtex/295850dc02fa8a64a08252dc580322d39/louissf
  66. Language Models are Few-shot Learners (GPT-3) - Samuel Albanie, accessed on February 2, 2026, https://samuelalbanie.com/digests/2022-07-gpt-3/
  67. Learning Transferable Visual Models From Natural Language Supervision - arXiv, accessed on February 2, 2026, https://arxiv.org/abs/2103.00020
  68. Learning Transferable Visual Models From Natural Language Supervision, accessed on February 2, 2026, https://proceedings.mlr.press/v139/radford21a/radford21a.pdf
  69. [PDF] Learning Transferable Visual Models From Natural Language Supervision, accessed on February 2, 2026, https://www.semanticscholar.org/paper/Learning-Transferable-Visual-Models-From-Natural-Radford-Kim/6f870f7f02a8c59c3e23f407f3ef00dd1dcf8fc4
  70. (PDF) Learning Transferable Visual Models From Natural Language Supervision (2021) | Alec Radford | 3672 Citations - SciSpace, accessed on February 2, 2026, https://scispace.com/papers/learning-transferable-visual-models-from-natural-language-1msnnp1spo
  71. Training language models to follow instructions with human feedback - OpenReview, accessed on February 2, 2026, https://openreview.net/references/pdf?id=imrVtN2Mrv
  72. Training Compute-Optimal Large Language Models - ResearchGate, accessed on February 2, 2026, https://www.researchgate.net/publication/359576828_Training_Compute-Optimal_Large_Language_Models
  73. Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures - arXiv, accessed on February 2, 2026, https://arxiv.org/html/2505.09343v1
  74. DeepSeek-V3 Technical Report - arXiv, accessed on February 2, 2026, https://arxiv.org/pdf/2412.19437
  75. DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models - arXiv, accessed on February 2, 2026, https://arxiv.org/html/2512.02556v1
  76. Comparative analysis of the performance of the large language models DeepSeek-V3, DeepSeek-R1, open AI-O3 mini and open AI-O3 mini high in urology - PubMed Central, accessed on February 2, 2026, https://pmc.ncbi.nlm.nih.gov/articles/PMC12234633/
  77. arXiv:2412.04645v1 [cs.AI] 5 Dec 2024, accessed on February 2, 2026, https://www.arxiv.org/pdf/2412.04645
  78. FunSearch: Making new discoveries in mathematical sciences using Large Language Models - Google DeepMind, accessed on February 2, 2026, https://deepmind.google/blog/funsearch-making-new-discoveries-in-mathematical-sciences-using-large-language-models/
  79. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, accessed on February 2, 2026, https://arxiv.org/html/2501.12948v1
  80. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning - arXiv, accessed on February 2, 2026, https://arxiv.org/pdf/2501.12948
  81. deepseek-ai/DeepSeek-R1 - Hugging Face, accessed on February 2, 2026, https://huggingface.co/deepseek-ai/DeepSeek-R1
  82. DeepSeek-R1's paper was updated 2 days ago, expanding from 22 pages to 86 pages and adding a substantial amount of detail. : r/LocalLLaMA - Reddit, accessed on February 2, 2026, https://www.reddit.com/r/LocalLLaMA/comments/1q6c9wc/deepseekr1s_paper_was_updated_2_days_ago/
  83. Brief analysis of DeepSeek R1 and its implications for Generative AI - arXiv, accessed on February 2, 2026, https://arxiv.org/pdf/2502.02523
  84. DeepSeek-OCR 2: Visual Causal Flow - arXiv, accessed on February 2, 2026, https://arxiv.org/pdf/2601.20552
  85. deepseek-ai/DeepSeek-OCR-2 - Hugging Face, accessed on February 2, 2026, https://huggingface.co/deepseek-ai/DeepSeek-OCR-2
  86. DeepSeek's MODEL1 Leak Reveals V4's Architectural Blueprint | by Tao An | Jan, 2026, accessed on February 2, 2026, https://tao-hpu.medium.com/deepseeks-model1-leak-reveals-v4-s-architectural-blueprint-28e2bdcc7f37