Understanding CNNs and ResNet - AI Study Notes

What intuition should you take away?

Convolutional neural networks (CNNs) are a central building block behind deep learning breakthroughs in computer vision; ResNet addresses a very practical question: when the network is already deep, what happens to optimization and representation if we keep going deeper? Following Christopher Olah’s modular perspective, we treat a conv layer as “the same neuron reused at many input locations”; we then connect He et al.’s original story of degradation and identity shortcuts to why residual blocks work. Finally we discuss genuine inductive biases and limitations of CNNs / ResNets, and why Transformers became the community default in NLP and for some vision backbones—not because convolutions were “retired,” but because the default problem framing and training recipe shifted.

If you have read our Understanding Transformer article, treat this page as a companion from “stacked local filters on a grid” to “global composable queries over a set of tokens.”

1. Modularity: what is a CNN reusing?

In Conv Nets: A Modular Perspective, Olah explains that, in the simplest view, a CNN uses many copies of the same neuron (weight sharing), so it can express a large computational graph while keeping the number of learnable parameters relatively small. This is the same abstraction as “define a function once, call it everywhere” in programming: less redundant learning and fewer bugs.

For 1D signals, a bank of neurons $A$ computes the same features in every local time window; for 2D images, $A$ slides over spatial patches. Statistics we care about often repeat across locations (e.g., “is there an edge somewhere?”), so scanning the image with one shared weight set is a reasonable inductive bias.

Weight sharing: learn once, convolve everywhere (schematic)

2. Why convolutions + pooling? Locality, hierarchy, mild translation invariance

Local connectivity: long-range pixel dependencies are often composed gradually across layers; the first layer only needs a small neighborhood to encode edges and textures.

Hierarchical composition: the output of one conv layer feeds the next, building a local-to-global feature hierarchy in depth.

Pooling (e.g., max-pooling): Olah highlights how it “zooms out” and increases effective receptive fields while being insensitive to small shifts—we often care whether a feature appears in a region, not at a single exact pixel coordinate.

Typical stack: conv extracts local patterns; pooling coarsens spatial resolution

3. A glance at formalism: from patches to feature maps

A 2D conv layer can be abstracted as: at each spatial location $(n,m)$, take a neighborhood patch and apply the same weights. If $A$ denotes the local nonlinear transform (often an affine map plus ReLU in practice), conceptually:

$$y_{n,m} = A(\text{patch at }(n,m))$$

Implementations use efficient kernels, stride, padding, and multiple channels; for design motivation, the equation above is enough to state what a CNN assumes about the world.

4. Going deeper: degradation and the residual form in ResNet

Krizhevsky et al. on ImageNet showed that deeper, larger CNNs with ReLU, Dropout, etc. are extremely powerful. But when we simply stack deeper plain networks, He et al., in Deep Residual Learning for Image Recognition (CVPR 2016, arXiv:1512.03385), observed a counter-intuitive effect: beyond a certain depth, training error can get worse for deeper nets—not classic overfitting (validation degrades too), but optimization difficulty, which they call degradation.

If a shallower net already achieves some representational capacity, a deeper net should at least be able to copy the shallow solution (e.g., make added layers approximate identity). Plain stacks do not always learn identities easily. ResNet explicitly parameterizes a residual mapping: stacked nonlinear layers fit a correction $\mathcal{F}$ relative to an identity path, instead of directly representing the full mapping $\mathcal{H}$.

$$ \mathcal{H}(x) = \mathcal{F}(x, \{W_i\}) + x $$

Here $x$ is the identity input on the shortcut branch; when dimensions mismatch, the paper uses a 1×1 projection shortcut to align channels and spatial sizes. Intuitively:

If the optimal mapping is close to identity, $\mathcal{F}$ can be pushed toward near zero, which is easier than forcing the entire $\mathcal{H}$ through nonlinear layers alone.
During backprop, shortcuts provide shorter, more stable gradient paths, mitigating signal decay as depth grows.

Residual block: learn a correction to the input and add the shortcut (schematic)

When reading the original paper, follow two threads: the degradation experiments that expose the optimization bottleneck, and how identity paths + residual branches turn “deeper is harder to train” into “deeper can be stacked stably.”

5. Limitations CNNs / ResNets still face

Locality and receptive fields: standard convolutions are neighborhood-first; building strong dependencies between very distant pixels often needs many layers, dilated convolutions, or non-local modules—possible, but not the default prior.
Reliance on a regular grid: 2D CNNs assume a pixel lattice; point clouds, graphs, or unstructured sets need different designs (or different backbones).
The cost of translation invariance: pooling improves robustness to tiny shifts but can discard fine spatial detail critical for some tasks (segmentation and detection often use FPNs, dilated convs, etc.).
Compute still scales with resolution: even with ResNet, high-resolution images remain a systems bottleneck.

6. Why did Transformers “replace” the default choice on many leaderboards?

NLP: language is a 1D discrete symbol sequence with ubiquitous long-range structure (coreference, ellipsis, cross-sentence reasoning). RNNs are hard to parallelize; CNNs used as sequence models need depth to grow effective receptive fields (as Olah and later work discuss). Self-attention can connect any two positions with direct weights in one layer, and its dominant operations parallelize well; together with large-scale pretraining, Transformers became the de facto standard for language modeling (see Understanding Transformer).

Vision: ViT-style models treat image patches as tokens and use Transformers to mix globally, competing with or beating CNNs when data, regularization, and training recipes are strong; Swin, ConvNeXt, and many hybrids revisit convolutional inductive bias. A fair summary: the community’s first architecture to try shifted from “pure CNN” toward “Transformer or hybrid,” while convolutions remain pervasive on mobile, real-time detection, embedded systems, and many production pipelines.

Relation to ResNet: Transformer blocks still rely heavily on residual connections + LayerNorm—the residual idea is broader than CNNs—but the shift from local filtering to global token mixing changes who assumes locality up front.

References

Olah, C. (2014). Conv Nets: A Modular Perspective.
Krizhevsky, A., Sutskever, I. & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. NIPS.
He, K., Zhang, X., Ren, S. & Sun, J. (2015). Deep Residual Learning for Image Recognition. arXiv:1512.03385 (CVPR 2016).
Lin, M., Chen, Q. & Yan, S. (2013). Network In Network. ICLR.
Vaswani, A., et al. (2017). Attention Is All You Need.
Dosovitskiy, A., et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT).

ResNet BibTeX (CVPR 2016)

@inproceedings{he2016deep,
  title={Deep residual learning for image recognition},
  author={He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian},
  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
  pages={770--778},
  year={2016}
}