๐—”๐—ฐ๐˜๐—ถ๐˜ƒ๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—™๐˜‚๐—ป๐—ฐ๐˜๐—ถ๐—ผ๐—ป๐˜€: ๐—ง๐—ต๐—ฒ ๐—•๐—ฒ๐—ป๐—ฑ ๐—œ๐—ป ๐——๐—ฒ๐—ฒ๐—ฝ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด

A neuron computes wยทx + b. This math creates a straight line.

If you stack two linear layers, the math collapses. Layer 2 (Layer 1(x)) = W2(W1x) = (W2W1)x.

Two layers become one single linear layer. A 100-layer network without activation functions remains one straight line. You cannot use a straight line to process images or language.

Activation functions add a bend to the math. Each layer warps space. Stacking these bends allows neural networks to approximate any shape.

Common activation functions:

โ€ข ReLU: The modern standard. It uses Math.max(0, z). It is fast and helps deep networks train. โ€ข Sigmoid: Used for output probabilities between 0 and 1. โ€ข Tanh: Maps values between -1 and 1. โ€ข Leaky ReLU: Fixes the problem where neurons stay stuck at zero. It uses a small slope for negative values.

How to choose your function:

These choices cover most networks you will build.

Interactive tool to see how curves work: https://dev48v.infy.uk/dl/day2-activations.html

Day 2 of DeepLearningFromZero.

Full post: https://dev.to/dev48v/activation-functions-why-a-100-layer-network-without-them-is-still-one-line-ef6