Neural Network from Scratch

Table of Content

Overview

In our previous blog, "Understanding Tensors in PyTorch" we explored the basics of tensors, which are fundamental and powerful data structures in neural networks. Those sections laid a strong foundation for learning how to build a neural network from scratch.

In this section, we will take a deeper dive into the building blocks of neural networks. We will start by understanding the dot product, a crucial mathematical operation in neural networks. Then, we will walk through coding a neural network from scratch, step by step.

We will cover the following key topics:

1. Basic Building Blocks: Understanding the essential components that make up a neural network.

2. Dot Product: Learning how the dot product works and why it is important in neural network computations.

3. Neural Network Coding: Implementing a simple neural network from scratch, including forward propagation, activation functions, and backpropagation for training.

By the end of this section, you will have a solid understanding of how neural networks operate at a fundamental level and be able to create your own basic neural network model from scratch.

Let's continue our journey into the fascinating world of neural networks and unlock the potential of deep learning!

Introduction

Neural networks are one of the most popular machine learning algorithms. They have been successfully used in many areas like image classification and time series forecasting, making them valuable in both business and research. Understanding how neural networks work is crucial because it helps diagnose issues and lays the groundwork for learning more advanced deep learning algorithms.

In this article, we will break down how a neural network works. We will go through the algorithm step-by-step and show you how to set up a simple neural network in PyTorch.

Combination of Functions

Let's start with some basic concepts. Imagine two simple linear functions:

blog image

Here, the coefficients (-1.75, -0.1, 0.172, 0.15) are chosen just for illustration. Now, we define two new functions, a1​ and a2​, using a special function called the sigmoid function:

blog image

The sigmoid function, σ(x) creates an S-shaped curve and plays a key role in neural networks.

Next, we define another function that combines a1​ and a2 linearly:

   z3​=0.25a1​+0.5a2​+0.2

Again, the coefficients (0.25, 0.5, 0.2) are arbitrarily chosen. By combining these functions, we create more complex functions capable of capturing intricate patterns. For example, the combined function z3 looks more complex than a1 or a2​.

blog image

If we apply the sigmoid function to z3 and combine it with another similar function, we can create even more complex patterns. By adjusting the coefficients, we can fit our final function to complex data sets. This is the basic idea behind neural networks: combining simpler functions to represent complicated variations in data.
Now, let's delve into the framework of a neural network and see how it all comes together.

Forward Pass

Now, let's introduce the concept of a forward pass in a neural network. The forward pass is the process of moving input data through the network to get an output. This involves multiple layers, each performing specific calculations on the data and passing the results to the next layer.

Here is an example:

blog image

As shown in the diagram, we start with our training data X. This data is fed into the first layer of our neural network. Initially, this layer uses randomly assigned weights and biases. These weights and biases are the parameters that our model will learn and optimize during the training process.

The output from this layer is then passed through an activation function. The activation function introduces non-linearity into the model, allowing it to learn and represent complex patterns in the data. Common activation functions include the sigmoid, ReLU, and tanh functions.

After applying the activation function, we get a set of predictions from the neural network. These predictions are then compared to the true values (the actual labels or outcomes we want to predict) to calculate the loss. The loss is a measure of how far off our predictions are from the true values. A common loss function for classification tasks is cross-entropy loss, while mean squared error is often used for regression tasks.

The optimizer then comes into play. The optimizer uses the calculated loss to adjust the weights and biases in the neural network, with the goal of reducing the loss in future iterations. Popular optimization algorithms include gradient descent, Adam, and RMSprop. The optimizer updates the weights by computing the gradient of the loss function with respect to each weight and bias, and then making adjustments in the direction that minimizes the loss.

This process of passing data through the layers, applying activation functions, calculating loss, and updating weights is known as the forward pass. It is repeated for many iterations, or epochs, allowing the neural network to learn and improve its performance over time.

By continually optimizing the weights and biases, the neural network becomes better at making accurate predictions. This iterative process of training helps the model generalize well to new, unseen data

How to update weights?:

To understand how weights are updated in a neural network, let's revisit what we got from our forward pass. We spoke about the loss calculation or error, which measures the difference between our predicted values and true values. This difference is plotted along the X-axis, while the Y-axis corresponds to the actual loss. Because we're squaring the differences, the loss function curve takes on a U-shape.

The objective of our optimizer is to minimize this loss. Ideally, the goal is to bring this loss function value as close to zero as possible. In the real world, the loss function values won't be exactly zero, but we will aim for values close to it, like L1 or L2, as shown in the diagram. If we get a loss value of L1, we need to increase the difference to reach the minimum value. Conversely, if we get a loss value of L2, we need to decrease the difference.

blog image

To achieve this, we need to delve into a bit of mathematics. We can place a tangent (shown as a red line) on the curve of the loss function. This tangent represents the gradient, which is the derivative of the loss function. This process is called gradient descent. We use these gradients to update the weights. Regardless of whether we're at L1 or L2, the gradient will guide us in the right direction to minimize the loss.

If you are far from the optimal minimum point, your gradient will be larger, indicating a larger error. Larger errors result in larger absolute gradients. This is where the concept of "Learning Rate" into play.

Let's consider the scenario where our error is at L1. If we do not have a learning rate, it is possible that we might overshoot the minimum point and jump to the other side, reaching L2, and then oscillate between L1 and L2. This means we might never reach the minimum. To handle this problem, we introduce something called the learning rate.

blog image

   Wnew = W - (dL/dw . LR)

The learning rate controls how much we adjust the weights with respect to the gradient. Instead of subtracting the entire gradient from the current weight, we multiply the gradient by the learning rate and then subtract this product from the weights. Typical learning rates are values like 0.01, 0.001, etc

Choosing the right learning rate is crucial. A learning rate that is too high can cause the model to converge too quickly to a suboptimal solution, or even diverge, while a learning rate that is too low can make the training process very slow.

In summary, the process of updating weights involves:

1. Calculating the gradient of the loss function with respect to the weights.

2. Multiplying this gradient by the learning rate

3. Subtracting the result from the current weights to get the updated weights

This process ensures that we make small, controlled steps towards minimizing the loss, ultimately leading to a well-trained neural network.

Chain Rule

In the previous section, we discussed how gradients and learning rates are used to update the weights in a neural network. Now, let's dive into the chain rule, a crucial concept for calculating gradients in neural networks, particularly during the backward pass.

Imagine we have an input vector X, and this is passed through a function y=f(X). This function y is dependent on X. Then, we pass y through another function to get z=g(y), which makes z dependent on y. Our goal is to find the derivative of z with respect to X, denoted as dz/dx​.

The chain rule helps us achieve this. According to the chain rule, to get dz/dx​, we need to first compute the derivative of y with respect to X, denoted as dy/dx​. Next, we calculate the derivative of z with respect to y, denoted as dz/dy​. Finally, we multiply these two derivatives to get dz/dx: 

dz/dx=dz/dy . dy/dX​

blog image

Let’s break this down with an example. Consider the above diagram:
Here, X is our input vector, y is the output of the first function f(x), and z is the output of the second function f(y).

1. Calculate dy/dx: Suppose f(X) = X^2. The derivative of y with respect to X is dy/dx=2X.

2. Calculate dz/dy​: Now, let’s say f(y)=sin⁡(y). The derivative of z with respect to y is dz/dy=cos⁡(y)

3. Apply the Chain Rule: To find dz/dx​, we multiply these two derivatives:

      dz/dx=cos⁡(y)×2X

However, since y=X^2, we substitute y back in:

      dz/dx=cos⁡(X^2)×2X

This process of applying the chain rule is exactly what we do during the backward pass in a neural network. During backpropagation, we start from the output layer and work our way back to the input layer, calculating gradients at each step using the chain rule. These gradients are then used to update the weights, as discussed previously.

In summary, the chain rule allows us to decompose the derivative of a composite function into the product of derivatives of its constituent functions. This is a fundamental aspect of training neural networks, enabling us to efficiently compute the gradients necessary for weight updates.

Implementing the Backward pass

Now that we have a solid understanding of the chain rule, let's see how it applies to our neural network, specifically in the backward pass. We will walk through the steps of calculating gradients and updating weights and biases based on our loss function.

We start with our input data X and pass it through the network layer by layer, using our initial random weights and biases. After passing through the activation functions, we obtain our predicted values ypred​. We then calculate the loss by comparing ypred​ with the true values.

The backward pass involves calculating the gradients of the loss with respect to each parameter (weights and biases) and then updating these parameters to minimize the loss. Here is how it works step-by-step:

1. Calculate Loss Derivative: We start by calculating the derivative of the loss L with respect to the predicted output ypred​, denoted as dL/dypred​. This tells us how the loss changes with changes in the predicted values.

2. Activation Layer Gradients: Next, we calculate the derivative of the predicted output ypred​ with respect to the output of the previous layer l1, denoted as dypred/dl1​​. This derivative depends on the activation function used in the output layer.

3. Layer1 Gradients: We then compute the derivative of the output of the previous layer l1 with respect to its weights w, denoted as dl1/dw​. This tells us how changes in the weights affect the output of the layer.

4. Chain Rule Application: To find the final gradient of the loss with respect to the weights, we apply the chain rule:

dL/dw=dL/dypred . dypred/dl1​ .  ​ dl1/dw

This product gives us the gradient we need to update the weights.

Weight Updates:

blog image

5. Bias Gradients: Similarly, we need to calculate the gradients for the biases. The steps are the same as for the weights:

dL/db=dL/dypred . dypred/dl1​ .  ​ dl1/db
blog image

6. Update Weights and Biases: Finally, we update the weights and biases using the calculated gradients. This is typically done using an optimizer, such as stochastic gradient descent (SGD). The weights and biases are adjusted in the direction that reduces the loss:

w(new)=w(old)−η×dL/dw​

b(new)=b(old)−η×dL/dw​

Here, η is the learning rate, a hyperparameter that controls the step size of the updates.

The backward pass is all about propagating the error backward through the network, calculating gradients at each step, and updating the weights and biases to minimize the loss. By doing this iteratively, we train the neural network to make better predictions.

Understanding the DOT PRODUCT

There is one more important concept we need to cover in this blog: the dot product. It is a key idea used in neural networks to help map input data to output predictions by adapting the weights.

Imagine you have some input data X and two different sets of weights. The question is: which set of weights is more similar to the input X? This is where the dot product comes in handy.

What is the Dot Product?

The dot product is a way to multiply two vectors (like our input data and weights) to see how similar they are. It is a simple calculation where we multiply corresponding elements of the vectors and then add those products together.

Why Use the Dot Product?

In neural networks, the dot product helps us determine which set of weights is better at mapping our input data to the desired output. By comparing the dot products of different weight vectors with the input X, we can see which weights are more similar to the input.

The weights that produce a higher dot product are more aligned with the input data, meaning they are more likely to be correct. During training, we adjust the weights to maximize this alignment, which helps the network make more accurate predictions.

In summary, the dot product is a useful tool for measuring the similarity between input data and weights. By using it, we can adapt the weights to better map the inputs to the outputs, improving the performance of our neural network.

Conclusion

In this blog, we have explored the foundational concepts of neural networks from scratch. We have covered the basics of how neural networks work, including forward and backward passes, and the importance of the dot product in adjusting weights to map input data to outputs. We have also discussed key concepts like gradient descent and the chain rule, which are crucial for understanding how neural networks learn and optimize.

By breaking down these core ideas, we have built a solid foundation for understanding neural networks. In our next blog, we will take these concepts further by implementing a neural network from scratch using PyTorch. This will allow us to see these principles in action and understand how to apply them in real-world scenarios.

Stay tuned for our next blog, where we dive into the PyTorch implementation of neural networks!

Impetus Img

Written By

Impetus Ai Solutions

Impetus is a pioneer in AI and ML, specializing in developing cutting-edge solutions that drive innovation and efficiency. Our expertise extends to product engineering, warranty management, and building robust cloud infrastructures. We leverage advanced AI and ML techniques to provide state-of-the-art technological and IT-related services, ensuring our clients stay ahead in the digital era.

Get in touch with us

Error Message
Error Message
Error Message
chat icon