Machine Learning

Machine Learning is the field of study that gives computers the ability to learn without explicitly being programmed. The program learns from experience E with respect to task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

Supervised Learning vs Unsupervised Learning

In supervised learning, we use examples to directly teach the computer. In unsupervised, we allow the computer to learn on it’s own.

Supervised Learning

The most common type of machine learning, this method involves giving your learning model the “right answers”. These come in the form of labels that explicitly detail attributes about each data point. The algorithm is then tasked with finding more “right answers” in that you are trying to predict the attributes for a new data point. This is often known as regression when used with continuous data.

Supervised models are also good at classification problems. Instead of continuous data like prices, you can have discreet data, like yes or no responses to a question. Of course, you can have more than two possible answers in a classification problem, if the responses can fall in more than two buckets.

Additionally, you can have multiple attributes which you are using to create your prediction. The algorithm will then try to cluster the examples and predict probabilities for different cases. More interesting learning algorithms (like the Support Vector Machine) can deal with an infinite amount of features (or data attributes).

Unsupervised Learning

In unsupervised learning, our data has no labels, and we ask the computer to find structure in the data. Often times, the data will be grouped into like clusters, so we call these clustering algorithms.

Another type of unsupervised learning are cocktail party algorithms, so called because the concept you are trying to capture is separating individuals at a cocktail party. If everyone is speaking at the same time, it is difficult to differentiate who is saying what. This class of algorithms can separate combined sources and output the different sources individually. This works best when clustering wouldn’t fit the data.

Learning Models and Cost Functions

Supervised Learning Models

Linear Regression (univariate, or one variable)

We’ve looked at this before, it is a supervised learning model that finds continuous real values. The data set we use to fit the model is called the training set. Equation notation includes:

  • m = Number of training examples
  • x’s = input variables / features
  • y’s = output variables / features
  • (x,y) = single training example
  • (xi,yi) = a specific training example, as denoted by row in the training set. So x3 is the x value in the third row of the data.
  • h = Function derived from learning model, called hypothesis. h is a function that maps from x to y

Input and output is relative to the question being asked. When choosing an algorithm, you must decide how to represent h

Cost Function

The cost function will allow us to fit the best regression line to our data.

hѲ (x) = Ѳ0 + Ѳ1x

Our goal is to find Ѳ0 and Ѳ1 such that the line drawn by the function fits the data. How do we choose these parameters? We use our training set to find h Ѳ(x) that is close to y for our training examples (x,y). Our training data gives us the x and y, so can we use that data to predict the values of y if given x?

Yes, we want to minimize the (h(x) -y)2. That is, we want to minimize the square of the difference between our prediction (h(x)) and actual values (y).

J( Ѳ0, Ѳ1) is the cost function that we want to minimize. This is also called a squared error function because all of these equations in this block are equal. This is a great choice for any regression problem. There are others that will work, but this is the most commonly used function for regression problems.

Hypothesis Function hѲ (x) vs Cost Function J( Ѳ0, Ѳ1)

The hypothesis is a function of X, it changes as the size of X changes. The Cost function is a parameter of Ѳ1, so it’s slope derived from Ѳ1

Gradient Descent

Gradient descent is an algorithm for minimizing Cost Function J. It is a general algorithm that is used all over machine learning, not just linear regression. Gradient descent can be used on any number of parameters, but we will focus on only two parameters at this time, Ѳ0, Ѳ1

We will start with initial guesses for Ѳ0, Ѳ1, common choice is to set both to 0. Then we keep changing Ѳ0, Ѳ1 a small about to reduce J(Ѳ0, Ѳ1). Think of ourselves standing on a hill and looking all around. After our survey, we make a small step in the direction that will most quickly lead to us dismounting the hill. This is, in essence, Gradient Descent.

Gradient Descent Algorithm

Ѳj := Ѳj – a(A/A * Ѳj)* J(Ѳ0, Ѳ1)

The first thing you will notice is the := operator, which is an assignment operator. Similar to programming, this operator takes whichever value is in the left and applies it to the value on the right of the operator.

The next thing you’ll see is a for alpha. You’re alpha is your learning rate and corresponds to the size of the steps you are taking as you descend the hill. Large alpha is aggressive and takes large steps downhill.

The middle term is the derivative term. The derivative deals with the slope of a line that touches but does not cross the cost function. If we simplify and think of one parameter, the cost function creates an upside-down paradola. This U shape has the cost of the function mapped across the graph. The minimum of the graph will be somewhere in the middle.

If we start at a point on the right side of the graph and draw a tangential line, that line will have a positive slope. The intuition you need to walk away with is that if we have a positive slope, then the derivative is positive and our Ѳ1 will be updated to be less than it is. If you start on the other side of the graph, any tangent will have a negative slope and our derivative will be negative. When we multiply this by the learning rate, we then subtract the negative term, thus increasing Ѳ1 as we move towards the local minimum.

The learning rate, then, decides how large the steps are as you increase or decrease Ѳ1 as you work towards the minimum. If your learning rate is too large, you might skip over the local minimum as you work towards it. You can either fail to converge or even diverge (cost function moves away from minimum) If your rate it too small, it will take you forever (slow gradient descent) to move from one end of the graph to the local minimum.

Important Note – As you approach minimum, the derivative moves closer to 0, which is the point of local minimum. Also note that the size of steps aren’t fixed, even though learning rate is. As you move away from local minimum, the slope of your function increases. Higher slopes mean larger steps, even as the learning rate is fixed. As you move towards local minimum, the size of your steps becomes smaller as well. This means we don’t need to decrease alpha over time to compensate, the algorithm does this for us.

It is also important to note that you are updating Ѳ0 and Ѳ1 simultaneously, updating them with new values at the same time. That is, you compute the right side for 0, then compute the right side for 1, then update the values simultaneously.

Gradient Descent for Linear Regression

Sometimes called Batch Gradient Descent, linear regression will always have only one local minimum, called the global minimum. Called batch because you sum up the differences in all training examples in each step of gradient descent. There are some algorithms that look at small subsets of the data, we’ll deal with these later.

repeat until convergence: {
θ1:= θ1−α1/mi=1m(((xi)−yi)xi)

where m is the size of the training set θ0​ a constant that will be changing simultaneously with θ1​ and xi​, yi​ are values of the given training set (data).

Matricies and Vectors

A matrix is a rectangular array of numbers written in square brackets. It has 2 dimensions and it’s dimensions are written as then number of rows x the number of columns. You can refer to specific entries in an array by first giving it a variable name (A) and then referring to it’s specific row (ith row) and column (jth column). So Aij refers to the entry in matrix A at the i row and j column.

A vector is a special case of a matrix that has only one column. Refering to elements in the vector is done by giving the vector a variable name (y) and then refering to the specific row. So y3 refers to the third item in the vector.

Important to note that your vector index can start at either 1 or 0. As a computer programmer, I will start my vectors at 0, but mathematicians might feel more comfortable labeling the first item in a vector as y1.

Lastly, as you may have noticed, it is convention to use uppercase letters when referring to matrices and lower case letters when referring to vectors.

Addition and Scalar Multiplication

When adding matricies, you simply add the corresponding rows and columns together:

You can only add matricies that are of the same dimension, in our example 2(rows) x 2(columns). The result is another matrix of the same dimension.

When multiplying matricies by scalars (numbers), you multiply each item in the matrix by the number:

Once again, the result will be a matrix with the same dimensions as the one used to multiply the scalar.

Scalar division is the same as multiplying by 1/n. You divide (or multiply) each item in the matrix and the result is a new matrix with the same dimension:

You can also take these operations and combine them together:

the above would result in a vector of [2, 12, 10.33]. You perform the multiplication first, one the left side, then the division on the right side, thus creating new vectors. Then you perform the addition/subtraction sequentially, so the first row is 3 + 0 – 1 = 2 and the second row is 12 + 0 – 0 = 12.

Matrix-vector Multiplication

When multiplying a matrix and a vector, note that we do not need equal dimensions. The result will always be a vector with the same number of rows that was in the matrix. The number of columns in the matrix must match the number of rows in the vector.

To multiply a matrix times a vector, you will multiply each row of the matrix by the column in the vector. Within this, you will multiply the first item in the first row of the matrix by the first item in the vector, then multiply the second item in the first row of the vector by the second item in the vector. Do this for all items in the row of the matrix, then sum the products and place that sum as the first item in the output vector. Continue until all spots in output vector are filled in.

Matrix by vector multiplication

Matrix – Matrix Multiplication

You can multiply matrices but they must have a match between their rows and columns. That is, one matrix must have either the same number of either rows or columns as the other matrix to multiply them together. You then want to set the matrices up in order with the matching field showing up on the left. So, if you have a 3 row and 2 column matrix and want to multiply it, you will need a matrix with exactly 2 rows, and however many columns.

What you do is take the first column of matrix B and vector-matrix multiply that with matrix A. Copy this resulting vector into the first column of your resultant matrix. Then take the next column in matrix B and multiply that by matrix A, resulting in another vector that is placed as the next column in the resultant matrix. Continue this until resultant matrix is filled in.

How to Multiply Matrices

Properties of Matrix Multiplication

Matrix multiplication is not commutative in that you can simply rearrange the matrices an still multiply them. Put simply, A * B does NOT equal B * A. You are not allowed to arbitrarily change the order of your matrices. Changing the order can lead to drastically different results, and even an output matrix of different dimension.

Matrix multiplication is associative. You CAN group them in different ways, in that if you have A * B * C, you can proceed however you like. You can (A * B) * C or you can A * (B * C), they will create the same product.

Identity Matrix

Denoted by I, the identity matrix will return anything it is multiplied by. It is typified by a series of diagonal 1’s starting from the top left and ending in the bottom right and 0’s everywhere else, for whatever dimension the matrix has. Must be square with equal dimensions. Written plainly: A * I = I * A = A . Note that the identity matrix must have the same number of rows as the matrix you are multiplying it with. This property also allows for the commutative property to apply to matrix multiplication, if one of the matrices is the identity matrix.

Inverse and Transpose

In real numbers, an inverse is a number that you can multiply by to return one from any real number. So 3 * (3-1) = 1 because 3-1 = 1/3. Similarly, 12 -1 = 1/12. Important to note that not all numbers have inverse, for instance 0-1 is undefined.

Matrix inverse occurs on m x m (square matrix, and only square) matrix and IF it has an inverse is written as A-1. When you multiply A with A-1, you get the identity matrix. You can use software to compute the inverse of a matrix. Any matrix that doesn’t have an inverse is a singular matrix, or degenerate matrix

You can also transpose a matrix by converting its columns to rows. Or, said differently, take the first row and make that entire row the first column of your new matrix. The transpose will always have opposite dimensions from the original matrix.

Multivariate Linear Regression

This will allow us to perform regression with multiple features. For example, instead of basing estimated housing prices on size alone, we can estimate the price based on number of floors, age of home, garage, etc. This should give us more prediction information to base our estimate on.

When setting up the regression, you will refer to your features (also called variables or inputs) as X1, X2, … Xi . The output variable we are trying to predict is still called Y. n is our number of features (use lowercase n). We then refer to an individual training example as X(i) , for instance, if you can refer to the features of your second example as X(2).

Using this, you can make a vector of all the features, and that is also X(2) . The superscript 2 is an index that tells you to look at the second example of the training set. You can refer to an individual feature within the training set. For example, the third feature of the second training example can be referred to with X3(2) .

Gradient Descent

Feature Scaling

If you have multiple features, it is important to make sure that the features are on a similar scale (similar range of values) to make Gradient Descent happen more quickly.

For instance, if you have data on house sizes and number of bedrooms, the house sizes will be in hundreds of square feet and the room numbers will be 1 -5 or maybe 6. These different scales of data will result in countour curves that are skinny and oval in design. Gradient descent on these contour curves will take longer as the algorithm will oscillate back and forth on the skinny, tall ovals.

You can instead scale the features by expressing them as a product of their size divided by their range. For square feet, you would take the actual size and divide it by the range of values( maximum value in training set minus minimum value in training set. So, if the house was actually 10,000 square feet, your X1 would be 1000/2000 or 0.5.

You would then scale the number of rooms by a range number (maximum minus minimum value in training set), say 5. You can also take the standard deviation of the data if it is all over the map and use that as the range number. Then, the house with 10 rooms would have an X2 of 3/5 or 0.6. You can see how this places the disparate data points into similar ranges.

In fact, your goal is to scale every feature into a approximate range of -1 < xi < 1, between -1 and 1, or between 0 and 1 in my cases. I’m not really sure when the feature scale would be negative. Remember, you want to be close enough, not exactly between those ranges. Your tolerance will be -3 < xi < 3, between -3 and 3, the closer to 1 the better.

Feature scaling should create contours that are more circular, making gradient descent easier.

Another options is mean normalization, when you would subtract by the average value before you divide by a common range. For instance, your x1 would be the actual size – the average size for the feature / common size for the feature. X1 = size – 1000 / 2000.

Your goal is to take make Xi fit between -0.5 < x < 0.5

Learning Rate

When performing gradient descent, each iteration (or batch of iterations) should lower the theta in cost function J, so a quick way to check if your gradient descent is working is to plot theta at milestones in your iteration count. Plot theta on a graph at 100 iterations, then at 200 iterations, then at 3000 iterations and so on. You should see the value of theta decrease each time, and at some point the value will stop decreasing so much. This tells you that you are reaching the point of convergence, the minimum theta for cost function J.

If your plots are going in the opposite and theta is increasing, then you want to start again using a smaller learning rate (alpha). If the values oscillate (go lower than higher), use a smaller learning rate (alpha). If it takes too long to reach a horizontal minimum in your plots, then the learning rate is a bit too slow, so you want to marginally increase alpha.

When beginning a gradient descent, try different values (.001 => .003 => .01 => .03 => .1 => .3 => 1 => 3) and plot their resultant theta over time to choose the best learning rate. This way, each alpha is 3X bigger than the others, and try to find the largest reasonable value that can result in convergence.

Features and Polynomial Regression

Given the data in your training set, it can sometimes be useful to combine or divide or multiply or subtract between the data, creating new data points and features. You want to use your intuition for this, but think what the data points truly represent and if you are asking the best questions given your data.

Another consideration is the linear form of the regression curve. It is possible that a straight line isn’t the best fit for our data. We can use other linear functions if they make more sense. Of particular note is the square root function below. It’s shape is similiar to many pricing data sets, so fitting a square root will offer lower costs than fitting a straight line. I don’t know how, but we can adjust our algorithm to use these different polynomial functions.

Normal Equation

Theta = (XT * X)-1 * XT * y

As opposed to using gradient descent to find the minimum theta, we can us the Normal Equation to solve for theta analytically. No iteration necessary. It has some other advantages and disadvantages.

To set up the normal equasion, you want to take your training data (tables of data) and convert the columns to rows (transpose) in you X matrix. You Y vector is then a vector of your training output. So, if you have housing data, each house would become a column in matrix X and the house prices would be vector Y.

Feature scaling isn’t necessary with the normal equation, so this is probably the method I prefer. You don’t need an alpha with the normal equation, and no need for iteration. However, the normal equation doesn’t work as fast if you have a lot of training features (n > 10,000).


There are times when XTX cannot be inversed, when it is called singular or degenerate. In our Octave assignements, this wont matter because the computer can handle this situation. If we had this occur and no computer to work through the data, there are 2 reasons why XTX would be non-invertible:

  1. Redundant Features – If you have 2 features that are linearly dependent on eachother (one moves as the other moves), you will have a problem. An example would be to have a house size in both meters and feet. These two columns in your table would be related and a change in one would result in a linear change in the other. Delete redundant features.
  2. Too Many Features – If you have more training features than you have training examples, you will have a problem. Suppose you have 10 training sets (m = 10) of housing data, but the data has 100 different house features (n = 100). This will cause a problem. To fix, remove some features.

Classification – Logistic Regression

For simple classification (binary classification), your data will have one of two values. These are often 0 and 1, with 0 being the negative class that does not have a feature you are looking for, while 1 is the positive class, and contains the feature.

Logistic Regression Model

We want our hypothesis to be between 0 and 1 for classifications.

Sigmoid Function (Logistic Function)

g(z) = 1/1+e-z is the sigmoid function, when combined with our hypothesis we receive this:

h(x) = 1/1+e-thetaTranspose(x)

The sigmoid function looks like an S curve:

Sigmoid function - Wikipedia

It asymtotes at both 0 and 1, making it perfect for classification.