Intro to AI

Artificial intelligence can be dropped into two buckets:

Strong – A computer understanding the context of its enviornment and making observations based on this understanding. We do not have examples of this as it is beyond our current capabilities

Weak – All of our current uses of AI are examples of weak AI. It is fundamentally a pattern-matching system. It takes an input, identifies a pattern, and then matches that pattern to an expected output.

Think of weak AI as a person in a dimly lit room, with post-it notes of characters arranged on a wall, a translation book on a desk, and a mail slot in the door. When a character is slipped through the mail slot, the person then examines the character against the translation book. When the person finds a match, they then find the translation on the post-it notes and pass it back through the mail slot. You see, the person has no idea what the character means, only that they can match the input to an expected output.

Supervised vs Unsupervised Machine Learning

Supervised machine learning occurs when a data scientist labels the data that the computer model is learning from. This is often the case in binary classification problems, where the computer must sort the data into two individual buckets that are set up by the data scientist.

A common example would be repeat buyers. Businesses want to be able to target frequent purchasers directly, so they can use machine learning to allow the computer to search the customer database and classify customers as frequent/non-frequent based on criteria established by the data scientist.

Unsupervised machine learning, on the other hand, occurs when the unstructured data is presented to the computer and the machine must decide into which bucket to place the data. There are often more then 2 different outcomes, so this isn’t a classification problem but more of a clustering problem. The computer will cluster data records together based on similarities or patterns it finds in the data, often to the bewilderment of the researcher (the researcher might have no idea how the cluster was obtained or what patterns the computer is matching).

Reinforcement machine learning is another type of machine learning that does not fall neatly into the aforementioned buckets. It is a built-in feedback mechanism to report back to the computer the efficacy of it’s matching. This allows the learning model to receive feedback on it’s decisions and update those decisions in the future.

A great example is when Netflix wanted to increase the rate of click-throughs for it’s recommendations. There is a huge business incentive in getting the viewer to keep watching, so the data scientists applied unsupervised machine learning techniques to the viewer statistics to determine clusters of interest for different viewer segments. If a viewer watched one of the videos from the cluster, then another video in the cluster would be recommended.

When a viewer then chooses a recommendation, a tiny vote is registered and the model learns that there is a strong link between the cluster and the recommendation. This is facilitated by the use of Q-Learning.

Q-Learning tracks interactions and increases the Q value when the user follows the recommendation. In essence, when the user clicks the recommendation, the Q value increases somewhat and the job of the model is to maximize the Q value across all the different viewer sessions.

Common Algorithms

Think of your machine learning algorithms as knifes in a chef’s drawer. Each knife has a very specific function, and while a certain knife can probably be used in a different function, the real efficiency comes from understanding how and when to apply the most precise tool for a particular job. The algorithms are the same: some are faster, some are more detailed, some handle certain types of jobs better. It is important to understand what they are and how they are used to apply them to your business problems.

K-Nearest Neighbor (KNN) – Used for multi-class classification (not binary), this algorithm plots new data and compares it to existing data. The more closely related the data is not it’s nearest neighbor, the stronger the classification. The Euclidean Distance is used to calculate the relative distance between the two points. It is a mathematical formula.

When using KNN, researchers often start with classification predictors. That is an early binary classification using the most obvious data points. You will place the data on axis of a graph, then plot the data points on the graph. Once the training data is plotted, the new example is then placed on the graph. A K nearest neighbor statistic is chosen, which refers to the amount of nearest neighbors to choose for the grouping, and a resultant classification is chosen.

For example, if trying to use KNN to match dog breeds, you could start with two data categories, hair length and weight. These categories would be placed on the axis of a graph, and you would then plot the data points. So, if you had a chihuahua, they would have short hair and low weight, so they would be at the bottom left of the graph. Whereas a Beethoven dog would be at the top right of the graph. Once all the dogs are plotted, the new dog you are trying to match is placed on the graph. Lets say this is a long-haired chihuahua. They would be high up on the hair length, axis, but low on the weight axis. A K nearest neighbor is then chosen (how?), lets say KNN = 5, and the 5 closest neighbors are used to create a cluster and categorize the new dog.

K-Means Clustering – This is an unsupervised machine learning algorithm that creates the clusters based on what the computer sees as relationships between the data. First, the computer will randomly choose “centroid” or example datapoints. The computer will then find datapoints that are closely related to the centroid, and cluster those datapoints together.

If the algorithm is unable to identify a closely related cluster around the centroid, then the computer will redistribute the centroid status and try again to cluster the datapoints.

Regression Analysis – Supervised machine learning algorithm that matches predictors (also called independent variables or regressors) with expected outcomes. It is supervised as we are taking the training data and labeling the correct output. Then you are combining the labelled data with the test data to understand the relationships between variables and outputs.

Naive Bays Algorithm – To classify datapoints based on features in the data, you can use naive bayse (supervised learning and classification algorithm). This assumes that all predictors are independent from eachother. The algorithm calculates class predictor probability by looking at each predictor individually and determining the closeness between the datapoint and the category.

For example, if we return to dog breeds, we can create three classes of dogs (supervised) based on three different characteristic (features). The computer can then determine the relative classification based on the features. So, if our features are hair lenght, weight and height, our chihuahua will be in the short hair grouping, the short height grouping, and the short length grouping, respectively. Of course, some of these features are correlated (a short dog will weigh less), but the naive bays algorithms treat each feature independently.

Choosing the Right Algorithm

The different algorithms are useful for different ascpects of machine learning projects. It is common for data scientists to use multiple algorithms in conjunction, a situation called Ensemble Modeling. There are a few different ways to create ensembles. The most popular way is Bagging and Stacking.

Bagging and Stacking

Bagging occurs when you use several versions of the same machine learning algorithm on a project.

Stacking, on the other hand, is when you use several different machine learning algorithms in conjunction.

Challenges in Machine Learning

Bias and Variance

Bias is a regular gap between the predicted value and the actual outcome. You have bias when the predicted value and the actual value are different, by a relatively constant amount. So, if you predict you will roll 5 on the dice 3 times, but you get 4 every time, you are off by 1 in each prediction, so you demonstrate a bias.

Variance is when the differences between the prediction values and actual values are scattered all over the place. This indicates the model is not well fit to the data. If you predict you will roll 5 on the dice 3 times, but you get 2, 6, and 1, then you demonstrate variance because the differences are scattered.

Bias and Variance are two separate challenges in ML, so understand the root of your error is important as the system fixes the error in different ways depending on bias vs variance.

It is important to note that bias and variance aren’t related. A model can have a high bias but low variance, in that the predictions are off by a predictable amount. This means the model can correct itself. On the other hand, you can have both high bias and high variance, meaning the model is way off. Ideally, your model will have low bias and low variance, meaning the predictions are accurate and stable (all the darts are on the bullseye). In most cases, you will face either high bias or high variance when modeling.

Bias – Variance Tradeoff

Data scientists often have to balance the cost of bias vs variance because attenuating the model for one will often have an affect on the other. When machine decreases variance spread, it will often increase the bias. To combat this, train the machine to follow the data, to choose the most appropriate tradeoff as determined by the data itself.

Overfitting vs. Underfitting

AI systems can create simple rules that work well when applied to the small set of training data, but don’t work when applied to the larger test data set. This is called underfitting the data.

To combat this, a data scientist might add more complexity to the model. However, you then have to be careful of overfitting the data to the model.

There is no one way to solve the problem of overfitting vs underfitting. When training the system, the data scientist must reach a compromise between simple rules vs. complex instructions.

Artificial Neural Networks

The algorithms discussed are great at finding patterns in data, but sometimes you simply have too much data for the algorithms to be effective. In this instance, you turn to Artificial Neural Networks.

Artificial Neural Networks are a type of machine learning, but they use a structure that mimics the human brain to tackle massive datasets that an algorithm couldn’t wrap it’s arms around. It breaks down the data into smaller pieces before applying the machine learning algorithms.

Artificial neural networks are most often used for supervised learning. The network is composed of neurons that are organized into layers (from left to right):

Input Layer – The data that is fed into the model will pass through the input layer. The input layer has as many nodes as needed for the data. For instance, if you are trying to categorize photos, each photo would be broken down into it’s pixel components. So, a photo of 25px x 25px would have 625px total, then the input layer would have 625 individual nodes, each for a different pixel.

Hidden Layers – If the network has a lot of hidden layers, it is called a Deep Learning artificial network. The more hidden layers a network has, the more easily it will be able to identify patterns in your complex data. Hidden layers have Activation Functions, decision rules on whether an individual neuron in the hidden layer should send the data on to the next neuron in the hidden layer. Each hidden layer then feeds the pixel data forward into the next hidden layer.

Output Layer – The output layer will only have the number of nodes needed for the classification. Remember, neural networks are most often used for supervised learning (think classification into known buckets). The output layer assigns a probability to the data. For our example above, the model takes a photo and decides if it is a dog. The output layer would have two nodes, Yes or No (binary classification), and the model would determine the probability that the photo falls in either “bucket”.

As the data moves from left to right, this is called a Feedforward Neural Network. One of the strengths of neural networks is that they are self tuning. You are able to check if they get the right answer, and the model will self tune until it does.

Connection Weights

ANN adds weights to the connections in the neural networks. Each neuron in the hidden layer feeds forward into every other neuron in the hidden layer. Additionally, each neuron will have it’s own weight. These are denoted by W# next to the connection. In our earlier example, each neuron would have 625 connections to the next layer. Each of these connections would have a weight (W1, W2, W3, and so on)

When first initialized, the system will assign random numbers to the weights. We then feed in training data (labeled data) that we know what the output should be. The system will then adjust the relative weights of individual neurons until we get the result we expect from the test data (the model is accurate when determining a dog or not). The network repeats the process over and over again until it can accurately identify the patterns that determine whether the photo is a dog or not.

Activation Bias

It is important to note that the neural network is trying to reduce variance when it assigns and reassigns weights. Due to the bias / variance tradeoff, this means bias is being affected during this process. As data scientists, it is our job to balance bias vs variance.

ANNs will assign a number to each neuron. This is important to note – the bias is assigned to the neuron itself, not the connection. The machine only adds the bias after it determines the data variance. This bias number shifts the data in order to make it more accurate. The network then tunes itself to find a balance between data bias and data variance. Additionally, ANNs tend to overfit the data.

Learning From Mistakes

To a neural network, there is a large difference between being 95% sure vs. 97% sure. To handle this, neural networks have a Cost Function. This is a number that the system uses to measure it’s answer vs the correct answer. If the ANNs prediction is close, than the cost function will be very low.

For example, if trying to determine if dog is in photo, the network is fed photos. If one of the photos is a cat, the network might mistake it for a dog. This error would have a cost. However, if the photo was a tree and the network mistook it for a dog, this would have a much larger cost as the prediction is further from the actual answer. Larger mistakes means the network needs to make more aggressive adjustments to it’s weights and biases.

Gradient Descent

Correcting for wrongness can be tricky. To help, data scientists employ Gradient Descent. Compare this to throwing darts. If you throw a dart and it’s close to the bullseye, you will make minimal adjustments to your throwing technique. If, on the other hand, you hit the floor, then you need to make a massive change (start by looking at the dart board 🙂

This is one of the big innovations in neural networks. It’s called the Backpropogation of Errors (backprop). ANNs are feedforward, so the data is passed from left to right. If there is an error, the network needs to go back to determine what it did wrong. The network uses gradient descent to determine wrongness (how?) then uses backprop to adjust the weights and biases based on seriousness of the error.

Building An AI System

The first step when building an AI system is to determine what you want from the data. Is the problem about classification, clustering, etc.
Next step is to determine the type of model you need. How much data do you have? The amount of data will determine if you can apply standard ML algorithms or if you need a neural network.
If you use a neural network, the layers will be established and the network will be initialized with weights between the connections.
Then the system will set the bias on every node to 0
Next step is to feed the training data into the neural network. The network will process the data and return a prediction.
The network will then compare the prediction to the labeled data, adjusting the weights according to how close or far the prediction is away from the answer. If incorrect, the network will then look at gradient descent to determine how much to change it’s weights and biases. It will then use backpropogation to adjust the weights and biases to lower the cost function.
This process is then repeated over all of the data in the training set.
Once the ANN has worked through the training set, it is then fed the test data set (unlabeled).
Sometimes the network will perform well with the training set but not with the test set, an issue known as overfitting. This means the network needs more complexity to handle the new data in the test set.