"Definition of the function to read the csv and create dataset"
]
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
" # Load a CSV file. Definition of the function to read the csv and create dataset here\n",
"def load_csv(filename):\n",
"\tdataset = list()\n",
"\twith open(filename, 'r') as file:\n",
@ -38,9 +54,20 @@
"\treturn dataset\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Preparing the data\n",
"### Conversion of certain data\n",
"extract the values of the column (here types of Iris)\n",
"calculate how many unique class values there are and store them into a set: a list with unique values\n",
"Tranform class values into numbers/integers"
]
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
@ -48,29 +75,29 @@
"def str_column_to_float(dataset, column):\n",
"\tfor row in dataset:\n",
"\t\trow[column] = float(row[column].strip())\n",
"\n",
"\n",
"# Convert string column to integer\n",
"def str_column_to_int(dataset, column):\n",
"\tclass_values = [row[column] for row in dataset] # extract the values of the column (here the classes of the dataset, mine and rocks)\n",
"\tunique = set(class_values) # calculate how many unique class values there are and store them into a set: a list with unique values\n",
"\tlookup = dict() #create a dictionnary\n",
"\tfor i, value in enumerate(unique): # loops through the set / enumerate gives you a tuple with an index number and a value /common way to get indexes from a list\n",
"\t\tlookup[value] = i # the key of the dictonnary is the value: mine or rock; and the value is a number: 0 or 1\n",
"\t\tlookup[value] = i # the key of the dictonnary is the value: mine or rock/or types of iris; and the value is a number: 0 or 1 (or 2)\n",
"\tfor row in dataset: # loops through the rows of the dataset\n",
"\t\trow[column] = lookup[row[column]] #replaces the value of the column: rock or mine, with the index value: 0 or 1;\n",
"\t\trow[column] = lookup[row[column]] #replaces the value of the column: rock or mine/or types of iris, with the index value: 0 or (1 or 2);\n",
"\treturn lookup # the code returns the lookup table\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"cell_type": "markdown",
"metadata": {},
"outputs": [],
"source": []
"source": [
"### Function to create a list of folds (divide the dataset into smaller subsets)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
@ -94,9 +121,17 @@
"\treturn dataset_split #return the dataset_split, a list of folds\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Functions definitions\n",
"### Calculate accuracy in the prediction"
]
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
@ -110,9 +145,16 @@
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create a list of score for each algorithm/tree"
]
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
@ -133,132 +175,318 @@
"\t\tactual = [row[-1] for row in fold] # list comprehension: list of actual classes from fold.\n",
"\t\taccuracy = accuracy_metric(actual, predicted) # function that compares the actual vs the predicted to give an idea of the accuracy of the prediction\n",
"\t\tscores.append(accuracy) #append the accuracy to the list of scores\n",
"\treturn scores\n",
" "
"\treturn scores"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create the trees by using different features and figuring out where to split the data at each point"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here the algorithms is running the same function many times to understand where is the best point to divide the best split point for the dataset. In order to assess it, it uses a coefficient called the Gini Coefficient.\n",
"There are three functions:\n",
"- the function for dividing the data into two based on a feature\n",
"- the function to assess whether this division is resulting in an equal divide and that return a coefficient\n",
"- the function to decide between all the different dividing point, which one is the best, which one result in the best gini coefficient"
]
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"\n",
"\n",
"# Split a dataset based on an attribute/feature and an attribute/feature value\n",
"# Split a dataset based on a feature and a feature value defined in build tree\n",
"# just trying many times, benefitting from speed of computer\n",
"def test_split(index, value, dataset):\n",
"\tleft, right = list(), list() # create two lists for each side\n",
"\tfor row in dataset: #iterate through each row of the dataset\n",
"\t\tif row[index] < value: #if the feature value of the current row is below the feature value given\n",
"\t\t\tleft.append(row) # append it to the left list\n",
"\tleft, right = list(), list()\n",
"\tfor row in dataset:\n",
"\t\t# compares set value to all values in that column, if it is smaller, it goes to the left\n",
"\t\t# he goes for each value through all dataset again\n",
"\t\tif row[index] < value:\n",
"\t\t\tleft.append(row)\n",
"\t\t# comparing the set value to itself, then it goes to the right\n",
"\t\telse:\n",
"\t\t\tright.append(row) # append it to the right list\n",
"\treturn left, right # return the two lists\n",
" \n",
"# Calculate the Gini index for a split dataset\n",
"\t\t\tright.append(row)\n",
"\treturn left, right"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"\n",
"# Calculate the Gini index for a split dataset, using left/right og test split as groups\n",
"\tn_instances = float(sum([len(group) for group in groups])) #counting the total number of instances into float for divisions\n",
"\t# count all samples at split point (the dataset), converts it in a float in order to do divisions\n",
"\tn_instances = float(sum([len(group) for group in groups]))\n",
"\t# sum weighted Gini index for each group\n",
"\tgini = 0.0 #gini variable\n",
"\tfor group in groups: #for each of the group\n",
"\t\tsize = float(len(group)) #number of instances in each group\n",
"\tgini = 0.0\n",
"\tfor group in groups:\n",
"\t\tsize = float(len(group))\n",
"\t\t# avoid divide by zero\n",
"\t\tif size == 0:\n",
"\t\t\tcontinue\n",
"\t\tscore = 0.0\n",
"\t\t# score the group based on the score for each class\n",
"\t\t# count number of instances for current class in the group and divide by total size of the group\n",
"\t\tfor class_val in classes:\n",
"\t\t\tp = [row[-1] for row in group].count(class_val) / size #count the number of instances for the current class in the group and divide it by the total size of the group\n",
"\t\t\tscore += p * p #amplifying the difference (exponential ?)\n",
"\t\t# weight the group score by its relative size\n",
"\t\tgini += (1.0 - score) * (size / n_instances) # substract the score from 1 and multiply it by the relative size of the group compared to the dataset\n",
"\t\t\t# outcome lies always between 0 and 1\n",
"\t\t\t# for each row it takes the class value and counts how many times the set class value appears, divided by size of the group\n",
"\t\t\tp = [row[-1] for row in group].count(class_val) / size\n",
"\t\t\t# multiply makes it exponentially smaller; you amplify the badness of the score\n",
"\t\t\tscore += p * p\n",
"\t\t# weight the group score by its relative size (size of group divided by total size of dataset)\n",
"\tclass_values = list(set(row[-1] for row in dataset)) # creates a list of the set for the class values. Here encoded as 1 and 0 . We already did it before\n",
"\twhile len(features) < n_features: # as long as features is smaller that the actual number of desirable features= n_features\n",
"\t\tindex = randrange(len(dataset[0])-1) # create a random number between 0 and the number of columns-1= minus the class\n",
"\t\tif index not in features: # if the column names is not already in the features\n",
"\t\t\tfeatures.append(index) # append the column name\n",
"\tfor index in features: # for each column name(=index name) in features\n",
"\t\tfor row in dataset: # for each row of the dataset\n",
"\t\t\tgroups = test_split(index, row[index], dataset) # get two lists. very computationnaly heavy. Why not do the ordering first?\n",
"\t\t\tgini = gini_index(groups, class_values) # get the gini value\n",
"\t\t\tif gini < b_score: #if the gini value is smaller than b_score (b for best?). this should always be true for the first operation. Test against the best option\n",
"\t\t\t\tb_index, b_value, b_score, b_groups = index, row[index], gini, groups # update the values for the best option\n",
"\treturn {'index':b_index, 'value':b_value, 'groups':b_groups} #return the best option in the form of a dictionnary\n",
" \n",
"# Create a terminal node value\n",
"# Returns the most popular value in the group\n",
"\t# takes last element of each row (class) and returns it as a row, as it is a set, it has only 2 values\n",
"\tclass_values = list(set(row[-1] for row in dataset))\n",
"### Create the child splits or terminal nodes\n",
"\n",
"here we define a function that will build the tree, decide wether the new data point is going to go left or right or to build a new node."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"\n",
"# Create a terminal node value = node at end of the tree = end leaf with its predicted class\n",
"def to_terminal(group):\n",
"\toutcomes = [row[-1] for row in group] #takes the class of the group elements and put it in a list\n",
"\treturn max(set(outcomes), key=outcomes.count) #return the class that appears the most. set is counting the different class and the key is counting the number of occurences of this class. SPOOKY AND DENSE\n",
" \n",
"# Create child splits for a node or make a terminal (node)\n",
"\t# returns list of classes of group\n",
"\toutcomes = [row[-1] for row in group]\n",
"\t# selects most popular class; list of outcomes is reduced to 0 or 1; key counts the amount of times 0 or 1 occurs\n",
"\t# selects class based on calculating how many times the class occurs\n",
"\tif len(left) <= min_size: # if the group is smaller or equal than the minimum size for a group\n",
"\t\tnode['left'] = to_terminal(left) #left node is a terminal node\n",
"\t# if length of left group is smaller or equal to 1\n",
"\tif len(left) <= min_size:\n",
"\t\t# it creates an end leaf\n",
"\t\tnode['left'] = to_terminal(left)\n",
"\telse:\n",
"\t\tnode['left'] = get_split(left, n_features) #create another split, another node from which to separate in two groups with a subset of a dataset\n",
"\tif row[node['index']] < node['value']: #if the feature value of the row is smaller of the feature value of the node\n",
"\t\tif isinstance(node['left'], dict): # is it a node or a terminal node(children are not node)\n",
"\t\t\treturn predict(node['left'], row) #recursion if a node. the function calls itselft on the following left node\n",
"\t# node index = column feature, it looks up value for this feature for this row in dataset\n",
"\t# compare feature value of row you're checking with feature value of node\n",
"\tif row[node['index']] < node['value']:\n",
"\t\t# is it node? \n",
"\t\tif isinstance(node['left'], dict):\n",
"\t\t\t# recursive function at the left\n",
"\t\t\treturn predict(node['left'], row)\n",
"\t\telse:\n",
"\t\t\treturn node['left'] # result if a terminal node\n",
"\t\t\t# creates final leaf at the left\n",
"\t\t\treturn node['left']\n",
"\telse:\n",
"\t\tif isinstance(node['right'], dict): # is it a node or a terminal node(children are not node)\n",
"\t\t\treturn predict(node['right'], row)#recursion if a node. the function calls itselft on the following right node\n",
"\t\t# is it node?\n",
"\t\tif isinstance(node['right'], dict):\n",
"\t\t\t# recursive function at the right\n",
"\t\t\treturn predict(node['right'], row)\n",
"\t\telse:\n",
"\t\t\treturn node['right']# result if a terminal node\n",
" \n",
"# Create a random subsample from the dataset with replacement\n",
"\t\t\t# creates final leaf at the left\n",
"\t\t\treturn node['right']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Bootstrapping: doubling data to fill-in the random forest"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"# Create a random subsample from the dataset with replacement, ratio is called sample_size further on\n",
"# This is called BOOTSTRAPPING: build new datasets from the original data, with the same number of rows\n",
"# with replacement: after selecting the row we put it back into the data, so it can be selected twice or more\n",
"def subsample(dataset, ratio):\n",
"\tsample = list() #creates a list\n",
"\tn_sample = round(len(dataset) * ratio) # rounds the multiplication of the length of the dataset with the sample.size: here 1. \n",
"\twhile len(sample) < n_sample: # loop up until the length of the sample is the length of n_sample\n",
"\t\tindex = randrange(len(dataset)) #take a random number from 0 up to length of the dataset\n",
"\t\tsample.append(dataset[index]) # append a sample whith this index\n",
"\treturn sample # return the list of sub-samples\n",
"\tsample = list()\n",
"\t# if it is smaller than 1, not all dataset is taken as sample - he uses the full dataset\n",
"\tn_sample = round(len(dataset) * ratio)\n",
"\twhile len(sample) < n_sample:\n",
"\t\tindex = randrange(len(dataset))\n",
"\t\tsample.append(dataset[index])\n",
"\treturn sample\n",
" \n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Aggregate the prediction of several trees - the council of trees - see what is their verdict"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"# Make a prediction with a list of bagged trees\n",
"def bagging_predict(trees, row): \n",
"\tpredictions = [predict(tree, row) for tree in trees] # we run the prediction in each tree, this gives a list of predictions of classes/votes. THE TREES ARE VOTING.\n",
"\treturn max(set(predictions), key=predictions.count) # we count the class with the maximum votes and return it as prediction\n",
" \n",
"def bagging_predict(trees, row):\n",
"\t# asks the forest to predict class for every row in the test data, this gives list of votes\n",
"\tpredictions = [predict(tree, row) for tree in trees]\n",
"\t# it calculates amount of votes for each class, returns most popular class as prediction\n",