"Definition of the function to read the csv and create dataset"
]
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 2,
"execution_count": 28,
"metadata": {},
"metadata": {},
"outputs": [],
"outputs": [],
"source": [
"source": [
" # Load a CSV file. Definition of the function to read the csv and create dataset here\n",
"def load_csv(filename):\n",
"def load_csv(filename):\n",
"\tdataset = list()\n",
"\tdataset = list()\n",
"\twith open(filename, 'r') as file:\n",
"\twith open(filename, 'r') as file:\n",
@ -38,9 +54,20 @@
"\treturn dataset\n"
"\treturn dataset\n"
]
]
},
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Preparing the data\n",
"### Conversion of certain data\n",
"extract the values of the column (here types of Iris)\n",
"calculate how many unique class values there are and store them into a set: a list with unique values\n",
"Tranform class values into numbers/integers"
]
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 3,
"execution_count": 29,
"metadata": {},
"metadata": {},
"outputs": [],
"outputs": [],
"source": [
"source": [
@ -48,29 +75,29 @@
"def str_column_to_float(dataset, column):\n",
"def str_column_to_float(dataset, column):\n",
"\tfor row in dataset:\n",
"\tfor row in dataset:\n",
"\t\trow[column] = float(row[column].strip())\n",
"\t\trow[column] = float(row[column].strip())\n",
"\n",
"\n",
"# Convert string column to integer\n",
"# Convert string column to integer\n",
"def str_column_to_int(dataset, column):\n",
"def str_column_to_int(dataset, column):\n",
"\tclass_values = [row[column] for row in dataset] # extract the values of the column (here the classes of the dataset, mine and rocks)\n",
"\tclass_values = [row[column] for row in dataset] # extract the values of the column (here the classes of the dataset, mine and rocks)\n",
"\tunique = set(class_values) # calculate how many unique class values there are and store them into a set: a list with unique values\n",
"\tunique = set(class_values) # calculate how many unique class values there are and store them into a set: a list with unique values\n",
"\tlookup = dict() #create a dictionnary\n",
"\tlookup = dict() #create a dictionnary\n",
"\tfor i, value in enumerate(unique): # loops through the set / enumerate gives you a tuple with an index number and a value /common way to get indexes from a list\n",
"\tfor i, value in enumerate(unique): # loops through the set / enumerate gives you a tuple with an index number and a value /common way to get indexes from a list\n",
"\t\tlookup[value] = i # the key of the dictonnary is the value: mine or rock; and the value is a number: 0 or 1\n",
"\t\tlookup[value] = i # the key of the dictonnary is the value: mine or rock/or types of iris; and the value is a number: 0 or 1 (or 2)\n",
"\tfor row in dataset: # loops through the rows of the dataset\n",
"\tfor row in dataset: # loops through the rows of the dataset\n",
"\t\trow[column] = lookup[row[column]] #replaces the value of the column: rock or mine, with the index value: 0 or 1;\n",
"\t\trow[column] = lookup[row[column]] #replaces the value of the column: rock or mine/or types of iris, with the index value: 0 or (1 or 2);\n",
"\treturn lookup # the code returns the lookup table\n"
"\treturn lookup # the code returns the lookup table\n"
]
]
},
},
{
{
"cell_type": "code",
"cell_type": "markdown",
"execution_count": null,
"metadata": {},
"metadata": {},
"outputs": [],
"source": [
"source": []
"### Function to create a list of folds (divide the dataset into smaller subsets)"
]
},
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 4,
"execution_count": 30,
"metadata": {},
"metadata": {},
"outputs": [],
"outputs": [],
"source": [
"source": [
@ -94,9 +121,17 @@
"\treturn dataset_split #return the dataset_split, a list of folds\n"
"\treturn dataset_split #return the dataset_split, a list of folds\n"
]
]
},
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Functions definitions\n",
"### Calculate accuracy in the prediction"
]
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 5,
"execution_count": 31,
"metadata": {},
"metadata": {},
"outputs": [],
"outputs": [],
"source": [
"source": [
@ -110,9 +145,16 @@
" "
" "
]
]
},
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create a list of score for each algorithm/tree"
]
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 6,
"execution_count": 32,
"metadata": {},
"metadata": {},
"outputs": [],
"outputs": [],
"source": [
"source": [
@ -133,132 +175,318 @@
"\t\tactual = [row[-1] for row in fold] # list comprehension: list of actual classes from fold.\n",
"\t\tactual = [row[-1] for row in fold] # list comprehension: list of actual classes from fold.\n",
"\t\taccuracy = accuracy_metric(actual, predicted) # function that compares the actual vs the predicted to give an idea of the accuracy of the prediction\n",
"\t\taccuracy = accuracy_metric(actual, predicted) # function that compares the actual vs the predicted to give an idea of the accuracy of the prediction\n",
"\t\tscores.append(accuracy) #append the accuracy to the list of scores\n",
"\t\tscores.append(accuracy) #append the accuracy to the list of scores\n",
"\treturn scores\n",
"\treturn scores"
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create the trees by using different features and figuring out where to split the data at each point"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here the algorithms is running the same function many times to understand where is the best point to divide the best split point for the dataset. In order to assess it, it uses a coefficient called the Gini Coefficient.\n",
"There are three functions:\n",
"- the function for dividing the data into two based on a feature\n",
"- the function to assess whether this division is resulting in an equal divide and that return a coefficient\n",
"- the function to decide between all the different dividing point, which one is the best, which one result in the best gini coefficient"
]
]
},
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 7,
"execution_count": 33,
"metadata": {},
"metadata": {},
"outputs": [],
"outputs": [],
"source": [
"source": [
"\n",
"# Split a dataset based on a feature and a feature value defined in build tree\n",
"\n",
"# just trying many times, benefitting from speed of computer\n",
"# Split a dataset based on an attribute/feature and an attribute/feature value\n",
"def test_split(index, value, dataset):\n",
"def test_split(index, value, dataset):\n",
"\tleft, right = list(), list() # create two lists for each side\n",
"\tleft, right = list(), list()\n",
"\tfor row in dataset: #iterate through each row of the dataset\n",
"\tfor row in dataset:\n",
"\t\tif row[index] < value: #if the feature value of the current row is below the feature value given\n",
"\t\t# compares set value to all values in that column, if it is smaller, it goes to the left\n",
"\t\t\tleft.append(row) # append it to the left list\n",
"\t\t# he goes for each value through all dataset again\n",
"\t\tif row[index] < value:\n",
"\t\t\tleft.append(row)\n",
"\t\t# comparing the set value to itself, then it goes to the right\n",
"\t\telse:\n",
"\t\telse:\n",
"\t\t\tright.append(row) # append it to the right list\n",
"\t\t\tright.append(row)\n",
"\treturn left, right # return the two lists\n",
"\treturn left, right"
" \n",
]
"# Calculate the Gini index for a split dataset\n",
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"\n",
"# Calculate the Gini index for a split dataset, using left/right og test split as groups\n",
"\t# count all samples at split point (the dataset), converts it in a float in order to do divisions\n",
"\tn_instances = float(sum([len(group) for group in groups])) #counting the total number of instances into float for divisions\n",
"\tn_instances = float(sum([len(group) for group in groups]))\n",
"\t# sum weighted Gini index for each group\n",
"\t# sum weighted Gini index for each group\n",
"\tgini = 0.0 #gini variable\n",
"\tgini = 0.0\n",
"\tfor group in groups: #for each of the group\n",
"\tfor group in groups:\n",
"\t\tsize = float(len(group)) #number of instances in each group\n",
"\t\tsize = float(len(group))\n",
"\t\t# avoid divide by zero\n",
"\t\t# avoid divide by zero\n",
"\t\tif size == 0:\n",
"\t\tif size == 0:\n",
"\t\t\tcontinue\n",
"\t\t\tcontinue\n",
"\t\tscore = 0.0\n",
"\t\tscore = 0.0\n",
"\t\t# score the group based on the score for each class\n",
"\t\t# score the group based on the score for each class\n",
"\t\t# count number of instances for current class in the group and divide by total size of the group\n",
"\t\tfor class_val in classes:\n",
"\t\tfor class_val in classes:\n",
"\t\t\tp = [row[-1] for row in group].count(class_val) / size #count the number of instances for the current class in the group and divide it by the total size of the group\n",
"\t\t\t# outcome lies always between 0 and 1\n",
"\t\t\tscore += p * p #amplifying the difference (exponential ?)\n",
"\t\t\t# for each row it takes the class value and counts how many times the set class value appears, divided by size of the group\n",
"\t\t# weight the group score by its relative size\n",
"\t\t\tp = [row[-1] for row in group].count(class_val) / size\n",
"\t\tgini += (1.0 - score) * (size / n_instances) # substract the score from 1 and multiply it by the relative size of the group compared to the dataset\n",
"\t\t\t# multiply makes it exponentially smaller; you amplify the badness of the score\n",
"\t\t\tscore += p * p\n",
"\t\t# weight the group score by its relative size (size of group divided by total size of dataset)\n",
"\tclass_values = list(set(row[-1] for row in dataset)) # creates a list of the set for the class values. Here encoded as 1 and 0 . We already did it before\n",
"\t# takes last element of each row (class) and returns it as a row, as it is a set, it has only 2 values\n",
"\t\tindex = randrange(len(dataset[0])-1) # create a random number between 0 and the number of columns-1= minus the class\n",
"\t# creates list called features\n",
"\t\tif index not in features: # if the column names is not already in the features\n",
"\tfeatures = list()\n",
"\t\t\tfeatures.append(index) # append the column name\n",
"\t# as long as features list is not as long as square root of total dataset\n",
"\tfor index in features: # for each column name(=index name) in features\n",
"\twhile len(features) < n_features:\n",
"\t\tfor row in dataset: # for each row of the dataset\n",
"\t\t# creates number between 0 and nr of colums (- class)\n",
"\t\t\tgroups = test_split(index, row[index], dataset) # get two lists. very computationnaly heavy. Why not do the ordering first?\n",
"\t\tindex = randrange(len(dataset[0])-1)\n",
"\t\t\tgini = gini_index(groups, class_values) # get the gini value\n",
"\t\t# add column value if not present yet in features, creates only the index with name of the column\n",
"\t\t\tif gini < b_score: #if the gini value is smaller than b_score (b for best?). this should always be true for the first operation. Test against the best option\n",
"\t\tif index not in features:\n",
"\t\t\t\tb_index, b_value, b_score, b_groups = index, row[index], gini, groups # update the values for the best option\n",
"\t\t\tfeatures.append(index)\n",
"\treturn {'index':b_index, 'value':b_value, 'groups':b_groups} #return the best option in the form of a dictionnary\n",
"\t# for each column name in list features:\n",
" \n",
"\tfor index in features:\n",
"# Create a terminal node value\n",
"\t\tfor row in dataset:\n",
"# Returns the most popular value in the group\n",
"\t\t\t# take split point, loops through all the points, selecting 1 feature\n",
"### Create the child splits or terminal nodes\n",
"\n",
"here we define a function that will build the tree, decide wether the new data point is going to go left or right or to build a new node."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"\n",
"# Create a terminal node value = node at end of the tree = end leaf with its predicted class\n",
"def to_terminal(group):\n",
"def to_terminal(group):\n",
"\toutcomes = [row[-1] for row in group] #takes the class of the group elements and put it in a list\n",
"\t# returns list of classes of group\n",
"\treturn max(set(outcomes), key=outcomes.count) #return the class that appears the most. set is counting the different class and the key is counting the number of occurences of this class. SPOOKY AND DENSE\n",
"\toutcomes = [row[-1] for row in group]\n",
" \n",
"\t# selects most popular class; list of outcomes is reduced to 0 or 1; key counts the amount of times 0 or 1 occurs\n",
"# Create child splits for a node or make a terminal (node)\n",
"\t# selects class based on calculating how many times the class occurs\n",
"\tif len(left) <= min_size: # if the group is smaller or equal than the minimum size for a group\n",
"\t# if length of left group is smaller or equal to 1\n",
"\t\tnode['left'] = to_terminal(left) #left node is a terminal node\n",
"\tif len(left) <= min_size:\n",
"\t\t# it creates an end leaf\n",
"\t\tnode['left'] = to_terminal(left)\n",
"\telse:\n",
"\telse:\n",
"\t\tnode['left'] = get_split(left, n_features) #create another split, another node from which to separate in two groups with a subset of a dataset\n",
" # Test here whether the group has only one class\n",
"\tif row[node['index']] < node['value']: #if the feature value of the row is smaller of the feature value of the node\n",
"\t# node index = column feature, it looks up value for this feature for this row in dataset\n",
"\t\tif isinstance(node['left'], dict): # is it a node or a terminal node(children are not node)\n",
"\t# compare feature value of row you're checking with feature value of node\n",
"\t\t\treturn predict(node['left'], row) #recursion if a node. the function calls itselft on the following left node\n",
"\tif row[node['index']] < node['value']:\n",
"\t\t# is it node? \n",
"\t\tif isinstance(node['left'], dict):\n",
"\t\t\t# recursive function at the left\n",
"\t\t\treturn predict(node['left'], row)\n",
"\t\telse:\n",
"\t\telse:\n",
"\t\t\treturn node['left'] # result if a terminal node\n",
"\t\t\t# creates final leaf at the left\n",
"\t\t\treturn node['left']\n",
"\telse:\n",
"\telse:\n",
"\t\tif isinstance(node['right'], dict): # is it a node or a terminal node(children are not node)\n",
"\t\t# is it node?\n",
"\t\t\treturn predict(node['right'], row)#recursion if a node. the function calls itselft on the following right node\n",
"\t\tif isinstance(node['right'], dict):\n",
"\t\t\t# recursive function at the right\n",
"\t\t\treturn predict(node['right'], row)\n",
"\t\telse:\n",
"\t\telse:\n",
"\t\t\treturn node['right']# result if a terminal node\n",
"\t\t\t# creates final leaf at the left\n",
" \n",
"\t\t\treturn node['right']"
"# Create a random subsample from the dataset with replacement\n",
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Bootstrapping: doubling data to fill-in the random forest"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"# Create a random subsample from the dataset with replacement, ratio is called sample_size further on\n",
"# This is called BOOTSTRAPPING: build new datasets from the original data, with the same number of rows\n",
"# with replacement: after selecting the row we put it back into the data, so it can be selected twice or more\n",
"def subsample(dataset, ratio):\n",
"def subsample(dataset, ratio):\n",
"\tsample = list() #creates a list\n",
"\tsample = list()\n",
"\tn_sample = round(len(dataset) * ratio) # rounds the multiplication of the length of the dataset with the sample.size: here 1. \n",
"\t# if it is smaller than 1, not all dataset is taken as sample - he uses the full dataset\n",
"\twhile len(sample) < n_sample: # loop up until the length of the sample is the length of n_sample\n",
"\tn_sample = round(len(dataset) * ratio)\n",
"\t\tindex = randrange(len(dataset)) #take a random number from 0 up to length of the dataset\n",
"\twhile len(sample) < n_sample:\n",
"\t\tsample.append(dataset[index]) # append a sample whith this index\n",
"\t\tindex = randrange(len(dataset))\n",
"\treturn sample # return the list of sub-samples\n",
"\t\tsample.append(dataset[index])\n",
"\treturn sample\n",
" \n",
" \n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Aggregate the prediction of several trees - the council of trees - see what is their verdict"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"# Make a prediction with a list of bagged trees\n",
"# Make a prediction with a list of bagged trees\n",
"def bagging_predict(trees, row): \n",
"def bagging_predict(trees, row):\n",
"\tpredictions = [predict(tree, row) for tree in trees] # we run the prediction in each tree, this gives a list of predictions of classes/votes. THE TREES ARE VOTING.\n",
"\t# asks the forest to predict class for every row in the test data, this gives list of votes\n",
"\treturn max(set(predictions), key=predictions.count) # we count the class with the maximum votes and return it as prediction\n",
"\tpredictions = [predict(tree, row) for tree in trees]\n",
" \n",
"\t# it calculates amount of votes for each class, returns most popular class as prediction\n",