python - Understanding shannon entropy of a data set -
i'm reading machine learning in action
, going through decision tree chapter. understand decision trees built such splitting data set gives way structure branches , leafs. gives more information @ top of tree , limits how many decisions need go through.
the book shows function determining shannon entropy of data set:
def calcshannonent(dataset): numentries = len(dataset) labelcounts = {} featvec in dataset: #the number of unique elements , occurance currentlabel = featvec[-1] if currentlabel not in labelcounts.keys(): labelcounts[currentlabel] = 0 labelcounts[currentlabel] += 1 shannonent = 0.0 key in labelcounts: prob = float(labelcounts[key])/numentries shannonent -= prob * log(prob,2) #log base 2 return shannonent
where input data set array of arrays each array represents potential classifiable feature:
dataset = [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]
what don't why shannon entropy function in book ever looking @ last element in feature array? looks calculating entropy "yes" or "no" items, , not entropy of of other features?
it doesn't make sense me because entropy data set
dataset = [[1, 1, 'yes'], [1, 'asdfasdf', 'yes'], [1900, 0, 'no'], [0, 1, 'no'], ['ddd', 1, 'no']]
is same entropy above, though has lot more diverse data.
shouldn't other feature elements counted in order give total entropy of data set, or misunderstanding entropy calculation supposed do?
if curious, full source (which code came from) book here under chapter03 folder.
the potential ambiguity here dataset looking @ contains both features , outcome variable, outcome variable being in last column. problem trying solve "do feature 1 , feature 2 me predict outcome"?
another way state is, if split data according feature 1, better information on outcome?
in case, without splitting, outcome variable [ yes, yes, no, no, no ]. if split on feature 1, 2 groups: feature 1 = 0 -> outcome [ no, no ] feature 1 = 1 -> ouctome [ yes, yes, no ]
the idea here see if better off split. initially, had information, described shannon entropy of [ yes, yes, no, no, no ]. after split, have 2 groups, "better information" group feature 1 = 0: know in case outcome no, , measured entropy of [ no, no ].
in other words, approach figure out if out of features have available, there 1 which, if used, increased information on care about, is, outcome variable. tree building greedily pick feature highest information gain @ each step, , see if it's worth splitting further resulting groups.
Comments
Post a Comment