hadoop - Turning co-occurrence counts into co-occurrence probabilities with cascalog -
i have table of co-occurrence counts stored on s3 (where each row [key-a, key-b, count]) , want produce co-occurrence probability matrix it.
to need calculate sum of counts each key-a, , divide each row sum key-a.
if doing "by hand" pass on data produce hash table keys totals (in leveldb or it), , make second pass on data division. doesn't sound cascalog-y way it.
is there way can total row doing equivalent of self-join?
sample data:
(def coocurrences [["foo" "bar" 3] ["bar" "foo" 3] ["foo" "quux" 6] ["quux" "foo" 6] ["bar" "quux" 2] ["quux" "bar" 2]])
query:
(require '[cascalog.api :refer :all] '[cascalog.ops :as c]) (let [total (<- [?key-a ?sum] (coocurrences ?key-a _ ?c) (c/sum ?c :> ?sum))] (?<- (stdout) [?key-a ?key-b ?prob] (div ?c ?sum :> ?prob) (coocurrences ?key-a ?key-b ?c) (total ?key-a ?sum)))
output:
results ----------------------- bar foo 0.6 bar quux 0.4 foo bar 0.3333333333333333 foo quux 0.6666666666666666 quux foo 0.75 quux bar 0.25 -----------------------
Comments
Post a Comment