hadoop - Turning co-occurrence counts into co-occurrence probabilities with cascalog -


i have table of co-occurrence counts stored on s3 (where each row [key-a, key-b, count]) , want produce co-occurrence probability matrix it.

to need calculate sum of counts each key-a, , divide each row sum key-a.

if doing "by hand" pass on data produce hash table keys totals (in leveldb or it), , make second pass on data division. doesn't sound cascalog-y way it.

is there way can total row doing equivalent of self-join?

sample data:

(def coocurrences   [["foo" "bar" 3]    ["bar" "foo" 3]    ["foo" "quux" 6]    ["quux" "foo" 6]    ["bar" "quux" 2]    ["quux" "bar" 2]]) 

query:

(require '[cascalog.api :refer :all] '[cascalog.ops :as c])  (let [total (<- [?key-a ?sum]               (coocurrences ?key-a _ ?c)               (c/sum ?c :> ?sum))]   (?<- (stdout) [?key-a ?key-b ?prob]     (div ?c ?sum :> ?prob)     (coocurrences ?key-a ?key-b ?c)     (total ?key-a ?sum))) 

output:

results ----------------------- bar     foo     0.6 bar     quux    0.4 foo     bar     0.3333333333333333 foo     quux    0.6666666666666666 quux    foo     0.75 quux    bar     0.25 ----------------------- 

Comments

Popular posts from this blog

SPSS keyboard combination alters encoding -

Add new record to the table by click on the button in Microsoft Access -

CSS3 Transition to highlight new elements created in JQuery -