hadoop - Turning co-occurrence counts into co-occurrence probabilities with cascalog -

September 15, 2015

i have table of co-occurrence counts stored on s3 (where each row [key-a, key-b, count]) , want produce co-occurrence probability matrix it.

to need calculate sum of counts each key-a, , divide each row sum key-a.

if doing "by hand" pass on data produce hash table keys totals (in leveldb or it), , make second pass on data division. doesn't sound cascalog-y way it.

is there way can total row doing equivalent of self-join?

sample data:

(def coocurrences   [["foo" "bar" 3]    ["bar" "foo" 3]    ["foo" "quux" 6]    ["quux" "foo" 6]    ["bar" "quux" 2]    ["quux" "bar" 2]])

query:

(require '[cascalog.api :refer :all] '[cascalog.ops :as c])  (let [total (<- [?key-a ?sum]               (coocurrences ?key-a _ ?c)               (c/sum ?c :> ?sum))]   (?<- (stdout) [?key-a ?key-b ?prob]     (div ?c ?sum :> ?prob)     (coocurrences ?key-a ?key-b ?c)     (total ?key-a ?sum)))

output:

results ----------------------- bar     foo     0.6 bar     quux    0.4 foo     bar     0.3333333333333333 foo     quux    0.6666666666666666 quux    foo     0.75 quux    bar     0.25 -----------------------

Search This Blog

Three

hadoop - Turning co-occurrence counts into co-occurrence probabilities with cascalog -

Comments

Post a Comment

Popular posts from this blog

Socket.connect doesn't throw exception in Android -

SPSS keyboard combination alters encoding -

iphone - How do I keep MDScrollView from truncating my row headers and making my cells look bad? -