java - Get word count from a string in Unicode (in any language) -
i want word count string. it's simple that. catch string can in unpredictable language.
so, need function of signature int getwordcount(string)
following sample output -
getwordcount("供应商代发发货") => 7 getwordcount("this sentence") => 4
any on how proceed appreciated :)
the concept of "word" may trivial or complex. here apache stanbol toolkit:
word tokenization: detection of single words required stanbol enhancer process text. while trivial languages rather complex task eastern languages, e.g. chinese, japanese, korean. if not otherwise configured, stanbol use whitespaces tokenize words.
so if concept of word linguistic, rather syntactic, should use nlp toolkit
my preferred java solution apache's open nlp
note: have used http://www.mdbg.net/chindict/chindict.php?page=worddict tokenize example. implies there 4 words not seven. have cut , pasted (rather fragmented):
original text simplified pīnyīn english definition add new word dictionary traditional hsk 供应商 供应商 gōngyìngshāng
supplier
供應商 代
代 dài
to substitute / act on behalf of others / replace / generation / dynasty / age / period / (historical) era / (geological) eon
发
发 fā
to send out / show (one's feeling) / issue / develop / classifier gunshots (rounds)
發 hsk 4
发 fà
hair / taiwan pr. [fa3]
髮 发货
发货 fāhuò
to dispatch / send out goods
發貨
these first 3 characters appear form single word.
Comments
Post a Comment