java - Get word count from a string in Unicode (in any language) -


i want word count string. it's simple that. catch string can in unpredictable language.

so, need function of signature int getwordcount(string) following sample output -

getwordcount("供应商代发发货") => 7 getwordcount("this sentence") => 4 

any on how proceed appreciated :)

the concept of "word" may trivial or complex. here apache stanbol toolkit:

word tokenization: detection of single words required stanbol enhancer process text. while trivial languages rather complex task eastern languages, e.g. chinese, japanese, korean. if not otherwise configured, stanbol use whitespaces tokenize words.

so if concept of word linguistic, rather syntactic, should use nlp toolkit

my preferred java solution apache's open nlp

note: have used http://www.mdbg.net/chindict/chindict.php?page=worddict tokenize example. implies there 4 words not seven. have cut , pasted (rather fragmented):

original text simplified pīnyīn english definition add new word dictionary traditional hsk 供应商 供应商 gōng​yìng​shāng​

supplier

供應商 代
代 dài​

to substitute / act on behalf of others / replace / generation / dynasty / age / period / (historical) era / (geological) eon


发 fā​

to send out / show (one's feeling) / issue / develop / classifier gunshots (rounds)

發 hsk 4

发 fà​

hair / taiwan pr. [fa3]

髮 发货
发货 fā​huò​

to dispatch / send out goods

發貨

these first 3 characters appear form single word.


Comments

Popular posts from this blog

SPSS keyboard combination alters encoding -

Add new record to the table by click on the button in Microsoft Access -

javascript - jQuery .height() return 0 when visible but non-0 when hidden -