C# encoding Shift-JIS vs. utf8 html agility pack -


i have problem. goal save text (japanese shift-js encoded)html utf8 encoded text file. don't know how encode text.. htmlnode object encoded in shift-js. after used tostring() method, content corrupted. method far looks this:

public string getpage(string url)     {         string content = "";          htmldocument page = new htmlweb(){autodetectencoding = true}.load(url);         htmlnode anchor = page.documentnode.selectsinglenode("//div[contains(@class, 'article-def')]");          if (anchor != null)         {             content = anchor.innerhtml.tostring();         }         return content;     } 

i tried

console.writeline(page.encoding.encodingname.tostring()); 

and got: japanese shift-jis converting html string produces error. thought there should way, since documentation html-agility-pack sparse , couldn't find solution via google, i'm here hints.

well, autodetectencoding doesn't work you'd expect to. found looking @ source code of agilitypack, property used when loading local file disk, not url.

so there's 3 options. 1 set encoding

overrideencoding = encoding.getencoding("shift-jis") 

if know encoding same that's easiest fix.

or download file locally , load same way instead of url you'd pass file path.

 using (var client=new webclient())  {    client.downloadfile(url, "20130519-oyt1t00606.htm");  }  var htmlweb = new htmlweb(){autodetectencoding = true};  var file = new fileinfo("20130519-oyt1t00606.htm");  htmldocument page = htmlweb.load(file.fullname); 

or can detect encoding content this:

byte[] pagebytes; using (var client = new webclient()) {   pagebytes = client.downloaddata(url); } htmldocument page = new htmldocument(); using (var ms = new memorystream(pagebytes)) {   page.load(ms);   var metacontenttype = page.documentnode.selectsinglenode("//meta[@http-equiv='content-type']").getattributevalue("content", "");   var contenttype = new system.net.mime.contenttype(metacontenttype);   ms.position = 0;   page.load(ms, encoding.getencoding(contenttype.charset)); } 

and finally, if page querying returns content-type in response can here how encoding.

your code of course need few more null checks mine does. ;)


Comments

Popular posts from this blog

SPSS keyboard combination alters encoding -

Add new record to the table by click on the button in Microsoft Access -

javascript - jQuery .height() return 0 when visible but non-0 when hidden -