C# encoding Shift-JIS vs. utf8 html agility pack -
i have problem. goal save text (japanese shift-js encoded)html utf8 encoded text file. don't know how encode text.. htmlnode object encoded in shift-js. after used tostring() method, content corrupted. method far looks this:
public string getpage(string url) { string content = ""; htmldocument page = new htmlweb(){autodetectencoding = true}.load(url); htmlnode anchor = page.documentnode.selectsinglenode("//div[contains(@class, 'article-def')]"); if (anchor != null) { content = anchor.innerhtml.tostring(); } return content; }
i tried
console.writeline(page.encoding.encodingname.tostring());
and got: japanese shift-jis converting html string produces error. thought there should way, since documentation html-agility-pack sparse , couldn't find solution via google, i'm here hints.
well, autodetectencoding
doesn't work you'd expect to. found looking @ source code of agilitypack, property used when loading local file disk, not url.
so there's 3 options. 1 set encoding
overrideencoding = encoding.getencoding("shift-jis")
if know encoding same that's easiest fix.
or download file locally , load same way instead of url you'd pass file path.
using (var client=new webclient()) { client.downloadfile(url, "20130519-oyt1t00606.htm"); } var htmlweb = new htmlweb(){autodetectencoding = true}; var file = new fileinfo("20130519-oyt1t00606.htm"); htmldocument page = htmlweb.load(file.fullname);
or can detect encoding content this:
byte[] pagebytes; using (var client = new webclient()) { pagebytes = client.downloaddata(url); } htmldocument page = new htmldocument(); using (var ms = new memorystream(pagebytes)) { page.load(ms); var metacontenttype = page.documentnode.selectsinglenode("//meta[@http-equiv='content-type']").getattributevalue("content", ""); var contenttype = new system.net.mime.contenttype(metacontenttype); ms.position = 0; page.load(ms, encoding.getencoding(contenttype.charset)); }
and finally, if page querying returns content-type in response can here how encoding.
your code of course need few more null checks mine does. ;)
Comments
Post a Comment