Python decode french char in html email attachment -


i'm trying decode html attachment file of email take on imap server. if html file contain normal character it's working without problem, when have french é character have this: "vous \xc3\xa9t\xc3\xa9 envoy\xc3\xa9e par" have \n \r appear.

i use beautifulsoup make search on html code. use loop check mail(not present in code)

imap_server = imaplib.imap4_ssl("server",993) imap_server.login(username, password) imap_server.select("test") result, data = imap_server.uid('search', none, "unseen") latest_email_uid = data[0].split()[-1] result, data = imap_server.uid('fetch', latest_email_uid, '(rfc822)') raw_email = data[0][1] raw_email=str(raw_email, 'utf8') msg = email.message_from_string(raw_email) 

i walk in mail, if find html decode base64 , send beautifulsoup. after print utf-8 conversion. if replace encode.('utf-8') latin-1 have special char.

if msg.is_multipart():      part in msg.walk():         if part.get_content_type() == 'text/html':             attachment= (part.get_payload(decode=1))             soup=beautifulsoup(attachment)             print (soup.prettify().encode('utf-8'))         else:             print ("no html") 

i tried encode,decode in lot charset without having nice. have tried base64.b64decode(text).decode('utf-16') still have same \xc3\xa9

you see special characters because encoding utf-8 or latin-1:

>>> print('\xe9') é >>> print('\xe9'.encode('utf8')) b'\xc3\xa9' >>> print('\xe9'.encode('latin1')) b'\xe9' >>> print('hello world!\n'.encode('utf8')) b'hello world!\n' 

when printing bytes literal, python shows repr() representation of value, replaces byte not represent printable ascii codepoint \x.. escape sequence; replaced shorter two-character escapes, such \r , \n. makes representation both re-usable python bytes literal , more logged files , terminals not set international character sets.

print() handles encoding you. print .prettify() output directly.

if printing unicode terminal or console not work, , instead raises unicodedecodeerror, terminal or console not configured handle unicode text properly. consult printfail python wiki page troubleshoot.


Comments

Popular posts from this blog

SPSS keyboard combination alters encoding -

Add new record to the table by click on the button in Microsoft Access -

javascript - jQuery .height() return 0 when visible but non-0 when hidden -