json - Python 'ascii' codec can't encode character with request.get -
i have python program crawls data site , returns json. crawled site has meta tag charset = iso-8859-1. here source code:
url = 'https://www.example.com' source_code = requests.get(url) plain_text = source_code.text
after getting information beautiful soup , creating json. problem is, symbols i.e. €
symbol displayed \u0080 or \x80 (in python) can't use or decode them in php. tried plain_text.decode('iso-8859-1)
, plain_text.decode('cp1252')
encode them afterwards utf-8 every time error: 'ascii' codec can't encode character u'\xf6' in position 8496: ordinal not in range(128).
edit
the new code after @chriskoston suggestion using .content
instead of .text
url = 'https://www.example.com' source_code = requests.get(url) plain_text = source_code.content the_sourcecode = plain_text.decode('cp1252').encode('utf-8') soup = beautifulsoup(the_sourcecode, 'html.parser')
encoding , decoding possible still character problem.
edit2
the solution set .content.decode('cp1252')
url = 'https://www.example.com' source_code = requests.get(url) plain_text = source_code.content.decode('cp1252') soup = beautifulsoup(plain_text, 'html.parser')
special tomalak solution
you must store result of decode()
somewhere because not modify original variable.
another thing:
decode()
turns list of bytes string.encode()
oposite, turns string list of bytes
beautifulsoup happy strings; don't need use encode()
@ all.
import requests bs4 import beautifulsoup url = 'https://www.example.com' response = requests.get(url) html = response.content.decode('cp1252') soup = beautifulsoup(html, 'html.parser')
hint: working html might want @ pyquery instead of beautifulsoup.
Comments
Post a Comment