mongodb - Python 2.x - How to work correctly with text in UNICODE -
okay, i'm trying manipulate list of texts obtained through mongodb. here how can create list , iterate on attribute contains messages:
client = mongoclient('localhost:27017') db = client['...'] query = {'message': {'$exists': 1}, 'id': {'$exists': 1}, 'created_time': {'$exists': 1}} projection = {'_id': 0, 'message': 1} db_messages = db['dadoscoletados1'].find(query, projection) message_list=[] id_list=[] time_list = [] document in db_messages: key, value in document.iteritems(): if key == 'message': message_list.append(value) #print value word_final = status_processing(message_list) else: id_list.append(value) time_list.append(value) pair_up = zip(id_list, message_list, time_list)
what happens following, if give find in mongo shell text returned me correctly in brazilian- portuguese.
if in code put print message_list
text comes in format: e7\xf5es institucionais para uma seguran\xe7a p\xfablica
. if put print what's inside value
, text comes correct. need list text perform various operations within function: status_processing
. 1 of these involves use of beautifulsoup
, when arrives @ stage beautifulsoup shows me exit: declared_encoding_match = xml_encoding_re.search(markup, endpos=xml_endpos) typeerror: expected string or buffer
.
here example of how make process within func status_processing
def status_processing(corpus): mycorpus = preprocessing.preprocessing() mycorpus.text = message_list mycorpus.initial_processing()
and here's how implemented beautifulsoup
def initial_processing(self): #def __init__(self, text): soup = beautifulsoup(self.text, "html.parser") #todo se quiser salvar os links mudar aqui self.text = re.sub(r'(http://|https://|www.)[^"\' ]+', " ", soup.get_text()) self.tokens = self.tokenizing()
how have best approach in situation? transform str through mycorpus.text = str(message_list)
? vist need later text , store again in bank, mode of conversion through str ()
not good, mongo warns type of _id
created wrong.
Comments
Post a Comment