mongodb - Python 2.x - How to work correctly with text in UNICODE -


okay, i'm trying manipulate list of texts obtained through mongodb. here how can create list , iterate on attribute contains messages:

client = mongoclient('localhost:27017')     db = client['...']     query = {'message': {'$exists': 1}, 'id': {'$exists': 1}, 'created_time': {'$exists': 1}}     projection = {'_id': 0, 'message': 1}      db_messages = db['dadoscoletados1'].find(query, projection)      message_list=[]     id_list=[]     time_list = []     document in db_messages:         key, value in document.iteritems():             if key == 'message':                 message_list.append(value)                 #print value                 word_final = status_processing(message_list)             else:                 id_list.append(value)                 time_list.append(value)      pair_up = zip(id_list, message_list, time_list) 

what happens following, if give find in mongo shell text returned me correctly in brazilian- portuguese.

if in code put print message_list text comes in format: e7\xf5es institucionais para uma seguran\xe7a p\xfablica. if put print what's inside value , text comes correct. need list text perform various operations within function: status_processing. 1 of these involves use of beautifulsoup, when arrives @ stage beautifulsoup shows me exit: declared_encoding_match = xml_encoding_re.search(markup, endpos=xml_endpos) typeerror: expected string or buffer.

here example of how make process within func status_processing

def status_processing(corpus):     mycorpus = preprocessing.preprocessing()     mycorpus.text = message_list     mycorpus.initial_processing() 

and here's how implemented beautifulsoup

 def initial_processing(self):     #def __init__(self, text):         soup = beautifulsoup(self.text, "html.parser")         #todo se quiser salvar os links mudar aqui         self.text = re.sub(r'(http://|https://|www.)[^"\' ]+', " ", soup.get_text())         self.tokens = self.tokenizing() 

how have best approach in situation? transform str through mycorpus.text = str(message_list) ? vist need later text , store again in bank, mode of conversion through str () not good, mongo warns type of _id created wrong.


Comments

Popular posts from this blog

account - Script error login visual studio DefaultLogin_PCore.js -

xcode - CocoaPod Storyboard error: -