python - pandas Read undelimited text file to dataframe -
this question has answer here:
i new pandas. until i've been learning pandas using csv files , excel spreadsheets.
now faced converting text file dataframe. text files call sequential data. format of file is:
state name city name state name city name city name city name ...
all 50 states plus territories listed number of cities varies. need convert dataframe like
[[state name, city name1],[state name, city name2],...]
using pandas read_table() method, i've been able @ least read file dataframe, i'm not how correct state name city name format.
i have dictionary of state name/state 2 letter abbreviations available. format of dictionary is
{'oh':'ohio', 'ky':'kentucky',...}
is there way can use dictionary, loop on file , separate state , city? or there easier way accomplish this?
thank you
edit - sample of text file sample of text file listed below. also, please not unable alter file.
alabama[edit] auburn (auburn university)[1] florence (university of north alabama) jacksonville (jacksonville state university)[2] livingston (university of west alabama)[2] montevallo (university of montevallo)[2] troy (troy university)[2] tuscaloosa (university of alabama, stillman college, shelton state)[3][4] tuskegee (tuskegee university)[5] alaska[edit] fairbanks (university of alaska fairbanks)[2] arizona[edit] flagstaff (northern arizona university)[6] tempe (arizona state university) tucson (university of arizona)
say columns called a
. first find states this:
df.a.str.contains('\[edit\]') out[25]: 0 true 1 false 2 false 3 false 4 false 5 false 6 false 7 false 8 false 9 true 10 false 11 true 12 false 13 false 14 false
use cumsum
define index per state+cities:
csum = df.a.str.contains('\[edit\]').cumsum() csum out[26]: 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 10 2 11 3 12 3 13 3 14 3
now can states , cities:
states = df.groupby(csum).first() states out[38]: a 1 alabama[edit] 2 alaska[edit] 3 arizona[edit] cities = df.groupby(csum).apply(lambda g: g[1:]) cities out[39]: a 1 1 auburn (auburn university)[1] 2 florence (university of north alabama) 3 jacksonville (jacksonville state university)[2] 4 livingston (university of west alabama)[2] 5 montevallo (university of montevallo)[2] 6 troy (troy university)[2] 7 tuscaloosa (university of alabama, stillman co... 8 tuskegee (tuskegee university)[5] 2 10 fairbanks (university of alaska fairbanks)[2] 3 12 flagstaff (northern arizona university)[6] 13 tempe (arizona state university) 14 tucson (university of arizona)
now join dataframes:
states.join(cities, rsuffix='_cities') out[49]: a_cities 1 1 alabama[edit] auburn (auburn university)[1] 2 alabama[edit] florence (university of north alabama) 3 alabama[edit] jacksonville (jacksonville state university)[2] 4 alabama[edit] livingston (university of west alabama)[2] 5 alabama[edit] montevallo (university of montevallo)[2] 6 alabama[edit] troy (troy university)[2] 7 alabama[edit] tuscaloosa (university of alabama, stillman co... 8 alabama[edit] tuskegee (tuskegee university)[5] 2 10 alaska[edit] fairbanks (university of alaska fairbanks)[2] 3 12 arizona[edit] flagstaff (northern arizona university)[6] 13 arizona[edit] tempe (arizona state university) 14 arizona[edit] tucson (university of arizona)
Comments
Post a Comment