python - Problems while merging two pandas dataframes with different shapes? -
this quite simple, not why can't merge 2 dataframes. have following df
s different shapes (one larger , wider other):
df1
id 0 microsoft inc 1 1 apple computer. 2 2 google inc. 3 3 ibm 4 4 amazon, inc. 5
df2
b c d e id 0 (01780-500-01) 237489 - 342 api true. 1 0 (409-6043-01) 234324 api other 2 0 23423423 api nan nan 3 0 (001722-5e240-60) nan nan other 4 1 (0012172-52411-60) 32423423. nan other 4 0 29849032-29482390 api yes false 5 1 329482030-23490-1 api yes false 5
i merge df1
, df2
index
column:
df3
b c d e id 0 microsoft inc (01780-500-01) 237489 - 342 api true. 1 1 apple computer. (409-6043-01) 234324 api other 2 2 google inc. 23423423 api nan nan 3 3 ibm (001722-5e240-60) nan nan other 4 4 ibm (0012172-52411-60) 32423423. nan other 4 5 amazon, inc. 29849032-29482390 api yes false 5 6 amazon, inc. 329482030-23490-1 api yes false 5
i know done using merge(). also, read excellent tutorial , tried to:
in:
pd.merge(df1, df2, on=df1.id, how='outer')
out:
indexerror: indices out-of-bounds
then tried:
pd.merge(df2, df1, on='id', how='outer')
and apparently repeating several times merged rows, this:
b c d e index 0 microsoft inc (01780-500-01) 237489 - 342 api true. 1 1 apple computer. (409-6043-01) 234324 api other 2 2 apple computer. (409-6043-01) 234324 api other 2 3 apple computer. (409-6043-01) 234324 api other 2 4 apple computer. (409-6043-01) 234324 api other 2 5 apple computer. (409-6043-01) 234324 api other 2 6 apple computer. (409-6043-01) 234324 api other 2 7 apple computer. (409-6043-01) 234324 api other 2 8 apple computer. (409-6043-01) 234324 api other 2 ...
i think related fact created temporal index df2['position'] = df2.index
since indices weird, , removed it. so, question how df3
?
update
i fixed index of df2
this:
df2.reset_index(drop=true, inplace=true)
and looks this:
b c d e id 0 (01780-500-01) 237489 - 342 api true. 1 1 (409-6043-01) 234324 api other 2 2 23423423 api nan nan 3 3 (001722-5e240-60) nan nan other 4 4 (0012172-52411-60) 32423423. nan other 4 5 29849032-29482390 api yes false 5 6 329482030-23490-1 api yes false 5
i still having same issue. merged rows repeating several times.
>>>print(df2.dtypes) b object c object d object e object id int64 dtype: object >>>print(df1.dtypes) object id int64 dtype: object
update2
>>>print(df2['id']) 0 1 1 2 2 3 3 4 4 4 5 5 6 5 7 6 8 6 9 7 10 8 11 8 12 8 13 8 14 9 15 10 16 11 17 11 18 12 19 12 20 13 21 13 22 14 23 15 24 16 25 16 26 17 27 17 28 18 29 18 ... 476 132 477 132 478 132 479 132 480 132 481 132 482 132 483 132 484 133 485 133 486 133 487 133 488 134 489 134 490 134 491 134 492 135 493 135 494 136 495 136 496 137 497 137 498 137 499 137 500 137 501 137 502 137 503 138 504 138 505 138 name: id, dtype: int64
and
>>>print(df1) 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 8 12 12 13 6 14 7 15 8 16 6 17 11 18 13 19 14 20 15 21 11 22 2 23 16 24 17 25 18 26 9 27 19 28 11 29 20 .. 108 57 109 43 110 22 111 2 112 58 113 49 114 22 115 59 116 2 117 6 118 22 119 2 120 37 121 2 122 9 123 60 124 61 125 62 126 63 127 42 128 64 129 4 130 29 131 11 132 2 133 25 134 4 135 65 136 66 137 4 name: id, dtype: int64
you try setting index id
, using join
:
df1 = pd.dataframe([('microsoft inc',1), ('apple computer.',2), ('google inc.',3), ('ibm',4), ('amazon, inc.',5)],columns = ('a','id')) df2 = pd.dataframe([('(01780-500-01)','237489', '- 342','api', 1), ('(409-6043-01)','234324', ' api','other ',2), ('23423423','api', 'nan','nan', 3), ('(001722-5e240-60)','nan', 'nan','other', 4), ('(0012172-52411-60)','32423423',' nan','other', 4), ('29849032-29482390','api', ' yes',' false', 5), ('329482030-23490-1','api', ' yes',' false', 5)], columns = ['b','c','d','e','id']) df1 =df1.set_index('id') df1.drop_duplicates(inplace=true) df2 = df2.set_index('id') df3 = df1.join(df2,how='outer')
since you've set index columns (aka join keys) both dataframes, wouldn't have specify on='id'
param.
this alternate way solve problem.. don't see wrong pd.merge(df1, df2, on='id', how='outer')
. might want double check id
column in both dataframes, mentioned @johne
Comments
Post a Comment