python - Problems while merging two pandas dataframes with different shapes? -


this quite simple, not why can't merge 2 dataframes. have following dfs different shapes (one larger , wider other):

df1

                         id 0   microsoft inc          1 1   apple computer.        2 2   google inc.            3 3   ibm                    4 4   amazon, inc.           5 

df2

    b   c   d   e   id 0   (01780-500-01)  237489 - 342    api     true.   1 0   (409-6043-01)   234324  api     other   2 0   23423423    api     nan     nan     3 0   (001722-5e240-60)   nan     nan     other   4 1   (0012172-52411-60)  32423423.   nan     other   4 0   29849032-29482390   api     yes     false   5 1   329482030-23490-1   api     yes     false   5 

i merge df1 , df2 index column:

df3

      b   c   d   e   id 0   microsoft inc   (01780-500-01)  237489 - 342    api     true.   1 1   apple computer. (409-6043-01)   234324  api     other   2 2   google inc. 23423423    api     nan     nan     3 3   ibm (001722-5e240-60)   nan     nan     other   4 4   ibm (0012172-52411-60)  32423423.   nan     other   4 5   amazon, inc.    29849032-29482390   api     yes     false   5 6   amazon, inc.    329482030-23490-1   api     yes     false   5 

i know done using merge(). also, read excellent tutorial , tried to:

in:

pd.merge(df1, df2, on=df1.id, how='outer') 

out:

indexerror: indices out-of-bounds 

then tried:

pd.merge(df2, df1, on='id', how='outer') 

and apparently repeating several times merged rows, this:

      b   c   d   e   index 0   microsoft inc   (01780-500-01)  237489 - 342    api     true.   1 1   apple computer. (409-6043-01)   234324  api     other   2 2   apple computer. (409-6043-01)   234324  api     other   2 3   apple computer. (409-6043-01)   234324  api     other   2 4   apple computer. (409-6043-01)   234324  api     other   2 5   apple computer. (409-6043-01)   234324  api     other   2 6   apple computer. (409-6043-01)   234324  api     other   2 7   apple computer. (409-6043-01)   234324  api     other   2 8   apple computer. (409-6043-01)   234324  api     other   2 ... 

i think related fact created temporal index df2['position'] = df2.index since indices weird, , removed it. so, question how df3?

update

i fixed index of df2 this:

df2.reset_index(drop=true, inplace=true) 

and looks this:

    b   c   d   e   id 0   (01780-500-01)  237489 - 342    api     true.   1 1   (409-6043-01)   234324  api     other   2 2   23423423    api     nan     nan     3 3   (001722-5e240-60)   nan     nan     other   4 4   (0012172-52411-60)  32423423.   nan     other   4 5   29849032-29482390   api     yes     false   5 6   329482030-23490-1   api     yes     false   5 

i still having same issue. merged rows repeating several times.

>>>print(df2.dtypes) b    object c    object d    object e    object id   int64 dtype: object  >>>print(df1.dtypes)                object id               int64 dtype: object 

update2

>>>print(df2['id']) 0        1 1        2 2        3 3        4 4        4 5        5 6        5 7        6 8        6 9        7 10       8 11       8 12       8 13       8 14       9 15      10 16      11 17      11 18      12 19      12 20      13 21      13 22      14 23      15 24      16 25      16 26      17 27      17 28      18 29      18       ...  476    132 477    132 478    132 479    132 480    132 481    132 482    132 483    132 484    133 485    133 486    133 487    133 488    134 489    134 490    134 491    134 492    135 493    135 494    136 495    136 496    137 497    137 498    137 499    137 500    137 501    137 502    137 503    138 504    138 505    138 name: id, dtype: int64 

and

>>>print(df1)  0       1 1       2 2       3 3       4 4       5 5       6 6       7 7       8 8       9 9      10 10     11 11      8 12     12 13      6 14      7 15      8 16      6 17     11 18     13 19     14 20     15 21     11 22      2 23     16 24     17 25     18 26      9 27     19 28     11 29     20        .. 108    57 109    43 110    22 111     2 112    58 113    49 114    22 115    59 116     2 117     6 118    22 119     2 120    37 121     2 122     9 123    60 124    61 125    62 126    63 127    42 128    64 129     4 130    29 131    11 132     2 133    25 134     4 135    65 136    66 137     4 name: id, dtype: int64 

you try setting index id , using join:

df1 = pd.dataframe([('microsoft inc',1), ('apple computer.',2), ('google inc.',3), ('ibm',4), ('amazon, inc.',5)],columns = ('a','id'))  df2 = pd.dataframe([('(01780-500-01)','237489', '- 342','api',   1), ('(409-6043-01)','234324', ' api','other   ',2), ('23423423','api', 'nan','nan',     3), ('(001722-5e240-60)','nan', 'nan','other',   4), ('(0012172-52411-60)','32423423','   nan','other',   4), ('29849032-29482390','api', '    yes','     false',   5), ('329482030-23490-1','api', '    yes','     false',   5)], columns = ['b','c','d','e','id'])  df1  =df1.set_index('id') df1.drop_duplicates(inplace=true) df2  = df2.set_index('id') df3  = df1.join(df2,how='outer') 

since you've set index columns (aka join keys) both dataframes, wouldn't have specify on='id' param.

this alternate way solve problem.. don't see wrong pd.merge(df1, df2, on='id', how='outer'). might want double check id column in both dataframes, mentioned @johne


Comments

Popular posts from this blog

account - Script error login visual studio DefaultLogin_PCore.js -

xcode - CocoaPod Storyboard error: -