r - Data.Table non-equi join with arithmetic operations -


i'm trying complex self-join on (for r) large data structure (tens hundreds of millions of rows), creating new columns 1 operation i'd avoid literally add gigs of memory pressure object, since want play different join time parameters.

structure of dt_sample:

str(dt_sample) classes ‘data.table’ , 'data.frame':  50 obs. of  6 variables:  $ gateway_airport: chr  "bos" "bos" "bos" "bos" ...  $ final_airport  : chr  "ord" "bna" "ord" "rsw" ...  $ dept_utc       : posixct, format: "2016-11-17 15:09:00" "2016-11-17 21:00:00" "2016-11-17 12:40:00" ...  $ arriv_utc      : posixct, format: "2016-11-17 17:03:00" "2016-11-17 23:00:00" "2016-11-17 14:35:00" ...  $ airlines_id    : chr  "ua" "b6" "ua" "b6" ...  $ flight_number  : num  1472 1907 449 965 3839 ... 

the idea self-join on x's final_airport y's gateway_airport, , y's dept_utc within range of x's arriv_utc (in example below between +30 minutes , +8 hours inclusive).

this works, creates large structure gets filtered after merge:

result <- merge(dt_sample, dt_sample, by.x=c("final_airport"),                 by.y=("gateway_airport"))[arriv_utc.x + 1800 <= dept_utc.y                                            &arriv_utc.x  +28800 >= dept_utc.y,] 

i'd prefer use on = .() notation, not seem allow arithmetic operations on join elements , self-join seems confuse well. again, i'd prefer not add new columns. any ideas on how use data.table here?

the result of head(result) data.table 3 airports per row, origin, intermediary stop , final airport (+ info). full result sample in dput below 19 rows long.

   final_airport gateway_airport          dept_utc.x         arriv_utc.x airlines_id.x flight_number.x final_airport          dept_utc.y         arriv_utc.y 1:           iad             bos 2016-11-17 14:35:00 2016-11-17 16:18:00            ua             525           jax 2016-11-17 17:30:00 2016-11-17 19:37:00 2:           iad             bos 2016-11-17 14:35:00 2016-11-17 16:18:00            ua             525           sav 2016-11-17 17:30:00 2016-11-17 19:16:00 3:           iad             bos 2016-11-17 14:35:00 2016-11-17 16:18:00            ua             525           tys 2016-11-17 17:31:00 2016-11-17 19:10:00 4:           iad             bos 2016-11-17 14:35:00 2016-11-17 16:18:00            ua             525           den 2016-11-17 17:35:00 2016-11-17 19:36:00 5:           iad             bos 2016-11-17 14:35:00 2016-11-17 16:18:00            ua             525           gso 2016-11-17 17:40:00 2016-11-17 19:09:00 6:           iad             bos 2016-11-17 14:35:00 2016-11-17 16:18:00            ua             525           lax 2016-11-17 17:40:00 2016-11-17 20:25:00    airlines_id.y flight_number.y 1:            ac            3891 2:            ac            2736 3:            ac            2567 4:            ua             735 5:            ac            2727 6:            ua             632 

click below expand 50-row sample dput structure reproducibility:

structure(list(gateway_airport = c("bos", "bos", "bos", "bos",   "iad", "iad", "iad", "bos", "iad", "bos", "bos", "bos", "bos",   "iad", "bos", "bos", "iad", "bos", "bos", "bos", "iad", "bos",   "bos", "bos", "bos", "iad", "bos", "iad", "bos", "iad", "iad",   "iad", "bos", "iad", "bos", "bos", "bos", "iad", "iad", "bos",   "iad", "bos", "bos", "bos", "iad", "bos", "iad", "bos", "bos",   "iad"), final_airport = c("ord", "bna", "ord", "rsw", "orf",   "mco", "den", "clt", "dfw", "phl", "ord", "mia", "ord", "gso",   "jfk", "fll", "ord", "ord", "lga", "lga", "lax", "ord", "iad",   "ric", "den", "tys", "sea", "tpa", "sav", "roa", "sea", "dfw",   "phl", "mia", "iah", "phl", "lga", "jfk", "jax", "tpa", "tpa",   "iah", "dfw", "las", "sav", "iad", "lax", "lga", "sfo", "lax"  ), dept_utc = structure(c(1479395340, 1479416400, 1479386400,   1479397800, 1479420600, 1479422700, 1479404100, 1479386100, 1479384840,   1479387600, 1479378840, 1479386700, 1479402000, 1479404400, 1479403800,   1479418500, 1479421500, 1479384000, 1479420900, 1479387600, 1479404400,   1479412500, 1479411000, 1479381000, 1479412920, 1479403860, 1479395700,   1479389100, 1479398400, 1479421500, 1479415200, 1479400140, 1479415440,   1479380400, 1479406080, 1479382200, 1479413700, 1479394800, 1479403800,   1479414300, 1479423000, 1479392520, 1479411600, 1479384000, 1479403800,   1479393300, 1479391200, 1479400200, 1479397500, 1479420600), class = c("posixct",   "posixt"), tzone = ""), arriv_utc = structure(c(1479402180, 1479423600,   1479393300, 1479410880, 1479424620, 1479431160, 1479411360, 1479395520,   1479393900, 1479393360, 1479386700, 1479400020, 1479408780, 1479409740,   1479408240, 1479431340, 1479425280, 1479391860, 1479425640, 1479392100,   1479414300, 1479419280, 1479417120, 1479387600, 1479422940, 1479409800,   1479407460, 1479397800, 1479408180, 1479426180, 1479425580, 1479409500,   1479421740, 1479390420, 1479418260, 1479387900, 1479418320, 1479399360,   1479411420, 1479426420, 1479431940, 1479404880, 1479423900, 1479395340,   1479410160, 1479399480, 1479401580, 1479404640, 1479411300, 1479430860  ), class = c("posixct", "posixt"), tzone = ""), airlines_id = c("ua",   "b6", "ua", "b6", "ac", "ua", "ua", "aa", "aa", "b6", "aa", "aa",   "aa", "ac", "ei", "b6", "ua", "aa", "b6", "aa", "ua", "ua", "ua",   "b6", "ua", "ac", "b6", "ua", "b6", "ac", "ua", "aa", "b6", "aa",   "ua", "aa", "b6", "b6", "ac", "ei", "ua", "ua", "b6", "b6", "ac",   "ua", "ua", "b6", "ua", "ua"), flight_number = c(1472, 1907,   449, 965, 3839, 419, 735, 1735, 2569, 459, 1155, 1274, 1240,   2727, 5021, 1969, 511, 1404, 1331, 2126, 632, 981, 511, 1481,   448, 2567, 597, 2002, 49, 2726, 357, 1556, 1059, 1083, 1233,   543, 1231, 1308, 3891, 5290, 360, 167, 1115, 1077, 2736, 525,   470, 831, 477, 325)), .names = c("gateway_airport", "final_airport",   "dept_utc", "arriv_utc", "airlines_id", "flight_number"), class = c("data.table",   "data.frame"), row.names = c(na, -50l), .internal.selfref = <pointer: 0x2301358>)


Comments

Popular posts from this blog

account - Script error login visual studio DefaultLogin_PCore.js -

xcode - CocoaPod Storyboard error: -