node.js - Count Unique using Redis and MongoDB (HyperLogLog) -
i have collection in mongodb sample doc follows -
{ "_id" : objectid("58114e5e43d6420b7db4e15c"), "browser" : "chrome", "name": "hyades", "country" : "in", "day" : "16-10-21", "ip" : "0.0.0.0", "class" : "a123" }
problem statement
i should able group on of fields while fetching distinct number of ips.
the aggregation query -
[ {$group: {_id: '$class', ip_arr: {$addtoset: '$ip'}}}, {$project: {class: '$_id.class', ip: {$size: '$ip_arr'}}} ]
gives desired results, slow. counting ip
using $group
slow. output -
[{class: "a123",ip: 42},{class: "b123", ip: 56}..]
what tried
i considered using hyperloglog this. tried using redis implementation. try stream entire data, projecting thing group on, , pfadd
corresponding hyperloglog structure in redis.
the logic looks -
var stream = model.find({}, {ip: 1, class: 1}).stream(); stream.on('data', function (doc) { var hash = "hll/" + doc.class; client.pfadd(hash, doc.ip); });
i tried run million plus data points. size of data streamed around 1gb, 1 gbps connection between mongo , node server. had expected code run fast enough. however, pretty slow (slower counting in mongodb).
another thing thought didn't implement pre-create buckets each class, , increment them realtime data flowing in. memory required support arbitrary grouping huge, had drop idea.
please suggest might doing wrong, or improve here able take full advantage of hyperloglog (i not constrained on redis, , open implementation)
disclaimer: i've used redis c++ & python, may not help, but...
pfadd supports multiple arguments; in system using redis hll counting unique entries, found batching them , sending single pfadd many (of order of 100) items in 1 go resulted in significant speedup - presumably due avoiding redis client round-trip.
Comments
Post a Comment