Cloud Computing: Selecting distinct entities across a large table

I have faced this kind of problem some time ago.
I tried some of the solutions suggested below (in memory sort and filtering, encoding things into keys etc. and I have benchmarked those for both latency and cpu cycles using some test data around 100K entities)
An other approach I have taken is encoding the date as an integer (day since start of epoch or day since start of year, same for hour of day or month depending on how much detail you need in your output) and saving this into a property. This way you turn your date query filter into an equality only filter which does not even needs to specify an index) then you can sort or filter on other properties.
Benchmarking the latest solution I have found that when the filtered result set is a small fraction of the unfiltered original set, is 1+ order of magnitude faster and cpu-eficient. Worst case when no reduction of the result set due to filtering the latency and cpu usage was comparable to the previous solutions)

Hope this helps, or did I missed something ?
Happy coding-:)

On Feb 14, 11:51 pm, Stephen Johnson wrote:
> Okay I think I got something that might work. Reverse the StartDate and
> CarId for the key from what I said above so the key would look like this:
> 2011:02:14:17:13:33:123 and the KEYS ONLY query then is:
>
> select __key__ where __key__ >= MAKEKEY(StartDate + CarId) && __key__ <= > MAKEKEY(EndDate + CarId) order by __key__ DESC
>
> Now, you can use the Async query to start processing. You're going to get
> entries that you're not interested in but you're only getting the key field
> back and not the whole CarJourney entry and this key/id has the Date and Car
> ID, so the first time you hit a Car ID for each Car then you have the ID for
> the latest CarJourney for that car. Now, once you've found all car ID's your
> looking for you can abort the query or you'll reach the end of the query
> results. Now, as you're looping, store the KEYs of the entries your looking
> for and then do a batch GET on memcache to retrieve as many Car (you've got
> the car id) and CarJourney (you've got the carjourney id) entries that might
> be stored there and then for any that you didn't get from memcache, you can
> do a batch GET on the datastore using the keys/ids that you have.
>
> I think that if you memcache things appropriately and use the batch gets for
> memcache and datastore then this might just work for you.
>
> Let me know what you think. It's an interestng problem,
> Stephen
>
> On Mon, Feb 14, 2011 at 2:12 PM, Stephen Johnson wrote:
>
>
>
>
>
>
>
> > Or maybe it blocks on different result sets just not on getting the next
> > fetch block?? Hmmm. Sounds like a tough problem.
>
> > On Mon, Feb 14, 2011 at 2:09 PM, Stephen Johnson wrote:
>
> >> Are you using .asList (which I think blocks like you describe), but I
> >> thought asIterable or asIterator wasn't suppose to. (if you're using Java).
>
> >> On Mon, Feb 14, 2011 at 12:38 PM, Edward Hartwell Goose < > >> ed.go...@gmail.com> wrote:
>
> >>> Hi Calvin & Stephen,
>
> >>> Thanks for the ideas.
>
> >>> Calvin:
> >>> We can't do the filtering in memory. We potentially have a car making
> >>> a journey (the car analogy isn't so good...) making a journey every 3
> >>> seconds, and we could have up to 2,000 cars.
>
> >>> We need to be able to look back up to 2 months, so it could be up to
> >>> 1.8 billion rows in this table.
>
> >>> Stephen:
> >>> That's an interesting idea. However the Asynchronous api actually
> >>> fires the requests synchronously, it just doesn't block. (Or at least,
> >>> that's my experience).
>
> >>> So, at the moment we fire off 1 query (which Google turns into 2) for
> >>> each site. And although the method call returns instantly, it still
> >>> takes ~5 seconds in total with basic test data. If each call takes
> >>> 12ms, we still have to wait 24 seconds for 2,000 sites.
>
> >>> So, the first call starts at time 0, the second call starts at 0+12,
> >>> the third at 0+12+12... etc. With 2,000 sites, this works out about 24
> >>> seconds. Once you've added in the overheads and getting the list of
> >>> Cars in the first place, it's too long.
>
> >>> If we could start even 100 queries at the same time of time 0, that'd
> >>> be superb. We thought we could do it with multithreading, but that's
> >>> not allowed on App Engine.
>
> >>> Finally - I've also posted this on StackOverflow -
>
> >>>http://stackoverflow.com/questions/4993744/selecting-distinct-entitie...
>
> >>> I'll try and keep both updated.
>
> >>> Any more thoughts welcome!
> >>> Ed
>
> >>> On Feb 14, 6:47 pm, Calvin wrote:
> >>> > Can you do filtering in memory?
>
> >>> > This query would give you all of the journeys for a list of cars within
> >>> the
> >>> > date range:
> >>> > carlist = ['123','333','543','753','963','1236']
> >>> > start_date = datetime.datetime(2011, 1, 30)
> >>> > end_date = datetime(2011, 2, 10)
>
> >>> > journeys = Journey.all().filter('start >', start_date).filter('start
> >>> <', > >>> > end_date).filter('car IN', carlist).order('-start').fetch(100)
> >>> > len(journeys)
> >>> > 43 # <- since it's less than 100 I know I've gotten them all >
> >>> > then since the list is sorted I know the first entry per car is the
> >>> most
> >>> > recent journey:
>
> >>> > results = {}
> >>> > for journey in journeys:
> >>> > ... if journey.car in results:
> >>> > ... continue
> >>> > ... results[journey.car] = journey
>
> >>> > len(results)
> >>> > 6
>
> >>> > for result in results.values():
> >>> > ... print("%s : %s" % (result.car, result.start))
> >>> > 753 : 2011-02-09 12:38:48.887976
> >>> > 1236 : 2011-02-06 13:59:35.221003
> >>> > 963 : 2011-02-08 14:03:54.587609
> >>> > 333 : 2011-02-09 10:40:09.466700
> >>> > 543 : 2011-02-09 15:28:53.197123
> >>> > 123 : 2011-02-09 14:09:02.680870
>
> >>> --
> >>> You received this message because you are subscribed to the Google Groups
> >>> "Google App Engine" group.
> >>> To post to this group, send email to google-appengine@googlegroups.com.
> >>> To unsubscribe from this group, send email to
> >>> google-appengine+unsubscribe@googlegroups.com.
> >>> For more options, visit this group at
> >>>http://groups.google.com/group/google-appengine?hl=en.

Cloud Computing

Feb 15, 2011

Selecting distinct entities across a large table

No comments:

Post a Comment

Categories

Related Links