Wednesday, October 24, 2012

Django + Google App Engine + MapReduce

If you're using Django-nonrel on Google App Engine, mapreduce will not work out of the box. I put a bit of work getting it running. Fortunately, I was not the first. This blog post suggests some code to get you started and allow you to run a mapper on all of our entities.  Unfortunately it only allows you to map app engine entities, not Django entities.  The code below fixes that issue. It works in a similar way, but performs a Django "get" before running the mapper to convert a key into a Django entity. This adds a bit more overhead; one more get per map.

class DjangoEntityInputReader(AbstractDatastoreInputReader):
'''
 An input reader that takes a Django model ('app.models.Model') 
 and yields entities for that model
 '''
 def _iter_key_range(self, k_range):
   query = Query(util.for_name(self._entity_kind)
            ).get_compiler(using="default").build_query()
   raw_entity_kind = query.db_table
   query = k_range.make_ascending_datastore_query(
            raw_entity_kind, keys_only=True)
   for key in query.Run(config = datastore_query.QueryOptions(
                              batch_size=self._batch_size)):
      yield key, eval(self._entity_kind).objects.get(pk=key.id())


 @classmethod
 def _get_raw_entity_kind(cls, entity_kind):
   '''
   A bit of a hack, returns a table name based on entity kind.
   '''
   return entity_kind.replace(".models.","_").lower()


To use code above, you would place the above class in your views.py and use the following in your mapreduce.yaml:

- name: My mapper

  mapper:

    input_reader: myapp.views.DjangoEntityInputReader

    handler: myapp.my_mapper

    params:
    - name: entity_kind
      default: myapp.models.MyModel

That's all you need to get mapreduce up and running, but there is an additional problem.  Mapreduce uses a property called "__scatter__" to scramble up the entities and assign them to a proper map reduce shard.  However, Django does not have the __scatter__ property, so what happens is that all of the entities get assigned to a single map reduce shard. You do not get to enjoy the massive parallelism of mapreduce. In order to make the change, you'll need some code of mine, which I posted here. Feel free to please contact me if you have any questions.