Dan Hooper: Updating schemas in App Engine [post price increase]

Since Google's App Engine got significantly more expensive, it has become more difficult to update all entities for a model while staying under the free quota.

There are many reasons to update all the entities for a model: including

Updating a model's schema.

Adding new fields to exisiting entities.
Removing obsolete non indexed properties from exisiting entities.

Adding an index for an existing property that was previously not indexed

Previously, a developer could use mapreduce operations to update their model's schema. Mapreduce was great, it would iterate through my all my entities of a given type in under a minute (about 5000 entities or so). Since mapreduce ran in minutes and developers never paid for datastore writes or reads under the old pricing scheme this update method cost almost nothing.

Under the new pricing scheme developers pay for datastore writes and reads. To write an entity to the datastore the cost depends on the number of indexes on that entity:

New Entity Put (per entity, regardless of entity size)	2 Writes + 2 Writes per indexed property value + 1 Write per composite index value
Existing Entity Put (per entity)	1 Write + 4 Writes per modified indexed property value + 2 Writes per modified composite index value

Therefore, writing just one entity might result in 20 or more write operations. Now considering the free limit for datastore writes is just 50,000 operations, we can reach the limit of writes in a day by writing just 2,500 entities that require 20 write operations each. If developers try to use map reduce on any data set the daily quota can be exceeded in just one mapreduce step. What used to be a completely free operation now could cost some money.

Image Source

Ideally it would be nice to have a method to search for all entities that do not have a given property or are not indexed, but unfortunately that cannot be done in App Engine.

To work around this problem, I have added a property "meta_version" to all of my models. Initially meta_version is set to 1 for all my entities. Now if I need to update my model's schema, over several days, I can run a search for entities with their meta_version set to 1 and bump the version to 2. Since meta_version is indexed, I can run a search for those entities and just update 100 entities at a time. In this way I can stay under the free quota.

For example after adding the field to your model:

class MyModel(db.Model):
    # Add this field to all models with a large number of entities.
    meta_version = db.StringProperty(default='1')

Add a view which iterates through the entities a few at a time:

def update_my_model(request, num_entities = 100):
    entity_list = MyModel.all().filter('meta_version =', '1').fetch(num_entities)
    for entity in entity_list:
        entity.meta_version = '2'
    db.put(entity_list)

Now you can slowly evolve your model's schema or index old unindexed properties while staying under your free quota! Of course the final irony is that in order to add the meta_version field to all entities you have to update all of those entities without using this method.

6 comments:

Jon GillhamDecember 11, 2011 at 3:40 PM
Hi Dan,

It seems to me that this is a good idea no matter how fast or slow your mapreduce jobs are. You always need a way to differentiate between pre and post mutation entities. I have another tip if applicable (http://www.justdoesnotwork.com/2011/09/app-engine-austerity-composite-key.html) for reducing costs on App Engine.

Jon
MikeDecember 12, 2011 at 3:25 PM
You can also specify processing rate (http://code.google.com/p/appengine-mapreduce/wiki/UserGuidePython) and go as slow as 1 entity/sec in mapreduce.
UnknownDecember 13, 2011 at 7:20 AM
Jon,

Initially, I tried to avoid adding tracking the version since that seemed against the principals of NoSQL but it does seem very useful.

Also I looked at your blog, I think that seems like a neat solution for some of the more used entities.

Dan
UnknownDecember 13, 2011 at 7:23 AM
Mike,

Thanks for pointing that out, I think even at 1/sec, I could blow through the quota if It equates to 20 writes/second due to the indexes.

Dan
MikeDecember 13, 2011 at 5:45 PM
Well, I think if you change int cast at http://code.google.com/p/appengine-mapreduce/source/browse/trunk/python/src/mapreduce/handlers.py#685 then you should be able to specify doubles.
UnknownJanuary 31, 2012 at 12:31 PM
i think you're going to run into a lot of issues if the meta numbers don't all update together, that the current code doesn't quite cover (not to say it can't)

what if during a job scores of entities don't get updated (ex: by exception), and you are resulted with 3000 entity_meta=1, and 30,000 entity_meta=2, and you intended for 33,000 entities to all be on version 2? you'd have to search and determine which 3000 entities got eff'ed up. you don't specify how you are uniquely identifying the entities, so i can't comment further to solve that scenario

you'd have to have some search indexing in order to filter out and sort of the entity_meta property..

Dan Hooper

Friday, December 9, 2011

Updating schemas in App Engine [post price increase]

6 comments:

About Me

Blog Archive