Friday, December 9, 2011

Updating schemas in App Engine [post price increase]

Since Google's App Engine got significantly more expensive, it has become more difficult to update all entities for a model while staying under the free quota.  

There are many reasons to update all the entities for a model: including
  • Updating a model's schema.
    • Adding new fields to exisiting entities.
    • Removing obsolete non indexed properties from exisiting entities.
  •  Adding an index for an existing property that was previously not indexed
Previously, a developer could use mapreduce operations to update their model's schema.  Mapreduce was great, it would iterate through my all my entities of a given type in under a minute (about 5000 entities or so).  Since mapreduce ran in minutes and developers never paid for datastore writes or reads under the old pricing scheme this update method cost almost nothing.

Under the new pricing scheme developers pay for datastore writes and reads.  To write an entity to the datastore the cost depends on the number of indexes on that entity:

New Entity Put (per entity, regardless of entity size)2 Writes + 2 Writes per indexed property value + 1 Write per composite index value
Existing Entity Put (per entity)1 Write + 4 Writes per modified indexed property value + 2 Writes per modified composite index value

Therefore, writing just one entity might result in 20 or more write operations.  Now considering the free limit for datastore writes is just 50,000 operations, we can reach the limit of writes in a day by writing just 2,500 entities that require 20 write operations each.  If developers try to use map reduce on any data set the daily quota can be exceeded in just one mapreduce step.  What used to be a completely free operation now could cost some money.

Image Source


Ideally it would be nice to have a method to search for all entities that do not have a given property or are not indexed, but unfortunately that cannot be done in App Engine.

To work around this problem, I have added a property "meta_version" to all of my models.  Initially meta_version is set to 1 for all my entities.  Now if I need to update my model's schema, over several days, I can run a search for entities with their meta_version set to 1 and bump the version to 2.  Since meta_version is indexed, I can run a search for those entities and just update 100 entities at a time.  In this way I can stay under the free quota.



For example after adding the field to your model:
class MyModel(db.Model):   
    # Add this field to all models with a large number of entities.
    meta_version = db.StringProperty(default='1')

Add a view which iterates through the entities a few at a time:

def update_my_model(request, num_entities = 100):
    entity_list = MyModel.all().filter('meta_version =', '1').fetch(num_entities)
    for entity in entity_list:
        entity.meta_version = '2'
    db.put(entity_list)

Now you can slowly evolve your model's schema or index old unindexed properties while staying under your free quota!  Of course the final irony is that in order to add the meta_version field to all entities you have to update all of those entities without using this method.

Saturday, January 15, 2011

Speeding Up Pages in App Engine - Part 3 [django specific]

In past blogs, we've looked at two techniques for speeding up pages. These ideas boil down to:

The next step is to look at the actual python code serving the requests and speeding it up.

Some background:
MindWell uses Django to serve the requests.  For those unfamiliar, Django is a great project that uses the MVC (Model View Controller) architecture.  In Django you create templates containing the html of the page.  Normally, these templates are great because they offer code reuse and performance is not a problem.  However, on one page I found a performance bottle-neck, so I resolved the issue by skipping the template system altogether.  Sometimes it is good to be different:

In MindWell there is a calendar that shows appointment dates and to the user similar to Google Calender.  Mindwell uses FullCalendar which has the ability to take JSON objects and render them on the calendar.  In MindWell there might be 100-300 appointments in one stream of JSON objects.  

Here is one such JSON object:
[
{
"title":"ClientLastName, ClientFirstName",
"start":"2011-01-11T11:00:00",
"end":"2011-01-11T11:45:00",
"allDay":false,
"url":"/Mindwell/2011/01/11/11/00/436/calendar_dos/",
"note":"",
"className":"scheduleClient",
"description":"Session Type:None
Session Result:Scheduled
DSM Code:None
Type of Payment:None
Amount Due:None
Amount Paid:None
Note:None
DOS Date and Time:2011-01-11 11:00:00
DOS Repeat:No
Repeat End Date:None
"
},
...

When I profiled the performance of the template rendering the JSON data, I found it was too slow.  Instead of rendering the objects using a template, I rendered them directly in python.  

Below is the code that renders the feed.  Normally in Django, we return the rendering of a template. Here, however, we create an HttpResponse and text directly, then set the mimetype to json.
#click this link for the code itself
def calendar_feed(request):
    if CheckUserNotAllowed():
        return CheckUserNotAllowed()
    start = request.GET.get('start', None)
    end = request.GET.get('end', None)
    if not start:
        return HttpResponse('')
    if not end:
        return HttpResponse('')
    cal_feed = CalendarFeed(request, 
        datetime.datetime.fromtimestamp(int(start)),
        datetime.datetime.fromtimestamp(int(end)))
    return HttpResponse(
        cal_feed.GetFeed(), mimetype='application/json')


The code that turns each object into a JSON object is shown below.  Note that I still call escape to prevent various Javascript and other attacks.  I want to use the <br/> tag in my JSON object so I put the <br/> tag back in.


#click this link for the code itself
def escape_json(text):
  """ 
  escape html using django (replace single slash with double for
  javascript and finally put back in our<br/>
  """
    return escape(text).replace('\\', 
        '\\\\').replace('&lt;br/&gt;', '<br/>')
        
def Javascript_Object_DOS( dos):
    """ 
    Converts a DOS to JSON.
    """
        return """
{
"title":"%s",
"start":"%s",
"end":"%s",
"allDay":false,
"url":"%s",
"note":"%s",
"className":"%s",
"description":"%s"
},
            """ %(
                escape_json(unicode(dos)), 
                dos.get_starttime().isoformat('T'),
                dos.get_endtime().isoformat('T'), 
                dos.get_calendar_absolute_url(),
                escape_json(dos.get_note()), 
                dos.get_class_name(), 
                escape_json(dos.get_hover_tip())
            )


By replacing the Django template system, I was able to reduce the rendering time of the JSON data by ~30%.  The rendering times in some of the bigger streams were reduced from ~650ms to ~400ms.

The bottom line is this.  While Django's template system is good, there is some overhead associated with using Django's templates. In some cases, it may be worth skipping the templates for your slower or high traffic pages.

Friday, December 31, 2010

Speeding up pages in App Engine - Part 2

So you've got AppStats configured and you've sped up some page loading times by removing obviously unnecessary queries and gets.  Also by using prefetch_refprops from Speeding up pages in App Engine - Part 1 you've removed the staircase of gets.  Now what?

In my case I realized that one of my Models was no longer needed.  I was storing DOSRecurr objects (which contain a reference to DOS object which in turn contain a reference to ClientInfo objects) in the datastore.  Fetching objects from the datastore is slow so if you can avoid looking into the datastore, your pages will be loaded faster.

In my case, I realized I was having to calculate all of the DOSRecurr objects every time anyway, since if a DOS changed then my DOSRecurr might have changed as well.  So I kept the same DOSRecurr model, but instead of storing it in the database, I just created them every time and then threw them away afterwards.

By keeping the DOSRecurr model I was able to keep all the logic and implementation that used the DOSRecurr model but by no longer storing it in the datastore I sped up my application significantly.

The main take away from this lesson is:
Don't store and fetch things from the App Engine Datastore if you don't have to!

Saturday, November 6, 2010

Speeding up pages in App Engine - Part 1

After adding numerous features to MindWell things had started to slow down a bit.  In this article I'll discuss various ways I sped up Mindwell and improved page loading times.  After going through these steps perhaps your users will thank you and start to feel like they're flying...



First a little information about Mindwell:
In MindWell there are three main models:
ClientInfo - which contains information about clients (encrypted of course).
DOS - date of service, essentially this is information about an individual session with a client.
DOSRecurr - this is a model for recurring appointments.  So if a client comes every week this is used to model those appointments.

DOS use a reference property to ClientInfo and DOSRecurr use a reference property to DOS.  So the hierarchy looks like:
ClientInfo <- DOS <- DOSRecurr

Techniques
First of all use app stats.  This will help tell you exactly how much time is dedicated to running various queries and database calls.  In my case I noticed the deadly staircase of gets which can be resolved by prefetching reference properties.  Essentially rather than iterating through a list of items you can group a bunch of objects into one get rather than a sequence of gets.  This alone made a remarkable speed up in my application.
 Below is a modified version of prefetch_refprops that ignores empty references.  Some of my objects do not set a reference to another model so I first filter out those entities.

def prefetch_refprops(entities, *props):
   non_empty_entities = [entity for entity in entities for prop in props if prop.get_value_for_datastore(entity)]
   fields = [(entity, prop) for entity in non_empty_entities for prop in props]
   ref_keys = [prop.get_value_for_datastore(x) for x, prop in fields]

   ref_entities = dict((x.key(), x) for x in db.get(set(ref_keys)))
   for (entity, prop), ref_key in zip(fields, ref_keys):
       if ref_entities[ref_key]:
         prop.__set__(entity, ref_entities[ref_key])

Monday, June 14, 2010

Encrypting Fields in Google App Engine

Recently, I've been coding an application (MindWell) in Google App Engine.  One of the more difficult features to implement was storing data in an encrypted format.  This is needed because Mindwell is intended for therapists/counselors and other mental health professionals.  Therefore, security of client's data is very important.  Below is some code taken from Mindwell.  For the complete code listing see: models.py

Code to create an Encrypted Field:


class EncryptedField(db.StringProperty):
    data_type = str
  
    def __GetSHADigest(self, random_number = None):
        """ This function returns a sha hash of a
            random number  and the secret
            password."""
        sha = SHA256.new()
        if not random_number:
            random_number = os.urandom(16)
        # mix in a random number      
        sha.update(random_number)
        # mix in our secret password
        sha.update(secret_passphrase)
        return (sha.digest(), random_number)
      

    def encrypt(self, data):
        """Encrypts the data to be stored in the
           datastore"""
        if data is None:
            return None
        if data == 'None':
            return None
        # need to pad the data so it is
        # 16 bytes long for encryption
        mod_res = len(data) % 16
        if mod_res != 0:
            for i in range(0, 16 - mod_res):
                # pad the data with ^
                # (hopefully no one uses that as
                # the last character, if so it
                # will be deleted
                data += '^'  
        (sha_digest, random_number) = self.__GetSHADigest()
        alg = AES.new(sha_digest, AES.MODE_ECB)
        result = random_number + alg.encrypt(data)
        # encode the data as hex to store in a string
        # the result will otherwise have charachters that cannot be displayed
        ascii_text = str(result).encode('hex')
        return unicode(ascii_text)
      
    def decrypt(self, data):
        """ Decrypts the data from the
            datastore.  Basically the inverse of
            encrypt."""
        # check that either the string is None
        # or the data itself is none
        if data is None:
            return None
        if data == 'None':
            return None
        hex_decoder = codecs.getdecoder('hex')
        hex_decoded_res = hex_decoder(data)[0]
        random_number = hex_decoded_res[0:16]
        (sha_digest, random_number) = self.__GetSHADigest(random_number)
        alg = AES.new(sha_digest, AES.MODE_ECB)
        dec_res = alg.decrypt(hex_decoded_res[16:])
        #remove the ^ from the strings in case of padding
        return unicode(dec_res.rstrip('^'))
      
    def get_value_for_datastore(self, model_instance):
        """ For writing to datastore """
        data = super(EncryptedField,
                     self).get_value_for_datastore(model_instance)
        enc_res = self.encrypt(data)
        if enc_res is None:
            return None
        return str(enc_res)

    def make_value_from_datastore(self, value):
        """ For reading from datastore. """
        if value is not None:
            return str(self.decrypt(value))
        return ''

    def validate(self, value):
        if value is not None and not isinstance(value, str):
            raise BadValueError('Property %s must be convertible '
                                'to a str instance (%s)' %
                                (self.name, value))
        return super(EncryptedField, self).validate(value)

    def empty(self, value):
        return not value

Code to use Encrypted Field:

class ClientInfo(db.Model): 
    lastname = EncryptedField(verbose_name='Last Name')

In a separate file, secret_info.py:
secret_passphrase = '0123456789abcdef' # must be a 16 byte long value

Explanation:
Why mix in a random number? To more securely encrypt the data, a random number (sometimes called salt) is mixed in.  This ensures that even if the same algorithm and method encrypts the data then the encrypted result will be unique.  So if two pieces of the same information are entered, now they will have different values stored in the database.  See Wikipedia for some more information about the problems of not mixing in some randomness.

There is also a secret_passphrase which is also used in the encryption which is not stored along in the database.  So in order to decrypt the fields the hacker needs to determine this secret passphrase or acquire it via some method.