Measuring Django serialization memory usage
In what follows:
- ~/django/upstream represents the checkout of Django development version we are working with that is assumed to be already installed and available in the PYTHONPATH.
- /path/to/serializers_memory_usage represents the directory of this project.
Preparation step: Given a model with four CharField's of length 255, create 50,000 instances of it in the database with random values for these fields.
Then serialize that data using Django serialization infrastructure while measuring the aggregate memory usage of all the model instances.
For this, the Pympler package is used, its tracker module is directed to take periodic (a few seconds) snapshots of the memory used collectively by the instances of the model class, this measurement is done in background at the same time that Django reads data from the DB and outputs it in the selected serialization format. After that, some fancy graphs and reports are generated from that collected data
This is implemented as a standalone Django project+apps plus added ad-hoc tools instead of a Django suite unittest because:
- The additional Pympler dependency.
- Running under the Django test machinery means that the data to be serialized should be loaded to the DB first at the beginning of the test case using a fixture, we don't want that to happen because of a) It would take too much time to load 50,000 records at the beginning of each run and b) it is possible loaddata (Django deserializers) have their own memory usage problems (see ticket 12007) and we don't want them to introduce noise in what we are trying to measure.
1. Install Pympler and its dependencies
We need to use the development version of Pympler (what at some point will become version 0.2), get a SVN trunk tip checkout. The only dependency we need to be installed is python-matplotlib so it is able to create our memory usage charts. But that dependecy isn't expressed in its setup.py so we install it manually (either with Python tools or our OS package manager).
$ pip install matplotlib $ cd <dir for Pympler installation> $ pip install -e svn+http://pympler.googlecode.com/svn/trunk#egg=Pympler
2. Patch Pympler with our tweaks
$ cd <Pympler installation dir>/pympler $ patch -p1 < pympler-no-instances-html-report-section.diff $ patch -p1 < pympler-more-memory-plots.diff
3. Create test data
First, make sure you customize the settings_postgresql_psycopg2.py file with your environment information (database name, username, password, etc.).
We use the create_data.py Python script, open it, review it and hack it at will.
$ cd /path/to/serializers_memory_usage $ export PYTHONPATH=..:. $ export DJANGO_SETTINGS_MODULE=settings_postgresql_psycopg2 $ django-admin syncdb $ python create_data.py 50000 Ok. 50000 instances inserted. Total now is 50000
Save the data to a fixture for future use:
$ django-admin dumpdata --format json --indent 2 ticket5423 > ticket5423/fixtures/data.json
4. Run the data collection with unpatched Django
We use the work.py Python script, open it, review it and hack it at will. it contains the real implementation of the tests.
$ cd /path/to/serializers_memory_usage $ export PYTHONPATH=..:. $ export DJANGO_SETTINGS_MODULE=settings_postgresql_psycopg2 $ python work.py postgresql_psycopg2 json "Unpatched Django trunk" 7 > out1.json Sampling period: 7.0seg. Serializing data and tracking memory usage. Please wait Finished memory data collection. Saving it. Starting report generation (reports/json-postgres-Trunk/index.html).
7 is the number of seconds between memory usage samples taken by Pympler. An appropiately named report in HTML format will be generated under the reports/ subdir.
5. Run the data collection with patched Django
For example we can test the patch attached to ticket #5423 updated to trunk as of now:
$ cd ~/django/upstream $ patch -p1 < .../5423.2.diff $ cd /path/to/serializers_memory_usage $ export PYTHONPATH=..:. $ export DJANGO_SETTINGS_MODULE=settings_postgresql_psycopg2 $ python work.py postgresql_psycopg2 json "#5423 patch" 7 > out2.json
Anonther HTML report will be generated under the reports/ subdir. Compare the memory usage graphs.
Other serialization formats
The other two serialization formats can be tested in a similar way:
$ python work.py postgresql_psycopg2 xml "#5423 patch" 7 > out.xml $ python work.py postgresql_psycopg2 yaml "#5423 patch" 7 > out.yaml
Other DB backends
You can also try against MySQL and SQLite. For that, make sure you customize the settings_*.py file first. Then, e.g. for SQLite, starting from the data.json fixture we saved in step 3:
$ cd /path/to/serializers_memory_usage $ export PYTHONPATH=..:. $ export DJANGO_SETTINGS_MODULE=settings_sqlite3 $ django-admin syncdb $ django-admin loaddata data.json Installed 50000 object(s) from 1 fixture(s)
Then follow with steps 4 and 5 taking care to replace 'postgresql_psycopg2' with 'sqlite'.
The same process can be used with MySQL (use 'mysql').
Alternatively, the run.sh script can be used to run the data collection:
- Against the three DB backends
- With the three serialization formats
- and with Django both unmodified and patched.
for a total of eighteen automatized runs.
Edit the run.py script to customize things like the Django tree location, the data sampling period and Django patches location and then run it. It will take care of invoking the work.py Python script to generate the 18 reports in an automatized way.
run.sh assumes Mercurial is used in the Django source tree and invokes it to restore the working copy and undo patches modifications. Tweak it if you use Git or SVN.