DSE - Simplified "bulk" insert/update/delete for Django ======================================================= Version : 3.1.0 Author : Thomas Weholt <thomas@weholt.org> License : Modified BSD. Status : Beta Url : https://bitbucket.org/weholt/dse2 Background ---------- * DSE is available for one reason - to insert/update/delete lots of data -- as fast as possible. * DSE vs Django ORM: typical speed gain is around 5x-10X for updates, 3X+ for inserts. * DSE is aware of default values specified in your Django models and will use those if no value is given for a field in an insert statement. * DSE caches SQL-statements, both inserts, updates and delete, and executes them when a specified number of statements has been prepared or when it`s told manually to flush cached statements to the database. The actual sql execution is done using DB API cursor.executemany and this is much faster than executing SQL-statements in sequence and way faster than using the Django ORM. * DSE uses a dictionary to specify what fields to work on. * My hope is to see something like this in the django core. News in 3.x ----------- * Changes in syntax which is NOT backwards compatible, therefore a version bump. That and the nice bulk_update method. * The add and execute methods have been removed. * Patched models now have a property called delayed instead of dse. You can also patch specific models (new in 2.1.0). * To insert an item call model.delayed.insert(values) * To update an item call model.delayed.update(values) * To delete an item call model.delayed.delete(id) * If you have to update a huge data set where the values for the fields are limited you can use the new model.delayed.bulk_update(values), for instance metadata from photos or music files. Thanks to Cal Leeming [Simplicity Media Ltd] for inspiration on this one :-). For more info look further down for a more complete walkthrough on what happends behind the scenes. * You can now decorate methods in your model with the dse_value_parser-decorator to make all values passed to dse be parsed by that method. If the method returns None, the values will not be added. This way you can define clean-up or validation methods for your data. NB! These methods must be classmethods! Thanks to Andre Terra for this one. * Cursor caching is now optional. It crashed using Mysql in one scenario where several cursors where in play at the same time. Again thanks to Andre Terra. Installation ------------ pip install dse or hg clone https://bitbucket.org/weholt/dse2 Example usage ------------- You got a model like:: gender = (('M', 'Male'), ('F', 'Female')) class Person(models.Model): name = models.CharField(max_length = 30) age = models.IntegerField(default = 30) sex = models.CharField(max_length = 1, choices = gender, default = 'M') Using dse:: import dse dse.patch_models() # Monkey patch all your models and expose dse for all models: # to monkey patch only the Person-model do dse.patch_models(specific_models=[Person]) with Person.delayed as d: for name, age, sex in (('Thomas', 36, 'M'), ('Joe', 40, 'M'), ('Jane', 28, 'F')): d.insert(dict(name = name, age = age, sex = sex)) Nothing will be inserted into the database before the loop is done ( or you insert 1000 items ). Then the items will be inserted using cursor.executemany, using plain SQL - no ORM in sight. DSE using default values defined in your model:: with Person.delayed as d: # Adding an item, just defining a name and using the default values from the model: d.insert({'name': 'John'}) # Overriding the default values? Just specify a valid value d.insert({'name': 'Thomas', 'age': 36, 'sex': 'M'}) # Update record with id = 1 and set its name to John. This will trigger # a SQL-statement for this update alone, since not all columns are specified: d.update({'id': 1, 'name': 'John'}) To use delayed execution of SQL statements:: Person.delayed.update({'id': 2, 'name': 'Al Capone'}) # will NOT trigger anything Person.delayed.update({'id': 3, 'name': 'John Dillinger'}) # will NOT trigger anything Person.delayed.insert({'name': 'Scarface'}) # will NOT trigger anything Person.delayed.flush() # will execute both update statements and insert a record for "Scarface" Say you want to update all records with some calculated value, something you couldn't find a way to do in SQL. Using dse this is easy and fast:: with Person.delayed as d: # Use Djangos ORM to generate dictionaries to use in DSE; objects.all().values(). for item in Person.objects.all().values(): d.update(dict(id=item.get('id'), somevar=calculated_value)) I've recieved some questions about transaction handling. Below is an simple example, but I`m looking into other ways of handling transactions as well:: from django.db import transaction import dse def some_method(): with transaction.commit_on_success(): with SomeModel.delayed as d: for item in somelist: d.insert({'some_column': item.some_value, 'another_column': item.another_value}) You can also cache items to delete:: with foo.delayed as d: for person in person.objects.all(): if person.likes_perl_more_than_python: foo.delayed.delete(person.id) # won't trigger anything # here all cached items for deletions are deleted using plain SQL, no orm. DSE caches id's and deletes them when 1000 items are cached or flush/close are called. It uses sql similar to "delete from tablename where id in (<list of ids>)". DSE features singletons (NB! very experimental, no locking/thread support yet!):: import dse.singleton p1 = dse.singleton.Models.Person() p2 = dse.singleton.Models.Person() print p1 is p2 # should print True p1.insert(dict(name = 'Joe')) p2.flush() print Person.objects.all().count() # should print 1 Singletons makes it possible to cache entries across sections of code and cache even more data, hitting the db less. New in the 3.x version of DSE is the bulk_update-method. It takes a dictionary of values to update, requires a value for the primary key/id of the record, but uses the django orm's own update method instead of plain sql to reduce number of statements to execute. This is helpful when your fields can have a limited set of values, like EXIF-data from photos. An example:: with Photo.delayed as d: d.update({'id': 1, 'camera_model': 'Nikon', 'fnumber': 2.8, 'iso_speed': 200}) d.update({'id': 2, 'camera_model': 'Nikon', 'fnumber': 11, 'iso_speed': 400}) d.update({'id': 3, 'camera_model': 'Nikon', 'fnumber': 2.8, 'iso_speed': 400}) d.update({'id': 4, 'camera_model': 'Canon', 'fnumber': 3.5, 'iso_speed': 200}) d.update({'id': 5, 'camera_model': 'Canon', 'fnumber': 11, 'iso_speed': 800}) d.update({'id': 6, 'camera_model': 'Pentax', 'fnumber': 11, 'iso_speed': 800}) d.update({'id': 7, 'camera_model': 'Sony', 'fnumber': 3.5, 'iso_speed': 1600}) # and then some thousand more lines like that Internally DSE will construct a structure like this:: bulk_updates = { 'camera_model': { 'Nikon': [1,2,3], 'Canon': [4,5], 'Pentax': [6], 'Sony': [7], }, 'fnumber': { 2.8: [1,3], 11: [2,5,6], 3.5: [4,7], }, 'iso_speed': { 200: [1,4], 400: [2,3], 800: [5,6], 1600: [7] } } And then execute those statements using:: # pk = the primary key field for the model, in most cases id for field, values in bulk_updates.iteritems(): for value, ids in values.iteritems(): model.objects.filter(**{"%s__in" % pk: ids}).update(**{field: value}) For huge datasets where the fields can have limited values this has a big impact on performance. So when to use update or bulk_update depends on the data you want to process. For instance importing a contact list where most of the fields had almost unique values would benefit from the update-method, but importing data from photos, id3-tags from your music collection etc would process much faster using bulk_update. Using the dse_value_parser decorator:: from dse import * class YourModel(models.Model): # your fields @dse_value_parser def data_validation(cls, values): # validate your data, return None if validation fails return values And that's all you have to do. Your method data_validation will be called each time you add a set of values. Performance using DSE 2.x compared to the django orm using the django test framework ------------------------------------------------------------------------------------ Lots of updates: DSE : 0.116728067398 seconds. ORM : 15.0528171062 seconds. Lots of inserts: DSE: 0.0825960636139 seconds. ORM: 0.830269098282 seconds. Iterate of objects and updating them: DSE: 0.817410230637 seconds. ORM: 60.4543881416 seconds. Delete lots of objects: DSE: 0.495353937149 seconds. ORM: 13.3724360466 seconds. You can see the code for the test producing these numbers here: https://bitbucket.org/weholt/dse2/src/dbd87b0e2cdb/tests/performance_tests.py NB! These numbers do NOT represent real world performance, but merely show the overhead of using the django orm compared to DSE. My latests tests shows DSE to be about 3 times faster on inserts, it uses about 14% of the time compared to the orm when doing updates and in some special cases DSE can speed up delete-operations also. License ------- Copyright (c) 2011, Thomas Augestad Weholt All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of Thomas Augestad Weholt nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Release notes ------------- 3.1.0 : patch from rassminus; Changed sql creation to quote all references to the table name and column labels. 3.0.0 : clean up and release. 3.0.0-BETA#3 : clean-up/validation decorator and optional cursor caching. 3.0.0-BETA#2 : fixed a few things reported by Fido Garcia. 3.0.0-BETA#1 : refactoring, removal of code, new methods for insert and update, removal of the add, execute and several other methods. UPDATE-code optimized. 2.1.0 : Small change; dse.patch_models can now take an optional list of models to patch, like so dse.patch_models(specific_models=[User, Comment]). 2.0.0 : labeled as stable. Updated docs and examples. 2.0.0-RC1 : no change in code, now released using the modified BSD license to be more compatible with django license use. 2.0.0-BETA#9 : added FileExport-class to ease debugging what is processed during testing. Writes SQL-data to file. See source/testsuite for usage. 2.0.0-BETA#4 : started refactoring DSE to remove all non-django specific code, mostly to gain speed and simply code. 1.0.2 : reconnect if cursor is dead. 1.0.1 : fixed issue #9 "Factory can eat up memory" reported by vangheem. When finding fields related to a table only the top row is fetched. 1.0.0 : Version bump. Added unittest for issue #8. 1.0.0-RC1 : updated README.txt. 0.9.4 : - PEP8 and pyflake. 0.9.3 : - Fixed issue #7: dse causes django-debug-toolbar to crash. Thanks to ringemup for pointing that out. Added some docstrings. 0.9.2 : - Corrected type in usage.rst and README.txt. 0.9.1 : - Refactored code even more, added usage.rst, singleton support in the singleton-package and some performance tests. Models not monkey patched be default anymore, must call dse.patch_models(). 0.9.0 : - Refactored code and cleaned up tests folder. Focus on getting singleton support in before 1.0.0. And more tests. 0.8.2 : - added 'pysqlite2' to _DBMAP. Thanks to David Marble for 0.8.1 and 0.8.2. 0.8.1 : - attempt to fix quoting problems with fields on postgresql. 0.8.0 : - fixed crash when more than one database connection has been configured. No ModelFactory will be triggered. 0.7.0 : - don`t remember. 0.6.0 : - added support for the with-statement. - added an ModelDelayedExecutor-instance to each model, so you can do Model.dse.add_item instead of dse.ModelFactory.Model.add_item. - renamed dse.modelfactory to dse.ModelFactory to be more style-compliant. 0.5.1 : just some notes on transaction handling. 0.5.0 : - added modelfactory. Upon first import a modelfactory will be created in the DSE module. It`s basically just a helper-class containing ModelDelayedExecutor-instances for all models in all apps found in INSTALLED_APPS in settings.py. - to change the default item limit before automatic execution of cached SQL statements to 10000 instead of the default 1000: import dse; dse.ITEM_LIMIT = 10000 0.4.0 : - fixed serious bug when using mass updates. Using cursor.executemany is only possible when values for all columns are specified. If only values for a subset of the columns is specified that will be executed as a seperate SQL-call. NOTE! Using dex.get_items() or Djangos Model.objects.values() will give you all the fields. - code clean-up. - added custom exceptions; UpdateManyException, UpdateOneException and InsertManyException.