Commits

Luke Plant committed 17dfc7a

Added proper documentation of app.

Also involved changing get_allowed_value method to being public.

  • Participants
  • Parent commits 2d5286f

Comments (0)

Files changed (11)

  Django Anonymizer
 ===================
 
-This app aims to help you anonymize data in a database used for development.
+This app helps you anonymize data in a database used for development of a Django
+project.
 
 It is common practice in develpment to use a database that is very similar in
 content to the real data. The problem is that this can lead to having copies of
-sensitive customer data on development machines and other placers (like
-automatic backups). This Django app helps by giving an easy and customizable way
-to anonymize data in your models.
+sensitive customer data on development machines. This Django app helps by
+providing an easy and customizable way to anonymize data in your models.
 
 The basic method is to go through all the models that you specify, and generate
 fake data for all the fields specified. Introspection of the models will produce
 an anonymizer that will attempt to provide sensible fake data for each field,
 leaving you to tweak for your needs.
 
-Please note that the methods provided will not provide full anonymity. Even if
-you anonymize the names and other details of your customers, there may well be
-enough data to identify them. Relationships between records in the database are
-not altered, in order to preserve the characteristic structure of data in your
-application, but this may leave you open to information leaks which might not be
-acceptable for your data. This application **should** be good enough for simpler
-policies like 'remove all real telephone numbers from the database'.
+Please note that the methods provided may not be able to give full
+anonymity. Even if you anonymize the names and other details of your customers,
+there may well be enough data to identify them. Relationships between records in
+the database are not altered, in order to preserve the characteristic structure
+of data in your application, but this may leave you open to information leaks
+which might not be acceptable for your data. This application **should** be good
+enough for simpler policies like 'remove all real telephone numbers from the
+database'.
 
-Usage:
+Quick overview (see docs for more information, either in docs/ or on
+<http://packages.python.org/django-anonymizer>).
 
 * Install using setup.py or pip/easy_install.
 
-* Add 'anonymizer' to your ``INSTALLED_APPS`` setting. It would be advisable
-  to arrange it so that this only happens on development machines, and not
-  anywhere with access to the live database.
+* Add 'anonymizer' to your ``INSTALLED_APPS`` setting.
 
-* To create some stub files for your anonymizers, do::
+* Create some stub files for your anonymizers::
 
     ./manage.py create_anonymizers app_name1 [app_name2...]
 
   This will create a file ``anonymizers.py`` in each of the apps you specify.
   (It will not overwrite existing files).
 
-  The file will contain autogenerated classes that attempt to use appropriate
-  functions for generating fake data.
-
-* Edit the generated ``anonymizers.py`` files, adjusting as necessary, and
-  adding any filtering. You can override any of the public methods defined in
-  ``anonymizer.base.Anonymizer`` in order to do filtering and other
-  customization.
-
-  The 'attributes' dictionary is the key attribute to edit. The keys are the
-  attribute names of attributes on the model that need to be set.  The values
-  are either strings or callables. If strings, they will be interpreted as a
-  function in the module ``anonymizer.replacers``. This module can be browsed
-  to find suitable functions to use to anonymize data.
-
-  If callables are used as the keys, they should have a signature compatible
-  with the callables in ``anonymizer.replacers``. You can use ``lambda *args:
-  my_constant_value`` to return a constant.
-
-  For some fields, you will want to remove them from the list of attributes, so
-  that the values will be unchanged - especially things like denormalised
-  fields. You can also override the 'alter_object' method to do any fixing that
-  may be necessary.
-
-  An example Anonymizer for django.contrib.auth.models.User might look like
-  this::
-
-      from datetime import datetime
-
-      from anonymizer import Anonymizer
-      from django.contrib.auth.models import User
-
-      class UserAnonymizer(Anonymizer):
-
-          model = User
-
-          attributes = {
-              'username':   'username',
-              'first_name': 'first_name',
-              'last_name':  'last_name',
-              'email':      'email',
-              'date_joined': 'similar_datetime'
-              # Set to today:
-              'last_login': lambda *args: datetime.now()
-          }
-
-          def alter_object(self, obj):
-              if obj.is_superuser:
-                  return False # don't change, so we can still log in.
-              super(UserAnonymizer, self).alter_object(obj)
-              # Destroy all passwords for everyone else
-              obj.set_unusable_password()
+* Edit the generated ``anonymizers.py`` files, adjusting or deleting as
+  necessary, using the functions in module ``anonymizer.replacers`` or
+  custom functions.
 
 * If you need to create anonymizers for apps that you do not control, you may
   want to move the contents of the anonymizers.py file to an app that you **do**
   will probably want to move the contents of django/contrib/auth/anonymizers.py
   into yourprojectapp/anonymizers.py)
 
-* To run the anonymizers, do::
+* Run the anonymizers::
 
     ./manage.py anonymize_data app_name1 [app_name2...]
 

anonymizer/base.py

         field_vals = set(x[0] for x in field.model._default_manager.values_list(field.name))
         self.init_values[field] = field_vals
 
-    def _get_allowed_value(self, source, field):
+    def get_allowed_value(self, source, field):
         retval = source()
         if field is None:
             return retval
         def source():
             length = random.choice(range(0, max_length + 1))
             return "".join(random.choice(general_chars) for i in xrange(length))
-        return self._get_allowed_value(source, field)
+        return self.get_allowed_value(source, field)
 
     def simple_pattern(self, pattern, field=None):
         """
         ? with a random letter.
         """
         source = lambda: bothify(pattern)
-        return self._get_allowed_value(source, field)
+        return self.get_allowed_value(source, field)
 
     def bool(self, field=None):
         """
         Returns a random boolean
         """
         source = lambda: bool(randrange(0, 2))
-        return self._get_allowed_value(source, field)
+        return self.get_allowed_value(source, field)
 
     def integer(self, field=None):
         source = lambda: random.randint(-1000000, 1000000)
-        return self._get_allowed_value(source, field)
+        return self.get_allowed_value(source, field)
 
     def positive_integer(self, field=None):
         source = lambda: random.randint(0, 1000000)
-        return self._get_allowed_value(source, field)
+        return self.get_allowed_value(source, field)
 
     def small_integer(self, field=None):
         source = lambda: random.randint(-32768, +32767)
-        return self._get_allowed_value(source, field)
+        return self.get_allowed_value(source, field)
 
     def positive_small_integer(self, field=None):
         source = lambda: random.randint(0, 32767)
-        return self._get_allowed_value(source, field)
+        return self.get_allowed_value(source, field)
 
     def datetime(self, field=None, val=None):
         """
         else:
             source = lambda: datetime.fromtimestamp(int(val.strftime("%s")) +
                                                     randrange(-365*24*3600*2, 365*24*3600*2))
-        return self._get_allowed_value(source, field)
+        return self.get_allowed_value(source, field)
 
     def date(self, field=None, val=None):
         """
         return d.date()
 
     def uk_postcode(self, field=None):
-        return self._get_allowed_value(uk_postcode, field)
+        return self.get_allowed_value(uk_postcode, field)
 
     def uk_county(self, field=None):
         source = lambda: random.choice(data.UK_COUNTIES)
-        return self._get_allowed_value(source, field)
+        return self.get_allowed_value(source, field)
 
     def uk_country(self, field=None):
         source = lambda: random.choice(data.UK_COUNTRIES)
-        return self._get_allowed_value(source, field)
+        return self.get_allowed_value(source, field)
 
     def lorem(self, field=None, val=None):
         """
                 return "\n".join(parts)
         else:
             source = self.faker.lorem
-        return self._get_allowed_value(source, field)
+        return self.get_allowed_value(source, field)
 
     ## Other attributes provided by 'Faker':
 
 
         def func(*args, **kwargs):
             field = kwargs.get('field', None)
-            return self._get_allowed_value(source, field)
+            return self.get_allowed_value(source, field)
         return func
 
 

anonymizer/replacers.py

 # Pre-built replacers.
 
 varchar = lambda anon, obj, field, val: anon.faker.varchar(field=field)
+varchar.__doc__ = """
+Produces random data for a varchar field.
+"""
+
 bool = lambda anon, obj, field, val: anon.faker.bool(field=field)
+bool.__doc__ = """
+Produces a random boolean value (True/False)
+"""
+
 integer = lambda anon, obj, field, val: anon.faker.integer(field=field)
+integer.__doc__ = """
+Produces a random integer (for a Django IntegerField)
+"""
+
 positive_integer = lambda anon, obj, field, val: anon.faker.positive_integer(field=field)
+positive_integer.__doc__ = """
+Produces a random positive integer (for a Django PositiveIntegerField)
+"""
+
 small_integer = lambda anon, obj, field, val: anon.faker.small_integer(field=field)
+small_integer.__doc__ = """
+Produces a random small integer (for a Django SmallIntegerField)
+"""
+
 positive_small_integer = lambda anon, obj, field, val: anon.faker.positive_small_integer(field=field)
+positive_small_integer.__doc__ = """
+Produces a positive small random integer (for a Django PositiveSmallIntegerField)
+"""
+
 datetime = lambda anon, obj, field, val: anon.faker.datetime(field=field)
+datetime.__doc__ = """
+Produces a random datetime
+"""
+
 date = lambda anon, obj, field, val: anon.faker.date(field=field)
+date.__doc__ = """
+Produces a random date
+"""
+
 uk_postcode = lambda anon, obj, field, val: anon.faker.uk_postcode(field=field)
+uk_postcode.__doc__ = """
+Produces a random UK postcode (not necessarily valid, but it will look like one).
+"""
+
 uk_country = lambda anon, obj, field, val: anon.faker.uk_country(field=field)
+uk_country.__doc__ = """
+Returns a randomly selected country that is part of the UK
+"""
+
 uk_county = lambda anon, obj, field, val: anon.faker.uk_county(field=field)
+uk_county.__doc__ = """
+Returns a random selected county from the UK
+"""
+
 username = lambda anon, obj, field, val: anon.faker.username(field=field)
+username.__doc__ = """
+Produces a randomly generated username
+"""
+
 first_name = lambda anon, obj, field, val: anon.faker.first_name(field=field)
+first_name.__doc__ = """
+Produces a randomly generated first name
+"""
+
 last_name = lambda anon, obj, field, val: anon.faker.last_name(field=field)
+last_name.__doc__ = """
+Produces a randomly generated second name
+"""
+
 name = lambda anon, obj, field, val: anon.faker.name(field=field)
+name.__doc__ = """
+Produces a randomly generated full name (using first name and last name)
+"""
+
 email = lambda anon, obj, field, val: anon.faker.email(field=field)
+email.__doc__ = """
+Produces a randomly generated email address.
+"""
 full_address = lambda anon, obj, field, val: anon.faker.full_address(field=field)
+full_address.__doc__ = """
+Produces a randomly generated full address, using newline characters between the lines.
+Resembles a US address
+"""
 phonenumber = lambda anon, obj, field, val: anon.faker.phonenumber(field=field)
+phonenumber.__doc__ = """
+Produces a randomly generated US-style phone number
+"""
+
 street_address = lambda anon, obj, field, val: anon.faker.street_address(field=field)
+street_address.__doc__ = """
+Produces a randomly generated street address - the first line of a full address
+"""
+
 city = lambda anon, obj, field, val: anon.faker.city(field=field)
+city.__doc__ = """
+Produces a randomly generated city name. Resembles the name of US/UK city.
+"""
+
 state = lambda anon, obj, field, val: anon.faker.state(field=field)
+state.__doc__ = """
+Returns a randomly selected US state code
+"""
+
 zip_code = lambda anon, obj, field, val: anon.faker.zip_code(field=field)
+zip_code.__doc__ = """
+Returns a randomly generated US zip code (not necessarily valid, but will look like one).
+"""
+
 company = lambda anon, obj, field, val: anon.faker.company(field=field)
+company.__doc__ = """
+Returns a randomly generated company name
+"""
+
 lorem = lambda anon, obj, field, val: anon.faker.lorem(field=field)
+lorem.__doc__ = """
+Returns a paragraph of lorem ipsum text
+"""
 
-# These use the value of the field to return a date/datetime that is close
-# (within two years) of the original value.
 similar_datetime = lambda anon, obj, field, val: anon.faker.datetime(field=field, val=val)
+similar_datetime.__doc__ = """
+Returns a datetime that is within plus/minus two years of the original datetime
+"""
+
 similar_date = lambda anon, obj, field, val: anon.faker.date(field=field, val=val)
+similar_date.__doc__ = """
+Returns a date that is within plus/minus two years of the original date
+"""
 
-# similar_lorem produces lorem ipsum text with the same length and same pattern
-# of linebreaks as the original. If the original often takes a standard form
-# (e.g. a single word 'yes' or 'no'), this could easily fail to hide the
-# original data.
 similar_lorem = lambda anon, obj, field, val: anon.faker.lorem(field=field, val=val)
-
+similar_lorem.__doc__ = """
+Produces lorem ipsum text with the same length and same pattern of linebreaks as
+the original. If the original often takes a standard form (e.g. a single word
+'yes' or 'no'), this could easily fail to hide the original data.
+"""

docs/anonymizers.rst

+===================
+Writing Anonymizers
+===================
+
+For each model, you need a subclass of :class:`anonymizer.base.Anonymizer`. They
+can be automatically generated using the :ref:`create-anonymizers-command`
+command.
+
+The main attributes that must be set are ``model`` and ``attributes``. You can
+also override other methods to customise the process.
+
+
+.. class:: anonymizer.base.Anonymizer
+
+   .. attribute:: model
+
+      This is the Django model that will be anonymized.
+
+   .. attribute:: attributes
+
+      This should be a list of 2-tuples, where the first value is the name of an
+      attribute on the model that need to be set (i.e. usually a field name),
+      and the second value specifies a 'replacer' - a source of faked data. The
+      replacer is either a string or a callable:
+
+      * If the replacer is string, it will be interpreted as a function in the
+        module :mod:`anonymizer.replacers`.
+
+      * If the replacer is a callable, it should have a signature compatible
+        with the callables in :mod:`anonymizer.replacers` - see the
+        documentation in that module for writing your own replacers.
+
+      Note that the order the fields are listed can be important. If you have a
+      username field with ``unique=True``, for example, it will help if this comes
+      before other fields like name (see :class:`anonymizer.base.DjangoFaker`
+      for more details).
+
+   .. attribute:: order
+
+      Sometimes it is important that some anonymizers are run before others.  By
+      default, this value is zero, but you can set it to any other numeric
+      value. The anonymizers will be sorted by this attribute before being run.
+
+   .. method:: get_query_set(self)
+
+      Returns the QuerySet to be manipulated.
+
+      The default implementation uses the default manager for the model, and
+      returns all objects, ordered by the 'id' field if it exists. You can
+      override this method to change that.
+
+   .. method:: alter_object(self, obj)
+
+      Alters the object (but does not save).
+
+      Override this method to add custom behaviour for altering an object. This
+      is especially useful if for some fields/rows you want to add some logic
+      that cannot easily be written as a 'replacer'.
+
+   .. method:: alter_object_attribute(self, obj, attname, replacer)
+
+      Alters the attribute 'attname' on object 'obj', using the replacement
+      source 'replacer'

docs/commands.rst

+========
+Commands
+========
+
+.. _create-anonymizers-command:
+
+create_anonymizers
+------------------
+
+.. code-block:: bash
+
+  manage.py create_anonymizers <app name> [<app name 2>..]
+
+For each model in each app, default anonymizers will be generated and saved in
+``<app>/anonymizers.py``. For each field, the best guess 'replacer' will be
+used, using the types and names of the Django fields defined. But you will
+almost certainly have to edit the generated file to tweak the choices made, and
+in many cases to completely remove anonymizers for models that don't need them.
+
+Currently some fields are deliberately skipped - these include `ForeignKey`
+fields and `ManyToManyField` relations. A comment will be included to show they
+have been skipped.
+
+Some fields type are also currently unsupported. The corresponding code
+generated will produce an error on import, to indicate that it must be dealt
+with before proceeding. Some of the missing fields could be added fairly easily
+(requests welcome, patches even more so).
+
+.. _anonymize-data-command:
+
+anonymize_data
+--------------
+
+.. code-block:: bash
+
+   manage.py anonymize_data <app name> [<app name 2>..]
+
+Runs all the anonymizers defined in ``<app>/anonymizers.py``. This destructively
+updates the data in your database, so be careful not to use on a live database!
 # If extensions (or modules to document with autodoc) are in another directory,
 # add these directories to sys.path here. If the directory is relative to the
 # documentation root, use os.path.abspath to make it absolute, like shown here.
-#sys.path.insert(0, os.path.abspath('.'))
+#sys.path.insert(0, os.path.abspath('../'))
+
+# To be able to find source and dependencies, we must include the virtualenv
+venv = os.path.join(os.environ['HOME'], ".virtualenvs/anonymizer/bin/activate_this.py")
+if os.path.exists(venv):
+    execfile(venv, dict(__file__=venv))
+else:
+    sys.stderr.write("Virtualenv %s not found, building docs will fail !!\n" % venv)
+    sys.stderr.write("Since the docs use the 'autodocs' extension, the sources need to be importable by the Sphinx process.\n")
+
 
 # -- General configuration -----------------------------------------------------
 

docs/djangofaker.rst

+====================================
+DjangoFaker - Django-aware fake data
+====================================
+
+.. class:: anonymizer.base.DjangoFaker
+
+   This class is used by the :mod:`anonymizer.replacers` module to generate fake
+   data. It encapsulates some knowledge about Django models and fields to
+   attempt to generate data that fits the constraints of the model.
+
+   .. method:: get_allowed_value(self, source, field)
+
+      This method is the part that encapsulates knowledge about Django models
+      and fields, and uses it to return appropriate data.
+
+      ``source`` is a callable that takes no arguments and returns a piece of
+      fake data. In some cases (such as for unique constraints), it may be
+      necessary for this callable to called more than once, to try to get data
+      that doesn't violate constraints.
+
+      ``field`` is the Django model field for which data must be generated. If
+      ``None``, the source callable will be used without further checking.
+
+       So far, this method understands and attempts to respect:
+
+       * The ``max_length`` attribute of fields.
+
+       * The ``unique`` constraint.
+
+
+   The remaining public methods all generate different types of fake data, using
+   ``get_allowed_value`` to apply constraints. Many methods use an underlying
+   ``faker.Faker`` instance. These include the following, which have some useful
+   properties.
+
+   .. method:: name(self)
+
+      Generates a full name, using the '<first name> <last name>' pattern.
+
+   .. method:: first_name(self)
+
+      Returns a randomly selected first name
+
+   .. method:: last_name(self)
+
+      Returns a randomly selected last name
+
+   .. method:: email(self)
+
+      Returns a randomly generated email address, using the pattern
+      '<initial><last name>@<free email provider>'
+
+   .. method:: username(self)
+
+      Returns a randomly generated user name, using the pattern
+      '<initial><lastname>'.
+
+   These name-related methods have the property that the same underlying first
+   name and last name will be used until you repeat any of the methods. This
+   means that if you use these methods to generate a set of data for a user
+   model, the name/username/email address for a single object will correspond to
+   each other.
+
+   For models with unique constraints on one of these fields, there is the
+   complication that django-anonymizer will avoid setting unique fields to
+   already present or already generated values. Sometimes this means that a fake
+   data source must be used more than once to get a unique value, and this can
+   upset the state of the cycle. To avoid this, put fields with a unique
+   constraint at the start of the :attr:`~anonymizer.base.Anonymizer.attributes`
+   list.
 Welcome to Django Anonymizer's documentation!
 =============================================
 
+Django Anonymizer helps you anonymize data in a database used for development of
+a Django project.
+
+It is common practice in develpment to use a database that is very similar in
+content to the real data, and sometimes you need the live data on a developer's
+machine in order to reproduce a bug. The problem is that this can lead to having
+copies of sensitive customer data on development machines and other places (like
+automatic backups).
+
+This app helps by providing an easy and customizable way to anonymize data in
+your models. It is designed for fairly small databases (or databases that have
+already been reduced to a small size) - otherwise the anonymization process will
+simply take too long to execute.
+
+Please note that the methods provided may not be able to give full
+anonymity. Even if you anonymize the names and other details of your customers,
+there may well be enough data to identify them. Relationships between records in
+the database are not altered, in order to preserve the characteristic structure
+of data in your application, but this may leave you open to information leaks
+which might not be acceptable for your data. This application **should** be good
+enough for simpler policies like 'remove all real telephone numbers from the
+database'.
+
+
 Contents:
 
 .. toctree::
    :maxdepth: 2
 
+   install
+   overview
+   commands
+   anonymizers
+   replacers
+   djangofaker
+
 Indices and tables
 ==================
 
+============
+Installation
+============
+
+Add 'anonymizer' to your ``INSTALLED_APPS`` setting. It would be advisable to
+arrange it so that this only happens on development machines, and not anywhere
+with access to the live database.
+
+

docs/overview.rst

+========
+Overview
+========
+
+* Generate some anonymizers for your apps::
+
+    ./manage.py create_anonymizers app_name1 [app_name2...]
+
+  This will create a file ``anonymizers.py`` in each of the apps you specify.
+  (It will not overwrite existing files).
+
+  The file will contain autogenerated classes that attempt to use appropriate
+  functions for generating fake data.
+
+* Edit the generated ``anonymizers.py`` files.
+
+  You will mainly need to edit the
+  :attr:`~anonymizer.base.Anonymizer.attributes` list:
+
+  * Choose an appropriate 'replacer' from the :mod:`anonymizer.replacers` module.
+
+  * For some fields, you will want to remove them from the list of attributes, so
+    that the values will be unchanged - especially things like denormalised
+    fields.
+
+  * Some models may not need any anonymization, and the corresponding anonymizer
+    can be deleted.
+
+* If you need to create anonymizers for apps that you do not control, you may
+  want to move the contents of the anonymizers.py file to an app that you **do**
+  control. It doesn't matter if the anonymizer classes are for models that do
+  not correspond to the applications they are contained it.
+
+  (For example, if you want to anonymize the models in django.contrib.auth, you
+  will probably want to move the contents of ``django/contrib/auth/anonymizers.py``
+  into ``yourprojectapp/anonymizers.py``)
+
+* Run the anonymizers::
+
+    ./manage.py anonymize_data app_name1 [app_name2...]
+
+  This will DESTRUCTIVELY UPDATE all your data. Make sure you only do this on a
+  copy of your database, use at own risk, yada yada.
+
+* Note: your database may not actually delete the changed data from the disk
+  when you update fields. For Postgresql you will need to VACUUM to delete that
+  data.
+
+  And even then, your operating system may not delete the data from the
+  disk. Properly getting rid of these traces is left as an excercise to the
+  reader :-)

docs/replacers.rst

+=============================
+Replacers - fake data sources
+=============================
+
+A 'replacer' is a source of faked data. The replacers in this module can be
+referred to using a string that is simply the name of the function. Custom
+replacers can be used by defining them as callables.
+
+When run by the anonymizer, the callable will be passed the Anonymizer object,
+the object being altered, the field being altered, and the current value of the
+field. It must return random data of the appropriate type. You can use ``lambda
+*args: my_constant_value`` to return a constant.
+
+All of the replacers defined in this module use a
+:class:`anonymizer.base.DjangoFaker` instance to generate fake data, and this
+object may be of use to you in writing your own replacers.
+
+
+.. automodule:: anonymizer.replacers
+   :members: