Commits

decalage  committed 4cf23ca

First version 0.01

  • Participants

Comments (0)

Files changed (8)

File pywordform/LICENSE.txt

+LICENSE for the pywordform module:
+
+
+Copyright (c) 2012, Philippe Lagadec (http://www.decalage.info)
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without modification,
+are permitted provided that the following conditions are met:
+
+ * Redistributions of source code must retain the above copyright notice, this
+   list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright notice,
+   this list of conditions and the following disclaimer in the documentation
+   and/or other materials provided with the distribution.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

File pywordform/MANIFEST.in

+include install.bat
+include LICENSE.txt

File pywordform/README.txt

+pywordform:
+a python module to parse a Microsoft Word form in docx format, and
+extract all field values with their tags into a dictionary.
+
+Project website: http://www.decalage.info/python/pywordform
+
+
+
+INSTALLATION:
+
+- on Windows, launch install.bat
+- on other systems, launch: setup.py install
+
+
+
+HOW TO USE THIS MODULE:
+
+See http://www.decalage.info/python/pywordform
+See main program at the end of the module, and also docstrings.
+
+
+
+LICENSE:
+
+See LICENSE.txt.

File pywordform/__init__.py

Empty file added.

File pywordform/install.bat

+@echo off
+rem INSTALL.BAT - Easy installer for Python modules on Windows
+
+rem version 0.02 2009-02-27 Philippe Lagadec - http://www.decalage.info
+
+rem License: 
+rem This file install.bat can freely used, modified and redistributed, as 
+rem long as credit to the author is kept intact. Please send any feedback,
+rem issues or improvements to decalage at laposte.net.
+
+rem CHANGELOG:
+rem 2007-09-04 v0.01 PL: - first version, for Python 2.3 to 2.5
+rem 2009-02-27 v0.02 PL: - added support for Python 2.6
+
+rem 1) test if python.exe is in the path:
+
+python.exe --version >NUL 2>&1
+if errorlevel 1 goto notpath
+echo Python.exe found in the path.
+python setup.py install
+if errorlevel 1 goto error
+goto end
+:NOTPATH
+
+rem 2) test for usual python.exe paths:
+
+REM Python 2.6: 
+c:\python26\python.exe --version >NUL 2>&1
+if errorlevel 1 goto notpy26
+echo Python.exe found in C:\Python26
+c:\python26\python.exe setup.py install
+if errorlevel 1 goto error
+goto end 
+:NOTPY26
+
+c:\python25\python.exe --version >NUL 2>&1
+if errorlevel 1 goto notpy25
+echo Python.exe found in C:\Python25
+c:\python25\python.exe setup.py install
+if errorlevel 1 goto error
+goto end 
+:NOTPY25
+
+c:\python24\python.exe --version >NUL 2>&1
+if errorlevel 1 goto notpy24
+echo Python.exe found in C:\Python24
+c:\python24\python.exe setup.py install
+if errorlevel 1 goto error
+goto end 
+:NOTPY24
+
+c:\python23\python.exe --version >NUL 2>&1
+if errorlevel 1 goto notpy23
+echo Python.exe found in C:\Python23
+c:\python23\python.exe setup.py install
+if errorlevel 1 goto error
+goto end 
+:NOTPY23
+
+"c:\program files\python\python.exe" --version >NUL 2>&1
+if errorlevel 1 goto notpf
+echo Python.exe found in C:\Program Files\Python
+"c:\program files\python\python.exe" setup.py install
+if errorlevel 1 goto error
+goto end 
+:NOTPF
+
+rem 3) last we just try to launch the script, if .py is associated to python.exe
+echo Python.exe not found, trying to launch setup.py directly.
+setup.py install
+if errorlevel 1 goto error
+goto end
+
+:ERROR
+echo.
+echo If the installation is not successful, try to run "python setup.py install"
+echo or simply "setup.py install" in the script directory.
+echo You can also copy files by hand in the site-package directory of your
+echo Python directory.
+REM pause
+
+:END
+pause

File pywordform/pywordform.py

+"""
+pywordform
+
+a python module to parse a Microsoft Word form in docx format, and
+extract all field values with their tags into a dictionary.
+
+Project website: http://www.decalage.info/python/pywordform
+
+Copyright (c) 2012, Philippe Lagadec (http://www.decalage.info)
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without modification,
+are permitted provided that the following conditions are met:
+
+ * Redistributions of source code must retain the above copyright notice, this
+   list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright notice,
+   this list of conditions and the following disclaimer in the documentation
+   and/or other materials provided with the distribution.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+"""
+
+__version__ = '0.01'
+
+#------------------------------------------------------------------------------
+# CHANGELOG:
+# 2012-02-17 v0.01 PL: - first version
+
+#------------------------------------------------------------------------------
+#TODO:
+# + extract multiline text fields using <w:t> and <w:br> tags
+# - recognize date fields and extract fulldate
+# - test docm files, add support if needed
+# - support legacy fields
+# - CSV output (option)
+# - more advanced parser returning a list of field objects: keep order, and
+#   get fields with no tag, extract other attributes such as title
+
+#------------------------------------------------------------------------------
+import zipfile, sys
+
+try:
+    # lxml: best performance for XML processing
+    import lxml.etree as ET
+except ImportError:
+    try:
+        # Python 2.5+: batteries included
+        import xml.etree.cElementTree as ET
+    except ImportError:
+        try:
+            # Python <2.5: standalone ElementTree install
+            import elementtree.cElementTree as ET
+        except ImportError:
+            raise ImportError, "lxml or ElementTree are not installed, "\
+                +"see http://codespeak.net/lxml "\
+                +"or http://effbot.org/zone/element-index.htm"
+
+
+# namespace for word XML tags:
+NS_W = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
+
+# XML tags in Word forms:
+TAG_FIELD         = NS_W+'sdt'
+TAG_FIELDPROP     = NS_W+'sdtPr'
+TAG_FIELDTAG      = NS_W+'tag'
+ATTR_FIELDTAGVAL  = NS_W+'val'
+TAG_FIELD_CONTENT = NS_W+'sdtContent'
+TAG_RUN           = NS_W+'r'
+TAG_TEXT          = NS_W+'t'
+
+
+def parse_form(filename):
+    fields = {}
+    zfile = zipfile.ZipFile(filename)
+    form = zfile.read('word/document.xml')
+    xmlroot = ET.fromstring(form)
+    for field in xmlroot.getiterator(TAG_FIELD):
+        field_tag = field.find(TAG_FIELDPROP+'/'+TAG_FIELDTAG)
+        if field_tag is not None:
+            tag = field_tag.get(ATTR_FIELDTAGVAL, None)
+            field_value = field.find(TAG_FIELD_CONTENT+'/'+TAG_RUN+'/'+TAG_TEXT)
+            if field_value is not None:
+                value = field_value.text
+                fields[tag] = value
+    zfile.close()
+    return fields
+
+
+if __name__ == '__main__':
+    fields = parse_form(sys.argv[1])
+    for tag, value in sorted(fields.items()):
+        print '%s = "%s"' % (tag, value)

File pywordform/sample_form.docx

Binary file added.

File pywordform/setup.py

+# Setup script for pywordform - Philippe Lagadec
+
+# History:
+# 2012-03-04 v0.01 PL: - first version
+
+import distutils.core
+
+from pywordform import __version__
+
+kw = {
+    'name': "pywordform",
+    'version': __version__,
+    'description': "a python module to parse a Microsoft Word form in docx format, and extract all field values with their tags into a dictionary.",
+    'author': 'Philippe Lagadec',
+    #'author_email': "decalage(a)laposte.net",
+    'url': "http://www.decalage.info/python/pywordform",
+    'license': "BSD (see source code or LICENCE.txt)",
+    'py_modules': ['pywordform'],
+    }
+
+
+# If we're running Python 2.3, add extra information
+if hasattr(distutils.core, 'setup_keywords'):
+    if 'classifiers' in distutils.core.setup_keywords:
+        kw['classifiers'] = [
+            'Development Status :: 4 - Beta',
+            'License :: OSI Approved :: BSD License',
+            'Intended Audience :: Developers',
+            'Operating System :: OS Independent',
+            'Programming Language :: Python',
+            'Topic :: Software Development :: Libraries :: Python Modules'
+          ]
+    if 'download_url' in distutils.core.setup_keywords:
+        kw['download_url'] = "https://bitbucket.org/decalage/pywordform/downloads"
+
+distutils.core.setup(**kw)