Use defusedxml to guard against various XML vulnerabilities

#300 Merged at c94a2cf
Repository
maykinmedia
Branch
default
Repository
openpyxl
Branch
2.5
Author
  1. Alexander Schrijver
Reviewers
Description
  • Test for various XML vulnerabilities

    Add tests which show that Quadratic blowup and Billion laughs attacks are possible for openpyxl. Add tests which test for the regression of XXE-attacks for local or remote files.

  • Use defusedxml for lxml and xml.etree.(c)Elementree to guard against the billion laughs attack and quadratic blowup attack.

Comments (15)

  1. CharlieC

    Thanks but this makes defusedxml a hard dependency. Wouldn’t it make sense to make it optional? Otherwise changes upstream to the standard library and lxml would make more sense.

    The vulnerability tests seem to be duplicate.

  2. Alexander Schrijver author

    That would mean depending on if you would install defusedxml, you would be vulnerable or not to these attacks. I would prefer it to be a hard dependency. If that’s not desirable, I can make it a note in the README about the security implications and make it optional.

    I agree fixing this in the standard library/lxml would be the more sensible way to go. However, I don’t think standard library/lxml are planning on making any of these changes. Both are well aware of the risks, see:

    https://docs.python.org/2/library/xml.html#xml-vulnerabilities

    https://lxml.de/FAQ.html#how-do-i-use-lxml-safely-as-a-web-service-endpoint

    About the duplicate tests, if you mean that i’m attacking both xl/sharedStrings.xml and [ContentTypes].xml is because they use different parsers. Both lxml and ElementTree are used at the same time, if lxml is installed. If that’s not the duplicate tests, could you point out which are duplicate?

    1. CharlieC

      There might be reasons for why the libraries don’t include relevant mitigations. In general I’d also favour the most secure library. Have you run the performance tests to see if there is a noticeable difference? Have you run tox for all the relevant environments?

      Billion laughs seems to be tested twice. I think the tests only need to be in the XML part of the library. No need to add additional tests that require changing existing files as the tests are related to the parsers only.

      1. Alexander Schrijver author

        Yes there probably are good reasons they don’t adapt these issues in standard library/lxml, I haven’t found out why yet. I’m guessing it is because it would break a lot of software.

        Oh, you’re right. about the tests, i’ll change it.

      2. Alexander Schrijver author

        I have run tox. I don’t have all the necessary requirements installed. But other than that it succeeds.

        _____________________________________________________________________________________________________ summary _____________________________________________________________________________________________________
          py27: commands succeeded
        SKIPPED:  py34: InterpreterNotFound: python3.4
          py35: commands succeeded
        SKIPPED:  py36: InterpreterNotFound: python3.6
          py37: commands succeeded
        SKIPPED:  pypy: InterpreterNotFound: pypy
        SKIPPED:  pypy3: InterpreterNotFound: pypy3
          nolxml: commands succeeded
          lxml: commands succeeded
          keep_vba: commands succeeded
        SKIPPED:  keep_vba-py34: InterpreterNotFound: python3.4
          nopillow: commands succeeded
          xfail: commands succeeded
          pandas: commands succeeded
          numpy: commands succeeded
          doc: commands succeeded
          doctest: commands succeeded
          congratulations :)
        
        1. CharlieC

          py36 and py34 aren’t important but we should definitely test against pypy. Not that it brings any performance improvements with openpyxl because we’re using cElementTree but people do use it. I can run those envs locally but you could also run them on Bitbucket using pipelines.

  3. Alexander Schrijver author

    Also, about any performance hit, I did a very basic comparison. with the script below. It reads in a file with 10000 rows with 4 columns and writes it to disk. and prints the time it takes. So there is a small performance hit.

    Version without defusedxml

    ./test.py ../large-file.xlsx Sheet1
    1.2656393930001286
    1.2581246709996776
    1.2842927739993684
    1.2873246990002372
    1.2602214349999485
    1.282842953999534
    1.398842481000429
    1.265241748000335
    1.3274842709997756
    1.383017629999813

    Version with defusedxml

    ./test.py ../large-file.xlsx Sheet1
    1.437634259999868
    1.420389238000098
    1.4562232119997134
    1.4478153339996425
    1.4217176570000447
    1.4359735990001354
    1.446506668999973
    1.4246281499999895
    1.429721251999581
    1.4423400240002593

    #!/usr/bin/env python
    
    from openpyxl import load_workbook, Workbook
    
    import sys
    
    
    import time
    
    
    def main():
        if len(sys.argv) < 3:
            print('{} FILENAME SHEET_NAME'.format(sys.argv[0]))
            return
    
        filename = sys.argv[1]
        sheet_name = sys.argv[2]
    
        write_wb = Workbook(write_only=True)
    
        read_wb = load_workbook(filename=filename, read_only=True)
        try:
            read_ws = read_wb[sheet_name]
        except KeyError:
            print(
                '{sheet_name} is not an existing sheet, sheet names are {names}'.format(
                    sheet_name=sheet_name, names=', '.join(read_wb.sheetnames)
                ))
        else:
            write_ws = write_wb.create_sheet()
    
            for row in read_ws.rows:
                write_ws.append([cell.value for cell in row])
    
            write_wb.save('write_only_file.xlsx')
    
    
    if __name__ == '__main__':
        for i in range(10):
            start = time.monotonic()
            main()
            end = time.monotonic()
    
            print(end - start)
    
      1. Alexander Schrijver author

        I used the excel file from this issue as input https://bitbucket.org/openpyxl/openpyxl/issues/494/ and this code https://bitbucket.org/!api/2.0/snippets/openpyxl/Ee9zqo/428f39c9dfaf9aee1dd23523905a5eee94b18ddd/files/snippet.txt

        Old version

        python snippet.txt
        
        Versions:
        python: 3.5.2
        xlread: 1.1.0
        openpyxl: 2.5.11
        
        xlrd
            Workbook loaded 44.48s
            OptimizationData 0.16s
            Output Model 0.00s
            >>DATA>> 0.00s
            Store days 0% 0.06s
            Store days 100% 0.05s
            Total time 44.75s
        
        openpyxl
            Workbook loaded 93.34s
            OptimizationData 3.11s
            Output Model 0.00s
            >>DATA>> 0.00s
            Store days 0% 2.65s
            Store days 100% 4.50s
            Total time 103.61s
        
        openpyxl, read-only
            Workbook loaded 0.76s
            OptimizationData 17.80s
            Output Model 0.00s
            >>DATA>> 0.00s
            Store days 0% 17.40s
            Store days 100% 21.04s
            Total time 57.00s
        
        openpyxl, read-only, values only
            Workbook loaded 0.76s
            OptimizationData 25.98s
            Output Model 0.00s
            >>DATA>> 0.00s
            Store days 0% 23.28s
            Store days 100% 17.83s
            Total time 67.85s
        

        New version with defusedxml

        python snippet.txt
        
        Versions:
        python: 3.5.2
        xlread: 1.1.0
        openpyxl: 2.5.11
        
        xlrd
            Workbook loaded 44.38s
            OptimizationData 0.17s
            Output Model 0.00s
            >>DATA>> 0.00s
            Store days 0% 0.06s
            Store days 100% 0.05s
            Total time 44.67s
        
        openpyxl
            Workbook loaded 118.20s
            OptimizationData 3.75s
            Output Model 0.00s
            >>DATA>> 0.00s
            Store days 0% 2.65s
            Store days 100% 4.40s
            Total time 129.00s
        
        openpyxl, read-only
            Workbook loaded 0.90s
            OptimizationData 25.32s
            Output Model 0.00s
            >>DATA>> 0.00s
            Store days 0% 33.04s
            Store days 100% 19.31s
            Total time 78.58s
        
        openpyxl, read-only, values only
            Workbook loaded 0.92s
            OptimizationData 33.16s
            Output Model 0.00s
            >>DATA>> 0.00s
            Store days 0% 31.76s
            Store days 100% 23.93s
            Total time 89.78s
        
        1. CharlieC

          Right, so a potentially significant performance penalty. No need to test all the combinations: it’s just a big file that gives the system a good workout. xlrd already has support for defusedxml but you have to pass it in as an option. It makes much more sense to encourage people to make defusedxml an optional dependency and use it if it’s around.

  4. CharlieC

    I’ve just been looking at the Python change log for 3.8 and it looks like at least entities bug has been fixed. It also looks like possible fixes for the other bugs are available but I don’t quite understand the details. It also looks like the Python developers would welcome a positive contribution here: https://bugs.python.org/issue17239

    1. Alexander Schrijver author

      Thanks for the link.

      I made some changes to the PR I’ve made defusedxml optional, and removed the test_vulns file (and the helper functions), which had the duplicate tests.

      1. CharlieC

        Merged and I’ll probably push a release today. Thanks for taking the time to do this and keep me informed re. the standard library. I’m sure Stefan Behnel would accept a relevant PR for lxml, but we need cElementTree for parsing really big files.