remove checking SUPPORTED_FORMATS by file extension

Issue #512 closed
kelvin22
created an issue

Is it necessary/appropriate to check the supported format of the source files using the file extension?

This breaks the pattern of using urllib to load files from network sources, which saves to a tmp file (with no file extension) and passes the path on for loading:

filepath, headers = urllib.request.urlretrieve('http://example.com/file.xlsx')
load_workbook(filepath)

It also deviates from the behaviour of xlrd, and other file loading functions in pandas, petl, etc.

Would like to propose removing the check, or using something other than the file extension (perhaps the file contents) to check for format support.

Comments (5)

  1. kelvin22 reporter

    Yes, but then then it's necessary to locating the tmp folder and manage that as well.

    My workaround so far is fairly simple, involves adding the file extension for openpyxl, but shouldn't be necessary:

    if url[-4:] == 'xlsx':
        filepath, headers = urllib.request.urlretrieve(url)
        os.rename(filepath, filepath + '.xlsx')
        filepath = filepath + '.xlsx'
    

    and later instead of using urllib.urlcleanup(), I can use

    os.remove(self.filepath)
    
  2. CharlieC

    If you can provide better heuristics for weeding out invalid files, I agree the current file extension ones are far from perfect but surprisingly good in practice, then we could make the change. In the meantime your code can probably be simplified to use a NamedTemporaryFile.

  3. Log in to comment