Backslashes accumulate when saving/loading iteratively

Issue #153 new
Johann Petrak created an issue

When pybtex writes to a file or converts entries to a string, a number of substitutions are performed, e.g. “_” to “\_” or “~” to “\textascootilde”.

However, when the string is parsed or the file is loaded, these substitutions are NOT undone.

This illustrates the problem:

from pybtex.database import Entry, Person, BibliographyData, parse_file, parse_string

bd = BibliographyData()

e = Entry("misc", fields=dict(url="https:some.org/a?x=%2#x", keywords="as#sfdfd%dfdf_and~too"))
bd.add_entry("key1", e)

str1 = bd.to_string("bibtex")

bd2 = parse_string(str1, "bibtex")

print("URL Original: ", bd.entries["key1"].fields["url"])
print("URL ser/deser:", bd2.entries["key1"].fields["url"])

print("KW Original: ", bd.entries["key1"].fields["keywords"])
print("KW ser/deser:", bd2.entries["key1"].fields["keywords"])

This will output:

URL Original:  https:some.org/a?x=%2#x
URL ser/deser: https:some.org/a?x=\%2\#x
KW Original:  as#sfdfd%dfdf_and~too
KW ser/deser: as\#sfdfd\%dfdf\_and\textasciitilde too

Comments (5)

  1. Johann Petrak reporter

    This is really bad, because pybtex adds backslashes on each iteration of serialization and deserialization:

    from pybtex.database import Entry, Person, BibliographyData, parse_file, parse_string
    import pybtex
    
    print("Version:", pybtex.__version__)
    bd = BibliographyData()
    
    e = Entry("misc", fields=dict(url="https:some.org/a?x=%2#x", keywords="as#sfdfd%dfdf_and~too"))
    bd.add_entry("key1", e)
    
    str1 = bd.to_string("bibtex")
    
    bd2 = parse_string(str1, "bibtex")
    
    print("URL Original: ", bd.entries["key1"].fields["url"])
    print("URL ser/deser:", bd2.entries["key1"].fields["url"])
    
    print("KW Original: ", bd.entries["key1"].fields["keywords"])
    print("KW ser/deser:", bd2.entries["key1"].fields["keywords"])
    
    # do it once more
    str2 = bd2.to_string("bibtex")
    bd3 = parse_string(str2, "bibtex")
    
    print("str2=", str2)
    print("URL ser/deser twice:", bd3.entries["key1"].fields["url"])
    print("KW ser/deser twice:", bd3.entries["key1"].fields["keywords"])
    
    # Once more
    str3 = bd3.to_string("bibtex")
    bd4 = parse_string(str3, "bibtex")
    
    print("str3=", str3)
    print("URL ser/deser twice:", bd4.entries["key1"].fields["url"])
    print("KW ser/deser twice:", bd4.entries["key1"].fields["keywords"])
    

  2. Johann Petrak reporter

    The bottom line is that any field that has one of the characters “&”, “%”, “_” or others that need backslash-escaping in latex will ACCUMULATE backslashes on each iteration of saving and loading (the additional backslashes are added on saving/serialization, nothing is changed on loading/deserialization).
    I am a bit stumped how such a basic problem can still be in the code?

  3. Johann Petrak reporter

    BTW this also happens when using pybtex-convert to repeatedly read in a bibtex file A, save it to B, then read B, save to C etc. At each iteration any underscore or hash character in a field will gain another backslash.

  4. Paul Kuberry

    @Johann Petrak , I’m currently using the following workaround:

    #db.to_file(f_out, bib_format='bibtex')  # <- what doesn't work correctly
    astr = db.to_string(bib_format='bibtex')
    astr = astr.replace("\\\\", "\\")
    astr = astr.replace("\_", "_")
    astr = astr.replace("\&", "&")
    astr = astr.replace("\%", "%")
    with open(f_out, "w") as f_handle:
        f_handle.write(astr)
    

  5. Log in to comment