Need to specially represent BibTeX markup

Problem

BibTeX input is getting escaped as if it's plain text and not LaTeX.

Issue ~~#81~~ asked that UTF-8 strings be encoded as BibTeX strings with latexcodec, leading to commit eff6996. This change has led to issues ~~#101~~ and #118 as well as ad hoc special cases as in the fix to issue ~~#86~~ (commit f79b6ad). Despite these hacks issues remain with other characters, such as _.

Example

As an example, consider the following program:

#! /usr/bin/env python3

import pybtex.database

print('pybtex version:', pybtex.__version__)

bib_data = pybtex.database.parse_file("test_in.bib", bib_format='bibtex')
bib_data.to_file('test_out.bib', bib_format='bibtex')

where the file test_in.bib holds

@inbook{sidak,
 author = {Bob I. \v{S}id\'ak},
 publisher = {John Wiley \& Sons},
 title = {Na\"ive~Bayes in $O(n^2 \log \phi_2)$ Time, A 50\% Improvement over Attempt \# 1 etc.\ of Smith},
 year = 2011,
 month = mar,
 url = {http://example.com/new_books/naive%20bayes.pdf},
 chapter = 3,
 pages = {95--105},
}

Ran with Python 3.6.5, it produces the output Version: 0.22.2 and test_out.bib with the content

@inbook{sidak,
    author = "\v{S}id\'ak, Bob I.",
    publisher = "John Wiley \\& Sons",
    title = {Na\"ive\textasciitilde Bayes in $O(n^2 \log \phi\_2)$ Time, A 50\\% Improvement over Attempt \\# 1 etc.\ of Smith},
    year = "2011",
    month = "March",
    url = "http://example.com/new\_books/naive\%20bayes.pdf",
    chapter = "3",
    pages = "95--105"
}

This shows the following problems:

\& becomes \\&
~ becomes \textasciitilde
\phi_2 becomes \phi\_2
50\% becomes 50\\%
\# becomes \\#

(It also shows mar becoming "March", as mentioned in issues #77 and #84. And it shows that % becomes \% in the URL, as mentioned in issue ~~#86~~.)

Cause

The immediate cause is using the latexcodec encode function backwards as mentioned elsewhere.
In particular, the code

codecs.encode(value, 'ulatex+{}'.format(self.encoding)))

at line 107 of /pybtex/database/output/bibtex.py is called when value holds a string extracted from a BibTeX file, that is, a string that is already encoded as LaTeX (or, more precisely, the brackets-aware version of LaTeX used by BibTeX---see, e.g., issue #98 and this for details).

The deeper cause is that Pybtex is attempting to represent a markup language as if it were a character set. BibTeX uses escaping and context in a way that some characters do not represent themselves but rather formatting instructions, making it a markup language. Pybtex appears to be attempting to strip out all such markup and translate it to a Unicode string in which every character literally represents itself. This internal representation can then be converted back to BibTeX by escaping characters as needed. However, this is a lossy process since the internal literal Unicode strings cannot distinguished between what was markup and what wasn't in the input BibTeX.

For example, the BibTeX \_ will become u"_". This leaves no obvious Unicode code point for the BibTeX markup _ for subscripts. So, it also gets mapped to u"_" as well, a clash meaning that there's no way to get the original BibTeX back again.

Before using the latexcodec's encode function, Pybtex's BibTeX output formatter was presuming the markup case. By using the encode function, it's now presuming the literal case and escaping more. Switching to use both the encode and decode functions of latexcodec will assume whatever latexcodec does. For example, the latexcodec README shows that both the LaTeX inputs # and \# gets decoded as the Unicode u"#" and that u"#" gets encoded as \#, meaning the literal case will be presumed. Using any of these ways, Pybtex will sometimes be wrong.

Given the immediate cause, it might be tempting to decode the LaTeX strings when parsing the BibTeX file, as suggested in issue #112. However, as explained above, there is no way of using latexcodec that won't sometimes lose information. That may be acceptable for some use cases, such as including BibTeX references in documentation, but is not a general solution. For example, Bibo needs to be able to round-trip BibTeX unchanged and the current version of Pybtex not doing this has led Bibo to use an old version of Pybtex and to not use latexcodec.

Possible Fixes

I can think of three possible fixes:

Pybtex could store strings in the representation used by the bibliography format parsed as input while also noting the representation used. Then, the output formatters can determine whether any translation is needed at the point where the output format is know. Knowing both the input format and output format will enable better translations that won't lose information. This will work well for round trips but will require a quadratic number of converters.
Pybtex could store strings in Unicode while using a Private Use Area for markup. For example, _ could become the private code point U+E000. Getting this right is subtle and may start to look like reinventing BibTeX.
Pybtex could internally use a markup, say BibTeX. This will avoid the need to roll your own encoding, but may not be flexible enough to handle features of some other input formats. That could be addressed by using an extended form of BibTeX, or some more advanced markup.

Problem

Example

Cause

Possible Fixes

Comments (0)