Support of rich text in cells

#224 Open
Repository
Cilyan
Branch
2.6
Repository
openpyxl
Branch
2.6

Bitbucket cannot automatically merge this request.

The commits that make up this pull request have been removed.

Bitbucket cannot automatically merge this request due to conflicts.

Review the conflicts on the Overview tab. You can then either decline the request or merge it manually on your local system using the following commands:

hg update 2.6
hg pull -r 2.6 https://bitbucket.org/Cilyan/openpyxl
hg merge e83320cf66d6
hg commit -m 'Merged in Cilyan/openpyxl/2.6 (pull request #224)'
Author
  1. Cilyan Olowen
Reviewers
Description

Dear all,

as I needed the feature, I invested some little time looking into the support of rich text in cells. This feature allows to set font style to only parts of a cell's text, in comparison with cell style (already supported) that styles the whole content.

I created the pull request in order to discuss with you, it is clearly not ready for integration. I planned on working further on it, but any advise/review comment is welcome as it's my first time hacking into openpyxl.

As cell style is already implemented and the rich text feature is nothing more than Font inside the shared strings, I tried to reuse as much as possible the existing code. Also, I did not want to break the compatibility of the Cell.value property since probably a lot of user program do not expect to receive rich text on that interface. So even if cell contains rich text, Cell.value returns a plain text version of it. At first, I used a separate internal property so that _value is also not changed. But in the end, I thought that user program that was low level enough to interface with Cell.internal_value should be able to distinguish between plain text and rich text content. Comments welcome on that decision.

The proposed implementation is based on a type named RichTextContent implemented in openpyxl.cell.rich_text. The name was selected so that it doesn't conflict with openpyxl.cell.text.RichText, although this later class is only used internally to read XML. Maybe it would be more intuitive to rename the original class and use RichText for the new class.

The RichTextContent class is responsible for holding the content of the cell when it is rich text instead of plain string. The class implements basic extend/append/add functions as well as + and += operators to present a (hopefully) nice interface to the user. The class is also responsible for its serialisation, although I did not inherit Serialisable to not clutter the namespace too much in a class that is supposed to be exposed to the user code. I modified write_shared_string to call the RichTextContent function when it recognises that the shared string is not plain text. The compatibility with IndexedList is ensured by the fact that objects have a hash and as such can be used as dict keys. However, this is limited to recognising the same object and not two objects that would contain the same content. A further improvement can be made by computing a hash from the content instead and implementing comparision functions (__hash__ and __eq__).

To have access to the rich text content and make modifications, one would use the new Cell.rich_text property. The Cell.rich_text propery always return a RichTextContent object even if the internal value is a plain string. This is done so that manipulation and mostly concatenation is made easy and transparent to the used. The property Cell.is_rich_text can then be used to query if internal content of a cell is indeed rich text or plain text. In the end, it doesn't matter much to the user, as RichTextContent is intelligent enough to produce plain text XML when it contains no styles. The only difference would be at the Cell._value level.

The most important limitation of the current implementation is that it doesn't write properly the xml:space="preserve" property when necessary. As such, space between text chunks are not observed. For this, I would need advice, because I could not properly understand how to support that attribute through the Serialisable interface of Text. Maybe by having a specialisation of NestedText? I am not sure.

I did not code yet an implementation for the reader. When the writer is finished and the Cell interface makes everyone happy, it should be straightforward to convert openpyxl.cell.text.Text to RichTextContent before placing the value in the cell, instead of calling Text.content.

Another discrepancy that would need care is the fact that if you save the file once, then change a cell's text and save it again, the new file will also contain strings that are no longer used. This happens the same for plain text in the current implementation. I think clearing the shared strings table before writing the file would be sufficient, but maybe not very efficient. Comments welcome.

>>> from openpyxl import Workbook
>>> wb = Workbook()
>>> ws = wb.active
>>> cell = ws["A1"]
>>> cell.rich_text
Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    cell.rich_text
  File "C:\Anaconda3\envs\openpyxl\lib\site-packages\openpyxl\cell\cell.py", line 328, in rich_text
    self.__class__.__name__
AttributeError: 'Cell' object as no attribute 'rich_text'
>>> from openpyxl.styles.fonts import Font
>>> bold_red = Font(name='Calibri',size=11,bold=True,italic=False,vertAlign=None,underline='none',strike=False,color='FFFF0000')
>>> cell.rich_text = "Hello "
>>> cell.rich_text += ("world", bold_red)
>>> cell.rich_text
<openpyxl.cell.rich_text.RichTextContent object at 0x0000000003C0AF98>
>>> cell.value
'Hello world'
>>> wb.save(r"W:\RichText.xlsx")

2018-01-25 16_30_43-RichText.xlsx - Excel.png

Comments (12)

  1. CharlieC

    Thanks for this. It is interesting and, as you note, most of the code required is already in the library.

    However, I remain fairly strongly opposed to supporting the feature at all. The file format is spectacularly unsuited for text in general and sub-cell formatting in particular, which is why the XML is as awful as it is.

    If this were to be supported then I would probably drop the creation of the sharedStrings table altogether and go inline all the way. That would at least simplify things. But I really, really hate the associated XML.

    Also, as you note, there are issues for user code particularly with duplicate strings.

  2. Cilyan Olowen author

    Hi Charlie and thanks for your quick answer. I understand your point of view and as I don't know much of the format except for the rich text and shared string part, I can't really argue. They say that it improves performance. Who knows? And well, anyway, it's XML... I mean, I was expecting it to be ugly so I did not disturb me a lot :)

    I cannot presume of what cost it adds to maintainance, but maybe waiting for a proper rewrite of including the strings inside the cells (I wonder if the format allows it by the way?), users like me will be happy to see some support of the feature. On my side, I will try to finish some basic support for my needs and probably report here anyway in case other users are interested.

    You're welcome to bring your ideas/guidance meanwhile, so that my code is not too far from your standards. One day, who knows, you will maybe have a change of mind? :)

    1. CharlieC

      sharedStrings is very much a legacy element: it allows memory optimisation in the spreadsheet by using only numbers in cells. You can imagine how much this mattered on older PCs with perhaps only a couple of MB RAM. The specification allows both possibilities and applications like Google Sheets essentially do the sensible thing and do it all inline. But my god is it verbose! See 18.3.1.53 is (Rich Text Inline) in the spec.

      I suspect that the main use case is people wanting to preserve this kind of formatting when editing existing files so what it looks like to the user is probably less important than we might imagine. You suggestion for creating rich text is reasonable but the value part remains a problem when reading files: people will almost always expect to see the plaintext value but have formatting preserved. Some kind of additional attribute would probably make sense so that it is queryable and appropriately editable.

      But it really is basically an admission of failure to have this possibility in the specification at all.

  3. Cilyan Olowen author

    I updated the code following your advice (I could not find yet how to easily see the differences between versions of the pull request in Bitbucket...). I also added support for xml:space="preserve", but not sure if that's the right way. At least it works... I checked the <is> tag. The bright side is that when you decide to go for it, you would be able to just use RichTextContent.to_tree to get the XML you need (minus the si tag that could be made an argument of to_tree)

  4. CharlieC

    A better approach here would be to use something like a Paragraph type for the cell value. This could be controlled by a flag when reading files. 2.6 defaults to writing inline strings for performance reasons so serialisation wouldn’t be hard.

  5. Timothy Wall

    Yes, the XML is a mess for RTF, but that’s why you’ve got a computer reading it and writing it.

    My use case is providing a reasonable facsimile of an application’s display without having to use an additional cell for every change in display element.

  6. Andrew B.

    I was actually just looking for this functionality and stumbled on this PR – your example code looks simple enough to implement once this PR is merged 🙂

  7. Cilyan Olowen author

    I’m still tracking this, especially I wanted to add unit tests to stabilize the code. Unfortunately, for now I need to prioritize other topics. I hope to be able to get back to this soon.

  8. David Wheeler

    Paragraph/Run code exists in python-docx library - have not looked to see how readily adaptable it would be for use in openpyxl.