String matching regex in JsonLexer causes catastrophic backtracking

Issue #1361 new
Chris Howard
created an issue

The attached text file contains JSON data, downloaded from a request to google.co.uk (the text isn't actually legal JSON, but that's besides the point).

When using the JsonLexer to parse this file, the process hangs for a very long time, and the CPU of one core is maxed out.

The reason is catastrophic backtracking in the regex engine, caused by the regex used to match strings:

r'"(\\\\|\\"|[^"])*"'

I believe the problem is that the groups in the regex are not mutually exclusive, as the last group can match a backslash character.

The solution I have found is to make each group in the regex mutually exclusive, by adding a backslash character to the final group:

r'"(\\\\|\\"|[^"\\])*"'

Note that this particular regex is actually in two places (in both the simplevalue and objectvalue sections), and so both regexes should be updated.

A good description of the problem can be found here:

http://www.regular-expressions.info/catastrophic.html

After making these changes, the data in the attached file can be formatted in ~1s on my machine.

Comments (0)

  1. Log in to comment