Heavy Allocation in Emitter.analyzeScalar(String) due to Regex Overhead
Writing large yaml documents using snakeyml I’ve found GC contention resulting from integer array allocations within Pattern.matcher
.
I believe we can dramatically improve performance by refactoring the leading-zero check away from regex and into a simple utility method along these lines:
private static boolean hasLeadingZero(String scalar) {
if (scalar.length() > 1 && scalar.charAt(0) == '0') {
int secondChar = scalar.charAt(1);
return (secondChar >= '0' && secondChar <= '9') || secondChar == '_';
}
return false;
}
Comments (15)
-
reporter -
- changed status to open
-
- marked as enhancement
-
The tests fail. Please run the tests locally
./mvn clean verify
-
you only check first and second char.
-
-
assigned issue to
-
assigned issue to
-
Can you please contribute the test where you measure the memory consumption ?
-
- marked as minor
-
reporter I’ve put together a test here: https://github.com/snakeyaml/snakeyaml/pull/6 since the original PR was merged while I worked on writing the test (thank you for such a quick review!). This sort of test is tricky because garbage collectors are complex machines, and it can be challenging to test specifically what we mean to without making the build unnecessarily expensive or flaky.
-
How do we know that anything is improved ?
What was the setup for memory consumption ? -
reporter Here’s the JMH benchmark I described earlier, which does a better job of illustrating the high rate of allocation: https://github.com/snakeyaml/snakeyaml/pull/7
Benchmark results and analysis copied here from the github pr for posterity.
Before 26f5d5f8 (benchmarks run atop 98ff897):
Benchmark Mode Cnt Score Error Units EmitterBenchmark.emitScalar avgt 3 0.311 ± 0.009 us/op EmitterBenchmark.emitScalar:·gc.alloc.rate avgt 3 6697.479 ± 420.474 MB/sec EmitterBenchmark.emitScalar:·gc.alloc.rate.norm avgt 3 256.000 ± 0.001 B/op EmitterBenchmark.emitScalar:·gc.count avgt 3 108.000 counts EmitterBenchmark.emitScalar:·gc.time avgt 3 62.000 ms
After 26f5d5f8 (benchmarks run atop 8a3063b):
Benchmark Mode Cnt Score Error Units EmitterBenchmark.emitScalar avgt 3 0.220 ± 0.006 us/op EmitterBenchmark.emitScalar:·gc.alloc.rate avgt 3 4713.283 ± 303.238 MB/sec EmitterBenchmark.emitScalar:·gc.alloc.rate.norm avgt 3 128.000 ± 0.001 B/op EmitterBenchmark.emitScalar:·gc.count avgt 3 116.000 counts EmitterBenchmark.emitScalar:·gc.time avgt 3 63.000 ms
The benchmarks were run on a macbook, so cpu scaling makes the timing values a bit chaotic and I wouldn't put too much trust in the ~30% speedup shown. The important piece is
gc.alloc.rate.norm
which is the garbage collected bytes peremitScalar
invocation, which cut in half. -
reporter - changed status to resolved
Resolving this because https://bitbucket.org/snakeyaml/snakeyaml/commits/26f5d5f82668aa396ca201b524159ab45158d565 has merged.
-
How do you produce the result ? Can you please provide the complete instructions to reproduce it ?
-
reporter Using the code change in https://github.com/snakeyaml/snakeyaml/pull/7, I executed this main method in my IDE: https://github.com/snakeyaml/snakeyaml/pull/7/files#diff-0dced93986ac61628c4b3093981408316cc847f3db657af126ed671b92d1f5cdR60-R63
I opted for a simple configuration, using a separate module for benchmarks is more standard, based on this maven archetype: https://github.com/openjdk/jmh#preferred-usage-command-line
-
It will be delivered in version 2.1
- Log in to comment
I’ve created a PR here with my proposed solution: https://github.com/snakeyaml/snakeyaml/pull/5
There are a few similar regex patterns. I don’t see them in profiling data, so I opted not to attempt replacement. In a fairly large application, integer arrays within
analyzeScalar
regex account for 83% of total allocation exceeding 9 terabytes per minute.