Optimize string / byte (Base64) validation

Issue #225 resolved
Sven Döring created an issue

The OpenAPI type string / byte is a Base64 string representing an byte array.

The SRV checks the validity of those strings with a long regex in Base64Attribute: "^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$"

It does its job.

Unfortunately Regex matches in Java are awfully slow. It’s actually eleven - 22 times faster to decode a Base64 String and check if an exception was thrown in case of invalidity.

But decoding is quite heavy on memory usage.

There is still an alternative. Guava offers an Base64 decoding stream. If the stream reads successfully the string is valid.

Below is some benchmarking code showing the time and memory usage for both validations.

Base64 String length Base64 RegEx match Base64 decoding stream
1.000 0.518s / 9313K 0.411s / 10849K
10.000 1.656s / 10090K 0.734s / 10849K
100.000 13.458s / 10578K 3.648s / 12268K
1.000.000 114.275s / 19271K 31.516s / 20936K
10.000.000 1302.552s / 105472k 233.884s / 87851K

Tested with

  • Java 11
  • Intel 8770k
  • the very useful Epsilon GC

the decoding stream validation is permanently faster with only marginal higher memory usage.

import java.io.IOException;
import java.io.InputStream;
import java.io.StringReader;
import java.util.Base64;
import java.util.Random;
import java.util.regex.Pattern;

import org.junit.Assert;
import org.junit.Test;

import com.google.common.io.BaseEncoding;

public class Base64AttributeTest {

    private static final String BASE64_STRING = createTestBase64();
    private static final String NO_BASE64_STRING = BASE64_STRING.substring(0, 5_000_000) + "@" + BASE64_STRING.substring(5_000_001);
    private static final Pattern BASE64_PATTERN = Pattern.compile("^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$");
    private static final int TEST_RUNS = 2_000;

    private static String createTestBase64() {
        final byte[] b = new byte[10_000_000];
        new Random().nextBytes(b);
        return Base64.getEncoder().encodeToString(b);
    }

    @Test
    public void testRegex() {
        for (int i = TEST_RUNS; i > 0; --i) {
            if (!BASE64_PATTERN.matcher(BASE64_STRING).matches()) {
                Assert.fail("This String is a Base64 String.");
            }
            if (BASE64_PATTERN.matcher(NO_BASE64_STRING).matches()) {
                Assert.fail("This String is not a Base64 String.");
            }
        }
    }

    @Test
    public void testDecode() {
        for (int i = TEST_RUNS; i > 0; --i) {
            if (!isBase64(BASE64_STRING)) {
                Assert.fail("This String is a Base64 String.");
            }
            if (isBase64(NO_BASE64_STRING)) {
                Assert.fail("This String is not a Base64 String.");
            }
        }
    }

    private static boolean isBase64(final String string) {
        try {
            final InputStream in = BaseEncoding.base64().decodingStream(new StringReader(string));
            int read;
            while ((read = in.read()) != -1) {
            }
            return true;
        } catch (final IOException e) {
            return false;
        }
    }
}

Comments (4)

  1. James Navin

    Thanks for raising this, and for the super-thorough perf testing.

    Out of curiosity - what is your use case that has such large Base64 encoded strings?

  2. Sven Döring reporter

    @James Navin We are sending multiple high resolution images as part of a JSON in the request body.

    These images are part of a bigger object containing additional Integer and String fields.

  3. Log in to comment