Optimize string / byte (Base64) validation even further

Issue #259 resolved
Sven Döring created an issue

In addition to #225 there is an even more simple, yet far more faster Base64 validation possible.

Base64 strings consist of a 64-character alphabet¹. Hence the name Base64.

These characters are arbitrarily arranged in blocks of size four. The last block can be filled with (at most two) '=' to achieve the proper size of four.

This regex was used in the first version of the validation:

?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$

This reflects the above description.

Further optimization of the Base64 validation is easy. Simply test each character of the string to see if it is present in the Base64 alphabet.

¹ The 64 characters: 0-9 a-z A-Z + /

Comments (6)

  1. Sven Döring reporter

    The performance test shown in #225 was extended by this optimization.

    import java.io.IOException;
    import java.io.InputStream;
    import java.io.StringReader;
    import java.util.Base64;
    import java.util.Random;
    import java.util.regex.Pattern;
    
    import org.junit.Assert;
    import org.junit.Test;
    
    import com.google.common.io.BaseEncoding;
    
    public class Base64AttributePerformanceTest {
    
        private static final int TEST_RUNS = 5_000;
        private static final int TEST_BYTE_ARRAY_LENGTH = 1_000_000;
    
        private static final String BASE64_STRING = createTestBase64();
        private static final String NO_BASE64_STRING = BASE64_STRING.substring(0, TEST_BYTE_ARRAY_LENGTH / 2) + "@" + BASE64_STRING.substring(TEST_BYTE_ARRAY_LENGTH / 2 + 1);
    
        private static final Pattern BASE64_PATTERN = Pattern.compile("^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$");
        private static final boolean[] BASE64_CHARACTERS = initBase64Characters();
    
        private static boolean[] initBase64Characters() {
            final boolean[] chars = new boolean[Character.MAX_VALUE];
            chars[43] = true; // '+'
            chars[47] = true; // '/'
            for (int i = 48; i <= 57; ++i) { // '0' to '9'
                chars[i] = true;
            }
            for (int i = 65; i <= 90; ++i) { // 'A' to 'Z'
                chars[i] = true;
            }
            for (int i = 97; i <= 122; ++i) { // 'a' to 'z'
                chars[i] = true;
            }
            return chars;
        }
    
        private static String createTestBase64() {
            final byte[] b = new byte[TEST_BYTE_ARRAY_LENGTH];
            new Random(0).nextBytes(b);
            return Base64.getEncoder().encodeToString(b);
        }
    
        private static boolean isBase64Decode(final String string) {
            try {
                final InputStream in = BaseEncoding.base64().decodingStream(new StringReader(string));
                int read;
                while ((read = in.read()) != -1) {
                }
                return true;
            } catch (final IOException e) {
                return false;
            }
        }
    
        private static boolean isBase64CharCheck(final String string) {
            final int length = string.length();
            if (length == 0) {
                return true;
            }
    
            // it is expected the Base64 string has padding - therefore its length is divisible by 4
            if (length % 4 != 0) {
                return false;
            }
    
            // check for padding at the end - which could be '', '=' or '=='
            final int end = (string.charAt(length - 1) != 61) ? length :
                    (string.charAt(length - 2) != 61 ? length - 1 : length - 2);
    
            // the remaining characters are from the Base64 alphabet only
            for (int i = 0; i < end; ++i) {
                if (!BASE64_CHARACTERS[string.charAt(i)]) {
                    return false;
                }
            }
            return true;
        }
    
        @Test
        public void testRegexCheck() {
            for (int i = TEST_RUNS; i > 0; --i) {
                if (!BASE64_PATTERN.matcher(BASE64_STRING).matches()) {
                    Assert.fail("This String is a Base64 String.");
                }
                if (BASE64_PATTERN.matcher(NO_BASE64_STRING).matches()) {
                    Assert.fail("This String is not a Base64 String.");
                }
            }
        }
    
        @Test
        public void testDecode() {
            for (int i = TEST_RUNS; i > 0; --i) {
                if (!isBase64Decode(BASE64_STRING)) {
                    Assert.fail("This String is a Base64 String.");
                }
                if (isBase64Decode(NO_BASE64_STRING)) {
                    Assert.fail("This String is not a Base64 String.");
                }
            }
        }
    
        @Test
        public void testCharCheck() {
            for (int i = TEST_RUNS; i > 0; --i) {
                if (!isBase64CharCheck(BASE64_STRING)) {
                    Assert.fail("This String is a Base64 String.");
                }
                if (isBase64CharCheck(NO_BASE64_STRING)) {
                    Assert.fail("This String is not a Base64 String.");
                }
            }
        }
    }
    

    That is a benchmark with checking 5000 valid and 5000 non valid Base64 strings of different length with different approaches.


    Approach 1 is checking the string with a regular expression. This is how the SRV validated Base64 strings up until v2.4.5.
    Approach 2 is decoding the string to their bytes. As of SRV v2.4.6 the Base64 validation is done this way.
    Approach 3 is checking the each character of the string as described above.

  2. Sven Döring reporter

    The approaches were tested in speed and memory usage.

    The tests run with:

    • Java 11
    • Intel 8770k
    • Epsilon GC

    Byte array length Regex check Decode Character check
    1.000 0.678s / 10.351KB 0.464s / 13.423KB 0.329s / 8.303KB
    10.000 3.728s / 11.128KB 1.302s / 14.712KB 0.340s / 8.303KB
    100.000 32.404s / 12.023KB 8.182s / 15.354KB 0.606s / 9.457KB
    1.000.000 316.781s / 20.860KB 77.746s / 23.970KB 2.695s / 18.683KB
    10.000.000 3197.124s / 107.520KB 769.089s / 110.592‬KB 22.976s / 104.448KB

    The simple character check is every time lower in memory usage and outclasses the other approaches in performance.

  3. Sven Döring reporter

    Base64 optimization is a long running story with #225 (since v2.4.6) and #214 (since v2.7.1).

    I did a rough test with the gains each iteration had by testing complete REST requests whose JSON had a 2.5MB big Base64 file with ``OpenApiInteractionValidator#validateRequest(Request)``.

    SVR v2.4.5 (no optimization) SVR v2.4.6 (replaced Base64 regex validation) SVR v2.7.1 (removed unnecessary, 2nd Base64 validation) latest PR #158 (Base64 alphabet check)
    ~125ms ~76ms ~35ms ~16ms

  4. Log in to comment