README

The findregex project is an exploratory research project concerned with searching spaces of input-output functions using smt.

What is your user story?

The basic user story is that of a python programmer who wants to find an existing regex that transforms some string input into a boolean found/not-found output. So the user enters the string input and expected output, or many pairs of input-output specifications, and findregex finds an existing regex that does what the user asks.

Why not generate a regex from the specifications?

Generating regexes directly from specifications may result in overfitting the generated result. To avoid this, the user may have to specify an unreasonable number of input-output pairs.
We are hoping that by re-using existing code, our user may find some regex that is performing well in the real world. This existing regex may have good features that the user did not think to specify, but are important.

Why are you searching for regexes?

The ulitmate goal of our research is to find novel ways to search for functions. We have chosen regular expressions as our function type because their inputs are simple Strings, and their outputs are boolean. This turns out to be a good match for smt solvers. Though the input-output signature is simple, regexes can be used for a wide variety of real world tasks. Findregex will build upon existing research that dealt with searching String functions using smt.

What is novel about your search strategy?

Instead of brute force searching an entire database of regexes, we hope we can make findregex use an abstraction approach, taking advantage of naturally occurring clusters of similarity within the search space, and the power of smt solvers. We hope to be able to represent most or all regexes in the database as a tree or several trees, and use the roots of these trees as filters, so that if an abstracted version of an encoded regex does not satisfy the specification, none of its children can satisfy it either. For roots that satisfy, their children can act as filters, and so on until the actual regexes are tested in their native language.

If this works, we think we could search a database of n regexes in log(n) time, since at each level we are eliminating a fraction of the entire space. If we are successful, we will have demonstrated a new way to search a space of functions using smt solvers.

What is your workflow?

Our workflow should look something like this:

We scrape regexes from public sources like github, etc.
We parse the regexes using a wonderful Perl Compatable Regular Expression (PCRE) parser written by Bart Kiers.
We encode them into smt using an original encoder.
We provide a web user interface for searching the encoded regexes.

What can you say about the codebase?

Most of the code is intended to be run once or twice to obtain the artifacts needed to do research. Generally, you scrape regexes using the 'regex' package, modify the parser (if absolutely necessary) using antlr and code found in the 'pcre' package, and tweak the encoder in the 'smt' package. Code generation is contained in the 'gen' package.
Dependencies: antlr-3.5.2-complete.jar, sqlite-jdbc4-3.8.2-SNAPSHOT.jar
How to run tests: in eclipse, right click on the project root and select Run As > JUnit Test

Who do I talk to if I have more questions?

Questions should be directed to Carl Chapman or Katie Stolee.

Wiki

findregex / Home