Regular expression matching for strings

Arguably, a major feature that is lacking from nfer is the ability to match strings using anything other than equality. Specifically, one would expect to use regular expressions, since these are typically the most effective tool in a programmer’s toolbox to match strings conforming to some pattern. Additionally, one would like to be able to parse out groups from the matched strings and then assign them to map keys as new string values.

Adding this to the language should not affect complexity, I believe, because although transforming the regex into a DFA requires an exponential blowup in the number of states actually running it is linear in the length of the string.

There are many places in the code that will need to change to accommodate this feature.

An NFA/DFA framework will need to be implemented so that an NFA can be translated to a DFA and a DFA can be executed against a string. The DFA execution will be typically called from the event parsing code or from the Python/R APIs. It should probably be agnostic to how the string is delimited. There will need to be code to read a regular expression and transform it into an NFA, of course, too.
The event parsing code will need to have access to the DFAs to use to match with the understanding that there can be more than one per event per field. Same for Python/R. The matching should be done on the raw string and it should never be stored as a string value if it isn’t used as such (could be both).
- This raises the important question - should we be trying to restrict the storage of fields not just to ones that are referenced in a specification but to ones that are referenced from events with that name. This would be a memory saving enhancement, but it is arguable whether or not it is worth the trouble. Doing so would require processing the whole spec to see how fields flow throughout. Probably, this can be done by defining a simple global data flow analysis over the specification graph. Is it worth it, though? The amount of memory saved is probably not great.
- If we do restrict the field storage to only those appropriate for the event name, then we need to decide if there should be a list of all the regular expressions that need to be checked for the particular field. This probably makes sense if we go to the effort of doing the former.
Some parsing code will also need to recognize the use in the DSL. For example the following should test the str field and pull out a group and store it as an id. The group parsing code should be smart enough to cast things to numbers where applicable. If there are multiple regexes then the group numbers will have to continue from the last one, which will get pretty tricky probably.
- a :- b before c where b.str =~ /id:([^,]*),/ map { id -> $1 }

‌

typedef struct _finite_automaton { 
  bool deterministic; 
  unsigned int size; 
  fa_node *nodes; 
} finite_automaton;

typedef struct _fa_node { 
  // not sure if a map actually is ideal for a transition table - maybe something new 
  data_map transition_table; 
} fa_node;

typedef struct _dfa_match { 
  bool matched; 
  data_map *groups; 
} dfa_match;

void get_nfa_from_regex(finite_automaton nfa, char regex, int regex_length); 
void determinize_nfa(finite_automaton nfa); 
void match(dfa_match result, finite_automaton dfa, char string, int string_length);

‌

Comments (2)