Towards Efficient Matching of Regexes with Backreferences using Register Set Automata
Matching regexes (regular expressions) is a common problem in many areas of computer science, with requirements on high speed and robust performance. Regexes with backreferences allow one to express certain patterns (even beyond regular) concisely, however, since the matching is usually done by backtracking, the matching speed can degrade to a degree that constitutes a service failure or a security threat. To facilitate high-speed matching of such regexes, we propose register set automata (RSAs), an extension of register automata where registers can contain sets of symbols (from a potentially infinite alphabet) and the following operations are supported: adding input values to registers, merging or clearing registers, and testing whether a register contains a value. We show that a large class of register automata can be transformed into deterministic RSAs, which can serve as a basis for fast matching of a family of regexes with single-letter capture groups and backreferences. We also give a derivative-based algorithm for transforming a large class of regexes with backreferences to register automata and show that the time complexity of matching is linear and quadratic to the length of the input for finite and infinite alphabets respectively. Our prototype implementation of a regex matcher shows that our approach can significantly improve the robustness of state-of-the-art regex matchers on regexes with backreferences. We also study the theoretical properties of the model and show that the emptiness problem for RSAs is decidable and complete for the $\mathbf{F}_\omega$ class and that RSAs are incomparable in expressive power to other popular automata models over data words.