ZK Regex
Introduction
ZK Regex is an idea that uses zeroknowledge proofs (ZK) and regular expressions (Regex) to validate strings. The concept of zk regex was introduced through the zkemail project. I find it unique in the world of zk circuits and believe it will be very useful in attestations from web2 to web3.
In order to better understand how it works, I delved into the process of converting regular expressions to a circom circuit. Through this process, I refactored the circuit compiler into a standalone project called zkregex. This was done in an effort to improve usability and better align with the ways that regular expressions are used in conventional programming languages.
This process has given me a deeper understanding of how regular expressions work internally, specifically through NFA/DFA. By understanding its internals, others may be inspired to explore potential use cases for zk regex as a means of bridging the gap between web2 and web3.
In this post, I will first explain why zk regex is necessary and how it provides an advantage over common search via substring. Then, I will delve into the theory behind regex and how a regex circuit works to assert patterns, using examples, diagrams, and mathematical artifacts to illustrate.
Why ZK needs regex
In web3, where decentralized applications (dApps) are built on blockchain platforms like Ethereum, there is often a need to attest that a piece of content follows specific string patterns without revealing the content itself. This is typically done to prove authority from web2 in web3, but it can also serve other purposes, such as enforcing constraints for integrity.
However, submitting the whole content to a smart contract to attest to its pattern would leak the content and be counterproductive to the purposes of onchain infrastructures. This is where ZK Regex comes in.
ZK Regex allows for proving the pattern of a piece of content without revealing the content itself. This is achieved through the succinctness of zeroknowledge tech, which allows a prover to prove that they got content following a particular set of patterns without submitting the entire content.
Unlike other methods, such as substring searching, ZK Regex can prove the pattern of a piece of content without having to check every possible substring. Put simply, regular expressions offer more comprehensive encapsulation of search functionality. It can be used for both flexibility and privacy. The attester only requires the content to follow a particular set of patterns instead of any extract substrings. Thus, regex searching is better aligned with the purposes of zeroknowledge proof, enabling data to be processed at a higher degree of security and privacy.
How regex works
In a nutshell, there is a process of converting regular expression(regex) to DFA, such that the matching algorithm can run through a specified content and generate data points of various matches.
The basic workflow for converting a regex into a DFA is as follows:

Parse the expression string: The first step is to parse the regex expression string into a data structure, such as a parse tree or an abstract syntax tree (AST), representing the expression's structure. This process involves breaking down the expression into component parts, such as characters, operators, and subgroups, and representing them in a structured format.

Convert the parse tree to a NFA: The next step is to use the parse tree to construct a NFA that recognizes the language defined by the regex. An NFA is a finitestate machine with multiple transitions from a single state for a given input symbol.

Convert the NFA to a DFA: The NFA can then be converted into a DFA, a finite state machine with precisely one transition from a state for each possible input symbol. The conversion process involves using the NFA to construct a new DFA that recognizes the same language but deterministically.
Here I won’t explain the details of the process from regex to DFA. I recommend checking out this blog post to understand the theory behind the regular expression. This will focus on the explanation of converting DFA to a circom circuit.
DFA formal definition
To try to convey the features behind DFA for the regex matching algorithm, we describe its definition in a concise and standardized way.
where:
is a finite set of states. is the input alphabet, a finite set of symbols. is the transition function that takes a state and a symbol as inputs and returns a new state. is the initial state. is the set of final or accepting states.
In other words, a DFA is a directed graph where the nodes are states and the edges are labeled with input symbols that recognize a pattern.
DFA transition example
Let's use the regular expression d(ab)*c
as an example to illustrate the state transitions with the math notations defined above.
First, we can represent the DFA as a 5tuple:
, where is the initial state and is the accept state defined in . : We can represent this function using a table, where the rows represent the states and the columns represent the input symbols. The entry in row and column is the next state when the DFA is in state and reads the input symbol at row .
Using the dabc
.
stepbystep proof
Since we have reached the final state dabc
.
The following table and graph representations provide another visual perspectives to help understand the concept.
table representation
δ  

d  
a  
b  
c 
graph representation
graph LR s0((s0))  d > s1 s1((s1))  a > s2 s1((s1))  b > s2 s2  a > s2 s2  b > s2 s2  c > s3 s3((s3))
Transition fulfillment representation
In the circom circuit, what we really want to deal with are the data points, such as the number of entire matches, the number of subgroup matches, and the positions of the matches etc. Therefore, we need a convenient representation of these data points in circom circuits.
At the core, the relationships between character inputs and state transitions can be used to derive these data points.
Let's define these relationships as fulfillments
Wrapping up, we have
index  char  

1  d  1  
2  a  1  
3  b  1  
4  c  1 
With the data points from the circom execution, they can be used for the constraints at the outer layers.
For example, the table reflects the following data points.
 index 1 is the start position of the entire match, not because of
but induced from  fulfillments at
reflects the substring bc
match in the index range. That is the reveal string for the regular expression part (ab)*
.  fulfillments at
reflects the entire regular expression d(ab)*c
is matched, sinceis the accept state.
Note that in a normal regular expression algorithm, without the fulfillments at the accept state
Fulfillment representation to circuit
Using a circom language data structure states[i][j]
to record the results of
for (var i = 0; i < num_bytes; i++) {
/* test s1 fulfillment */
// matching char d
eq[0][i] = IsEqual();
eq[0][i].in[0] <== in[i];
eq[0][i].in[1] <== 100;
and[0][i] = AND();
and[0][i].a <== states[i][0];
and[0][i].b <== eq[0][i].out;
states[i+1][1] <== and[0][i].out;
/* test s2 fulfillment */
/** transition from different state **/
// matching char a
eq[1][i] = IsEqual();
eq[1][i].in[0] <== in[i];
eq[1][i].in[1] <== 97;
// matching char b
eq[2][i] = IsEqual();
eq[2][i].in[0] <== in[i];
eq[2][i].in[1] <== 98;
and[1][i] = AND();
and[1][i].a <== states[i][1];
multi_or[0][i] = MultiOR(2);
multi_or[0][i].in[0] <== eq[1][i].out;
multi_or[0][i].in[1] <== eq[2][i].out;
and[1][i].b <== multi_or[0][i].out;
/** transition from same state **/
// matching char a
eq[3][i] = IsEqual();
eq[3][i].in[0] <== in[i];
eq[3][i].in[1] <== 97;
// matching char b
eq[4][i] = IsEqual();
eq[4][i].in[0] <== in[i];
eq[4][i].in[1] <== 98;
and[2][i] = AND();
and[2][i].a <== states[i][2];
multi_or[1][i] = MultiOR(2);
multi_or[1][i].in[0] <== eq[3][i].out;
multi_or[1][i].in[1] <== eq[4][i].out;
and[2][i].b <== multi_or[1][i].out;
/** testing both transition paths **/
multi_or[2][i] = MultiOR(2);
multi_or[2][i].in[0] <== and[1][i].out;
multi_or[2][i].in[1] <== and[2][i].out;
states[i+1][2] <== multi_or[2][i].out;
/* test s3 fulfillment */
// matching char c
eq[5][i] = IsEqual();
eq[5][i].in[0] <== in[i];
eq[5][i].in[1] <== 99;
and[3][i] = AND();
and[3][i].a <== states[i][2];
and[3][i].b <== eq[5][i].out;
states[i+1][3] <== and[3][i].out;
}
Welcome
I hope this post has shed some light on the theory behind zkregex and provided practical insights into the relationship between theory and the generated circom circuit code.
If this has motivated you to explore further, that’s great! You can visit the zkregex to play around with the code. Don’t hesitate to share any suggestions or questions you might have!
I would like to give special thanks to Kelley Meck and Yush for reviewing and providing feedback, as well as to credit Sampriti Panda and Yush for their original work on the zkemail repository.