ZK Regex

tumblr_pep7kan09p1qbgdnxo1_1280.png

Introduction

ZK Regex is an idea that uses zero-knowledge proofs (ZK) and regular expressions (Regex) to validate strings. The concept of zk regex was introduced through the zk-email project. I find it unique in the world of zk circuits and believe it will be very useful in attestations from web2 to web3.

In order to better understand how it works, I delved into the process of converting regular expressions to a circom circuit. Through this process, I refactored the circuit compiler into a standalone project called zk-regex. This was done in an effort to improve usability and better align with the ways that regular expressions are used in conventional programming languages.

This process has given me a deeper understanding of how regular expressions work internally, specifically through NFA/DFA. By understanding its internals, others may be inspired to explore potential use cases for zk regex as a means of bridging the gap between web2 and web3.

In this post, I will first explain why zk regex is necessary and how it provides an advantage over common search via substring. Then, I will delve into the theory behind regex and how a regex circuit works to assert patterns, using examples, diagrams, and mathematical artifacts to illustrate.

Why ZK needs regex

In web3, where decentralized applications (dApps) are built on blockchain platforms like Ethereum, there is often a need to attest that a piece of content follows specific string patterns without revealing the content itself. This is typically done to prove authority from web2 in web3, but it can also serve other purposes, such as enforcing constraints for integrity.

However, submitting the whole content to a smart contract to attest to its pattern would leak the content and be counterproductive to the purposes of on-chain infrastructures. This is where ZK Regex comes in.

ZK Regex allows for proving the pattern of a piece of content without revealing the content itself. This is achieved through the succinctness of zero-knowledge tech, which allows a prover to prove that they got content following a particular set of patterns without submitting the entire content.

Unlike other methods, such as substring searching, ZK Regex can prove the pattern of a piece of content without having to check every possible substring. Put simply, regular expressions offer more comprehensive encapsulation of search functionality. It can be used for both flexibility and privacy. The attester only requires the content to follow a particular set of patterns instead of any extract substrings. Thus, regex searching is better aligned with the purposes of zero-knowledge proof, enabling data to be processed at a higher degree of security and privacy.

How regex works

In a nutshell, there is a process of converting regular expression(regex) to DFA, such that the matching algorithm can run through a specified content and generate data points of various matches.

The basic workflow for converting a regex into a DFA is as follows:

  1. Parse the expression string: The first step is to parse the regex expression string into a data structure, such as a parse tree or an abstract syntax tree (AST), representing the expression's structure. This process involves breaking down the expression into component parts, such as characters, operators, and sub-groups, and representing them in a structured format.

  2. Convert the parse tree to a NFA: The next step is to use the parse tree to construct a NFA that recognizes the language defined by the regex. An NFA is a finite-state machine with multiple transitions from a single state for a given input symbol.

  3. Convert the NFA to a DFA: The NFA can then be converted into a DFA, a finite state machine with precisely one transition from a state for each possible input symbol. The conversion process involves using the NFA to construct a new DFA that recognizes the same language but deterministically.

Here I won’t explain the details of the process from regex to DFA. I recommend checking out this blog post to understand the theory behind the regular expression. This will focus on the explanation of converting DFA to a circom circuit.

DFA formal definition

To try to convey the features behind DFA for the regex matching algorithm, we describe its definition in a concise and standardized way.

M=(S,Σ,δ,s0,F)

where:

In other words, a DFA is a directed graph where the nodes are states and the edges are labeled with input symbols that recognize a pattern.

DFA transition example

Let's use the regular expression d(a|b)*c as an example to illustrate the state transitions with the math notations defined above.

First, we can represent the DFA as a 5-tuple: (S,Σ,δ,s0,F), where:

Using the M defined above, we can prove that the DFA accepts the string dabc.


step-by-step proof

(1)δ(d,s0)=s1(2)δ(a,s1)=s2(3)δ(b,s1)=s2(4)δ(a,s2)=s2(5)δ(b,s2)=s2(6)δ(c,s2)=s3

Since we have reached the final state s3, the DFA accepts the string dabc.

The following table and graph representations provide another visual perspectives to help understand the concept.


table representation

δ s0 s1 s2
d s1
a s2
b s2
c s3

graph representation

graph LR

s0((s0)) -- d --> s1

s1((s1)) -- a --> s2
s1((s1)) -- b --> s2

s2 -- a --> s2
s2 -- b --> s2

s2 -- c --> s3

s3((s3))

Transition fulfillment representation

In the circom circuit, what we really want to deal with are the data points, such as the number of entire matches, the number of sub-group matches, and the positions of the matches etc. Therefore, we need a convenient representation of these data points in circom circuits.

At the core, the relationships between character inputs and state transitions can be used to derive these data points.

Let's define these relationships as fulfillments f:X×Y0,1, such that


f=1 when X=Yf=0 when XY

Wrapping up, we have f(δ(char(i),sj1),sj) to indicate whether or not a state is fulfilled with an input character. Then these indications, as the following table shows, can be used to derive the desired data points introduced above.


index char s1 s2 s3
1 d 1
2 a 1
3 b 1
4 c 1

With the data points from the circom execution, they can be used for the constraints at the outer layers.

For example, the table reflects the following data points.

Note that in a normal regular expression algorithm, without the fulfillments at the accept state s3 in this example, the fulfillments at previous states are not valid.


Fulfillment representation to circuit

Using a circom language data structure states[i][j] to record the results of f(δ(char(i),sj1),sj), here is the key section of generated circuit code based on the rules derived from the δ state transition table:

for (var i = 0; i < num_bytes; i++) {
	/* test s1 fulfillment */
	// matching char d
	eq[0][i] = IsEqual();
	eq[0][i].in[0] <== in[i];
	eq[0][i].in[1] <== 100;
	and[0][i] = AND();
	and[0][i].a <== states[i][0];
	and[0][i].b <== eq[0][i].out;
	states[i+1][1] <== and[0][i].out;

	/* test s2 fulfillment */
	/** transition from different state **/
	
	// matching char a
	eq[1][i] = IsEqual();
	eq[1][i].in[0] <== in[i];
	eq[1][i].in[1] <== 97;
	// matching char b
	eq[2][i] = IsEqual();
	eq[2][i].in[0] <== in[i];
	eq[2][i].in[1] <== 98;

	and[1][i] = AND();
	and[1][i].a <== states[i][1];
	
	multi_or[0][i] = MultiOR(2);
	multi_or[0][i].in[0] <== eq[1][i].out;
	multi_or[0][i].in[1] <== eq[2][i].out;
	
	and[1][i].b <== multi_or[0][i].out;

	/** transition from same state **/
	
	// matching char a
	eq[3][i] = IsEqual();
	eq[3][i].in[0] <== in[i];
	eq[3][i].in[1] <== 97;
	// matching char b
	eq[4][i] = IsEqual();
	eq[4][i].in[0] <== in[i];
	eq[4][i].in[1] <== 98;
	
	and[2][i] = AND();
	and[2][i].a <== states[i][2];
	
	multi_or[1][i] = MultiOR(2);
	multi_or[1][i].in[0] <== eq[3][i].out;
	multi_or[1][i].in[1] <== eq[4][i].out;
	
	and[2][i].b <== multi_or[1][i].out;

	/** testing both transition paths **/
	multi_or[2][i] = MultiOR(2);
	multi_or[2][i].in[0] <== and[1][i].out;
	multi_or[2][i].in[1] <== and[2][i].out;
	states[i+1][2] <== multi_or[2][i].out;

	/* test s3 fulfillment */
	// matching char c
	eq[5][i] = IsEqual();
	eq[5][i].in[0] <== in[i];
	eq[5][i].in[1] <== 99;
	
	and[3][i] = AND();
	and[3][i].a <== states[i][2];
	and[3][i].b <== eq[5][i].out;
	
	states[i+1][3] <== and[3][i].out;
}


Welcome

I hope this post has shed some light on the theory behind zk-regex and provided practical insights into the relationship between theory and the generated circom circuit code.

If this has motivated you to explore further, that’s great! You can visit the zk-regex to play around with the code. Don’t hesitate to share any suggestions or questions you might have!

I would like to give special thanks to Kelley Meck and Yush for reviewing and providing feedback, as well as to credit Sampriti Panda and Yush for their original work on the zk-email repository.