11 minute read

Introduction

In data processing applications like federated learning and confidential analytics, it is often necessary to prove that a function was correctly executed over inputs originating from specific sources. The authenticity of such data is typically attested via a digital signature by the data source such that any recipient can verify the signature and thereby, the origin of the data source.

Setting Summary
Multiple Data Sources transmit data to a Prover. It computes a function over the data, sending the output comp_output to a Verifier with a proof of correctness proof. Both the Prover and Verifier know the public keys of all Data Sources.

A straight-forward approach for generating SNARKs over authenticated data is executing the signature verification logic inside the SNARK circuit. However, it turns out that the arithmetic operations involved in verifying digital signatures significantly increase the complexity of the circuit in terms of number constraints, which the prover has to suffer. This impact is further amplified when a circuit deals with many signatures, such as in a federated learning scenario.


This Post

Incorporating data authentication into SNARK-based protocols faces the following challenges:

  1. Prover Efficiency: How can the prover attest to data provenance efficiently.
  2. Verifier Efficiency: How can this attestation be efficiently verified.
  3. Data Secrecy: Whether efficiency towards the prover or verifier has security limitations.

A SNARK circuit that encodes all required digital signature verification enables the prover to use the ZK property and keep secret (part of) the data, the signature, or both. But as shown above, this does not come for free, but against a computation cost the Prover has to pay. In this post, we will look at different approaches of proving functions over authenticated data, analyzing their efficiency and their implications on security.

Setting

Before getting into the different approaches we first establish the setting and notation that will be used.

Parties

In our target setting, there are 3 types of parties: DataSource, Prover, and Verifier.

Data Source. A Data Source with ID i has a signing key pair (sk, pk) produces data with signature sig.

Prover. Peggy is a Prover party that interacts with n data sources. It maintains an array of public keys pubkeys[], an array of data dataset[], and an array of signatures sigs[] of size n each, such that sigs[i] is a signature over dataset[i] under public key pubkeys[i], where i is the ID of the data source.

Verifier. Victor is a Verifier party that also knows the array pubkeys[] and wishes to obtain the output of an arbitrary function over the entire dataset, i.e., comp_output = f(dataset), where Peggy carries out this computation and communicates the result.

High-level Goals

Victor requires a convincing argument that

  1. comp_output is the correct output of f(dataset)
  2. All data items in the input dataset[] originate from data sources that Victor expect

To fulfill the first requirement, Peggy will use a SNARK system to prove that the execution of the arbitrary function f(dataset) produced the output comp_output.

In the upcoming sections we will look at various approaches of fulfilling the second requirement.

Algorithms

In each proposed approach, we describe a DataSource, Prover, and a Verifier algorithm. Therein, all data sources execute the DataSource algorithm, sending the output to Peggy. Peggy plugs those outputs into the Prover algorithm and sends the output to Victor. Finally, Victor inputs the data from Peggy into the Verifier algorithm and obtains a verification result.

Both Peggy and Victor are assumed to have pre-existing knowledge of the array pubkeys[], thus it is not communicated but are made available to their respective algorithms as part of the inputs.

Setting Summary
Output from Data Source algorithms is transmitted to Peggy. Peggy executes the Prover algorithm and sends the output to Vicky who executes the Verifier algorithm.

For a concise description of the algorithms, we devise a pseudo-code notation. As an example, the following pseudo-code describes the algorithm DataSource.

BEGIN DataSource
  IN: data, sk
  OUT: data, sig
  EXEC:
    SET sig = Sign(sk, data)
END DataSource

The algorithm DataSource takes as input (denoted via IN:) a data item data and a signing key sk. Denoted via OUT:, it outputs the same data item data and a signature sig, which is what gets sent to Peggy. Following EXEC: is how the algorithm computes sig.

This DataSource algorithm is what is implicitly used in the following approaches (unless stated otherwise).

Baseline Approach: Signatures Inside the Circuit

To establish a baseline reference, we let the signature verification be encoded in the proving circuit, where in addition to proving correctness of the function, Peggy also needs to prove validity of all of the signatures.

BEGIN Prover
  IN:  dataset[], sigs[], pubkeys[]
  OUT: comp_output, proof
  EXEC:
    SET comp_output = f(dataset[])
    SET proof = BEGIN Proof
      PRIVATE: dataset, sigs
      PUBLIC:  pubkeys, comp_output
      CONSTRAINTS:
        comp_output == f(dataset)
        FOR ALL indices i in dataset:
          Sig.verify(pubkeys[i], dataset[i], sigs[i]) == 1
      END Proof
END Prover

Peggy executes the Prover algorithm that computes f(dataset[]). The algorithm also produces a proof that:

  1. this output is correct
  2. the data set input to f had correct signatures under the array of public keys pubkeys.

Peggy sends Victor proof and the computation output comp_output, nothing else.

Victor receives proof, func_output from Peggy, and using his list of pubkeys, he executes the Verifier algorithm:

Begin Verifier
  IN: proof, comp_output, pubkeys[]
  OUT: verification_result
  EXEC:
    SET verification_result = VerifyProof(proof, comp_output, pubkeys)
End Verifier

Efficiency Analysis

As conveyed in the introduction, this approach has high cost for the Prover who need to encode all signature verification steps into the circuit. For the Verifier, however, it is a best-case scenario as the it only needs to process a succinct proof.

Security Analysis

In this setting, Peggy is able to hide all data as well as the signatures. Thus, after execution, Victor effectively learns nothing about the data, except what is leaked by the computation output func_output.

This is essentially a best case scenario for secrecy, since Peggy relies entirely on the ZK property of SNARKS for delivering the authenticity guarantees that Victor requires.

Approach I: Signatures Outside Circuit

The simplest approach for optimizing proofs over authenticated data is to not carry out verification of signatures in the circuit, but to move them outside. In this setting, Peggy generates a proof of correctness of the desired function, but not of any signature validity.

BEGIN Prover
  IN: dataset[], sigs[], pubkeys[]
  OUT: comp_output, dataset[], sigs[], proof
  EXEC:
    SET comp_output = f(dataset)
    SET proof = BEGIN Proof:
      PUBLIC: dataset[], comp_output
      CONSTRAINTS:
        comp_output == f(M)
    END Proof
END Prover

In addition to sending Victor comp_output and proof, Peggy must also send dataset and sigs such that Victor can verify authenticity of these messages, and that they are the same messages used in generating the proof.

Thus Victor receives comp_output, dataset, sigs and proof, and verifies both the proof and all signatures:

Verifier
  IN: proof, comp_output, pubkeys[], dataset[], sigs[]
  OUT: verification_result
  EXEC:
    SET proof_result = VerifyProof(dataset, comp_output, proof)
    SET sig_result  = 1
      FOR ALL indices i in dataset:
        SET sig_result = sig_result AND Sig.verify(pubkeys[i], dataset[i], sigs[i])
    SET verification_result = proof_result AND sig_result
END Verifier

Efficiency Analysis

The approach strips away all the overhead of data authenticity checks from the proof generation circuit. In this setting, Prover Efficiency is maximized since only the function f needs to be encoded in the circuit.

However, the Verifier efficiency suffers. Victor now is required to receive and verify all data and signatures, in addition to verifying the proof itself, effectively making the protocol as a whole non-succinct.

Security Analysis

Signature checks are removed from the circuit and are carried out by the Verifier external to the circuit. This step alone would be insufficient to guarantee authenticity, as there must be a way to ensure that those data whose signatures have been verified are the same data used in the proof generation. To do that, the Verifier inputs the data into the proof verification algorithm. This mechanism prevents the prover from making using of the ZK property in hiding any of the input data, since the verifier needs to learn them. Such protocol is would only make sense if Peggy has no secrecy requirements for the data, and also if the computation is too expensive for Victor, otherwise Victor can just carry out the computation himself.

In the next approach, we will see how we can bring back some level of confidentiality.

Approach II: Sign Commitment instead of Data

Moving verification of the signatures outside the circuit prevented Peggy from hiding any data. We can work-around this limitation by making use of a commitment scheme, and have the signature be over commitments of the data rather than the raw data. To this end, we modify the DataSource algorithm as follows:

Begin DataSource
  IN: data, sk
  OUT: data, sig, comm, opening
  EXEC:
    SET comm, opening = Commit(m)
    SET sig = Sign(sk, comm)
End DataSource

A data source then sends to the Prover its data, a commitment comm, a random opening that opens the commitment to data, and a signature sig over this commitment.

Peggy then executes the following Prover algorithm:

Begin Prover
  IN: dataset[], sigs[], pubkeys[], comms[], openings[]
  OUT: comp_output, comms[], sigs[], proof
  EXEC:
    SET comp_output = f(M)
    SET proof = BEGIN Proof
      PRIVATE: dataset[], openings[]
      PUBLIC: comms
      CONSTRAINTS:
        comp_output == f(dataset)
        FOR ALL indices i in dataset:
          VerifyCommitment(comms[i], openings[i], dataset[i])
    End Proof
End Prover

In addition to proving correctness of comp_output, Peggy additionally proves that all commitments for data in dataset used in producing comp_output are correct with respect to the private openings.

Peggy then sends Victor comp_output, proof, as well as the set of commitments comms and set of signatures sigs over those commitments (but not openings or dataset).

Upon receiving proof, comp_output, comms, sigs, Victor computes:

Begin Verifier
  IN: proof, comp_output, pubkeys[], comms[], sigs[]
  OUT: final_result
  EXEC:
    SET proof_result = Verifyproof(comms, comp_output, proof)
    SET sig_result  = 1
      FOR ALL indices i in dataset:
        SET sig_result = sig_result AND Sig.verify(pubkeys[i], comms[i], sigs[i])

    SET final_result = proof_result AND sig_result
End Verifier

Thus in addition to verifying the proof, Victor verifies all signatures over the received commitments.

Efficiency

Let H be a cryptographic hash function, we can construct a commitment scheme as follows:

BEGIN Commit
  IN: data
  OUT: comm, opening
  EXEC:
    SET opening = Sample a uniform value in H's input space
    SET comm = H(data || opening)
END Commit

BEGIN VerifyCommitment
  IN: data, comm, opening
  OUT: vf_result
  EXEC:
    SET vf_result = comm == H(data || opening)
END VerifyCommitment

The Prover Efficiency is now worse that Approach I, since the proving circuit needs to encode verification of all commitments, i.e., compute a hash function per message. To optimise, we can instantiate H using a circuit friendly hash function like Poseidon.

The verifier may have a slight performance gain over Approach I as he verifies signatures over fixed-size small messages, compared to the arbitrary-sizes messages. Nevertheless, the succinctness property remains sacrificed.

Security Analysis

The protocol rests on the outlined commitment scheme being (1) Hiding: Given comm, it is intractable to obtain data, and (2) Binding: it is further intractable to find opening2 != opening that makes VerifyCommitment output 1 for data2' != data. Hiding is guaranteed by randomness of the chosen opening, and Binding by the collision resistance the H.

Authenticity. Instead of verifying signatures over the direct data, Victor verifies signatures over commitments and is guaranteed of the authenticity of those commitments. Because the proof guarantees that these commitments are bound to the input data, the authenticity of commitments is transitively carried over to the input data.

Confidentiality. While the protocol is a step up from Approach I since Victor receives the authenticity guarantees without learning the data, Victor still learns a long-term commitment-signature of the data. This makes usage of the same data linkable by Victor across difference interactions with Peggy.

Approach III: Signature Aggregation

We can achieve some improvement on Verifier Efficiency if the data sources use a signature scheme that supports signature aggregation. Instead of verifying a digital signature per message, the Prover can aggregate all the signatures into a single short one which is only what needs to be verified.

BEGIN Prover
  IN: dataset[], sigs[], pubkeys[], comms[], openings[]
  OUT: comp_output, comms[], aggsig, proof
  EXEC:
    SET res = f(dataset)
    SET aggsig = Aggregate(sigs)
    SET proof = BEGIN proof
      PRIVATE: dataset[], openings[]
      PUBLIC: comms[]
      CONSTRAINTS:
        comp_outputs == f(dataset)
        FOR ALL indices i in dataset:
          VerifyCommitment(comms[i], openings[i], dataset[i])
    END proof
END Prover

Instead of sending Victor the set of signatures, Peggy sends a short aggregation of those signatures that–together with the commitments–suffice to convince Victor of the authenticity of the data.

Upon receiving comp_output, comms[], aggsig, and proof Victor computes:

BEGIN Verifier
  IN: comp_output, comms[], aggsig, pubkeys[], proof
  OUT: verification_result
  EXEC:
    SET proof_result = VerifyProof(comp_outputs, comms, proof)
    SET sig_result = VerifyAggregateSignature(pubkeys, comms, aggsig)
    SET verification_result = proof_result AND sig_result
END Verifier

Efficiency

The Prover is slightly more expensive than Approach II, since it additionally needs to aggregate the signatures (outside the proving circuit).

Compared to Approach II, Victor needs to verify an aggregate, single signature instead of many. This is not inherently more efficient, as an aggregatable signature scheme like BLS would require the Verifier compute pairing operations, which may be more expensive than verifying many signatures of a standard signature scheme.

Security

This approach has similar security as in approach II.

Conclusion

We reviewed different approaches for generating SNARK proofs over authenticated data and analyzed their computational impact on the Prover and Verifier, as well as their implications on security. Although some had positive impact on Prover Efficiency, they make sacrifices on Verifier Efficiency and/or Secrecy. A correct approach then depends on the functional and security requirements of the use case.

Leave a comment