Link Search Menu Expand Document

Verify a data contract

experimental
Last modified on 10-Jul-24

× Data contracts is an experimental project in Soda Core.

As the development team explores data contracts, expect minor imperfections, inconsistencies, and limited support, compatibility, and functionality if you download and use the soda-core-contracts package.

To verify a Soda data contract is to scan the data in a data source to execute the data contract checks you defined in a contracts YAML file. Available as a Python library, you run the scan programmatically, invoking Soda data contracts in a CI/CD workflow when you create a new pull request, or in a data pipeline after importing or transforming new data.

When deciding when to verify a data contract, consider that contract verification works best on new data as soon as it is produced so as to limit its exposure to other systems or users who might access it. The earlier in a pipeline or workflow, the better! Further, best practice suggests that you store batches of new data in a temporary table, verify a contract on the batches, then append the data to a larger table.

✖️    Requires Soda Core Scientific
✔️    Experimentally supported in Soda Core 3.3.3 or greater for PostgreSQL, Snowflake, and Spark
✖️    Supported in Soda Core CLI
✖️    Supported in Soda Library + Soda Cloud
✖️    Supported in Soda Cloud Agreements + Soda Agent

✖️    Available as a no-code check

Prerequisites
Verify a data contract via API
Review contract verification results
About data source configurations
Verify data contracts with Spark
Validate data contracts
Add a check identity
Skip checks during contract verification
Go further

Prerequisites

  • Python 3.8 or greater
  • a code or text editor
  • your data source connection credentials and details
  • a soda-core-contracts package and a soda-core[package] installed in a virtual environment. Refer to the list of data source-specific Soda Core packages available to use.
  • a Soda data contracts YAML file; see Write a data contract

Verify a data contract via API

  1. In your code or text editor, create a new file name data_source.yml accessible from within your working directory in your virtual environment.
  2. To that file, add a data source configuration for Soda to connect to your data source and access the data within it to verify the contract. The example that follows is for a PostgreSQL data source; see data source configuration for further details .
    Best practice dictates that you store sensitive credential values as environment variables using uppercase and underscores for the variables.
     name: local_postgres
     type: postgres
     connection:
       host: localhost
       database: yourdatabase
       username: ${POSTGRES_USERNAME}
       password: ${POSTGRES_PASSWORD}
    

    Alternatively, you can use a YAML string or dict to define connection details; use one of the with_data_source_...(...) methods.

  3. Add the following block to your Python working environment. Replace the values of the file paths with your own data source YAML file and contract YAML file respectively.
     from soda.contracts.contract_verification import ContractVerification, ContractVerificationResult
    
     contract_verification_result: ContractVerificationResult = (
         ContractVerification.builder()
         .with_contract_yaml_file('soda/local_postgres/public/customers.yml')
         .with_data_source_yaml_file('soda/local_postgres/data_source.yml')
         .execute()
     )
    
     print(str(contract_verification_result))
    
  4. At runtime, Soda connects with your data source and verifies the contract by executing the data contract checks in your file. Use ${SCHEMA} syntax to provide any environment variable values in a contract YAML file. Soda returns results of the verification as pass or fail check results, or indicate errors if any exist; see below.

Review contract verification results

Contract verification results make a distinction between two types of problems: failed checks, and execution errors.

Output Meaning Action Method
Failed checks A failed check indicates that the values in the dataset do not match or fall within the thresholds you specified in the check. Review the data at its source to determine the cause of the failure. .has_failures()
Execution errors An execution error means that Soda could not evaluate one or more checks in the data contract. Errors include incorrect inputs such as missing files, invalid files, connection issues, or invalid contract format, or query execution exceptions. Use the error logs to investigate the root cause of the issue. .has_errors()

When Soda surfaces a failed check or an execution error, you may wish to stop the pipeline from processing the data any further. To do so, you can use the Soda data contracts API in one of two ways:

  • Append .assert_ok() at the end of the contract verification result which produces a SodaException when a check fails or when or execution errors occur. The exception message includes a full report.
  • Test for the result using if not contract_verification_result.is_ok(): Use str(contract_verification_result) to get a report.

About data source configurations

Soda data contracts connects to a data source to perform queries, and verify schemas and data quality checks on data stored in a data source. Notably, it does not extract or ingest data, it only scans your data to complete contract verification. If you are using the Contract API, you only need to provide one data source configuration in the contract verification which Soda uses to verify contracts.

Best practice dictates that you store sensitive credential values as environment variables that use uppercase and underscores, such as password: ${DATA_SOURCE_PASSWORD}. Soda data contracts uses environment variables by default; you can pass extra variables via the API using .with_variables({"DATA_SOURCE_PASSWORD": "***"}).

Verify data contracts with Spark

Where you have a Spark session that potentially includes data frames that live in-memory, you can pass a Spark session into the contract verification API to verify a data contract in data frames without persisting and reloading.

Use with_data_source_spark_session to pass your Spark session into the contract verification, as in the example below.

spark_session: SparkSession = ...

contract_verification: ContractVerification = (
    ContractVerification.builder()
    .with_contract_yaml_str(contract_yaml_str)
    .with_data_source_spark_session(spark_session=spark_session, data_source_name="spark_ds")
    .execute()
)

Validate data contracts

If you wish to validate the syntax of a data contract without actually executing the contract verification, use the build method instead of execute on the contract verification builder, as in the following example.

contract_verification: ContractVerification = (
  ContractVerification.builder()
  .with_contract_yaml_file('soda/local_postgres/public/customers.yml')
  .build()
)

if contract_verification.logs.has_errors():
  print(f"The contract has syntax or semantic errors: \n{contract_verification.logs}")

Add a check identity

Add an identity to a check to correlate the check’s verification results with a check in Soda Cloud.

In a contract YAML file, every check must have a unique identity. By default, Soda generates a check identity based on the location of the checks list and two properties: type and name. This is generally enough information to correlate a data contracts check with a check in Soda Cloud.

However, if the error Duplicate check identity appears in the verification output, that indicates that two checks exist with the same type and name, or same type and no name. Where this occurs, manually change the name of one of the checks or, in the case where neither check has a name, add a name to one of the checks.

Be aware that if you do change or add a name to a data contract check, Soda Cloud considers this as a new check and discards the previous check result’s history; it would appear as though the original check and its results had disappeared.

Skip checks during contract verification

During a contract verification, you can arrange skip checks using check.skip as in the following example that does not check the schema of the dataset.

contract_verification: ContractVerification = (
    ContractVerification.builder()
    .with_data_source_yaml_file('soda/local_postgres/data_source.yml')
    .with_contract_yaml_file('soda/local_postgres/public/customers.yml')
    .build()
)

contract = contract_verification.contracts[0]
for check in contract.checks:
    if check.type != "schema":
        check.skip = True

contract_verification_result: ContractVerificationResult = contract_verification.execute()

Go further


Was this documentation helpful?

What could we do to improve this page?

Documentation always applies to the latest version of Soda products
Last modified on 10-Jul-24