Skip to content
This repository was archived by the owner on May 17, 2024. It is now read-only.

Commit 81b21dd

Browse files
committed
Merge branch 'master' into issue_518_2
2 parents 97a150a + 0b2dcec commit 81b21dd

21 files changed

Lines changed: 575 additions & 224 deletions

README.md

Lines changed: 60 additions & 87 deletions
Original file line numberDiff line numberDiff line change
@@ -2,136 +2,109 @@
22
<img alt="Datafold" src="https://user-images.githubusercontent.com/1799931/196497110-d3de1113-a97f-4322-b531-026d859b867a.png" width="50%" />
33
</p>
44

5-
# **data-diff**
5+
<h1 align="center">
6+
data-diff
7+
</h1>
8+
9+
<h2 align="center">
10+
Develop dbt models faster by testing as you code.
11+
</h2>
12+
<h4 align="center">
13+
See how every change to dbt code affects the data produced in the modified model and downstream.
14+
</h4>
15+
<br>
616

717
## What is `data-diff`?
8-
data-diff is a **free, open-source tool** that enables data professionals to detect differences in values between any two tables.
918

10-
## Documentation
19+
data-diff is an open source package that you can use to see the impact of your dbt code changes on your dbt models as you code.
1120

12-
[**🗎 Documentation**](https://docs.datafold.com/guides/os_data_diff) - our detailed documentation has everything you need to start diffing.
21+
<div align="center">
1322

14-
### Databases we support
23+
![development_testing_gif](https://user-images.githubusercontent.com/1799931/236354286-d1d044cf-2168-4128-8a21-8c8ca7fd494c.gif)
1524

16-
- PostgreSQL >=10
17-
- MySQL
18-
- Snowflake
19-
- BigQuery
20-
- Redshift
21-
- Oracle
22-
- Presto
23-
- Databricks
24-
- Trino
25-
- Clickhouse
26-
- Vertica
27-
- DuckDB >=0.6
28-
- SQLite (coming soon)
25+
</div>
2926

30-
For their corresponding connection strings, check out our [detailed table](https://github.com/datafold/data-diff/blob/master/docs/supported-databases.md).
27+
<br>
3128

32-
#### Looking for a database not on the list?
33-
If a database is not on the list, we'd still love to support it. [Please open an issue](https://github.com/datafold/data-diff/issues) to discuss it, or vote on existing requests to push them up our todo list.
29+
## Getting Started
3430

35-
## Get started
31+
**Install `data-diff`**
3632

37-
### Installation
38-
39-
#### First, install `data-diff` using `pip`.
33+
Install `data-diff` with the command that is specific to the database you use with dbt.
4034

35+
### Snowflake
4136
```
42-
pip install data-diff
37+
pip install data-diff 'data-diff[snowflake,dbt]' -U
4338
```
4439

45-
#### Then, install one or more driver(s) specific to the database(s) you want to connect to.
46-
47-
- `pip install 'data-diff[mysql]'`
48-
49-
- `pip install 'data-diff[postgresql]'`
50-
51-
- `pip install 'data-diff[snowflake]'`
52-
53-
- `pip install 'data-diff[presto]'`
54-
55-
- `pip install 'data-diff[oracle]'`
56-
57-
- `pip install 'data-diff[trino]'`
58-
59-
- `pip install 'data-diff[clickhouse]'`
60-
61-
- `pip install 'data-diff[vertica]'`
62-
63-
- For BigQuery, see: https://pypi.org/project/google-cloud-bigquery/
64-
65-
_Some drivers have dependencies that cannot be installed using `pip` and still need to be installed manually._
66-
67-
### Run your first diff
40+
### BigQuery
41+
```
42+
pip install data-diff 'data-diff[dbt]' google-cloud-bigquery -U
43+
```
6844

69-
Once you've installed `data-diff`, you can run it from the command line.
45+
### Redshift
46+
```
47+
pip install data-diff 'data-diff[redshift,dbt]' -U
48+
```
7049

50+
### Postgres
7151
```
72-
data-diff DB1_URI TABLE1_NAME DB2_URI TABLE2_NAME [OPTIONS]
52+
pip install data-diff 'data-diff[postgres,dbt]' -U
7353
```
7454

75-
Be sure to read [the docs](https://docs.datafold.com/reference/open_source/cli) for detailed instructions how to build one of these commands depending on your database setup.
55+
### Databricks
56+
```
57+
pip install data-diff 'data-diff[databricks,dbt]' -U
58+
```
7659

77-
#### Code Example: Diff Tables Between Databases
78-
Here's an example command for your copy/pasting, taken from the screenshot above when we diffed data between Snowflake and Postgres.
60+
### DuckDB
61+
```
62+
pip install data-diff 'data-diff[duckdb,dbt]' -U
63+
```
7964

65+
**Update a few lines in your `dbt_project.yml`**.
8066
```
81-
data-diff \
82-
postgresql://<username>:'<password>'@localhost:5432/<database> \
83-
<table> \
84-
"snowflake://<username>:<password>@<password>/<DATABASE>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<ROLE>" \
85-
<TABLE> \
86-
-k activity_id \
87-
-c activity \
88-
-w "event_timestamp < '2022-10-10'"
67+
#dbt_project.yml
68+
vars:
69+
data_diff:
70+
prod_database: my_database
71+
prod_schema: my_default_schema
8972
```
9073

91-
#### Code Example: Diff Tables Within a Database
74+
**Run your first data diff!**
9275

9376
```
94-
data-diff \
95-
"snowflake://<username>:<password>@<password>/<DATABASE>/<SCHEMA_1>?warehouse=<WAREHOUSE>&role=<ROLE>" <TABLE_1> \
96-
<SCHEMA_2>.<TABLE_2> \
97-
-k org_id \
98-
-c created_at -c is_internal \
99-
-w "org_id != 1 and org_id < 2000" \
100-
-m test_results_%t \
101-
--materialize-all-rows \
102-
--table-write-limit 10000
77+
dbt run && data-diff --dbt
10378
```
10479

105-
In both code examples, I've used `<>` carrots to represent values that **should be replaced with your values** in the database connection strings. For the flags (`-k`, `-c`, etc.), I opted for "real" values (`org_id`, `is_internal`) to give you a more realistic view of what your command will look like.
80+
We recommend you get started by walking through [our simple setup instructions](https://docs.datafold.com/development_testing/open_source) which contain examples and details.
81+
82+
Please reach out on the dbt Slack in [#tools-datafold](https://getdbt.slack.com/archives/C03D25A92UU) if you have any trouble whatsoever getting started!
10683

107-
### We're here to help!
84+
<br><br>
10885

109-
We're here to help! Please post any questions in [GitHub Discussions](https://github.com/datafold/data-diff/discussions).
86+
### Diffing between databases
11087

111-
## How to Use
88+
Check out our [documentation](https://github.com/datafold/data-diff/blob/master/docs/supported-databases.md) if you're looking to compare data across databases (for example, between Postgres and Snowflake).
11289

113-
* [Examples with dbt, joindiff, and hashdiff](https://docs.datafold.com/reference/open_source/cli#examples)
114-
* [Examples with Python](https://data-diff.readthedocs.io/en/latest/python-api.html)
115-
* [How to use with TOML configuration file](https://docs.datafold.com/reference/open_source/cli#toml-config-file)
90+
<br>
11691

117-
## How to Contribute
118-
* Feel free to open an issue or contribute to the project by working on an existing issue.
119-
* Please read the [contributing guidelines](https://github.com/datafold/data-diff/blob/master/CONTRIBUTING.md) to get started.
120-
* To add a new database driver, check out [docs](https://github.com/datafold/data-diff/blob/master/docs/new-database-driver-guide.rst).
92+
## Contributors
12193

122-
Big thanks to everyone who contributed so far:
94+
We thank everyone who contributed so far!
12395

12496
<a href="https://github.com/datafold/data-diff/graphs/contributors">
12597
<img src="https://contributors-img.web.app/image?repo=datafold/data-diff" />
12698
</a>
12799

128-
## Technical Explanation
129-
130-
Check out this [technical explanation](https://github.com/datafold/data-diff/blob/master/docs/technical-explanation.md) of how data-diff works.
100+
<br>
131101

132102
## Analytics
103+
133104
* [Usage Analytics & Data Privacy](https://github.com/datafold/data-diff/blob/master/docs/usage_analytics.md)
134105

106+
<br>
107+
135108
## License
136109

137110
This project is licensed under the terms of the [MIT License](https://github.com/datafold/data-diff/blob/master/LICENSE).

data_diff/__main__.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -228,6 +228,13 @@ def write_usage(self, prog: str, args: str = "", prefix: Optional[str] = None) -
228228
metavar="PATH",
229229
help="Which directory to look in for the dbt_project.yml file. Default is the current working directory and its parents.",
230230
)
231+
@click.option(
232+
"--select",
233+
"-s",
234+
default=None,
235+
metavar="PATH",
236+
help="select dbt resources to compare using dbt selection syntax",
237+
)
231238
def main(conf, run, **kw):
232239
if kw["table2"] is None and kw["database2"]:
233240
# Use the "database table table" form
@@ -264,6 +271,7 @@ def main(conf, run, **kw):
264271
profiles_dir_override=kw["dbt_profiles_dir"],
265272
project_dir_override=kw["dbt_project_dir"],
266273
is_cloud=kw["cloud"],
274+
dbt_selection=kw["select"],
267275
)
268276
else:
269277
return _data_diff(**kw)
@@ -306,6 +314,7 @@ def _data_diff(
306314
cloud,
307315
dbt_profiles_dir,
308316
dbt_project_dir,
317+
select,
309318
threads1=None,
310319
threads2=None,
311320
__conf__=None,

data_diff/dbt.py

Lines changed: 10 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -18,17 +18,6 @@
1818
logger = getLogger(__name__)
1919

2020

21-
def import_dbt():
22-
try:
23-
from dbt_artifacts_parser.parser import parse_run_results, parse_manifest
24-
from dbt.config.renderer import ProfileRenderer
25-
import yaml
26-
except ImportError:
27-
raise RuntimeError("Could not import 'dbt' package. You can install it using: pip install 'data-diff[dbt]'.")
28-
29-
return parse_run_results, parse_manifest, ProfileRenderer, yaml
30-
31-
3221
from .tracking import (
3322
set_entrypoint_name,
3423
set_dbt_user_id,
@@ -55,12 +44,15 @@ class TDiffVars(pydantic.BaseModel):
5544

5645

5746
def dbt_diff(
58-
profiles_dir_override: Optional[str] = None, project_dir_override: Optional[str] = None, is_cloud: bool = False
47+
profiles_dir_override: Optional[str] = None,
48+
project_dir_override: Optional[str] = None,
49+
is_cloud: bool = False,
50+
dbt_selection: Optional[str] = None,
5951
) -> None:
6052
diff_threads = []
6153
set_entrypoint_name("CLI-dbt")
6254
dbt_parser = DbtParser(profiles_dir_override, project_dir_override)
63-
models = dbt_parser.get_models()
55+
models = dbt_parser.get_models(dbt_selection)
6456
datadiff_variables = dbt_parser.get_datadiff_variables()
6557
config_prod_database = datadiff_variables.get("prod_database")
6658
config_prod_schema = datadiff_variables.get("prod_schema")
@@ -102,11 +94,6 @@ def dbt_diff(
10294
else:
10395
dbt_parser.set_connection()
10496

105-
if config_prod_database is None:
106-
raise ValueError(
107-
"Expected a value for prod_database: OR prod_database: AND prod_schema: under \nvars:\n data_diff: "
108-
)
109-
11097
for model in models:
11198
diff_vars = _get_diff_vars(
11299
dbt_parser, config_prod_database, config_prod_schema, config_prod_custom_schema, model
@@ -163,12 +150,12 @@ def _get_diff_vars(
163150
prod_schema = dev_schema
164151

165152
if dbt_parser.requires_upper:
166-
dev_qualified_list = [x.upper() for x in [dev_database, dev_schema, model.alias]]
167-
prod_qualified_list = [x.upper() for x in [prod_database, prod_schema, model.alias]]
153+
dev_qualified_list = [x.upper() for x in [dev_database, dev_schema, model.alias] if x]
154+
prod_qualified_list = [x.upper() for x in [prod_database, prod_schema, model.alias] if x]
168155
primary_keys = [x.upper() for x in primary_keys]
169156
else:
170-
dev_qualified_list = [dev_database, dev_schema, model.alias]
171-
prod_qualified_list = [prod_database, prod_schema, model.alias]
157+
dev_qualified_list = [x for x in [dev_database, dev_schema, model.alias] if x]
158+
prod_qualified_list = [x for x in [prod_database, prod_schema, model.alias] if x]
172159

173160
datadiff_model_config = dbt_parser.get_datadiff_model_config(model.meta)
174161

@@ -298,7 +285,7 @@ def _cloud_diff(diff_vars: TDiffVars, datasource_id: int, api: DatafoldAPI) -> N
298285
try:
299286
diff_id = api.create_data_diff(payload=payload)
300287
diff_url = f"{api.host}/datadiffs/{diff_id}/overview"
301-
rich.print(f"{diff_vars.dev_path[2]}: {diff_url}")
288+
rich.print(f"{diff_vars.dev_path[-1]}: {diff_url}")
302289

303290
if diff_id is None:
304291
raise Exception(f"Api response did not contain a diff_id")

0 commit comments

Comments
 (0)