-
As the title suggests, I'm interested in developing table and column-level lineage from ELT-type queries written using ibis. I'm new to ibis, but it feels like there should be plenty of opportunity to introspect a query to find the tables, columns, transformations, and other details that would allow really fine modeling of data lineage. However, I didn't see much documentation on the topic (and may easily have just missed it). Any pointers or suggestions on where to start? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi @seandavi! Welcome to Ibis! You are right that there are many opportunities to introspect Ibis queries -- a lot of them are currently what I would categorize as "developer tooling" and it is definitely under-documented. I can point you at a few things that can get you started poking at things, but also, the more that you can tell us about what kind of tasks you want to do with this table and column lineage, the better positioned we'll be to build up a reasonable API around it. If I use the penguins dataset for a silly example adding together bill_length and bill_depth measurements -- note that I'm not using interactive mode here: In [1]: import ibis
...: from ibis import _
...: from ibis.common.graph import toposort
...:
...: t = ibis.examples.penguins.fetch()
In [2]: t
Out[2]:
DatabaseTable: penguins
species string
island string
bill_length_mm float64
bill_depth_mm float64
flipper_length_mm int64
body_mass_g int64
sex string
year int64
In [3]: t.head().execute()
Out[3]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
3 Adelie Torgersen NaN NaN NaN NaN None 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007
In [4]: t = t.mutate(big_bill=_.bill_length_mm + _.bill_depth_mm)
In [5]: big_bill = t.big_bill
In [6]: big_bill.execute()
Out[6]:
0 57.8
1 56.9
2 58.3
3 NaN
4 56.0
...
339 75.6
340 61.6
341 67.8
342 69.8
343 68.9
Name: big_bill, Length: 344, dtype: float64 I have In [7]: big_bill
Out[7]:
r0 := DatabaseTable: penguins
species string
island string
bill_length_mm float64
bill_depth_mm float64
flipper_length_mm int64
body_mass_g int64
sex string
year int64
r1 := Selection[r0]
selections:
r0
big_bill: r0.bill_length_mm + r0.bill_depth_mm
big_bill: r1.big_bill This isn't the most compact format for this information, but it shows us that we started with a table named For more granular access to the same information, we can look at the underlying operations. All Ibis expressions can be decomposed into a series of operations -- you can inspect that operation by looking at the In [8]: big_bill.op()
Out[8]: <ibis.expr.operations.generic.TableColumn object at 0x7fc2b04dd120> The op is a In [9]: big_bill.op().argnames
Out[9]: ('table', 'name') And then inspect them: In [10]: big_bill.op().table
Out[10]: <ibis.expr.operations.relations.Selection object at 0x7fc256e1eea0>
In [11]: big_bill.op().name
Out[11]: 'big_bill' The In [13]: big_bill.op().table.to_expr()
Out[13]:
r0 := DatabaseTable: penguins
species string
island string
bill_length_mm float64
bill_depth_mm float64
flipper_length_mm int64
body_mass_g int64
sex string
year int64
Selection[r0]
selections:
r0
big_bill: r0.bill_length_mm + r0.bill_depth_mm As you might imagine, you can also inspect the If you want the entire chain of operations, you can use In [14]: toposort(big_bill.op())
Out[14]: Graph({<ibis.expr.operations.relations.DatabaseTable object at 0x7fc256e416c0>: (), <ibis.expr.operations.generic.TableColumn object at 0x7fc2b03da390>: (<ibis.expr.operations.relations.DatabaseTable object at 0x7fc256e416c0>,), <ibis.expr.operations.generic.TableColumn object at 0x7fc2b03dad40>: (<ibis.expr.operations.relations.DatabaseTable object at 0x7fc256e416c0>,), <ibis.expr.operations.numeric.Add object at 0x7fc254947c40>: (<ibis.expr.operations.generic.TableColumn object at 0x7fc2b03da390>, <ibis.expr.operations.generic.TableColumn object at 0x7fc2b03dad40>), <ibis.expr.operations.core.Alias object at 0x7fc2b0908280>: (<ibis.expr.operations.numeric.Add object at 0x7fc254947c40>,), <ibis.expr.operations.relations.Selection object at 0x7fc256e1eea0>: (<ibis.expr.operations.relations.DatabaseTable object at 0x7fc256e416c0>, <ibis.expr.operations.relations.DatabaseTable object at 0x7fc256e416c0>, <ibis.expr.operations.core.Alias object at 0x7fc2b0908280>), <ibis.expr.operations.generic.TableColumn object at 0x7fc2b04dd120>: (<ibis.expr.operations.relations.Selection object at 0x7fc256e1eea0>,)}) It's a bit of a mess to look at on the screen, but that's the lineage of the Hopefully that gives you a sense of what Ibis could do -- what information we have available for a given set of transformations. If you give us a sense of what you'd like to do and how you think you might want to go about doing it, we can see what we can put together. Almost forgot -- there's a graphviz option to visualize expressions (and so all the information present in that visualization is also accessible). In [15]: big_bill.visualize() yields |
Beta Was this translation helpful? Give feedback.
Hi @seandavi! Welcome to Ibis!
You are right that there are many opportunities to introspect Ibis queries -- a lot of them are currently what I would categorize as "developer tooling" and it is definitely under-documented.
I can point you at a few things that can get you started poking at things, but also, the more that you can tell us about what kind of tasks you want to do with this table and column lineage, the better positioned we'll be to build up a reasonable API around it.
If I use the penguins dataset for a silly example adding together bill_length and bill_depth measurements -- note that I'm not using interactive mode here: