Skip to content

Commit ef6f198

Browse files
committed
docs for householding, patient dupplicates and poor data quality
1 parent 84de7eb commit ef6f198

5 files changed

Lines changed: 25 additions & 6 deletions

File tree

README.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# Contents
22
- [Why?](#why?)
33
- [Connectors](docs/pipes.md)
4+
- [Security](#security)
45
- [Key Zingg Concepts](#key-zingg-concepts)
56
- [Installation](docs/installation.md)
67
- [Configuration](docs/configuration.md)
@@ -14,7 +15,7 @@
1415

1516
## Why?
1617

17-
Real world data contains multiple records belonging to the same customer. These records can be in single or multiple systems and they have variations across fields which makes it hard to combine them together, especially with growing data volumes. This hurts customer analytics - establishing lifetime value, loyalty programs or marketing channels is impossible when the base data is not linked. No AI algorithm for segmentation can produce right results when there are multiple copies of the same customer lurking in the data. No warehouse can live up to its promise if the dimension tables have duplicates.
18+
Real world data contains multiple records belonging to the same customer. These records can be in single or multiple systems and they have variations across fields which makes it hard to combine them together, especially with growing data volumes. This hurts [customer analytics](docs/bizLeaderSurvey.md) - establishing lifetime value, loyalty programs or marketing channels is impossible when the base data is not linked. No AI algorithm for segmentation can produce right results when there are multiple copies of the same customer lurking in the data. No warehouse can live up to its promise if the dimension tables have duplicates.
1819

1920
![data silos](assets/dataSilos.png)
2021

@@ -25,16 +26,17 @@ With Zingg, the analytics engineer and the data scientist can quickly intergate
2526

2627
![# Zingg - Data Mastering At Scale with ML](/assets/dataMastering.png)
2728

28-
Zingg integrates different records of an entity like customer, supplier, product etc in same or disparate data sources. Zingg is useful for
29+
Zingg integrates different records of an entity like customer, patient, supplier, product etc in same or disparate data sources. Zingg is useful for
2930

3031
- Building unified and trusted views of customers and suppliers across multiple systems
3132
- Large Scale Entity Resolution for AML, KYC and other fraud and compliance scenarios
32-
- Deduplication and data quality
33+
- [Deduplication](docs/patient.md) and data quality
3334
- Identity Resolution
3435
- Integrating data silos during mergers and acquisitions
3536
- Data enrichment from external sources
37+
- Establishing customer [households](docs/households.md)
3638

37-
Zingg is a no code ML based tool for data unification. It scales well to enterprise data volumes. It works for English as well as Chinese, Thai, Japanese, Hindi and other languages.
39+
Zingg is a no code ML based tool for data unification. It scales well to enterprise data volumes and entity variety. It works for English as well as Chinese, Thai, Japanese, Hindi and other languages.
3840

3941
## Connectors
4042

@@ -45,6 +47,7 @@ Zingg connects, reads and writes to most on-premise and cloud data sources. Zing
4547

4648
Zingg can read and write to Snowflake, Cassandra, S3, Azure, Elastic, major RDBMSes and any other Spark supported data sources. Zingg also works with all major file formats like Parquet, Avro, JSON, XLSX, CSV, TSV etc. Read more about the Zingg [pipe](docs/pipes.md) interface.
4749

50+
##S
4851
## Key Zingg Concepts
4952

5053
For data mastering, Zingg learns 2 models from the training data.

docs/bizLeaderSurvey.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
A [survey of 510 business leaders by Dun & Bradstreet](https://www.forbes.com/sites/joemckendrick/2019/06/26/running-a-business-on-data-is-still-an-elusive-goal/?sh=2ae347d536d3) revealed that
2+
3+
- Almost 20% executives offered too much credit to a customer and 15% failed to sign a new customer due to lack of information about them
4+
- Nearly half said that their internal data is too siloed to make any sense out of them
5+
- 26% stated that they doubt the accuracy of their Data
6+
7+
Given that customers are our most valuable assets, it is important to invest in breaking customer data silos and build accurate and consolidated customer data, enriching it with third party sources where possible.

docs/fuzzyMatching.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -314,5 +314,3 @@ Entity representation often varies across systems. The number of attributes coul
314314

315315
\*Entity Definition: An entity, as we know, is a unique thing — a person, a business, a product, a supplier, a drug, an organization. Every entity comes with its describing attributes such as name, address, date, color, shape, price, age, website, brand, model, capacity et cetera.
316316

317-
Fuzzy matching is a great technique to match non-identical data but it comes with its challenges. At Nube, we solve the problem of fuzzy matching by employing artificial intelligence. Do reach out to us if you need help in reconciling your organization’s data.
318-

docs/households.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
Householding means grouping customer data into groups of family units. These units make financial and budgetary decisions together. From a marketers’ viewpoint, householding helps to understand the relationships between individuals and execute the optimal communication strategy for the unit. With an understanding of the household, marketers can build a combined offer package that is valuable at the individual and household level. Opportunities for upselling and cross selling can also be discovered. For management, householding provides a deep view into customer lifetime value, risk, compliance and reporting metrics. For operations, householding reduces the mailing costs for disclosure and other mailers. For example, the SEC allows single mailers for a household while mailing prospectuses, annual and semi-annual reports.
2+
3+
Householding, though highly desirable, is not easy to implement. Most business data is not segmented properly into first name, last name, suffixes, prefixes. Addresses are not standardised and remain unformatted. Missing components, abbreviations like St. for street, Av. for avenue, wrong zip codes or differing formats add to the complexities of householding implementation.
4+
5+
Typical rule based systems address these through parsing by lookups. Auxiliary tables with values, formats and patterns are provided as part of the software to parse the components. For example long dictionaries of first names are supplied which help lookup first name from the name fields. Similarly addresses and other fields are parsed and standardized. This exercise is pretty intensive and time consuming but as the downstream name and address matching can not work without this, it is a mandatory exercise in the traditional system. Needless to say, a vast majority of financial institutions and retailers understand the need for householding, but it remains an item on the wishlist, pushed from quarter to quarter.
6+
7+
With Zingg fuzzy matching, tolerance towards typos, field concatenation, unformatted records, abbreviations, prefixes and suffixes is pretty high. Learning from data means the system can generalize and find matches at high accuracy even with the raw data. Zingg mostly sidesteps the normalizing and parsing phase. Hence its must easier and faster to discover households.

docs/patient.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
It is a truth universally acknowledged, that duplicate records are bad. They hurt analytics, increase operational overheads, make compliance a pain and increase risk. But well, there are so many challenges in the data stack, surely duplicate records can be something we can live with? How bad can it be?
2+
A recent survey by [Black Book](https://blackbookmarketresearch.newswire.com/news/improving-provider-interoperability-congruently-increasing-patient-20426295) has quantified just how bad duplicate records can be. The survey found that an average hospital is spending an extra 1.5 million USD an year due to duplicate and fragmented patient records. 1.5 million USD! Lack of a master patient index is clearly a very costly affair.
3+
The survey also found that with hospitals with more than 150 beds and hundreds of thousands of records, it took approximately 5 months for data cleaning with data validation and normalisation.
4+
5 months of data cleaning equals 625,000 USD duplicate data spend besides the software and implementation costs (1.5*5/12) Surely there is a faster and much cheaper way to get there? 😉

0 commit comments

Comments
 (0)