docs for householding, patient dupplicates and poor data quality

sonalgoyal · sonalgoyal · commit ef6f198511f5 · 2021-09-10T17:25:54.000+05:30
diff --git a/README.md b/README.md
@@ -1,6 +1,7 @@
 # Contents
 - [Why?](#why?)
 - [Connectors](docs/pipes.md)
+- [Security](#security)
 - [Key Zingg Concepts](#key-zingg-concepts)
 - [Installation](docs/installation.md)
 - [Configuration](docs/configuration.md)
@@ -14,7 +15,7 @@
 
 ## Why?
 
-Real world data contains multiple records belonging to the same customer. These records can be in single or multiple systems and they have variations across fields which makes it hard to combine them together, especially with growing data volumes. This hurts customer analytics - establishing lifetime value, loyalty programs or marketing channels is impossible when the base data is not linked. No AI algorithm for segmentation can produce right results when there are multiple copies of the same customer lurking in the data. No warehouse can live up to its promise if the dimension tables have duplicates. 
+Real world data contains multiple records belonging to the same customer. These records can be in single or multiple systems and they have variations across fields which makes it hard to combine them together, especially with growing data volumes. This hurts [customer analytics](docs/bizLeaderSurvey.md) - establishing lifetime value, loyalty programs or marketing channels is impossible when the base data is not linked. No AI algorithm for segmentation can produce right results when there are multiple copies of the same customer lurking in the data. No warehouse can live up to its promise if the dimension tables have duplicates. 
 
 ![data silos](assets/dataSilos.png)
 
@@ -25,16 +26,17 @@ With Zingg, the analytics engineer and the data scientist can quickly intergate
 
 ![# Zingg - Data Mastering At Scale with ML](/assets/dataMastering.png)
 
-Zingg integrates different records of an entity like customer, supplier, product etc in same or disparate data sources. Zingg is useful for
+Zingg integrates different records of an entity like customer, patient, supplier, product etc in same or disparate data sources. Zingg is useful for
 
 - Building unified and trusted views of customers and suppliers across multiple systems
 - Large Scale Entity Resolution for AML, KYC and other fraud and compliance scenarios
-- Deduplication and data quality
+- [Deduplication](docs/patient.md) and data quality
 - Identity Resolution 
 - Integrating data silos during mergers and acquisitions
 - Data enrichment from external sources
+- Establishing customer [households](docs/households.md)
 
-Zingg is a no code ML based tool for data unification. It scales well to enterprise data volumes. It works for English as well as Chinese, Thai, Japanese, Hindi and other languages.   
+Zingg is a no code ML based tool for data unification. It scales well to enterprise data volumes and entity variety. It works for English as well as Chinese, Thai, Japanese, Hindi and other languages.   
 
 ## Connectors
 
@@ -45,6 +47,7 @@ Zingg connects, reads and writes to most on-premise and cloud data sources. Zing
 
 Zingg can read and write to Snowflake, Cassandra, S3, Azure, Elastic, major RDBMSes and any other Spark supported data sources. Zingg also works with all major file formats like Parquet, Avro, JSON, XLSX, CSV, TSV etc. Read more about the Zingg [pipe](docs/pipes.md) interface.  
 
+##S
 ## Key Zingg Concepts
 
 For data mastering, Zingg learns 2 models from the training data. 
diff --git a/docs/bizLeaderSurvey.md b/docs/bizLeaderSurvey.md
@@ -0,0 +1,7 @@
+A [survey of 510 business leaders by Dun & Bradstreet](https://www.forbes.com/sites/joemckendrick/2019/06/26/running-a-business-on-data-is-still-an-elusive-goal/?sh=2ae347d536d3) revealed that
+
+- Almost 20% executives offered too much credit to a customer and 15% failed to sign a new customer due to lack of information about them
+- Nearly half said that their internal data is too siloed to make any sense out of them
+- 26% stated that they doubt the accuracy of their Data
+
+Given that customers are our most valuable assets, it is important to invest in breaking customer data silos and build accurate and consolidated customer data, enriching it with third party sources where possible. 
diff --git a/docs/fuzzyMatching.md b/docs/fuzzyMatching.md
@@ -314,5 +314,3 @@ Entity representation often varies across systems. The number of attributes coul
 
 \*Entity Definition: An entity, as we know, is a unique thing — a person, a business, a product, a supplier, a drug, an organization. Every entity comes with its describing attributes such as name, address, date, color, shape, price, age, website, brand, model, capacity et cetera. 
 
-Fuzzy matching is a great technique to match non-identical data but it comes with its challenges. At Nube, we solve the problem of fuzzy matching by employing artificial intelligence. Do reach out to us if you need help in reconciling your organization’s data. 
-
diff --git a/docs/households.md b/docs/households.md
@@ -0,0 +1,7 @@
+Householding means grouping customer data into groups of family units. These units make financial and budgetary decisions together. From a marketers’ viewpoint, householding helps to understand the relationships between individuals and execute the optimal communication strategy for the unit. With an understanding of the household, marketers can build a combined offer package that is valuable at the individual and household level. Opportunities for upselling and cross selling can also be discovered. For management, householding provides a deep view into customer lifetime value, risk, compliance and reporting metrics. For operations, householding reduces the mailing costs for disclosure and other mailers. For example, the SEC allows single mailers for a household while mailing prospectuses, annual and semi-annual reports.
+
+Householding, though highly desirable, is not easy to implement. Most business data is not segmented properly into first name, last name, suffixes, prefixes. Addresses are not standardised and remain unformatted. Missing components, abbreviations like St. for street, Av. for avenue, wrong zip codes or differing formats add to the complexities of householding implementation.
+
+Typical rule based systems address these through parsing by lookups. Auxiliary tables with values, formats and patterns are provided as part of the software to parse the components. For example long dictionaries of first names are supplied which help lookup first name from the name fields. Similarly addresses and other fields are parsed and standardized. This exercise is pretty intensive and time consuming but as the downstream name and address matching can not work without this, it is a mandatory exercise in the traditional system. Needless to say, a vast majority of financial institutions and retailers understand the need for householding, but it remains an item on the wishlist, pushed from quarter to quarter.
+
+With Zingg fuzzy matching, tolerance towards typos, field concatenation, unformatted records, abbreviations, prefixes and suffixes is pretty high. Learning from data means the system can generalize and find matches at high accuracy even with the raw data. Zingg mostly sidesteps the normalizing and parsing phase. Hence its must easier and faster to discover households.
diff --git a/docs/patient.md b/docs/patient.md
@@ -0,0 +1,4 @@
+It is a truth universally acknowledged, that duplicate records are bad. They hurt analytics, increase operational overheads, make compliance a pain and increase risk. But well, there are so many challenges in the data stack, surely duplicate records can be something we can live with? How bad can it be?
+A recent survey by [Black Book](https://blackbookmarketresearch.newswire.com/news/improving-provider-interoperability-congruently-increasing-patient-20426295) has quantified just how bad duplicate records can be. The survey found that an average hospital is spending an extra 1.5 million USD an year due to duplicate and fragmented patient records. 1.5 million USD! Lack of a master patient index is clearly a very costly affair.
+The survey also found that with hospitals with more than 150 beds and hundreds of thousands of records, it took approximately 5 months for data cleaning with data validation and normalisation.
+5 months of data cleaning equals 625,000 USD duplicate data spend besides the software and implementation costs (1.5*5/12) Surely there is a faster and much cheaper way to get there? 😉

Original file line number	Diff line number	Diff line change
`@@ -314,5 +314,3 @@ Entity representation often varies across systems. The number of attributes coul`
`314`	`314`
`315`	`315`	`\*Entity Definition: An entity, as we know, is a unique thing — a person, a business, a product, a supplier, a drug, an organization. Every entity comes with its describing attributes such as name, address, date, color, shape, price, age, website, brand, model, capacity et cetera.`
`316`	`316`
`317`		`-Fuzzy matching is a great technique to match non-identical data but it comes with its challenges. At Nube, we solve the problem of fuzzy matching by employing artificial intelligence. Do reach out to us if you need help in reconciling your organization’s data.`
`318`		`-`