
Accurate Data Matching Solution to Accelerate Sales Analytics for Darnytsia
Data Matching Solution for Darnytsia Aligns Data from Different Sources with Above 96% Accuracy
Business Challenge
Our long-term client, Darnytsia, needed assistance to address a challenge related to data used for sales and market analytics:
- Matching the drug information between the company’s internal database and external tables received from pharmacies collaborating with Darnytsia
- Matching the pharmacies’ names and addresses between a table from the internal database and external sources
Initially, data matching was conducted manually; yet such an approach is time-consuming and is associated with risks of human errors that may negatively affect the results. The pharma giant was not satisfied with a ready-made solution they had previously tested and requested us to develop a custom matching algorithm.
Solution & Business Value
Together with Darnytsia, we developed a hybrid matching algorithm that matches records of data tables (internal and external) with a total error of less than 5%.
The business value from such an output for Darnytsia includes:
- Complete automation of the data matching process with 20x less time required
- Reduced time and cost for the related operations: matching tens of thousands of records takes less than 10 minutes
- 85-96% accuracy of matching
- Zero human error risk due to automation and no manual job involved
- Faster data analytics, decision-making, and insights generation
The project implementation took two weeks.
Technical Details
The client provided us with two Excel data tables containing drug information (300-400 records each), and three tables containing pharmacy data (up to 10,000 records each). Data was inconsistent and set in different formats, making the initial task more complicated. The goal was to design a solution that would find matches between records, taking into account disparate formats and typos. Using these initial datasets, we searched for suitable matching metrics and tested various approaches and available algorithms. As a result, we created a hybrid matching algorithm that looks for similarities between the two data strings and calculates the total similarity score.
The following metrics were used:
- Ratcliff similarity
- Levenshtein distance
- TF-IDF (term frequency-inverse document frequency)
As initially agreed with Darnytsia, the output is presented in the form of Python code that our client can run when needed.
Matching algorithm tests showed the following results:
- 96% accuracy for matching two drug tables of 300-400 records each
- Above 85% accuracy for matching three pharmacy tables (names and addresses matched) of 10,000 records each
The time required for running the matching algorithm is under one minute for drug tables, and under 10 minutes for pharmacy tables. The solution is not limited by a specific data scale and can work for any data volume.
Technologies & Tools
- Python
- Difflib
- SequenceMatcher
- RapidFuzz
- Sklearn