Accurate Data Matching Solution to Accelerate Sales Analytics for Darnytsia

Data Matching Solution for Darnytsia Aligns Data from Different Sources with Above 96% Accuracy

Business Challenge

Our long-term client, Darnytsia, needed assistance to address a challenge related to data used for sales and market analytics:

Matching the drug information between the company’s internal database and external tables received from pharmacies collaborating with Darnytsia
Matching the pharmacies’ names and addresses between a table from the internal database and external sources

Initially, data matching was conducted manually; yet such an approach is time-consuming and is associated with risks of human errors that may negatively affect the results. The pharma giant was not satisfied with a ready-made solution they had previously tested and requested us to develop a custom matching algorithm.

Solution & Business Value

Together with Darnytsia, we developed a hybrid matching algorithm that matches records of data tables (internal and external) with a total error of less than 5%.

The business value from such an output for Darnytsia includes:

Complete automation of the data matching process with 20x less time required
Reduced time and cost for the related operations: matching tens of thousands of records takes less than 10 minutes
85-96% accuracy of matching
Zero human error risk due to automation and no manual job involved
Faster data analytics, decision-making, and insights generation

The project implementation took two weeks.

Technical Details 

The client provided us with two Excel data tables containing drug information (300-400 records each), and three tables containing pharmacy data (up to 10,000 records each). Data was inconsistent and set in different formats, making the initial task more complicated. The goal was to design a solution that would find matches between records, taking into account disparate formats and typos. Using these initial datasets, we searched for suitable matching metrics and tested various approaches and available algorithms. As a result, we created a hybrid matching algorithm that looks for similarities between the two data strings and calculates the total similarity score.     

The following metrics were used:

Ratcliff similarity
Levenshtein distance
TF-IDF (term frequency-inverse document frequency)

As initially agreed with Darnytsia, the output is presented in the form of Python code that our client can run when needed.

Matching algorithm tests showed the following results:

96% accuracy for matching two drug tables of 300-400 records each
Above 85% accuracy for matching three pharmacy tables (names and addresses matched) of 10,000 records each

The time required for running the matching algorithm is under one minute for drug tables, and under 10 minutes for pharmacy tables. The solution is not limited by a specific data scale and can work for any data volume.   

Technologies & Tools

Python
Difflib
SequenceMatcher
RapidFuzz
Sklearn

Services

Software

Industries

Our sustainability work What We Offer

Learn more about our initiatives

Details and analysis What We Offer

Quicklinks

Discover more What We Offer

Accurate Data Matching Solution to Accelerate Sales Analytics for Darnytsia

Business Challenge

Solution & Business Value

Technical Details

Technologies & Tools

Technical Details