Menu Close Log in Get started

The pursuit of a perfect dataset

Data Cleanliness

What is required to develop an ideal solution? I frequently ask myself this when confronted with a new Data Science problem. Surprisingly, the answer has always remained constant - “Data.”

The most challenging task for me in my career has always been the quest for flawless data. To be honest, I have never been able to succeed.

Does perfect dataset exist?

But, what exactly is considered to be perfect data? Simply put, data can be divided into a good training and validation dataset to train a model that performs well in production and is robust over time. And if your dataset was not good when training a model, the results in real-time would be significantly different from what you observed during experimentation.

The real-world data is messy, seldom structured, imbalanced, has missing information, noise, and sometimes is wrongly labeled. Furthermore, there could be situations where there is no data or a significantly less proportion of actual data being worthwhile. In addition, in some cases, the expense of creating a high-quality dataset could be high.

A personal experience working with imbalanced data set

In this story, I’ll share the challenges I faced in the first-ever project of my Data Scientist career.

I worked on a text classification project to classify clients’ comments into positive or negative. These comments were posted as part of our regular conversations with clients on Basecamp.

Contrary to popular belief, the problem was anything from simple.

First of all, the definition of a negative comment was very particular.

We only had to flag the comment if it showed one of the following intents:

  • The customer is asking for an update/ETA/status/follow-ups on the ongoing project;
  • The customer gave negative feedback, or the CX team could not meet the client’s expectations on the delivered item;
  • The customer is frustrated, enraged, or confused;
  • The client is asking for urgent delivery of the project or an interim deliverable;

Additionally, even after scraping all the comments from Basecamp, we got only 3500 clients’ comments. Annotation resulted in only 540 negatives, while the rest were positive comments.

During EDA, we analyzed that some of the comments were as enormous as 100-200 words. These were clients’ feedbacks on the delivered item. In most of these feedbacks, there was only a part of the text which was negative. However, that negative part was ignored due to text truncation, and our model predicted the text positively.

Furthermore, increasing the truncation length was deteriorating our results. To overcome this problem, the text was tokenized into individual sentences. These individual sentences were tagged again, resulting in 760 negatives and 9905 individual positive sentences. A classic example of an imbalanced dataset. We included some negative texts from the SST2 to ensure a balanced dataset.

The addition of further data helped us resolve small dataset and imbalanced data issues. It also helped us capture humans’ common negative sentiments.

Our final dataset looked like - 9905 positive and 2760 negative comments, which helped us achieve an accuracy of 92.5%.

Since most of our negative comments were from SST2, we followed a continuous re-training approach to refine our classification model. We generated a weekly summary of all positive and negative messages predicted by our model for the first three months. We analyzed this summary manually and identified False Positives (FPs). These FPs were then fed to our model for re-training and performance enhancement. This improved the accuracy of our model to 97%.

Contributed by
Akshay Kalra
EX Squared Solutions India Pvt. Ltd.

Akshay is a Data Scientist at EX Squared Solutions India Pvt. Ltd. In his spare time, he likes to learn new advancements in the Machine Learning industry. He is also a fan of Volleyball and Table Tennis.

photo of Akshay Kalra