Resources / Blogs / LLMs for data classification: How Scribble built SADL for achieving breakthrough accuracy

LLMs for data classification: How Scribble built SADL for achieving breakthrough accuracy

Raj Krishnan Vijayaraj

June 8, 2023

Modern-day organizations are generating vast amounts of data that hold immense potential for making informed decisions. However, with the ever-growing volume of data, the greater challenge lies in how these organizations can generate actionable insights. Data classification plays a vital role in addressing this challenge.

Until now, organizations have relied on traditional methods for data classification, including natural language processing (NLP) and machine learning (ML) but these methods have shown limited success.

Through this blog, we explore the journey of implementing SADL (Scribble Automated Data Labeller), a configurable and scalable system designed to classify data fields accurately. We also discuss the limitations of traditional classification methods, the evolution to language model-based learning (LLM), and the role of OpenAI’s GPT-3 in achieving breakthrough accuracy.

Preparation and methodology

The project utilizes a diverse dataset obtained from publicly available sources, specifically 1,300 randomly selected datasets provided by the Government of the United States. The implementation process begins with data cleaning and exploratory data analysis. Over 300 generic features are extracted from the dataset to train the model. These features include textual and numerical attributes, such as a bag of words, histograms, percentage of special characters, and maximum/minimum length. Random Forest models are initially employed for classification tasks, but limitations in accuracy and contextual information lead to the exploration of transformer models. We discuss the details of the two methods that were employed in the following sections:

Method 1

Our implementation involved training two separate Random Forest models. One for textual records and another one for numerical records.

Implementation

The following are the features extracted from the textual records:

  • The bag of words feature was reduced to 300 features using truncatedSVD
  • The frequency of occurrence of each item ) was extracted and scaled down to 20 samples
  • Percentage of special characters
  • Percentage of digits
  • Percentage of alphabets
  • Maximum length
  • Minimum length

The following are the features extracted from numerical records:

  • Maximum Value
  • Minimum Value
  • Mean Value
  • Count
  • Standard Deviation

These features were used to train two different models. We tried the following test-train split ratio: 30-70

Metrics

Accuracy – Exact Match: The predicted label is exactly the same as the given label

Accuracy – At Least 1 word: At least a word is common between the predicted label and the given label

Results for Text Data

Trees Depth Accuracy – Exact Match Accuracy – Atleast 1 word
25 25 18.70% 36.21%
25 50 20.24% 47.47%
25 75 21.27% 49.05%

Results for Numerical Data

Trees Depth Accuracy – Exact Match Accuracy – Atleast 1 word
25 25 14.05% 35.80%
25 50 13.31% 34.90%
25 75 14.05% 35.31%

This approach showed some results (up to 49% accuracy) and could have been improved by predetermining a given set of labels, using business rules. This approach had limitations in terms of its ability to provide contextual information.

We realized that complex models that are trained in much larger and diverse data would be able to generalize better.

This is why we decided to use LLMs, which are based on transformer architecture.

Transformer model

The transformer model is a type of neural network architecture that was introduced in 2017 to solve the limitations of processing sequential data. It uses a self-attention mechanism [2] to process input data in parallel and has the ability to capture long-range dependencies and contextual relationships between words in a sentence.

These models are huge and require a large amount of data to train and could cost millions of dollars, but there are many open-sourced models which were pre-trained and made publicly available via API service.

These large models are the state of the art in many NLP tasks like sentiment analysis or text classification.

The task given to the model during inference is completely determined by the prompt and parameters that you pass in through the API. The API will respond to your prompt and you can use a series of prompts to get your task done. The next section will share details on how we used such an API-based service to complete the task.

Method 2

We chose OpenAI’s GPT-3 as it is easy to use and is more flexible with the categories.

Implementation

Hyperparameters used for the experiment

  • Model: text-davinci-003
  • Temperature: 0.1
  • Max_tokens: 3500
  • Top_1: 1
  • Frequency_penalty: 0
  • Presence_penalty: 0

We followed a three-level hierarchy for classification. Starting with the top layer of Domain and it would illustrate the domain of the dataset (not the column label) from the following. The following are the domains that we finalized after multiple iterations:

  • Technology
  • Government
  • Manufacturing
  • Finance
  • Healthcare
  • Retail
  • Business

The next level is the category and it depicts the category of the column label and the following categories were selected after multiple iterations:

  • Date/Time
  • Education level
  • Events
  • Healthcare
  • Location
  • Organizations
  • Other
  • People
  • Products
  • Services

The third level is subcategories and it depicts the subcategory of the column label. These categories were optimized to fit the diversity within each category. Hence the list of sub-categories would vary for each category.

Results

The results from the experiments and trials showed an acceptability rate of over 90% for both domain and category-related tasks. This was manually verified for a randomly selected subsection of the dataset.

The following table summarizes the results for the domain:

Domain Acceptability
Technology 80.95%
Government 96.00%
Manufacturing 87.10%
Finance 100%
Healthcare 96.04%
Retail 94.85%
Business 74.60%
Total 91.08%

The following table summarizes the results for the categories:

Category Acceptability
Date/Time 95.56%
Events 97.50%
Healthcare 91.38%
Location 94.25%
Organizations 94.89%
Other 91.43%
People 92.92%
Products 81.33%
Services 78.95%
Education level 100.00%
Total 91.22%

The following table gives an example of how the GPT-3 model can be used to determine the category, sub-category, and description for a column label.

Table:

Label Category Sub Category Description
MMWR Year Date/Time Year Year of the Morbidity and Mortality Weekly Report
Birth Rate Healthcare Measurement Number of live births per 1000 people
Jurisdiction Location Region The authority of a legal body to exercise its power over a certain area.

Comparison of Approaches:

Language models like GPT-3 excel in text completion tasks and exhibit high accuracy due to their extensive training data. However, their performance may decrease when prompted with specific queries. While we could fine-tune LLMs with domain-specific data, this could be compute-intensive. Few-shot learning is an approach where the prompt contains a few examples of task completion. This is another viable approach, enabling models to learn new concepts with no extra training data.

Conclusion:

In conclusion, our SADL project has successfully demonstrated the power of language model-based learning in data classification. OpenAI’s GPT-3 has showcased exceptional accuracy and the ability to generate descriptions for previously unknown data. Fine-tuning models and exploring LLMs available for download are promising avenues for future improvements. As the volume of available data continues to grow, innovative and efficient techniques like LLMs are essential for extracting meaningful insights. SADL represents one such example of leveraging advanced technologies to unlock the untapped potential of data.

References:

  • [Dataset Repository] – United States Government. [Online]. Available here
  • Chen, Z., et al. (2018). Generating Schema Labels for Data-centric Tasks. [Online]. Available here
  • Vaswani, A., et al. (2017). Attention Is All You Need. [Online]. Available here 

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Blogs

January 17, 2025

The Rise of Domain-Specific Knowledge Agents: A Deep-Dive

TL;DR Agents have caught enterprise fancy with clear economics and by all predictions, demand for AI agent led IT and Business transformation is likely to be a multi-year journey. Output quality, accuracy, safety, and privacy are key differentiators and crucial for driving up the consumption. Agentization depends on availability of high quality domain-specific knowledge Curating, […]

Read More
November 4, 2024

The Future of Employee Benefits: Top Trends to Watch Out for in 2025

Imagine telling an insurance executive in the 1970s that, in the not-so-distant future, they would be crafting group benefit plans that include coverage for mindfulness app subscriptions, pet insurance, or even student loan repayment assistance. They might have chuckled at the absurdity—or marveled at the complexity. Yet here we are in 2024, navigating a landscape […]

Read More
October 4, 2024

Top Insurtech Trends for 2025 and Beyond

The insurance industry stands at a crossroads. The global protection gap, a measure of uninsured risk, looms large. By 2025, it will reach $1.86 trillion. This is not just a number. It represents real people and businesses exposed to financial ruin. The old models of insurance are failing to keep pace with a rapidly changing […]

Read More