Resources / Blogs / Scaling Entity Matching at The Room with Scribble Enrich and Redis

Scaling Entity Matching at The Room with Scribble Enrich and Redis

The Room’s mission is to connect top talent from around the world to meaningful opportunities. Envisioned as a technology-driven, community-centric platform to help organizations quickly find high-quality, vetted talent at scale, The Room will host tens of millions of members in its system and have a worldwide presence. At the core of the technology challenge is scaling the mathematically difficult entity-matching problem. Each entity in the system—individuals, organizations, opportunities, and content—must be matched to other entities with high accuracy and relevance, context-sensitivity, and timeliness. The application has multiple instances of this problem.

After experimentation, The Room’s ML team settled on an approach using vector space embedding of raw text data describing entities, and a combination of vector similarity matching and evolving business logic to surface similar entities. When testing the implementation, it was discovered that the core vector similarity computation was driving the compute and memory needs of the algorithm. Using Redis’ high-performance key retrieval based on nearest neighbour vector lookup, the team was able to achieve more than 15 times improvement in the core similarity computation loop without any memory overhead. In turn, this created headroom to increase the complexity of the algorithm and also deliver it on the real-time path.

Overall system

The overall flow of data is shown in the architecture diagram in Figure 1. While fairly standard, the key challenge was to put the entire production system in place with a small team, tight timelines, and the need for flexibility.

Figure 1: Data flow around The Room Intelligence Platform (TRIP)

Scribble Data’s Enrich feature store and its apps provided the overall framework for implementation of The Room Intelligence Platform which is responsible for entity matching. Enrich handled integration with approximately 15 data sources, data quality, batch, and streaming-feature engineering including vectorization, as well as integration with in-memory database backends such as Redis, cataloging, and compliance. Enrich’s pipelining and orchestration was also used for modeling, and future versions of implementations will use a standalone model database such as MLFlow.

Challenges

Profiling of the early versions of the matching code for The Room’s entity matching engine indicated that more than 95% of the time consumed and 90% of memory was spent in the vector similarity computation. This resource intensiveness hinted at the need for compute distribution and cumbersome optimizations. The three main challenges were:

  1. Cost: The N2 growth in computation as entities grew linearly (hundreds of millions of  potential matches to score) implied switching to a distributed implementation—which is people and resource heavy.
  2. Messiness: The growing sophistication in the matching algorithm and optimizations would interact in complex ways.
  3. Realtime support: Matching an entity against all other potential candidate entities was taking too long to be acceptable for real-time generation of matches.

Furthermore, it was necessary to optimize for the following key constraints:

  1. Team Size: The ML team’s small size makes it limited in addressing other problems than the challenges above.
  2. Time Constraints: The business is evolving fast, and it is important to rapidly iterate by surfacing results and testing outputs.

Redis for vector similarity

Redis made available a private-preview version of their vector similarity functionality and, to demonstrate the capability, leveraged RedisGears to support key retrieval using high-performance vector similarity lookup. The interface was quite simple, involving storing and lookup using similarity matching, with straightforward integration. Cosine similarity and Euclidean distance metrics were supported to find k-nearest neighbors (k-NN) vectors stored in Redis given a query vector. Standalone tests showed that the top-k closest match retrieval time was sub-millisecond, and the duration is almost constant over a large range of vector counts. The current test implementation has some limitations such as only supporting fixed length of vectors and ability to match all vectors stored in Redis. The production version is expected to address these issues.

Entity matching and performance comparison

The entity matching problem at The Room was modeled as a k-nearest neighbour (k-NN) retrieval on vectors representing entities. The number of candidate vectors (N) is large and expected to grow rapidly. The number of dimensions of each vector is on the order of a few thousand. The testing only focused on the core similarity score computation portions in the application and the speedup reported is based on this.

Three approaches were tested:

  1. First, as a naive baseline, a Pandas dataframe apply (…) loop that effectively does a row-wise vector-vector similarity computation (Pandas Apply).
  2. Second, an optimized version of the similarity computation using the Python scikit-learn package’s spatial.distance.cdist(…) function (Python Vectorized).
  3. Third, the Redis Gears k-NN lookup (Redis).

For approaches (1) and (2) above, a key sort was performed on the scores to get the indices of the top-k similar candidate vectors. Various values of N and k were tested. All three approaches used the same dataset of vectors. The results are shown in Table 1.

Table 1: Speedup comparison using various methods for retrieving top-k closest vectors given a query vector

We see a consistent 15 times speedup in top-k vector retrieval performance. The larger the data size, the greater the speedup. Our hypothesis is that this speedup is due to elimination of Python and Pandas overhead in compute and memory, and that this speedup can be expected in production consistently.

Redis at The Room

There are a number of value propositions in using Redis’ functionality at The Room:

  1. Online feature store: Storage of features computed on the streaming path and also the batch path for fast lookup by clients.
  2. k-Nearest vector lookup: Redis’ high-performance key lookup based on vector similarity computation addressed a core scaling challenge at The Room.
  3. Building with Redis: The Room roadmap involves building social graphs and processing streams. Redis offers additional data models and other capabilities to core Redis, including search, graph, and programmable data processing, which reduces the need for managing the data flow across separate systems.

Next steps

Redis’ low latency and high-performance vector similarity computation, integrated with Scribble Data’s Enrich feature store, is being extended to address other problems. Beyond a datastore for serving online features, there are two near-term problems that Redis Enterprise is being evaluated for, including use of graphs to identify opportunities and streams to process real-time events. The Room’s own data intelligence platform benefits from this integration as it develops state-of-the-art low-latency data applications for its members and internal staff alike.

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Blogs

January 17, 2025

The Rise of Domain-Specific Knowledge Agents: A Deep-Dive

TL;DR Agents have caught enterprise fancy with clear economics and by all predictions, demand for AI agent led IT and Business transformation is likely to be a multi-year journey. Output quality, accuracy, safety, and privacy are key differentiators and crucial for driving up the consumption. Agentization depends on availability of high quality domain-specific knowledge Curating, […]

Read More
November 4, 2024

The Future of Employee Benefits: Top Trends to Watch Out for in 2025

Imagine telling an insurance executive in the 1970s that, in the not-so-distant future, they would be crafting group benefit plans that include coverage for mindfulness app subscriptions, pet insurance, or even student loan repayment assistance. They might have chuckled at the absurdity—or marveled at the complexity. Yet here we are in 2024, navigating a landscape […]

Read More
October 4, 2024

Top Insurtech Trends for 2025 and Beyond

The insurance industry stands at a crossroads. The global protection gap, a measure of uninsured risk, looms large. By 2025, it will reach $1.86 trillion. This is not just a number. It represents real people and businesses exposed to financial ruin. The old models of insurance are failing to keep pace with a rapidly changing […]

Read More