Resources / Blogs / The Rise of Domain-Specific Knowledge Agents: A Deep-Dive

The Rise of Domain-Specific Knowledge Agents: A Deep-Dive

TL;DR

  1. Agents have caught enterprise fancy with clear economics and by all predictions, demand for AI agent led IT and Business transformation is likely to be a multi-year journey.
  2. Output quality, accuracy, safety, and privacy are key differentiators and crucial for driving up the consumption. Agentization depends on availability of high quality domain-specific knowledge
  3. Curating, maintaining, and providing the knowledge to agents is complex and continuous activity
  4. Anticipate development and offerings of sophisticated knowledge agents

Domain-Specific Knowledge Agents are a class of agents whose function it is to build and maintain domain and task-specific knowledge-bases, and deliver them to consuming functional and training agents, and humans. These knowledge agents enable three crucial organizational functions in the agentic future – training/eval of domain models, building internal knowledge-driven applications, and drive stakeholder engagement and network building. We are seeing early versions of the knowledge agents working with function agents to deliver end-to-end services now.

We look at Knowledge agents and knowledge bases specifically in this post. While they work as a part of the system of agents, they are by themselves specialized with internal structure and goals. Their implementation and economics make them different from functional agents and merit a separate discussion. We have written elsewhere about how they fit into the system of agents

About Knowledgebases

Knowledge itself is somewhat of an ambiguous word. At what point does data become useful knowledge? That’s an academic question. What’s more important is that we’ll have data in various forms—detail, abstraction, quality—and it will become available to the consuming entities. While humans require limited data for making accurate decisions, Gen AI are expected to be voracious consumers of data until new data-efficient conceptual mechanisms are developed. 

The consumption can be of very low-level datasets, or processed data that looks closer to insights. The consumers will be both machines and humans, and often a combination of the two. For example, when training models, the consumer is purely a machine. It expects data in a structured form and can consume near infinite amounts of input data. Humans, on the other hand, might consume data in the form of summarized reports, prepared LinkedIn posts, and similar outputs.

In all of these cases, organizations are becoming increasingly knowledge-heavy and knowledge-centric. They rely on knowledge to decide which product to deliver, how to deliver it, and even in the actual act of delivering the product. Everything is now triggered, mediated, and enabled by knowledge:

Knowledge Dimension Example (Acme Power Inc)
Org/Context Specificity Power plant design
Continuously evolving New materials, thermodynamics research published everyday
Integration support Drive test-time compute, fine-tuning of plant design models, Feed simulators, applications of power plants
Internal+External Sourcing Plant RFPs, IoT plant data

 

This trend itself is not new. It’s not that we didn’t have systems and processes to leverage knowledge in the past. We know that the knowledge intensity of industries is growing across the board, and this has been a historical process for several decades now.

What, in my opinion, is striking today is the scale at which this is happening, the number of industries that will be impacted, the scope of knowledge—shifting from financial reporting and bespoke products to covering every single function at all levels within organizations—and, finally, the speed at which all of this is unfolding.

Driving up quality, accuracy and reliability that is critical to adoption requires extensive, high quality knowledge. It is important both for building domain specific models as well as dynamic context that goes along with the metaprompt. Therefore, the data processing pipelines wrapped in an agentic application that brings data to these important downstream applications will become a very significant activity. It is our assessment that, twelve to twenty-four months from now, knowledge base curation, development, management, and maintenance will become a standalone, full-time activity within organizations.

Centrality of Knowledge Bases in Agentic Future

LLMs already have the world’s knowledge, but their process of collecting, storing, and retrieving this world’s knowledge is a very lossy process. We struggle with building applications like designing a power plant using LLMs because designing a power plant requires us to meet many accuracy, technical feasibility, usability, relevance, and legality, and other considerations in the real world. The LLM doesn’t actually understand the high threshold that needs to be met for these real-world applications. LLMs are like untrained human minds. Just like how we cannot wake up one day and design a power plant, it cannot do these complex tasks with high accuracy without training either. It has the potential, but doesn’t have the data in the required reasoning process. So, knowledge bases, domain-specific, that means power plant-specific, that industry-specific knowledge is central to designing the power plant agent. 

Lets detail out what we need in order to design a new power plant: 

  1. [Tools] Sufficient understanding of all blueprints and technical specifications 
  2. [Model] Ability to model the implementation tradeoffs, experiences, preferences, and implementation journeys
  3. [Summarization] Ability to understand the goals and market for power plants 
  4. [Extraction] Ability to understand and extract requirements for an opportunity document
  5. [Model] Ability to design a powerplant to a given set of constraints, experiences of the  organization, and availability of partners/materials. 
  6. [Model] Ability to plan for implementation trajectories, model costs, and risks using mathematical and simulation techniques
  7. [Process] Ability to explain the design, and coordinate across organization to get appropriate approvals 
  8. [Process] Generate and submit a proposal with specifications, costing, and signoffs
  9. [Process] Create appropriate project coordination, resource allocation, and tracking mechanisms 

Generalizing from there, LLMs are missing important knowledge along few dimensions:

Knowledge Dimension Detail
Task-specific, Technical Tools, methods and details specific to the task at hand including configurations, interpretations, and representations
Task-specific, Consolidated Methods that organize everything that we know about the task within and across organization, and over time
Domain-specific, Synthetic Simulation/other sources developed that help us reason about the situation and outcome such as war, bid, or power plant design
Organization-specific, Private Private sources that are organization or individual specific such as past bids for construction
Experiential, Realtime Ideas, experiments that are relevant such as conference presentations, papers
Spatial, Differentiated Sources that indicate how non-English speakers spread across geographies think about the design given their unique experiences

We have second-order knowledge needs related to decisioning. First, imagine a world where every power plant designer in this world has a power plant designing agent. The economics and dynamics of such a market are very different – very intensely competitive. Given the agents’ ability to compute in real time and their almost infinite memory, what should the agents do? We have the problem of determining the goals and the use cases that this agent can be effectively applied to. Second, these goals will evolve with time. We have the problem of determining how these goals should evolve and why. We need market intelligence on what is happening around us – moves made by your peers, customers, users, regulators etc so that agents can adapt. Third, if we look at a slightly longer time frame, like 12-24 months, we have the problem of determining the capabilities that we should build into the agent. Should it be pre-computing and pre-designing power plants and storing them, or should it be doing it on the fly? Should it invest in understanding materials research and change the production processes? The product roadmap has to be developed, maintained, and applied to effectively steer the functionality.

We have knowledge needs related to the organization as well. The organization needs to engage with the ecosystem, partners, regulators, the legal community, the policymakers, employees, customers, partners, suppliers, and so on. There is the problem of continuous messaging of our ideas, approaches, and strategies to all the stakeholders with whom we have to collaborate in order to accomplish our objectives. Presumably, they will have direct, indirect feedback that also has to be incorporated. We have to discover and build our network. We have new partners that are coming on board. We need to incorporate them and so on. We need to power this engagement system as well in this agentic future. 

If you look at what is the common thread across all of these different aspects, whether it is the modeling subsystem, or whether it is the decisioning subsystem, or even the engagement subsystem, all of them are going to be powered by knowledge base that is real-time, that is continuously evolving, that is both internal and external, and that is specific to the context that you are dealing with.

Structure of a Knowledge Agent System

A Knowledge Agent System fulfills this need by acquiring, transforming, and leveraging data from diverse sources to support complex organizational tasks. We break the system down into its key subsystems: Ingestion, Transformation, Knowledge Base, Policy and Orchestration Engine. Each section outlines its taxonomy, key features, and areas for further development.

1. Ingestion System

The ingestion system is responsible for discovering, acquiring, and organizing raw data from diverse sources. Its primary goal is to ensure data integrity and readiness for transformation. It looks similar to crawling systems that we have today with some changes related to agents and agentic applications:

Feature Description
Discovery Continuous discovery of sources via meta-crawlers and iterative learning from existing data sources.
Sources Support for diverse sources including internal, external, synthetic, and other models
Scoring Scoring the sources and data for relevance, trust and quality
Near-Real Time The value is most accrued when the sources are acquired in near-real time.
Differentiation Unique and non-public sources drive differentiation, and therefore the acquisition techniques should enable them
Flexibility Sources could be internal or external, and could be documents, APIs, datasets or databases.
Typing Handling unstructured text, images, and conversational content like Reddit and YouTube comments.
Resolution Resolve ambiguity in ideas, concepts, and authorship; Deduplication of content.

 

Open Areas of Work:

  • Developing advanced meta-crawlers for real-time source discovery.
  • Enhancing scoring algorithms for novelty, relevance, trust and quality in conjunction with policy engine.
  • Addressing challenges in ambiguous and conversational content.

2. Transformation System

The transformation system processes raw data to generate a trusted, scored knowledge base. It bridges content, structure, and trust gaps between input and output.

It is a DAG that takes the input to the output. There are a few complexities that make it different from the structured DAG pipelines: 

  1. Nature of the DAG and how it is constructed 
  2. Nature of the DAG nodes – simple, complex, and meta
  3. Special purpose Eval nodes
DAG Type Description & Example
Rigid Similar to existing structured pipelines with predictable nodes and edges
Semi-Flexible/adaptive Adaptive with constraints such as types of nodes, edges, number, and datasets 
Dynamic Constructed dynamically using planning and dynamic code generation for the nodes

The nodes within the DAGs can vary depending on the context:

Node Type Description Example
Compute Imperative programmatic nodes with python code. They are rigid in input, output, logic Simple estimator of statistics of the content
LLM Simple Compute Probabilistic nodes that trigger calls to LLMs and tools to complete the goal. Summarization of content or object detection
LLM Complex Compute Probabilistic nodes that trigger multiple calls and use multiple strategies to accomplish a goal Monte-carlo search node or something that simulates test-time compute
LLM Planner LLM does with meta-level capability to add nodes and sub-DAGs to the existing DAG dynamically.  Handle new unseen sources, and dynamically handle determine the output required and associated transformation
LLM Eval Evaluate the quality of the output.This is also a meta-node that constructs DAG within depending on the nature of the eval to be performed Node that applies multiple strategies to score including self-eval, mechanical turk,code evaluation, and other techniques.

 

Eval nodes are special nodes that have further structure and taxonomy:

Eval Node Type Description Example
Compute Create a sandbox, run the code, and evaluate the output Evaluate algorithm implementation in a paper
Self-Reflection Generate and use feedback Evaluate the quality of linkedin post 
Verifier Use thirdparty services to evaluate the output – manual and automated Use math solvers, and expression evaluation
Explainer Use LLMs of thirdparties to infer causal relationships and provide explanations for the input data Provoide justification text for a given output 
Human Sample and distribute output to humans for evaluation Sample of textual content for quality checking

A lot depends on the context, and how dynamic and complex the knowledge agent system gets. It can start with fixed DAGs with simple compute nodes, and then evolve to handle more complexity.

Open Areas of Work:

  • Enhancing meta-nodes to construct DAGs dynamically.
  • Expanding eval node capabilities with human verification and self-evaluation.
  • Optimizing DAG design for efficiency and adaptability.

3. Trusted Knowledge Base

The knowledge base is the repository of verified, high-quality data that supports organizational tasks. There are several design dimensions of this knowledge base:

Feature Description Example
Representation Tables, graphs, blobs, text documents, JSON, and other structured formats. Knowledge graph from the input data
Deduplication Content, trust, or usage based reduction of the content Collapsing of the graph nodes based on similarity metrics
Trust Scoring Assess lineage and quality of datasets to assign scores to individual items and sources Suppress nodes from sources that are salesy in nature
Access Provide APIs and search interfaces to surface indexed information efficiently. Write to Neo4J database

Open Areas of Work:

  • Developing indexing systems for diverse data types.
  • Building robust deduplication algorithms.
  • Enhancing access mechanisms for usability and integration.

4. Policy and Orchestration Engine

This self-engine ensures coherency across subsystems while optimizing resource allocation and enforcing policies. Every subsystem has many choices given the large design space it operates on and economic choice to make. The combination of rules, learnt strategies, and resource constraints, are used to control the outcomes of the various subsystem.

Feature Description Example
Trust Scoring All knowledge is not equal, Source, detail, verifiability, consistency with other sources matter Scoring whether a YouTube video channel is trust worthy
Resource Management Determine the timing, policy, and limits of compute intensive activities such as MTCS Limit verification based on trust, goal
Fine-Tuned Model Tasks such as scoring or planing could use task and domain-specific model to reduce time taken Use verifier output to train a model to predict the verifier outcome

The knowledge model enables proactive, self-learning capabilities, essential for task-specific and high-accuracy outputs.

Feature Description Example
Task-Specific Models Learn about sources, disambiguate data, and improve accuracy for specific tasks. Scoring whether a YouTube video was sponsored and salesy
Self-learning Incorporate feedback from transformation and ingestion outcomes. Rate higher particular specific speaker or venue
Autonomous/ Agentic Operations Orchestrate ingestion, transformation, and feedback loops for iterative learning. Find new sources, ideas from existing stream

Open Areas of Work:

  • Developing dynamic policies for efficient resource allocation.
  • Automating coordination between modules.
  • Expanding exclusion mechanisms for legal and ethical compliance.
  • Building task-specific models with high accuracy.
  • Enhancing feedback loops for continuous improvement.
  • Integrating real-time learning datasets.

Conclusion

Building a Knowledge Agent System is a complex but essential task for organizations aiming to thrive in a data-rich environment. Each subsystem plays a critical role, from ingestion to transformation, storage, and policy orchestration. While the architecture is robust, there remain significant open areas of research and development, particularly in self-learning, optimization, and dynamic policy enforcement. Addressing these challenges will pave the way for truly proactive, domain-specific knowledge systems.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Blogs

January 17, 2025

The Rise of Domain-Specific Knowledge Agents: A Deep-Dive

TL;DR Agents have caught enterprise fancy with clear economics and by all predictions, demand for AI agent led IT and Business transformation is likely to be a multi-year journey. Output quality, accuracy, safety, and privacy are key differentiators and crucial for driving up the consumption. Agentization depends on availability of high quality domain-specific knowledge Curating, […]

Read More
November 4, 2024

The Future of Employee Benefits: Top Trends to Watch Out for in 2025

Imagine telling an insurance executive in the 1970s that, in the not-so-distant future, they would be crafting group benefit plans that include coverage for mindfulness app subscriptions, pet insurance, or even student loan repayment assistance. They might have chuckled at the absurdity—or marveled at the complexity. Yet here we are in 2024, navigating a landscape […]

Read More
October 4, 2024

Top Insurtech Trends for 2025 and Beyond

The insurance industry stands at a crossroads. The global protection gap, a measure of uninsured risk, looms large. By 2025, it will reach $1.86 trillion. This is not just a number. It represents real people and businesses exposed to financial ruin. The old models of insurance are failing to keep pace with a rapidly changing […]

Read More