In the first article of this series, Semantic Kernel 101 in the January/February 2024 issue of CODE Magazine, I gave an overview of the concepts of Semantic Kernel (SK), Microsoft's Framework for working with Large Language Models (LLMs). In the second article, Semantic Kernel 101: Part 2 in the March/April 2024 issue, I walked through hands-on examples of coding the basics with Semantic Kernal (SK). In this article, I'll create a truly useful real-world Copilot, walking through the evolution of my first professional AI application. I'll start at the beginning with the first version, showing every step our team took to get to a good result. Then I'll show you how, only a few months later, it's evolved into a much simpler and more effective app.

In the latest version, we allow the LLM to identify and call functions on its own without any of the step-by-step procedural code we used in the original version. I'll show you how we incorporated a very basic Retrieval Augmented Generation (RAG) pattern in the original version to ground the LLM's responses with customer data, and how the new version delegates even that work to the LLM.

The Original Implementation

The first time I was tasked with creating a “Copilot” system, I thought I had a pretty good understanding of LLMs and all the related technologies I'd need. In fact, I did have a pretty good conceptual understanding, but, like everyone else, neither I nor my team had much practical experience because LLMs were brand new. The goal of our new Copilot was to answer questions about a specified contract without the user being a legal expert or having to comb through the contract to find the answer. The user could just point to a contract and start asking questions in natural language. The first step would be to inject the entire text of a contract into the prompt, to give the LLM context.

We created one simple prompt, and we were well on our way. Victory!

This worked surprisingly well and with little effort, our Copilot could answer a lot of questions. We created one simple prompt, and we were well on our way. Victory!

Based on the following:
#############
{{$contract}}
#############
Answer the question:
{{$input}}

LLMs were shiny and new and seemed nothing short of magic, but they're incapable of doing anything except generating text based on public data, mainly obtained from the internet, up to a certain date. By inserting the text of the contract into our prompt, we could extend the magic of the LLM to include the text of our contract, but only to a point. Our first real challenge came when we realized that there were still a lot of questions our Copilot couldn't answer. They seemed like simple questions, but they caused the Copilot to stumble. For instance, if a user asked whether a contract had expired, the LLM couldn't answer because the LLM wasn't trained on what today's date is. To solve this, we retrieved this additional data in code and injected it into the prompt:

{{$dateandtime}}
Based on the following:
#############
{{$contract}}
#############
Answer the question:
{{$input}}

We developed intricate ways of determining if the current date and time might be required to answer a question, so we could retrieve it and add it to the prompt, but we found it difficult to get right. We looked for keywords and phrases in the question to determine if the date and time was required to answer the question, but the approach was error prone. It turned out that at that time, the most successful way to determine if a piece of information was needed was to ask the LLM with a prompt like this one:

Based on the following:
#############
{{$input}}
#############
Would it help to know the current date 
or time to answer the question?
Answer only with a single word YES or NO. 
Do not expand on your answer.

This approach worked pretty well, because the LLM was now provided with the current date and time when it was relevant to the user's question. If it wasn't relevant, the {{$dateandtime}} placeholder was set to an empty string and didn't affect the prompt. Of course, we hand-coded all of this and we now had to make two calls to the LLM for every question. In fact, we'd have to hand-code something similar and make an additional call for every piece of information that might be useful in answering a user's question. For instance, what's the user's current location? We could use Location Services in Windows to retrieve the user's location, but we had to ask the LLM if the location was helpful for each question.

For every bit of information, we created a prompt to ask the LLM if it was relevant to the current question and send it. If the info was useful, we'd retrieve the data and merge it into the prompt. At one point, we thought it couldn't hurt to just always inject all this information into every prompt so we could skip all those preliminary calls, but it turned out that, aside from being slow including a lot of extraneous information, doing this often degraded the quality of the response. The code quickly became quite ugly, but it worked! After the text of the contract and all the potentially useful information was added to the prompt, the prompt became very large, which made processing the prompt slow. In some cases, the prompt became too large to run at all, and sometimes it became quite convoluted, which made it more difficult to get good answers from the LLM. But it mostly worked!

The size of the larger contracts was a problem we had to solve. Even though it was built to handle only one contract at a time, the system could only handle contracts stored in plain text only up to a certain size. One of the sample contracts our testers provided to us after our early successes was a 630-page PDF, and we realized that we had two problems. First, we had no way of inserting a PDF (a binary format) into a textual prompt template and second, the length of the text was way too long. Even if we could extract the text from the PDF to insert it into the prompt, it put us way over the token limit allowed by the LLM.

We solved the first problem by using a free, open-source utility named PdfPig by UglyToad. I'll save the details of this for another article, as this is a pre-processing step and doesn't directly involve SK. Once we had the raw text, we needed to break it up into chunks small enough that we could stuff one or two of the most relevant sections of the contract into the prompt without hitting the token limit for the LLM.

We found that just blindly cutting a document into sections of a certain length didn't work very well. Imagine if we created the following chunks, “...MikeYeager.com herein kno”, and "wn as CLIENT…". We would never be able to determine that the company MikeYeager.com was also referred to as CLIENT in the contract, because we cut that thought in half. Chunkers need to be smart enough to respect paragraph and sentence boundaries when deciding where to start and end each chunk, so each chunk captures complete thoughts. We also found that it helps to include a little overlap from the previous and next chunk for the same reason. It's trickier than you might think, but luckily, a lot of people have already solved this problem well and we didn't have to code this from scratch.

We solved the second problem by incorporating the TextChunker class included with Semantic Kernel. By the time SK V1 was released, the TextChunker was marked for evaluation purposes only and it looks like the team may drop it in the future. Because chunking is also a pre-processing step and doesn't directly involve SK (the primary reason we suspect it will be dropped), I'll also cover that in the future article I mentioned.

After creating a new text extraction and chunking utility, we could run a pre-processing step to save the plain text chunks of all the contracts to disk. Now, when the user asked a question, we first asked the LLM to extract search terms from the question with a prompt like this one:

Extract search terms suitable for a keyword 
search from the following. 
Separate each term with a comma.
###################
{{$input}}
###################

Then we wrote C# code to search the text files from a particular contract for those keywords. We took the top one or two matches and injected the text of those chunks into the prompt instead of injecting the entire contract. The RAG code to find relevant chunks of text for a contract can be found in Listing 1.

Listing 1: RAG code to find relevant chunks of the specified contract.

public string FindRelevantChunks(string contractFileName, string keyWords)
{
    var results = new List<ChunkInfo>();

    var keywords = keyWords.Split(',');

    var justStem = Path.GetFileNameWithoutExtension(contractFileName);
    var i = 0;
    var potentialFileName = Path.Combine("SampleDocuments",
      $"{justStem}_{i.ToString().PadLeft(4,'0')}.txt");
    while (File.Exists(potentialFileName))
    {
        var fileContents = File.ReadAllText(potentialFileName);
                
        var keywordCount = 0;
        foreach (var keyword in keywords)
          keywordCount+= Regex.Matches(fileContents, keyword, 
            RegexOptions.IgnoreCase).Count;

        results.Add(new ChunkInfo 
        { Text = fileContents, KeywordCount = keywordCount });

        i++;
        potentialFileName = Path.Combine("SampleDocuments", 
          $"{justStem}_{i.ToString().PadLeft(2, '0')}.txt");
    }
}

With all the pieces in place, the Copilot could now answer most questions, even if it was a little bit slow. For example, if the user asked, "Is the contract currently in effect?", the final prompt sent to the LLM might look like Listing 2. The final original implementation with all these steps can be seen in Listing 3.

Listing 2: The final prompt sent to the LLM in the original Implementation.

The current date and time is 3/27/2024 10:18AM
Based on the following: 
#############
Address: 1234 Oak St.
City: Houston 
State: TX
ZIP: 77333
Apartment Number: 1235

Term of the Agreement:
The rental term will commence on 02/14/2020 and 
terminate on 02/13/2021. This Agreement is valid 
for a period of twelve months.

Rent Payment:
a. The monthly rent for the Apartment is set at 
$799 and is payable on or before the 1st of each 
month. The first payment will be due on 02/14/2020.
b. Rent payments should be made in check 
payable to the Landlord at the address mentioned 
above or as otherwise instructed by the Landlord.

Security Deposit:
a. The Tenant(s) shall provide a security deposit of 
$500 on or before the commencement of this 
Agreement.
b. The security deposit will be held by the 
Landlord as security against any damage, 
unpaid rent, or other charges owed by the Tenant(s)
under this Agreement.
c. The security deposit will be returned within 10 
days after the termination of this Agreement, 
subject to deductions for any unpaid rent or 
damages as outlined in Section 5.
#############
Answer the question:
Is the contract currently in effect?

Listing 3: Complete original version.

private static async Task AskQuestionsAboutContract1()
{
    var builder = Kernel.CreateBuilder();

    builder.Services
        //.AddLogging(configure => configure.AddConsole())
        .AddAzureOpenAIChatCompletion("gpt-35-turbo", _endpoint, _apiKey);

    var kernel = builder.Build();

    var semanticFunctions = kernel.ImportPluginFromPromptDirectory("Semantic");

    //TODO: Add UI to have user select the contract they want 
    //to work with and enter questions they want to ask... 
    var currentContract = "SampleContract1";
    var question = "Is the contract currently in effect?";

    var needsDateTimeResult = await 
      kernel.InvokeAsync(semanticFunctions["NeedsCurrentDateAndTime"],
        new KernelArguments { ["input"] = question });

    Console.WriteLine($"Needs Date and/or Time? {needsDateTimeResult}");

    var dateTime = string.Compare(needsDateTimeResult.ToString(), "YES") == 0
      ? $"The current date and time is {DateTime.Now}" : string.Empty;

    Console.WriteLine($"dateTime parameter: {dateTime}");

    var searchTerms = await 
      kernel.InvokeAsync(semanticFunctions["ExtractSearchTerms"],
        new KernelArguments { ["input"] = question });

    Console.WriteLine($"Search Terms: {searchTerms}");

    var chunkSearcher = new SearchChunks();
    var relevantChunks = chunkSearcher.FindRelevantChunks(currentContract, 
      searchTerms.ToString());

    Console.WriteLine($"Relevant Chunks: {relevantChunks}");

    var result = await 
      kernel.InvokeAsync(semanticFunctions["SearchContract"],
        new KernelArguments
      {
          ["input"] = question,
          ["dateandtime"] = dateTime,
          ["contract"] = relevantChunks
      });

      Console.WriteLine();
      Console.WriteLine($"Working with contract: {currentContract}");
      Console.WriteLine(question);
      Console.WriteLine(result);
}

An Updated Approach

While writing the original version, we thought we could use planners in the future to reduce the amount of calls we were making and the amount of hand coding we were doing. Planners are an SK concept that asks the LLM to create a set of steps to be executed to accomplish a complex task. Based on the user's question, the planner would examine both the native and semantic functions available to it, and come up with a set of steps, which we could inspect and even interact with. SK then executed the steps, passing the results of one step to the next. It was a pretty good idea, and we had some success with it, but our attempts to use planners at the time found them to be error prone and it was a lot of work to make relatively simple decisions. What we had hand coded worked well enough and was faster, so we went with that.

Recently, my team was given the chance to re-evaluate our solution and see if we could use new advances, such as GPT-4 and automatic function calling to improve on our Copilot. Inspired by OpenAI, automatic function calling is similar to planners, except that we found it a little less ambitious, a little more automatic, and a little less error prone. Automatic function calling allows us to load a set of function descriptions and signatures into SK (just like we did with planners) then, based on those descriptions, have SK automatically determine which functions (if any) to call and when. Automatic function calling is simpler than planners because it doesn't require coming up with the entire set of steps up front. Instead, it determines if and when to call a function as it runs, so it can adapt to new information and make simpler decisions. Then SK automatically runs the functions and incorporates the results. It sounded promising.

In the OpenAI version of automatic function calling, when a function is to be called, the API initiates a callback with the name of the function to be called as well as values for the parameters it determines need to be passed to the function. When a callback happens, our application oversees making the call and returning the results to OpenAI for further processing. With SK however, the callbacks are handled automatically for us. SK invokes the functions, manages the parameters and return values, and injects the response into the prompt when and where it's appropriate. It was impressive, but it only worked marginally well with the GPT-35-TURBO model we were using. Though slower and a bit more expensive, the GPT-4 model proved to be quite good at this type of work and did a fine job of calling the right functions and generating a correct answer almost all the time. Even though the model is slower, it was making fewer calls on our behalf than our original solution, so overall, it was faster.

We turned our original inline C# code into full-fledged native SK functions to allow GPT-4 to call them automatically.

Our first task was to turn our original inline C# code into full-fledged native SK functions to allow GPT-4 to call them automatically. All we had to do was add the KernelFunction and Description attributes to our existing methods. No other changes were required.

[KernelFunction, Description("Returns the current date and time")]
public DateTime GetNow()
{
    return DateTime.Now;
}

Listing 4 shows a simple example of automatic function calling in a console app with logging enabled so you can see what's going on under the hood. After loading a set of native and/or semantic functions into the kernel, we run the prompt, passing a setting of AutoInvokeKernelFunctions = true, indicating that that SK can call any of the loaded functions, or any of the out of the box functions provided by SK on our behalf, as it sees fit. In this case, we've loaded the GetNow native function into the kernel, which has the following description: “Returns the current date and time”. If we examine the logs when we ask the question, "What is today's date?", we'll see that SK opts to call our GetNow function and automatically includes the response in the prompt, without any intervention from us, before asking the user's question to the LLM. Provided with this information in the prompt, the LLM then correctly answers the question.

Listing 4: Calling functions automatically.

private static async Task CallFunctionsAutomatically()
{
    var builder = Kernel.CreateBuilder();

    builder.Services
        .AddLogging(configure => configure.AddConsole())
        .AddAzureOpenAIChatCompletion("gpt-4", _endpoint, _apiKey);

    var kernel = builder.Build();

    var nativeFunctions = kernel.ImportPluginFromType<DateAndTime>();

    OpenAIPromptExecutionSettings settings = new()
    {
        ToolCallBehavior = ToolCallBehavior.AutoInvokeKernelFunctions
    };

    var result = await kernel.InvokePromptAsync(
      "What is today's date?", new(settings));

    Console.WriteLine(result);
}

If we comment the ToolCallBehavior setting, we'll see that the LLM can no longer answer the question because it can no longer call our functions or the out of the box functions that come with SK. Similarly, if we uncomment the setting and change the question to "What is the capital of New Mexico?", we'll see in the logs that SK no longer calls the GetNow function, because the information is no longer relevant in answering the question. The current date and time are such common pieces of information that SK actually includes them in the out of the box implementations we get for free, but we show it here as a simple, but powerful illustration.

We soon realized that we no longer needed most of the semantic functions (prompts and settings) that we'd created for our original version. GPT-4 was good enough to figure out whether it needed to call our GetNow function, to generate search terms, to call FindRelevantChunks, to pass the right parameters, and to inject the results into a prompt it generated on the fly without any help from us! With a little tweaking, using only the GetNow and FindRelevantChunks native functions, we could now turn on the AutoInvokeKernelFunctions setting and ask the LLM to answer almost any user question. SK and GPT-4 now handled almost everything else. Listing 5 shows the updated code.

Listing 5: Updated version using automatic function calling and GPT-4

private static async Task AskQuestionsAboutContract2()
{
    var builder = Kernel.CreateBuilder();

    builder.Services
        //.AddLogging(configure => configure.AddConsole())
        .AddAzureOpenAIChatCompletion("gpt-4", _endpoint, _apiKey);

    var kernel = builder.Build();

    kernel.ImportPluginFromType<DateAndTime>();
    kernel.ImportPluginFromType<SearchChunks>();

    OpenAIPromptExecutionSettings settings = new()
    {
        ToolCallBehavior = ToolCallBehavior.AutoInvokeKernelFunctions
    };

    var currentContract = "SampleContract1";
    Console.WriteLine($"Working with contract: {currentContract}");
    var question = "Is the contract currently in effect?";
    await AskQuestion(kernel, settings, currentContract, question);

    question = "What jusrisdiction governs the contract?";
    await AskQuestion(kernel, settings, currentContract, question);

    question = "Who pays for electricity?";
    await AskQuestion(kernel, settings, currentContract, question);

    currentContract = "SampleContract2";
    Console.WriteLine($"Working with contract: {currentContract}");

    question = "How long does the depositor have to deposit materials?";
    await AskQuestion(kernel, settings, currentContract, question);

    question = "What is Tower48's maximum liability?";
    await AskQuestion(kernel, settings, currentContract, question);
}

private static async Task AskQuestion(Kernel kernel, 
    OpenAIPromptExecutionSettings settings, string currentContract, 
      string question)
{
    Console.WriteLine(question);
    var result = await kernel.InvokePromptAsync(
        $"ContractFileName: {currentContract}.Question: {question}", 
            new(settings));
    Console.WriteLine(result);
    Console.WriteLine();
}

As an example, Figure 1 shows the series of test questions posed against the first sample contract, an apartment rental agreement.

Figure 1: Test questions and results about an apartment rental agreement
Figure 1: Test questions and results about an apartment rental agreement

Figure 2 shows questions posed against the second contract, an escrow contract for digital assets.

Figure 2: Test questions and results about an escrow contract for digital assets
Figure 2: Test questions and results about an escrow contract for digital assets

Summary

In our original implementation, we did a lot of work by hand to get the GPT-35 and GPT-35-TURBO models to do their magic. We had to determine the user's intent, we had to figure out what information was needed to retrieve the answer to the question, we had to write functions to retrieve the information, we had to inject the information into the prompt, and we had to engineer all our prompts to get good results, all so the LLM could generate a friendly and correct response. After GPT-4 and automatic function calling, we got rid of the lion's share of code, and we now get even better results with extremely basic prompts!

We now get even better results with extremely basic prompts

Looking back on this experience several projects later, the lessons learned building our first Copilot were fundamental and we've used them in nearly every project since. Even though we're continually improving our approach to building Copilots and various other flavors of AI and even as the LLMs and tools we use are getting better and better, those early lessons are still relevant, important, and useful. In the next article, I'll take a look at more advanced RAG implementations and the lessons learned creating them.