Best practices for developing a generative AI copilot for business

Chris Ackerson Contributor

Chris Ackerson, formerly of IBM Watson, is currently the Vice President of Product at AlphaSense, a market intelligence and search platform, where he spearheads the development of AI and ML capabilities to deliver better data and insights to thousands of enterprise companies.

Since the launch of ChatGPT, I can’t remember a meeting with a prospect or customer where they didn’t ask me how they can leverage generative AI for their business. From internal efficiency and productivity to external products and services, companies are racing to implement generative AI technologies across every sector of the economy.

While GenAI is still in its early days, its capabilities are expanding quickly — from vertical search, to photo editing, to writing assistants, the common thread is leveraging conversational interfaces to make software more approachable and powerful. Chatbots, now rebranded as “copilots” and “assistants,” are the craze once again, and while a set of best practices is starting to emerge, step 1 in developing a chatbot is to scope down the problem and start small.

A copilot is an orchestrator, helping a user complete many different tasks through a free text interface. There are an infinite number of possible input prompts, and all should be handled gracefully and safely. Rather than setting out to solve every task, and run the risk of falling short of user expectations, developers should start by solving a single task really well and learning along the way.

At AlphaSense, for example, we focused on earnings call summarization as our first single task, a well-scoped but high-value task for our customer base that also maps well to existing workflows in the product. Along the way, we gleaned insights into LLM development, model choice, training data generation, retrieval augmented generation and user experience design that enabled the expansion to open chat.

LLM development: Choosing open or closed

In early 2023, the leaderboard for LLM performance was clear: OpenAI was ahead with GPT-4, but well-capitalized competitors like Anthropic and Google were determined to catch up. Open source held sparks of promise, but performance on text generation tasks was not competitive with closed models.

To develop a high-performance LLM, commit to building the best dataset in the world for the task at hand.

My experience with AI over the last decade led me to believe that open source would make a furious comeback and that’s exactly what has happened. The open source community has driven performance up while lowering cost and latency. LLaMA, Mistral and other models offer powerful foundations for innovation, and the major cloud providers like Amazon, Google and Microsoft are largely adopting a multi-vendor approach, including support for and amplification of open source.

While open source hasn’t caught up in published performance benchmarks, it’s clearly leap-frogged closed models on the set of trade-offs that any developer has to make when bringing a product into the real world. The 5 S’s of Model Selection can help developers decide which type of model is right for them:

Smarts: Through fine-tuning, open source models can absolutely outperform closed models on narrow tasks. This has been proven time and time again.
Spend: Open source is free outside of fixed GPU time and engineering ops. At reasonable volumes this will always scale more efficiently than usage-based pricing.
Speed: By owning the full stack, developers can continuously optimize latency and the open source community is producing new ideas every day. Training small models with knowledge from large models can bring latency down from seconds to milliseconds.
Stability: Drifting performance is inherent to closed models. When the only lever of control is prompt engineering, this change will inevitably hurt a carefully tuned product experience. On the other hand, collecting training data and regularly retraining a fixed model baseline enables systematic evaluation of model performance over time. Larger upgrades with new open source models can also be planned and evaluated like any major product release.
Security: Serving the model can guarantee end-to-end control of data. (Note: I would go further and say that AI safety in general is better served with a robust and thriving open source community.)

Closed models will play an important role in bespoke enterprise use cases and for prototyping new use cases that push the boundaries of AI capability. However, I believe open source will provide the foundation for all significant products where GenAI is core to the end-user experience.

LLM development: Training your model

To develop a high-performance LLM, commit to building the best dataset in the world for the task at hand. That may sound daunting, but consider two facts: First, best does not mean biggest. Often, state-of-the-art performance on narrow tasks can be achieved with hundreds of high-quality examples. Second, for many tasks in your enterprise or product context, your unique data assets and understanding of the problem offer a leg up on closed model providers collecting training data to serve thousands of customers and use cases. At AlphaSense, AI engineers, product managers and financial analysts collaborate to develop annotation guidelines that define a process for curating and maintaining such datasets.

Distillation is a critical tool to optimize this investment in high-quality training data. Open source models are available in multiple sizes from 70 billion+ parameters to 34 billion, 13 billion, 7 billion, 3 billion and smaller. For many narrow tasks, smaller models can achieve sufficient “smarts” at significantly better “spend” and “speed.” Distillation is the process of training a large model with high-quality human-generated training data and then asking that model to generate orders of magnitude of more synthetic data to train smaller models. Multiple models with different performance, cost and latency characteristics provide great flexibility to optimize user experience in production.

RAG: Retrieval augmented generation

When developing products with LLMs, developers quickly learn that the output of these systems is only as good as the quality of the input. ChatGPT, which is trained on the entire internet, maintains all of the benefits (access to all published human knowledge) and downsides (misleading, copyrighted, unsafe content) of the open internet.

In a business context, that level of risk may not be acceptable for customers making critical decisions every day, in which case developers can turn to retrieval-augmented generation, or RAG. RAG grounds the LLM in authoritative content by asking it only to reason over information retrieved from a database rather than reproduce knowledge from its training dataset. Current LLMs can effectively process thousands of words as input context for RAG, but nearly every real-life application must process many orders of magnitude more content than that. For example, AlphaSense’s database contains hundreds of billions of words. As a result, the task of retrieving the right context to feed the LLM is a critical step.

Expect to invest more in building the information retrieval system than in training the LLM. As both keyword-based retrieval and vector-based retrieval systems have limitations today, a hybrid approach is best for most use cases. I believe grounding LLMs will be the most dynamic area of GenAI research over the next few years.

User experience and design: Integrate chat seamlessly

From a design perspective, chatbots should fit in seamlessly with the rest of an existing platform — it shouldn’t feel like an add-on. It should add unique value and leverage existing design patterns where they make sense. Guardrails should help a user understand how to use the system and its limitations, they should handle user input that can’t or shouldn’t be answered, and they should provide for automatic injection of application context. Here are three key points of integration to consider:

Chat vs. GUI: For the most common workflows, users would prefer not to chat. Graphical user interfaces were invented because they are a great way to guide users through complex workflows. Chat is a fantastic solution for the long tail when a user needs to provide difficult-to-anticipate context in order to solve their problem. Be thoughtful about when and where to trigger chat in an app.
Setting context: As discussed above, a limitation with LLMs today is the ability to hold context. A retrieval-based conversation can quickly grow to millions of words. Traditional search controls and filters are a fantastic solution to this problem. Users can set the context for a conversation and know that it’s fixed over time or adjust it along the way. This can reduce cognitive load while increasing the probability of delivering accurate and useful responses in conversation.
Auditability: Ensure that any GenAI output is cited to the original source documents and is auditable in context. Speed of verification is a key barrier to trust and adoption of GenAI systems in a business context, so invest in this workflow.

The release of ChatGPT alerted the world to the arrival of GenAI and demonstrated the potential for the next generation of AI-powered apps. As more companies and developers create, scale and implement AI chat applications, it’s important to keep these best practices in mind and focus on alignment between your tech and business strategies to build an innovative product with real, long-term impact and value. Focusing on completing one task well while looking for opportunities to expand a chatbot’s functionality will help set a developer up for success.