LlamaIndex adds private data to large language models

Last fall, after playing around with OpenAI’s GPT-3 text-generating AI model — the predecessor to GPT-4 — former Uber research scientist Jerry Liu discovered what he describes as “limitations” around the model’s ability to work with private data (e.g., personal files). To solve for this, he launched an open source project, LlamaIndex, designed to unlock the capabilities and use cases of large language models (LLMs) like GPT-3 and GPT-4.

“LLMs offer incredible capabilities for knowledge extraction and reasoning — they can perform question-answering, summarization and insight extraction and even sequential decision making with an external environment,” Liu told TechCrunch in an email interview. “But LLMs have limits.”

As the project grew in popularity (to the tune of 200,000 monthly downloads), Liu joined forces with Simon Suo, one of his old colleagues at Uber, to turn LlamaIndex into a fully fledged company. Today, LlamaIndex (the company) offers a framework to assist developers in leveraging the capabilities of LLMs on top of their personal or organizational data.

“LlamaIndex [helps] developers manage their data for LLM applications,” Liu said. “Our toolkit contains the most depth in this aspect, and we make it easy to integrate with other tools the developer is using.”

Image Credits: LlamaIndex

The LlamaIndex framework allows developers to connect data from files like PDFs, PowerPoints, apps such as Notion and Slack and databases like Postgres and MongoDB to LLMs. The framework includes connectors to ingest data sources and data formats, as well as ways to structure data so that it can be easily used with LLMs.

In addition, LlamaIndex features a data retrieval and query interface that lets developers feed in any LLM input prompt to get back — as Liu describes it — “context and knowledge-augmented” output.

“There are other LLM application frameworks out there that offer basic building blocks for LLM applications and agents,” Liu said. “What’s specific to LlamaIndex is that we focus on connecting your data sources with LLMs, and we have extensive tools around data ingestion, data management and indexing and data retrieval with respect to LLM applications.”

The prospect of augmenting LLMs in this way wooed investors, which pledged $8.5 million toward LlamaIndex in a recently closed seed funding round. Greylock led with participation from angel investors, including Jack Altman, Lenny Rachitsky and Charles Xie.

So what will LlamaIndex spend the money on? Liu says that it’ll be used to build an “enterprise solution” atop the open source LlamaIndex project, set to launch later this year. One capability will allow customers to use “protection-grade” data connectors to parse and transport large volumes of data, while another, related capability will let them index “domain-specific” data.

“LlamaIndex is not tied to a specific piece of technology, so that we can continue to be used with LLMs as the technology evolves,” Liu said. “The AI industry is moving so quickly that any initial stacks that are emerging will likely change in the course of the next few months.”