How to Train Generative AI Using Your Company’s Data
Many organizations are experimenting with ChatGPT and other large language models. These companies have found them to be valuable in terms of their ability to express sophisticated ideas in plain language. However, many users realize these generative AI services are solely trained on internet-based data. Therefore, they can’t accurately respond to questions or prompts regarding company knowledge or content. This has resulted in a growing need to train generative AI models using the company’s data.
Leveraging an organization’s proprietary knowledge is essential to its ability to innovate and compete in today’s dynamic environment. But how can companies train generative AI using their data to help them express complex organization-specific ideas in articulate language? Let’s find out.
3 Ways to Train Generative AI Using Company Data
There are several approaches through which you can incorporate company data into a generative AI model. These include:
1. Training a Large Language Model (LLM) from Scratch
One way to train generative AI using your company’s data is by creating and training the model from scratch. However, this approach isn’t common since it demands a massive amount of top-quality data to train an LLM. In fact, most companies cannot have such data. Also, it requires access to well-trained data scientists and substantial computing power, making it pretty costly.
However, some companies that have access to such an amount of data and talent have embraced this approach. For instance, Bloomberg recently announced that it created BloombergGPT for finance-related content and a natural LLM interface using its data terminal.
The company has more than 40 years’ worth of finance-related documents, news, and data. Combining this with the large volume of text from internet data and financial filings makes the company one of the leaders in terms of the data it has generated and has access to over the years. In total, Bloomberg’s data specialists used 350 billion words, 700 billion tokens, and 50 billion parameters to create its LLM from scratch. It’s no secret that only a few companies have access to such resources.
2. Fine-Tuning Existing LLMs
Another way to train generative AI using your company’s data is by fine-tuning an existing large language model. As initially stated, the first option is resource-intensive, and only a few companies can have those resources. Luckily, the second approach doesn’t demand such resources. Instead of building the LLM from scratch, you modify an existing one, adding specific domain content to an LLM that’s already trained on language-based interaction and general knowledge.
This method involves modifying some parameters of the base model. Therefore, it requires considerably less data – typically hundreds or thousands of records instead of billions or millions, as with the first approach. Also, it demands less computing power than building the LLM from scratch, but it’s still resource-intensive.
One company that adopted this approach is Google. It fine-tuned its Med-PaLM2 (second version) for medical knowledge. The project began with Google’s general PaLM2 LLM. The company then retrained this model on carefully curated medical data from medical datasets. The new model answered 85% of United States medical licensing exam quizzes, which is nearly 20% better than the previous version of the system.
3. Prompt-Tuning an Existing LLM
Prompt-tuning an existing large language model is arguably the most common approach to customizing an LLM for a company’s data, especially non-cloud vendors. It involves freezing the initial model and modifying it through prompts in a way that contains industry-specific knowledge. Once the prompts are tuned, the generative AI model can respond to questions associated with that knowledge. This method is the least resource-intensive and most computationally efficient of all. Also, it doesn’t demand a massive amount of data to be trained on the new content domain. However, it can be technically complex.
One company that has adopted this approach to training generative AI is Morgan Stanley. The company leveraging this approach to train OpenAI’s GPT-4 generative AI model. It used a meticulously curated set of 100,000 documents with crucial investment processes, general business, and investing knowledge. Its main objective was to offer the organization’s financial advisors easily accessible and accurate knowledge on critical issues they encounter in their advising role. The new prompt-trained model is run in a private cloud only accessible to the company’s employees.
Factors to Consider When Training Generative AI Using Your Company’s Data
Here are some factors to consider when training generative AI using your company’s data:
Data Quality
A generative AI model is only as accurate as the data used to train it. If the quality of data used is poor, its results will be poor quality. Errors in training data can be problematic for general companies but fatal in healthcare applications. Therefore, you must ensure the training data is of high quality by rigorously curating and vetting your company content before model training.
Data Governance
Training generative AI requires a lot of data. If not governed appropriately, this data can be incorrectly labeled, which may raise significant issues in the later stages. Therefore, implementing robust data governance practices around collecting, tagging, and cleaning data is crucial. Also, train content curators on labeling and formatting source documents correctly.
Continuous Monitoring
Everything is changing fast. Therefore, you must continually monitor data for false information, inaccuracies, and errors generated by the AI models. Establishing rigorous quality assurance processes is vital to ensure your generative AI solutions produce accurate and unbiased output.
Manage Risks and Set Policies
Generative AI models come with their risks, especially around legal issues, security, improper use, and bias. Therefore, you must train the model while considering these aspects. Also, clearly outline the model’s limitations, capabilities, and policies that are applicable when using it. This will help manage its related risks.
Final Thoughts
Training generative AI using your company’s data can be challenging. There are three ways in which companies can train generative AI: building an LLM from scratch, prompt-tuning, and fine-tuning an existing LLM. Each option has its benefits and challenges. Therefore, companies should pick a training approach that best suits their needs and resource availability. However, you must consider various factors to ensure effective training of generative AI based on the company’s data. These factors include ensuring data quality, governance, and continuously monitoring models for accuracy and relevance. This way, you can train generative AI using your company’s data while minimizing its negative implications.