This post was originally published on this site.
The researchers extracted and cleaned the text, removing duplicates and irrelevant content. This created a dataset providing a vast source of professionally written, factual content that has largely been untapped for AI training, according to the researchers, who write that less than 0.1 percent of BeanCounter’s content appears in existing Common Crawl–based datasets.
They have made BeanCounter publicly available through the Hugging Face Hub, an online platform for sharing and collaborating on machine-learning models, datasets, and tools. This allows other researchers and organizations to use BeanCounter to further work towards creating safer AI systems.
To measure the value the new dataset can bring to companies, Wang and Levy conducted finance-specific experiments using two tasks: named entity recognition and Financial PhraseBank. NER is a technique for identifying and categorizing key information including names, dates, and locations in text—for example, discerning when the word apple is referring to the technology company Apple and identifying Cupertino as the city where its headquarters is located. Financial PhraseBank is a sentiment classification task involving nearly 5,000 sentences from financial news, all labeled as “positive,” “neutral,” or “negative” for their likely effect on a stock’s price.
The researchers conducted continued pretraining (extending a model’s initial training by using domain-specific data to improve performance in a task) on two existing small AI models—Pythia-1.4B from the nonprofit research group EleutherAI and Phi-1.5 from Microsoft—using BeanCounter’s data. They then compared the performance of those continuously pretrained models against the original versions that had not been exposed to the specialized dataset. The results were striking: Models continually pretrained on BeanCounter showed an 18–33 percent reduction in toxic content generation while at the same time improving their performance for NER and Financial PhraseBank by up to about 4 percent.
Wang and Levy also conducted an analysis of how demographic groups are represented in their dataset. They find that BeanCounter’s business documents mentioned demographic groups at similar rates to Common Crawl but did so in less toxic ways. For instance, when the word Asian appeared in BeanCounter documents, the surrounding text was about 72 percent less toxic, on average, than in internet content. (To measure toxicity, they relied on Perspective, a state-of-the-art classifier for detecting toxic language.) This pattern held true across nearly all the demographic descriptors they examined. Potentially sensitive topics appear to be discussed in more professional and measured ways in BeanCounter’s source material, the researchers write.
BeanCounter can be used to complement existing data sources, Levy explains. It’s big enough that, on its own, it could pretrain a model such as OpenAI’s GPT-4o mini. And while it’s too small to pretrain, say, Meta’s largest 405-billion-parameter Llama models, it could be helpful as part of the “annealing” stage of pretraining during which Meta taps high-quality data from lengthy documents to improve its models’ performance.
BeanCounter can also evaluate LLMs, Levy says. Its data are grounded in facts and have associated time stamps, so BeanCounter can assess whether a model or AI system provides answers that are not just accurate but were correct and relevant at a particular point in time.
The importance of high-quality, reliable training data will almost certainly grow as AI systems become increasingly integrated into decision-making across industries. BeanCounter demonstrates that carefully curated, domain-specific datasets can lead to AI models that are both more capable and more ethically aligned, write Wang and Levy. This suggests a potential pathway for developing specialized AI systems in other professional fields, such as law or medicine, where accuracy and professional conduct are paramount.
The researchers envision a future where AI systems could learn from professional rather than social sources and deliver more reliable and unbiased insights while being more efficient and economical than their larger, general-purpose counterparts—kind of like getting investment advice from a financial adviser instead of relying on the Twitterverse.