
Snorkel AI extends data curation beyond labeling for Generative AI
Snorkel AI has announced new capabilities to help organizations curate and prepare data for Generative AI, shifting beyond its primary function of providing data labeling services for machine learning (ML) and artificial intelligence (AI). According to VentureBeat, Snorkel AI is a data platform that assists organizations with the data aspect of AI. Although data labeling remains important for predictive AI tasks, CEO and co-founder at Snorkel AI, Alex Ratner, said that in the long run, he expects much of the enterprise value from AI to come from more traditional predictive AI. As for generative AI, there is still a need for feedback, and Snorkel AI's new tools are designed to help organizations assemble, curate, and develop feedback programatically, accelerated and better managed.
The role of data labeling
Data labeling has long been a critical component in helping data scientists prepare data for ML and AI. In November 2022, Snorkel Flow technology was updated with features that enable organizations to accelerate the data labeling process, using large language models (LLMs) to get started. The new GenFlow service goes one step further, building generative AI applications while the Snorkel Foundry helps organizations build customized LLMs.
Ensuring good data for Generative AI
“How you curate, sample, filter, and clean data ends up having a tremendous impact on the resulting foundation model that you get out,” Ratner said in an exclusive interview with VentureBeat. One issue that generalized generative AI tools face is the risk of hallucination, where responses are not accurate. To address this issue, multiple vendors are exploring the concept of Retrieval Augmented Generation (RAG), where sources for generating results are cited. However, if there are no sources, it becomes a data problem that Snorkel Foundry can solve with its data curation capabilities. With Snorkel Foundry, organizations can point the service at a data repository to get the right mix of data to meet business objectives, reduce bias, and the risk of hallucination.
Beyond labeling with GenFlow
After pre-training an LLM, the next step is to fine-tune it to generate an optimal output. For non-generative AI, such as Snorkel Flow, classifying data with tags helps label it properly. However, for generative AI outputs, traditional labeling is not what's needed. This is where the GenFlow service comes in – it provides the right tooling and management capability to provide feedback and filter out poor-quality data points to help generative AI generate an optimal output.
Advice for enterprises
Ratner believes that the majority of data organizations have will likely be unstructured. The Snorkel Foundry includes data sampling functions that enable users to heuristically identify data relevance and compose the right balance of content to put into an ML training routine. Ratner explained that "most enterprises don't have perfectly curated data," and Snorkel AI is helping them do that programmatically to organize, curate, and optimize the mixture of data.
Final thoughts
Generative AI has brought a new challenge to data curation with the risk of hallucination; Snorkel AI's new tools aim to tackle this issue by asking for feedback in different forms. Enterprises must ensure that they have good data for generative AI and an optimal mix of data for ML training routines. By using Snorkel AI's tools, organizations can assemble, curate, and develop feedback programatically. In other words, they can integrate and optimize AI investments for success.
延伸閱讀
- OpenAI 重磅推出 GPT-4.5:歷史上最大的語言模型來了!
- 探索 DeepSeek:你必須理解的 AI 聊天機器人應用全指南!
- 揭開 Mistral AI 的神祕面紗:揭示 OpenAI 競爭者的所有祕密!
- 全方位理解 Anthropic 的 AI:揭開 Claude 的神祕面紗!
- OpenAI 努力破解 ChatGPT 的約束,帶你進入無阻礙的對話世界!
- 粉紅魚助力企業打造 AI 代理人!揭開自然語言處理的背後祕密
- DeepMind 的 AI 挑戰數學奧林匹克金牌選手,表現驚人!
- 深入探索 DeepSeek:您需要理解的 AI 聊天機器人應用程式的全部資訊!
- 探索 Microsoft Copilot:你必須知道的 AI 技術全解析!
- OpenAI 揭開 o3-mini 模型思考過程的神祕面紗!