Why AI startups are taking data into their own hands

Spread the love

For a week this summer, Taylor and her roommate strapped GoPro cameras to their foreheads while they painted, sculpted and did housework. They were training an AI vision model, carefully syncing their footage so the system could get multiple angles of the same behavior. It was hard work in many ways, but they paid well for it – and it allowed Taylor to make much of his day’s art.

“We woke up, did our regular routine, and then strapped cameras on our heads and synced the times together,” he told me. “Then we’d make our breakfast and clean the dishes. Then we’d go our separate ways and do art.”

They were assigned to produce five hours of synchronized footage per day, but Taylor quickly learned that he had to allocate seven hours a day to work, allowing enough time for breaks and physical recovery.

“It’ll give you a headache,” she said. “You take it off and you just have a red square on your forehead.”

Taylor, who asked not to give his last name, was working as a data freelancer for Turing Labs, an AI company that connected him to TechCrunch. Turing’s goal was not to teach AI how to paint, but to acquire more abstract skills around sequential problem-solving and visual reasoning. Unlike a large language model, Turing’s vision model will be trained entirely on video — and much of it will be collected directly by Turing.

In addition to artists like Taylor, Turing is making deals with chefs, construction workers and electricians — anyone who works with their hands. Turing’s Chief AGI Officer Sudarshan Sivaraman told TechCrunch that manual collection is the only way to get a diverse enough dataset.

“We’re doing this for a variety of blue-collar jobs, so that we have a variety of data in the pre-training phase,” Sivaraman told TechCrunch. “After we capture all this information, the models will be able to understand how a particular task is performed.”

TechCrunch event

San Francisco
|
October 27-29, 2025

Turing’s work on vision models is part of a growing shift in how AI companies work with data. Where training sets were once freely scraped from the web or gathered from underpaid annotators, companies are now paying top dollar for carefully curated data.

With the raw power of AI already established, companies are looking to proprietary training data as a competitive advantage. And instead of farming the job out to contractors, they often take on the job themselves.

Email company the fixerwhich uses AI models to sort emails and draft replies, is one example.

After some initial experimentation, founder Richard Hollingsworth discovered the best approach was to use an array of small models with tightly focused training data. Unlike Turing, Fixer is building someone else’s base model — but the underlying intuition is the same.

“We realized that data quality, not quantity, is the thing that really defines performance,” Hollingsworth told me.

In practical terms, this means some unconventional personnel choices. In the early days, fixer engineers and managers were sometimes outnumbered four to one by the executive assistants needed to train the model, Hollingsworth says.

“We used a lot of experienced executive assistants, because we needed training on the basics of whether to reply to an email,” he told TechCrunch. “It’s a very people-oriented problem. Great people are very hard to find.”

Data collection never slowed down, but over time Hollingsworth became more precious about datasets, preferring smaller sets of more tightly curated datasets when it came to post-training. As he puts it, “The quality of information, not the quantity, is what really defines effectiveness.”

This is especially true when using synthetic data, magnifying both the scope of possible training scenarios and the impact of any errors in the original dataset. In terms of perspective, Turing estimates that 75% to 80% of its data is synthetic, extrapolated from the original GoPro video. But this makes it even more important to keep the original dataset as high-quality as possible.

“If the pre-training data itself isn’t of good quality, then what you do with the synthetic data isn’t going to be of good quality either,” says Sivaraman.

Beyond quality concerns, there is a strong competitive rationale behind keeping data collection in-house. For Fyxer, the hard work of data collection is one of the company’s best moats against the competition. As Hollingsworth sees it, anyone can build an open source model into their product — but not everyone can find expert annotations to train it into a viable product.

“We believe the best way to do this is with data,” he told TechCrunch, “by building custom models, with high-quality, human-led data training.”

Leave a Reply

Your email address will not be published. Required fields are marked *