Training AI Where Data Doesn’t Exist
Opportunities for new companies exist in economic sectors where data access is sparse. Using cutting-edge methods, and with creative wedge products, these new start-ups will find great success.
AI’s most effective results have primarily been limited to sectors with robust, digitized data. Financial markets generate petabytes of structured transaction data daily. Consumer internet platforms log billions of behavioral signals per hour. Healthcare’s electronic medical record systems, however imperfect, provide massive, labeled datasets for clinical NLP. In these environments, the AI playbook is well-established: collect data, train models, deploy, iterate.
But a significant share of the global economy operates in data deserts – sectors where the raw material for AI training either doesn’t exist in digital form, is too fragmented to aggregate, or is locked behind privacy and regulatory constraints. A few examples: construction, one of the world’s largest industries at $12 trillion globally, has a digitization rate as low as 1.4%. Drug discovery for rare diseases suffers from a fundamental scarcity of patient data – by definition, the populations are small and geographically dispersed. Legal services involve high-value domain expertise, but confidentiality requirements prevent the kind of open data sharing that accelerates model training in other fields.
These are massive markets, and companies that figure out how to build performant AI with sparse data will capture value – precisely because the problem is hard.
The Data Wall
The data scarcity problem will rise in prominence in the next few years. Epoch AI has projected with 80% certainty that high-quality public text data will be effectively exhausted between 2026 and 2028. By April 2025, 74% of newly created web pages already contained AI-generated content – creating a recursive contamination problem (known as model collapse) where models increasingly train on the outputs of other models. These constraints affect every AI company, but they are acutely felt in verticals where native data was scarce to begin with.
Consider the scale of the gap: the latest frontier models cost over $100 million to train; Dario Amodei, CEO of Anthropic, says that models currently in development are approaching a cost of $1 billion. These budgets assume abundant, high-quality training data. In areas like construction or rare disease research, that assumption fails.
Filling the Gap: How Companies Supplement Sparse Data
The companies winning in data-sparse sectors are engineering around this data constraint using increasingly sophisticated methods.
Synthetic data generation is the most prominent approach. The synthetic data market grew to roughly $2 billion in 2025 and is projected to reach $10 billion by 2033; Gartner forecasts that synthetic data will be more widely used for AI training than real-world datasets by 2030. Products like NVIDIA’s Cosmos allow companies to generate physically accurate synthetic environments for robotics and autonomous systems. In life sciences, synthetic patient data and virtual cell models are compressing drug development timelines without requiring access to large real-world patient cohorts. But synthetic data has its drawbacks. Recent research has shown that even a small fraction of synthetic content in training data can trigger model collapse. The most effective practitioners therefore treat synthetic data as a supplement to real-world data, not a replacement.
Federated learning addresses a different dimension of the problem – data that exists but can’t be shared. In healthcare, financial services, and government, privacy regulations make centralized data aggregation impractical or illegal. Federated learning allows multiple institutions to train a shared model without exchanging raw data. The European Data Protection Supervisor endorsed federated learning in 2025 as a mechanism for GDPR-compliant AI development. While the federated learning market is still early (~$100 million in 2025), it’s projected to grow by an order of magnitude to $1.6 billion by 2035. In rare disease research specifically, federated approaches have enabled hospitals to collaboratively train diagnostic models across institutions without sharing sensitive patient records – a breakthrough for conditions where no single institution has enough data to train a model alone.
Few-shot and zero-shot learning techniques have also matured considerably. Instead of requiring millions of labeled examples, these approaches teach models to generalize from a handful of in-depth demonstrations. Practically, few-shot prompting has demonstrated the ability to improve accuracy from near-zero to 90% in domain-specific tasks by showing models targeted examples. For sectors where labeled data is expensive to create – such as legal document classification, construction defect identification, or specialty insurance claim adjudication – these methods lower the data floor required to build useful products.
Simulation and digital twins represent a fourth category. The Sparse Identification of Nonlinear Dynamics method has gained major traction in the last year or two. It constructs digital twin simulators from minimal time-series data – enabling manufacturers and chemical engineers to model complex systems without decades of historical data. In construction and heavy industry, combining physics-based models with AI creates environments where models can train on simulated scenarios that would take years to observe in the real world.
Examples of Building in Data Deserts
I’ve mentioned Chai Discovery in the drug development sector before – the OpenAI-backed company raised a $130 million Series B in December 2025 at a $1.3 billion valuation. Chai builds foundation models that predict molecular interactions – training on sparse molecular datasets by leveraging transfer learning and physics-informed architectures that encode known chemical properties into the model itself. Chai’s approach exemplifies how foundation models can compensate for data scarcity in biotech by embedding domain knowledge directly into model architecture.
Construction – the sector with some of the lowest digitization rates in any major industry – is seeing its own AI emergence. Buildots recently raised a $45 million Series D at a $300 million valuation. Their approach is instructive: rather than relying on historical construction data that doesn’t exist in digital form, the company generates its own training data by processing 360-degree images captured from hard hats on active job sites. Each project creates a digital record where none existed before – and each new project makes the AI that follows that much better. This market is early but growing: the global construction tech AI market grew to $1.6bn in 2025 with a >30% CAGR expected through 2035.
The Data Flywheel Advantage
In data-sparse sectors, the data infrastructure flywheel is even more powerful, because the barrier to generating the initial data is higher. A company like Buildots uses construction management software as a wedge to create the datasets that no competitor can replicate without equivalent distribution across active job sites. Every customer engagement generates proprietary training data that improves the product, which attracts more customers, which generates more data.
This dynamic is the most important structural advantage in data-sparse sectors. In data-rich environments, many competitors can train on similar public datasets. In data-sparse environments, the company that solves the distribution problem first – and uses that distribution to generate proprietary data – builds a moat that compounds with every customer.
Looking Ahead
Keep an eye out for companies in these data-sparse sectors, like construction, rare disease, specialty insurance, and agriculture. These are the areas where the underlying data problems are the hardest to solve – but therein lies the opportunity.
Each market shares similar characteristics that make them ripe for AI disruption. The markets are enormous, the incumbents are under-digitized, and the technical toolkit for building in sparse-data environments has never been stronger. But each of these markets also has a buyer base that is comparatively resistant to change: companies who build for these areas will have to be creative to design wedge products that overcome this bias. In doing so, we may well begin to see that the hardest data problems produce the most defensible, valuable companies.

