Massive Datasets Are Built, Not Found

The goal of most successful digital businesses is to build a data pipeline, rather than build a product and capitalize on the 'exhaust'.

datasets data pipelines analytics agentic ai
ProcurePlus (Menacon)

Observation

Large datasets are not discovered in a finished state. They are built through continuous collection and refinement.

Shift

Previously, the perceived limitation was access to public datasets. Now, the real limitation is control over the data generating process. Agentic AI makes it easy to generate and update data at scale.

Perhaps the goal of the product is to build data.

The shift is from data as exhaust to data as steam.

Implication

Data-first creation.

Data mines (closer to explosive mines than gold mines): touch them in a unique way and you trigger an explosion of new data, shaped by who interacted and how.

Principle

The dataset is not the asset. The collection and refinement loop is.

Next step

  • Prioritise improving the pipeline
  • Introduce feedback signals into dataset growth
  • Write field notes on both of the above
  • Identify further products which are self-recursive and data-enabled, or act as data-builders for a meta product (often involving human assessment, and assessment of human assessment)