LUMI AI Factory launches Dataset-as-a-Service to make data easier to access and use alongside its supercomputer for faster research.

The LUMI AI Factory, a leading service infrastructure and support center designed for AI innovation across Europe has introduced Dataset-as-a-Service (DaaS) to directly address the expanding demands of AI and data-intensive research, which brings data and compute closer together.

By putting carefully chosen, superior datasets in close technological and physical proximity to the enormous processing capacity of the LUMI supercomputer, the service addresses a significant obstacle in the development of artificial intelligence. Before training can start, this configuration reduces latency and rejects the traditional, time-consuming overhead of transferring huge volumes of data through external networks.

The new solution makes datasets discoverable and instantly accessible on the LUMI supercomputer by combining metadata, access rights, and data locations into a unified whole. This is particularly significant for AI research, as training models need a lot of data, and the physical closeness of data to computing has a big impact on workflow reproducibility and performance.

The Open Web Search Index is one of the top resources hosted by this infrastructure. It enables developers to simplify structured web document data for training large language models (LLMs) without separately searching the web thanks to its petabyte-sized collection of over 1,000 datasets. Initially, the service is offered in a pre-productized format with LUMI professionals actively helping with certain activities. Access to the data is free as long as the applicant is enrolled as a user and is involved in a research project in academic or industry that has obtained compute allocations on the LUMI supercomputer.

DaaS from LUMI AI Factory generates value for both the data suppliers and the data users. DaaS removes the difficulty of moving a sizable dataset to another location before analysis and simplifies the search for AI-ready datasets for data users. The service provides a defined publication path for data producers, making datasets accessible for wider usage and discoverable in a regulated, standardized manner.