Compute vs. Datasets: Experts Debate India’s AI Investment
Recently, the Ministry of Electronics and Information Technology (MEITY) divulged in response to a Lok Sabha query regarding the ‘India AI Mission,’ an insightful breakdown of the Rs. 10,372 crore budget planned for the mission. Notably, a sum of Rs. 4563.36 crores, representing approximately 44% of the entire budget, has been allocated towards Compute Capacity.
The central debate emerging is whether this considerable 44% dedication for compute within India’s AI mission is judicious. Our in-depth discussion at the MediaNama roundtable on how regulations can facilitate the AI ecosystem in India scrutinized this very question.
Nikhil Pahwa, founder of MediaNama, introduced India’s AI mission, structured around seven pivotal pillars: compute capacity, AI innovation centers, development and deployment of indigenous large language models (LLMs), domain-specific foundational models, AI dataset platforms, sectoral AI application development, skilling initiatives, startup financing, and ensuring responsible AI development.
Pahwa questioned whether the substantial investment in compute capacity is warranted or if resources might be more efficaciously directed towards other crucial areas like datasets, skilling, or broader AI research.
The discussion highlighted concerns about compute resources possibly not being the ultimate bottleneck in AI development. There was a suggestion to potentially pivot focus towards areas such as the crafting of innovative algorithms or meticulous data curation, which could yield more substantial impacts for the AI ecosystem in India.
C. Chaitanya, Co-Founder and CTO of Ozonetel Communications, offered a notable counterview, questioning the prevailing method of AI model development, which heavily leans on large computational power and extensive datasets. Drawing from his own experiences, he articulated that while such methods have seen success, they are not the sole pathway.
“So, the question we started asking ourselves was, do we really need that much compute? The way models have been built so far, is that the only approach? Is there a better way to create a model without relying on a massive compute?”
Chaitanya suggested that different algorithms might achieve similar or improved results. He asserted India’s mathematical prowess should be leveraged to construct these models, possibly revealing the compute-centric approach as somewhat linear.
“Our goal is to build a language model without using GPUs to prove that we don’t need that much compute.”
Chaitanya emphasized the critical nature of datasets over compute in shaping effective AI models. Presently, the government has set aside Rs. 199.55 crore for datasets, just 1.92% of the total budget.
“The data sets are the most important because as an AI engineer, I know it has nothing to do with anything else, it is purely all about data sets; focus should be on data sets instead.”
The underlying argument posits that data is the bedrock of AI. Enhanced datasets are vital for developing robust models and having diverse datasets is crucial for driving AI forward.
“Because what is AI? It’s prompting. And what is prompting? To use it effectively, you need to know English. If you’re not proficient in English, you can’t fully utilize AI. But why is that the case? If I speak Telugu, shouldn’t I be able to interact with AI in Telugu? Isn’t that the problem AI is meant to solve? The issue is, it can’t solve this unless it has the right datasets.”
In contrast, investments in compute are fleeting, losing value post-utilization. Conversely, dataset investments create durable, reusable assets continuously contributing value.
Umang, an attendee and AI developer in San Francisco, contributed by discussing shifts within AI research. His perspective underscored the historical significance of compute in AI research, especially during model pre-training and training phases. However, recent trends emphasize “test time” or “inference time” computation—where the significant computational burden now lies in real-time model interaction when a user engages the model.
“So from my perspective, I think the need for compute is going to grow even larger. 44% is probably the bare minimum.”
Umang explained how OPEX (operational expenditure) and CAPEX (capital expenditure) delineate AI expenses. Traditionally, model training counts as CAPEX, while inference demands fall under OPEX, highlighting AI’s evolving compute models.
Kesava Reddy from E2E Networks underscored the importance of vast, high-quality datasets in ensuring the longevity and performance of AI, suggesting the government prioritize dataset acquisition, leveraging existing datasets for AI training and research to spur AI growth.
“The government can make money with the current datasets they have. For instance, traffic data and health data, like X-rays and scan data, could be used for AI training, including visual models for autonomous driving. These datasets can be provided to others for research and development purposes.”
MEITY’s recent announcement regarding the India AI Mission platform describes intentions to make public sector datasets AI-ready. However, details on data quality assurance, curation, or anonymization procedures to safeguard personal information and comply with privacy regulations remain unspecified.
Ajay Kumar from Triumvir Law raised the government’s role in augmenting domestic compute capacity to ensure independence from foreign data centers for public infrastructure, potentially supported by private sector incentives like tax breaks.
Vivek Abrahim from Salesforce expressed concerns about the lack of clear direction in governmental compute investments. He noted various state-level initiatives independently pursuing compute infrastructure, calling for greater clarity and coordination at a national level to promote synergy.
Bharath from Takshashila Institution reiterated that the issue is not solely the compute budget size but spending efficacy, critiquing the current tender system model for resource allocation.
An attendee inquired about the implications of the compute budget as a preliminary funding stage, anticipating rising demand that might outpace supply, potentially leading to inefficiencies.