Inside India's Aadhar, The World’s Biggest Biometrics Database

India’s Unique Identification project, also known as Aadhar, earlier this week finished capturing demographic and biometric data of over half a billion residents–the largest biometric project of its kind currently in the world.

It’s been a multi-year effort not without its critics among privacy and security advocates and others. The latest development this week concerned the method that Aadhar is using to capture, store and manage the data, and the role a startup from the U.S. called MongoDB may be playing in it.

MongoDB, a NoSQL database startup, last year raised funding from the CIA-backed In-Q-Tel, an independent non-profit venture backed by the CIA and other U.S. intelligence agencies.

During past few days, several reports in the Indian media have quoted political parties and activists, raising questions about whether sensitive data is being compromised by Aadhar, headed by the Infosys co-founder Nandan Nilekani.

Some of the reports have linked the controversy with MongoDB.

Governments across the world are raising concerns over spying by the National Security Agency, and anything even remotely associated with U.S. government intelligence agencies is enough to cause uproar. Moreover, with general elections set to be held next year, political rhetoric is at an all time high in India.

Still, the timing of these allegations couldn’t have been worse, at least for the ambitious identification project, which is waiting for a parliament bill to be passed this year to be established as a fully constitutional authority.

I took a tour of Aadhar’s offices in Bangalore, and the truth of the matter, according to officials I spoke to, is that while some have alleged large contracts that include sharing data with MongoDB, the reality is that Aadhar is using MongoDB open source code that doesn’t touch sensitive data. The meeting also offered an opportunity to understand how the biggest biometrics database on earth is functioning, and dealing with concerns of security and privacy.

Moreover, the Unique Identification Authority of India (UIDAI), refuted allegations of sharing Indian residents’ data with any U.S. agencies.

What Aadhar means for India

To set the context right here about Aadhar, and what it means for a country like India, more than half a billion people have no official ID of any kind, which makes it impossible for them to receive government aids, open a bank account, get a loan, get a driving license, and so on. The database project, which is now enrolling over one million Indians residents a day, is scheduled to sign up about 1.2 billion people by the end of next year, making it the biggest biometrics database on earth.

One of the biggest advantages of having a 12-digit Aadhar number is that the government can link bank accounts of the country’s poor with it, and directly transfer cash benefits and other subsidies. Already, nearly 40 million bank accounts in India have been linked with Aadhar.

According to research firm CLSA, more than 40% of the Indian government’s $250 billion worth of subsidies and other benefits meant for poor, will be lost to corruption over next few years. Aadhar will remove the middlemen and curb any corruption by enabling direct cash transfer to those who need government subsidies.

But several think-tanks and activists including Bangalore-based Centre for Internet & Society, have been raising concerns about privacy issues and even questioning the effectiveness of the entire project.

Inside the biggest biometrics database on earth

I have been trying to get meetings with the officials at Aadhar to understand security aspects, progress so far and their reaction to the MongoDB allegations.

They finally agreed to meet on Friday in their headquarters across the road in one of Bangalore’s southern suburbs, where both Intel’s and Cisco’s India headquarters are located. From outside, Aadhar’s technology center, which stores all residents’ data (now totalling 5 Petabytes in size) does not look like a government building at all—it could pass for as one of the buildings housing Intel or Cisco nearby.

Inside, as I walked into a room with about dozen television screens in the center of it, some twenty young engineers feverishly looked ahead, typing on their computer keyboards, checking the movement of data packets storing information, the setting looked like a very sophisticated command center. The television screens they were looking at showed the journey of these data packets (each sized at around 5MB) from the time they are logged at one of the 30,000 enrollment centers around the country, through at least three stages of validation. Validation includes running duplication checks for each of the profiles to ensure there are not more than one Aadhar number for the same person.

So, for every new enrollment, a ‘de-duplication’ check is done against all existing profiles, which is over half a billion currently.

Srikanth Nadhamuni, a former Intel engineer who helped set up Aadhar’s technology platform in September 2010, and is now running Khosla Labs in Bangalore, tells me that these data packets are stored behind 2048-bit encryption and capable of self-destruction if any unauthorized access is attempted.

Dealing with MongoDB controversy

So why did Aadhar engage with MongoDB in the first place and will it continue working with the startup?

Sudhir Narayana, assistant director general at Aadhar’s technology center, told me that MongoDB was among several database products, apart from MySQL, Hadoop and HBase, originally procured for running the database search. Unlike MySQL, which could only store demographic data, MongoDB was able to store pictures.

However, Aadhar has been slowly shifting most of its database related work to MySQL, after realizing that MongoDB was not being able to cope with massive chunks of data, millions of packets.

They have already started using ‘database sharding’: a process where data packets are stored across different machines to ensure the system does not crash as volumes rise.

This has helped Aadhar reduce its dependency on MongoDB and instead use MySQL for storing most of the data.

Ashok Dalwai, deputy director general of the tech center, told me that MongoDB has no access to any biometric data.

“We believe in using open source technologies to avoid any vendor lock-in, but that doesn’t mean we are in any way, compromising security,” Dalwai said.

When contacted, a MongoDB spokesperson redirected to this announcement about the company’s funding involving In-Q-Tel.

And more importantly, UIDAI started using MongoDB’s open source software much before the startup received any funding from In-Q-Tel. As this Crunchbase entry shows, MongoDB received venture round funding of $7.7 million from Red Hat, Intel Capital and In-Q-Tel, only in 2012.

So what lies ahead for Aadhar?

Despite all the controversies surrounding it, Aadhar is on track to enroll over 1.2 billion Indian residents by end of 2014, the officials added. This will create a database of about 15 petabytes in size.

Currently, the project is enrolling around one million residents in the country a day. Narayana told me that he’s confident of achieving around two million enrollments a day from next year, and that will help bring the remaining 700 million people into the database.

Photo credit