Reimagining Public Data to Advance AI: Maximizing Insights from Federal Data

Clifton_Roberts · ‎08-16-2019

The U.S. Federal Data Strategy is well-aligned with Intel’s recommendations that new regulatory initiatives should be comprehensive and technology-neutral, support the free flow of data, and promote access to data while liberating data responsibly. Intel has been, and remains, committed to helping accelerate federal agencies’ responses to make their datasets publicly available, unleashing innovative, trusted, and inclusive AI.

Pragmatically Sorting Raw Data Sets

A significant challenge that federal agencies face is that data is stored across numerous federal agencies resulting in data silos. The mandate to migrate siloed federal data to a single location is impractical, demanding the consumption of vast amounts of resources to satisfy such a requirement.

To confront this challenge head-on, we recommend that agencies do not view siloed data as a barrier, but rather as a unique opportunity dealt with through a pragmatic lens of Secure Federated Machine Learning (SFML). In Figure 1 below, enormous effort goes into migrating data from existing locations to a single location. In Figure 2, SFML brings processing mechanisms to the data source for training and inferencing versus requiring that agencies migrate data to one place.

Figure 1

Figure 2

Such data federation ensures privacy and security of both data and machine learning models, offering additional confidence in post-deployment insights. SFML ensures that (i) data remains in place and compute moves to the data, and (ii) compute and data are protected at the hardware level from confidentiality and data integrity attacks.

Numerous SFML studies have illustrated its effectiveness, allowing for faster deployment & testing of smarter models, lower latency, and less power consumption. Furthermore, SFML employs a combination of privacy-by-design techniques to ensure data de-identification, data protection, and insights security. As the commercial sector plans to leverage the use federal data for algorithmic expression(s), the business interest of those participants must be protected through security techniques ingrained at the lowest level of hardware – silicon!

Avoiding Replication of Biases

The paradox of data is that while data reflects societal issues, societal issues can influence data collected, possessing inherent racial, gender, economic, regional, and other types of biases. To date, nearly 200 human biases have been defined and classified, any one of which can affect how we make decisions. As such, there is a probability that the consumption, processing, and sharing of such data will lead to discriminatory insights that under-represent the macrocosm, which could erode trust between humans and machines that learn. Such bias could particularly be amplified in SFML where learning occurs on the data node rather than the combined dataset.

Constructing datasets while optimizing AI algorithms to avoid bias are interdependent functions. Continued investment by federal agencies in fostering access to large and reliable datasets is essential to the development and deployment of innovative, trusted, and inclusive AI. Specifically, greater diversity in datasets will reduce the risk of unintended bias. Diversity in the teams working to curate and then release datasets for AI can also address training bias. A diverse, inclusive team of individuals with different skills and approaches to the curation and release of datasets ensures more holistic and ethical designs of AI algorithms.

Ethical Liberation of Datasets

One growing technology trend is the increase in mechanisms for data collection and creation. Personal data is not just collected from individuals who provide it for particular uses, but also observed and gathered by sensors in connected devices, and derived or created through further automated processing. In fact, the percentage of data coming directly from individuals is decreasing compared to the information that is collected in our increasingly connected society and inferred through machine learning technologies.

While ethical data processing is built on privacy, the increasing amount of data collected, processed and inferred in the artificial intelligence space, strong encryption and de-identification (full anonymization) techniques serve the purpose of protecting individuals’ privacy while achieving higher levels of security. Nonetheless, achieving de-identification will require increasingly complex practices because re-identification will be increasingly possible in deep learning-driven environments. Differential privacy (DP) techniques have emerged in the last years as viable solutions to minimize privacy risks, adding “noise” to scramble personal data. Furthermore, in the academic and research community, Homomorphic encryption (HE) seems particularly promising as it allows computation on encrypted data, enabling AI tasks without the need to transfer personal information.

Thus, along with government investment in the development of international standards for algorithmic explainability and promoting diversity in datasets, increased investment, research & development, and the transparency of outcome-based studies about de-identification techniques like DP and HE, is in U.S. national interest to further drive innovations in the marketplace. Finally, in practice, data de-identification techniques should be uniformly applied across agencies to ensure that (i) data is anonymized and (ii) combining anonymized data across silos does not result into the re-identification of subjects.

Standards and Frameworks Implementation

Data represents a complex field, since data is subject to a variety of regulatory regimes, which include privacy, data sovereignty, localization, and cross-border transfers. This type of subject-matter variance cultivates disharmony in common regulatory approaches and stifles the development of regulations that rely on voluntary standards in evolving technical requirements for AI data use cases.

Access to large, reliable datasets is essential to the development and deployment of robust and trusted AI. Standards and guidelines play an important role in developing approaches for access to AI datasets, but need to be carefully defined based on different use case contexts and consider common ethical concerns, including privacy regulations. Data-related standards, such as metadata and format interoperability standards, can:

make available public sources of information in structured and accessible databases

create reliable datasets (while employing data de-identification techniques) for use by all AI developers to test automated solutions and benchmark program(s) quality, and;

foster incentives for data sharing between the public and private sector and among industry players

In similar areas highly affected by privacy regulations, international standards have been successful in defining useful mechanisms to support a harmonized approach to regulatory objectives. Examples of such standards include the Do Not Track standard (under W3C) and anonymous signatures and authentication standards (under ISO/IEC JTC 1 SC 27)

Re-imagining Data

There is a tremendous amount of data generated daily that must be stored, secured, and organized. More importantly, the value that all of this data represents is nearly immeasurable; value that comes from analysis and the resulting insights. As the U.S. Federal Government embraces data as a strategic asset, society may experience the next great business opportunity, societal advancement, or scientific discovery.