Personal Data and DataRobot

Is personal data useful for AI predictions?

Personal data is usually not useful for the creation of predictive models relating to human behavior. When creating predictive models the aim is to understand what is likely to happen in the future based on past behavior. When looking at past behavior it is not usually relevant to be able to specifically identify the person that behaved in a certain way. What is relevant is to understand characteristics related to that person.

For example, consider a business that would like to use our automated machine learning to help recommend restaurants to its users. The business has access to user information such as name, date of birth, address, and preferences. To provide a recommendation, the user’s name is irrelevant, but date of birth and address are useful. However, converting these to age and zipcode can both anonymize the data and improve model performance. It is easier to learn from 200 people who are 25 years old, rather than two people born April 1, 1996, or 20 people born in April 1996.

Generalized data is more useful from a modeling perspective and is not personal data, so long as the data points, in combination, couldn’t identify someone. For example:

  • Instead of full address or precise geolocation > use zip code
  • Instead of telephone number > use area code
  • Instead of birthdate > use year or age

Why is the use of some types of personal data in DataRobot restricted?

As noted above, personal data is usually not useful for predictive modelling. The type of personal data we prohibit customers from collecting is highly regulated under various laws such as the GDPR, HIPAA, and the Gramm-Leach-Bliley Act. Not only are the requirements for handling such personal data stricter, so are the penalties if these types of personal data are compromised. It is an unnecessary risk to allow the ingestion of such personal data into our product where it is not useful for creating predictive models.

We do not want DataRobot or our customers to be taking this risk whether it is an on-premise or SaaS deployment.

What types of personal data can customers use in DataRobot?

Our product does not have any technical restrictions preventing it from ingesting personal data.

However, our Master Subscription Agreement prohibits SaaS customers from ingesting any financial or PCI data, any data regulated by the Health Insurance Portability Act, social security numbers, driver’s license numbers or other government ID numbers, any sensitive personal data as defined by GDPR, personal data of under 16 year olds, information subject to regulation or protection under the Gramm-Leach-Bliley Act, Children’s Online Privacy Protection Act or similar foreign or domestic laws. It is the customer’s responsibility to ensure their users follow this prohibition.

Customers may import any other type of non-sensitive personal data.

What if customers need to create predictive models using personal data?

For some industries, certain highly regulated types of personal data could be beneficial for modelling AI. For example:

  • Medical or healthcare data
  • Banking or financial data
  • Payment card data

In these scenarios, tokenization can be a useful tool. This process consists of replacing the key data points in the dataset with a nonsensitive token and using the tokenized dataset in the SaaS platform.  Once the prediction is made, a customer can re-identify the personal data using their key, outside of the DataRobot product. For example:

  • Instead of medical condition > token #1
  • Instead of birth date > token #2
  • Instead of bank account number > token #3

If tokenization isn’t an option for a customer they can explore using DataRobot’s on-premise platform, which has no limitations on customer personal data.

What customer data does DataRobot have access to? 

In the SaaS platform, we only access the data a customer uploads with their prior permission to enable us to provide customer support. For on-premise deployments, we have no access to customer data.

Where is customer data stored and processed?

SaaS customer data is hosted at our AWS data centers in the US and Ireland. Customer data may be processed by employees located in the US, or by employees of our international subsidiary companies. For a list of our subprocessors and our subsidiaries, please visit here.

What about user data?

We collect certain information about its customers’ usage of the platform, such as technical logs, account and login information, user engagement, volume of data uploaded, number of models deployed, and feature usage.

The collection of user data is necessary for us to improve our product, develop new functionalities, perform diagnostics, and review performance trends. This data also helps us help our customers. With user data, we can provide performance analysis to ensure our customers are getting the most out of their subscription.

Does DataRobot offer contractual terms around information security and privacy?

On request, we can provide an exhibit to our MSA that details our information security controls in our SaaS platform. We also offer a Data Processing Exhibit for customers that opt to use personal data.