Thou shalt always anonymise.

Nov 21, 2018

Protecting personal information starts with basic ID encryption.

Tables are a wonder of information management. Their simple structure, with rows as records and columns as fields, serves with ease countless needs of our lives. Unsurprisingly, they underpin a few bulwarks of the information age. RDBMS’s (relational database management systems), for example, are perhaps the single largest enabler of digital services around. And, down to their stripped-down heart, they are nothing more than engines that manage data across tables and their relationships.

In order for any of that to work, tabular logic relies on keys. Residing on ‘special’ columns, keys are unique identifiers that establish the entity to what a particular record refers. With them, tables can be merged and disperse information integrated across sources within large, comprehensive schema. Alas, keys are also a door for data privacy to suffer serious threats.

In a telco environment, for example, OLTP and OLAP systems function on the basis of subscriber-centric data: a client’s phone number (MSISDN), device identification (IMEI) or SIM card (IMSI) typically work as system-wide entity enforcers. While that is a must to run day-to-day operations and generate valuable intelligence, it also exposes clients’ data to ‘snooping’. When database keys are directly linkable to known individuals, malicious agents can easily collate pieces of sensitive datasets and explicitly attribute them to data subjects.

At the very least, privacy-protective systems must embrace, then, key encryption. That is not sufficient to guarantee bullet-proof anonymisation, because creative data manipulation can reveal data subjects without relying on known IDs, but it does help to mitigate blatant privacy intrusions.

The concept of masking unique identifiers is not complex per se. However, creating an entire data pipeline that can retain functionality (performing hassle-free set intersections) while using encrypted keys can prove a bit more difficult. Specially, as it is often the case, when privacy is at risk from the very moment when primary information is generated, before it can be derived into higher-order by-products.

“Creating an entire data pipeline that can retain functionality while using encrypted keys can prove challenging.”

A solution is not to store any data without encrypted identifiers, from the outset. That ensures that information can flow around without being directly attributable to identifiable subjects. Key decryption, when properly authorised and justified, can be served through on-demand mechanisms.

That solves one problem, but creates another, potentially larger one. A critical source of analytical value creation comes from the integration (and subsequent analyses) of disparate datasets, often generated by different systems. If each system applies its own encryption methods on what initially were universally utilisable IDs, how can data be possibly matched across sources later?

Two obvious alternatives are possible:

  1. encrypted sources must be ‘downgraded’ (have their keys decrypted) before they can be merged with sources that have not been safeguarded yet;
  2. non-encrypted sources, and their underlying systems, must be upended to apply the same key encryption routines already present in other components.

That works easily within single firms. However, different data controllers, each interested in promoting cross-company data enrichment but not willing to compromise on privacy, often want to merge datasets that remain ‘invisible’ to all sides.

In the real world, parties sometimes resort to use of a trusted third-party, who computes a fixed function on all parties’ private input multisets. This unconditional trust is fraught with security risks, though; the trusted party may be dishonest or compromised. Privacy-preserving techniques and protocols allow for computation over multisets by mutually distrustful parties: no party learns more information about other parties’ private input sets than what can be deduced from the result of the computation[1].

The privacy engineering domain is in its infancy. The holy grail of information management, which is to produce maximum intelligence while imposing minimum risks to privacy infringement, may not be upon us yet. But, with all research being done, it is soon approaching. In the meantime, responsible companies must get on with some basics: anonymise IDs at-source, for all sources, and establish privacy-preserving capabilities to merge their data with other players’. For what will still be a long journey ahead, that would be a great start.

[1] Kissner, Song, 2006.