GDPR – Federated Search Technology Pattern

The EU General Data Protection Regulation requires that European Citizen’s personally identifiable information is controlled, secured, available-on-request and able to be erased by 25 May 2018. Additionally customers have the right to easily change their marketing contact preferences within a reasonable time. For most organisations this is a challenge when they are dealing with legacy systems with departmental controls and the only data warehouse is provisioned for analytical purposes only.

One way of achieving compliance is to create a enterprise wide Data Lake containing all the organisations customer information sourced from different business unit’s operational systems and core systems such as email, intranet and shared drives. The Data Lake also contains all the original systems meta-data (data about the data) plus provenance information such as back-links to the systems of record in the individual business units. These back-links to the operational systems allow the right to be deleted to be exercised if requested by a customer. The meta-data contains the creation and last accessed dates from the system of record along with security information to allow the correct access controls to be applied to the Data Lake.

Ensuring a single view of the customer is much easier if all the enterprise data is within the Data Lake including third party SaaS (Software-as-a-service) providers such as Salesforce or Microsoft Dynamics. The addition of Master Data Management technologies from Informatica or IBM can provide data cleansing either before the data reaches the Data Lake or during a Search on the Data Lake. The Data Lake model also allows enterprises to resolve the issues around marketing contact preferences which can be
ifferent in each customer relationship management system or account. Allowing the customer to change their contact status or contact channels becomes easier if they can be found in the Data Lake.

However creating Data Lakes can be very challenging if the security or data models are heavily embedded within the operational systems or local jurisdictional systems have to be used for access control and monitoring.

The alternative model is Federated Search which for some organisations is a better solution. The Federated Search also allows the minimal amount of sharing as it uses a ‘merge on query’ approach to inter-departmental data which allows potentially greater compliance with ‘privacy-by-design’ constraints on the systems. Additionally a Federated Search can cross the organisational boundary into external data processor’s systems in real time.

The Federated Search model requires each departmental operational system to provide a full text search index either re-using an existing index technology or deploying a bespoke search capability using, for instance, Apache Solr or ElasticSearch. An Enterprise Search Service provides a central service or portal from which queries can be made. The Enterprise Search Services cascades any customer lookup queries down to departmental federated query engines which then searches for the data in their local index. If customer data is found then a specific query is constructed on the operational system. The returned data is correlated, matched and linked to provide a single view of the customers data. The actual search uses the security credentials of the user requesting the information of the Enterprise Search Services so the security controls and logging are preserved. In addition the business unit data owner retains real-time control over access to the data and can see the data access patterns within their existing context. Another benefit is the local index protects the operational system from unexpected load or logging as the resulting queries from the federated queries engines can be optimised for extracting specific information and not searching.

From a delivery perspective the Federated Search option means the organisation is not running a big programme at the centre of the organisation with the issues of communication, governance, funding and additional dependencies on an already stretched enterprise. The individual business units have the freedom to define the indexing technologies and subsequent queries and only need to comply with a well defined API for data query and security authentication and authorisation information. The system owner for the Active Directory (or equivalent identity and access management service) is not required to implement consolidation of permissions from various systems. The Security Operations Centre does not need to take on new feeds from a new system and try to correlate them with the existing operational system to determine access patterns.

The central technology programme is therefore responsible for defining the Customer Search API, the Federation Services and the API definitions alongside an on-boarding plan which can meet the speed of the overall organisation and the individual business units.

Data Lakes are a powerful technology for organisation to deploy, however with the impending deadline for GDPR compliance (25 May 2018) looming some organisations may need to take a more expedient approach.

For more information about GDPR, please see the UK Information Comissioner’s Office website:

I encourage you to watch the video and provide feedback in the comments for suggestions, improvements, alternative approaches and critique.