Challenges with Cloning, Scrambling and Anonymizing Employee Data in SAP-HR – part 2
In my previous blog I sketched a framework to deal with cloning, scrambling and anonymizing employee data in SAP-HR. It builds on the experience that Adessa has gotten in managing HR GDPR on SAP-HR. In this blog I want to build upon this framework and delve into the details and challenges of cloning.
Please note : although the technical solutions are written for SAP-HR, a large part of the blog is platform-agnostic.
In the “pré-GDPR” time, it was much less important to know what employee data was exactly stored where. You could simply select for which SAP (sub-)modules you wanted to clone data from the production to the test or development environment. The cloning tool then “assessed” the SAP-tables to copy in order to get a set of test data. Typically, any decent cloning tool would do some rudimentary scrambling to make identification of the employee in in both Dev and Prod “a little more difficult”.
Those days are gone. With GDPR you have to make sure, that sensitive and identifying data is only accessible to anybody who are processing that data to fulfil legal or contractual obligations. Typically, developers and testers are not part of this.
But, the “deeper” you go in scrambling, the more challenges arise, especially when you scramble the identifiers (e.g. personnel number, system user-id, e-mail address).
This blog will cover:
- How to find out which employee data we store
- Some data challenges when scrambling
- Interfacing challenges when scrambling
- Scrambling types in a cloning tool
- Features to look for when selecting a cloning tool, if you do not have a “clone and scrambling” tool yet
Note: this blogs is part of a series on HR GDPR. Here’s how you can visit the other parts:
- HR GDPR: What HR needs to know to manage the unique operational demands of GDPR
- HR GDPR Compliance with SAP-HR: Analysing the logs matters
- HR GDPR : Retention, Blocking and Deletion of Employee Data in SAP-HR
- HR GDPR : Cloning, Scrambling and Anonymization of Employee Data in SAP HR – Part 1
Start by finding out where the sensitive employee data is stored
When you buy the right cloning tool, the tool “knows” which tables and fields to copy. However, it is your job to tell the tool which fields to scramble. Before you can instruct the tool to scramble certain fields (identifying and/or sensitive data, see my previous blog) you need to know all tables and fields the company is using. Then, out of this total view, you select the fields which need to be scrambled.
We used the following methods to track which tables / fields in SAP-HR are in use:
- Functional analysis, where we consulted the HR-department to discover the functionalities (submodules of SAP-HR), infotypes and programs / tables in use.
- Technical analysis to check the tables in use directly in the database with tooling:
- Infotypes in use in Personnel Administration: program RPDINF01
- Objects, relations and infotypes in use in Organizational Management: a customer program created by Adessa Group
- Customer specific tables containing a personnel number or an HR-object: program RSCRDOMA
The outcome of the technical analysis was used to check the completeness of the functional analysis. In case a table / field was used according to the technical analysis, but not according to the functional analysis, that table / field was mostly obsolete (not in use anymore). Obsolete information will not be cloned to test and development environments.
After having filtered out the obsolete information, we looked which tables / fields contained sensitive data and data, which could lead to identification of employees. These were the fields we had to scramble
Data challenges while scrambling
Due to its inherent complexity, the customer decided not to scramble Organizational Management.
But we still faced plenty of other challenges when scrambling:
Scrambling related fields. Individual fields in Infotypes are easy to scramble. But what to do with combined fields. When field #1 and field #2 together influence the value of field #3? For example: the address in the Netherlands. The postal code is determined by: street, house number and town. Another example: the social security number in Belgium. The SSN is partly determined by gender and date of birth. One possibility to solve this is by using “Random Table with N columns” (see further under “Scrambling types in a cloning tool”).
Scrambling duplicated fields. The SAP-HR data model has a lot of duplication of data. If we scramble a field in Personnel Administration, is it also scrambled in all other SAP-HR tables? E.g. the “Date of Birth” occurs in an Infotype, in the payroll results (“clusters”) and in the tax office file. When we scramble “Date of Birth” in the Infotype, does the cloning tool scramble it automatically in the other related tables of SAP-HR as well? One possibility to solve this is by scrambling all fields using the same domain in the same way (see further under “Features to look for when selecting a cloning tool”). The tool detects all the different fields (fields with a different name) using the same domain, say “Date of Birth”. When for the infotype of a certain employee this is scrambled to value A, this same value is also proposed when scrambling the “Date of Birth” in the payroll cluster and in the tax office file. By checking the domain name, the scrambling tool is capable to detect that fields, although named differently, have the same meaning and should be scrambled to the same value.
Scrambling higher ranked officials. The higher the employee is in the hierarchy, the higher the remuneration. Even anonymized higher ranked employees can often be identified only by their remuneration or type of contract. Do you by default exclude these employees from cloning, such that only in specific cases and after permission they can be cloned? But this excludes them from “negative testing” and makes controls on totals more difficult, because by default they are not in the test system.
Scrambling clusters. When scrambling e.g. “Date of Birth” or city in Personnel Administration, it could be, that the days off (or the pay) should be recalculated. Sometimes the leave quota depend on the Date of Birth. Or the holiday calendar varies per region and as such changes when the city is scrambled. Do you want to recalculate clusters? For larger customers this takes several days. Or do you accept a certain incoherence between Personnel Administration and Time / Payroll cluster? And how much cluster history do you copy anyway?
Maintaining the coherence between Infotypes. If you scramble PA-information in e.g. IT0001 Org.Assignment, this has consequences for the allowed values in e.g. IT0007 Work Schedule or IT0008 Basic Pay. One possibility to solve this is by using “Random Table with N columns” (see further under “Scrambling types in a cloning tool”).
Maintaining the coherence between Organizational Management and Personnel Administration. PA can inherit values from OM for example in IT0001 Organizational Assignment and IT0008 Basic Pay. A position is mostly an identifier, because there is only one employee per position. But if you change the position in PA, PA and OM may not be coherent anymore. Without being exhaustive, several alternatives exist:
- During the cloning and scrambling operation the integration between PA and OM is switched off on the receiving system. After the cloning operation, you switch on the integration and run the SAP-standard integration programs.
- When scrambling you use “Random Table with N columns” (see further under “Scrambling types in a cloning tool”) to scramble the position by randomly picking between similar employees e.g. within the same employee group / subgroup.
And finally, why do you want to clone “everything” anyway? Reasons not to clone everything:
- If you clone less information of a smaller number of employees, the impact of a data breach is likely to be smaller as well.
- What functionality do you want to test? Do you need to clone all information of all employees to test this functionality or is a limited set of tables sufficient?
- Some parts of the SAP-HR system are functioning very stable and are rarely part of an incident, so why clone them to a test system?
- Some tables reveal a lot of personal information, but they are not needed for testing: PA-change logs, tables containing SAP user-id information, …
- Some parts are very complex to clone and scramble: information related to workflows, …
- The information is not needed, because it is obsolete / not in use anymore.
And a final remark.
A clone of a production system is not a guarantee for a perfect test population. Even for large customers, many test realistic test situations can be created which do not (yet) exist in production. E.g. during one payroll implementation we consistently used a test population of only 70 out of 6000 employees. The first and only parallel run for these 6000 employees had only 50 errors, most of them upload errors. Cloning a productive environment is sometimes also an excuse for not knowing what you need to test.
Interfacing challenges while scrambling
Some of the interface challenges we face during scrambling are:
- How do you deal with scrambling in delta interfaces, interfaces which basically only sends the changes (e.g. PU12)? Do you scramble all interface history? Or do you have no history in the cloned test system? Our customer did not have any delta interfaces inside its SAP-HR landscape.
- When you scramble do you maintain to have employee A in system #1 to be connected to employee A in system #2? In principle… When scrambling you also need to scramble personnel number and user-id’s (also the SAP user-id), because these are identifiers. The impact is, that after scrambling personnel number of employee A in system #1 will be connected to employee B in system #2, because of scrambling of personnel number and/or user-id’s.
For example, consider SAP-HR, system #1, interfacing to SuccessFactors, system #2:
- Do you scramble personnel number and user-id’s? Due to the complexity, the customer decided not to scramble personnel number and user-id. However, then there is still identifying data in the SAP-HR (and SuccessFactors) system.
- Do you scramble the data in SAP-HR and SuccessFactors exactly the same way? Or do you scramble the master, say SAP-HR, first and then interface all the scrambled employees to the slave, say SF? But does the receiving system (e.g. its data model) accept this file? Completely emptying a receiving system and then importing the interface file is not always possible.
- An alternative: in test and development is to split testing the interface from testing the functionality of the sending and receiving system. When you test new functionality in the sending or receiving system you test within the systems, but the receiving system does not import the interface file. And when you test the interface file often it is sufficient to study the file produced. Less frequently it is needed to import the interface file into the receiving system to see the interface is working. Only when adding completely new data fields to the interface it may be needed to create the interface file and to import that file into the receiving system.
The graphics below show the challenge with interfaces when scrambling personnel number and name on the source system SAP-HR.
In this graphics, personnel number and name are aligned between SAP-source and SF-target.
Imagine the challenges which occur in a realistic interface landscape, such as the one below.
Scrambling types in a cloning tool
The following scrambling options can be thought of when replacing a value from production:
- Delete Value. The value in the specific field is deleted or set to blank initialized. This option could be used for very sensitive data like medical data.
- Fixed Value. The value in the specific field is replaced by the same constant value for all records to be scrambled.
- Manual 1 To 1 Mapping. The value in the corresponding field is converted to a new value according to a conversion table. This conversion is the same for all records and all values.
- Number Conversion and Value Range Table. The value in the specific field is replaced by a random number. Sometimes you can choose an interval. This means the random number will lie within a certain interval. For example personnel numbers could be scrambled this way.
- Random Table with One column. You specify a set of values in a table. When replacing the real value by a random value, the system takes one of the values specified in the table. E.g. when replacing the last name, you specify a table with e.g. 100 names. After scrambling all employees will have one of these 100 last names.
- Random Table with N columns. It works similar to VLOOKUP in Excel and allows to maintain consistency between fields. For example: you have the social security number, which depends upon date of birth and gender. When scrambling the SSN, the system randomly chooses one of e.g. 500 valid rows. When scrambling date of birth the system picks the date of birth from that same row, but another column. And the same for gender. The result is, that after scrambling the SSN, it is valid for that date of birth and gender. It could work similar for the Dutch postal code, which is the result of street, house number and town.
- Random Table for Time Periods. This is a method to scramble for example dates. When scrambling dates from a productive system, day, month and year are randomly chosen from three intervals. One interval is for day, one for month and one for year.
- Multiple Pass Scrambling. This allows to scramble data, which has already been scrambled. E.g. first you scramble the first and last name. After that you create the e-mail address.
- Custom Mapping. Without action required from the supplier you can write your own logic to scramble data.
- The scrambling types above are a brief extract from information at support.sap.com/tdms
- Without pretending completeness other suppliers from cloning tools are e.g. EPI-USE Labs, Accenture and Spinifex IT.
- Some scrambling types are not available in all the mentioned tools or require a lot of programming in user-exits.
Features to look for when selecting a cloning tool
When the customer does not yet have a cloning and scrambling tool, this paragraph presents some points to consider before selecting a tool:
- If more than one SAP-module is in use: does the cloning tool only clone data for SAP-HR or also for other SAP-modules (see also question 1)? For which modules does the tool work without major customizing? Only a few cloning tools work for more than one SAP-module.
- Does the cloning tool work for SAP-HR or for SAP-HR AND SAP Successfactors?
- For how many countries will the cloning tool work (see also question 4)? E.g. in SAP Payroll many tables are country dependent. If a cloning tool works for a SAP-standard payroll in France, it is not sure it also works for Belgium. How quick will the supplier respond to add SAP-standard, but country specific functionality?
- How difficult is it to add customer-specific infotypes and tables to the cloning tool (see also question 5)? By default no cloning tool will copy customer-specific tables. Can you add these tables yourself (maybe after a training) or do you have to call the supplier to do this?
- How many scrambling types does the tool support (see “Scrambling types in a cloning tool”)?
- Can the tool scramble the personnel number?
- Can the tool scramble information in the time and payroll results (“cluster”)?
- How easy is it to add your own scrambling routines or do you have to ask the supplier to program this?
- Will the cloning tool scramble related / duplicated fields in the same way or do you have to program this? When I scramble a certain field in the infotype, is it also automatically scrambled in the other tables in SAP-HR? E.g. the field date of birth occurs in an infotype, in the payroll results (“clusters”) and in the file for the tax office. Some tools can scramble similar fields the same way by looking at the field name or the domain used.
- Does scrambling take place at the source system, before transmission to the receiving system, or at the receiving system (e.g. in test or development)? Scrambling at the source system is preferable if the source is the productive system. Then unscrambled, real data is only available in the productive environment and never leaving it.
- Most of the tools for cloning and scrambling are either based upon the number of employees or upon the size of your database. It could pay off to first cleanse your productive data and reduce its size before buying the tool.
Feel free to reach out with specific questions : Rudolf Von Stein : firstname.lastname@example.org