Anonymizing Customer Data for Development or Testing

Anonymizing customer data for development or testing

Anonymizing data can be tricky and if not done correctly can lead you to trouble, like what happened to AOL when they released search data a while back. I would attempt to create test data from scratch at all costs before I tried to convert existing customer data. Things may lead you to be able to figure out who the data belonged to using things such as behavioral analysis and other data points that you might not consider sensitive. I would rather be safe than sorry.

Remove sensitive data from development copies of production database (is there a gem for this?)

Looks like this will do what you're looking for:

http://sunitparekh.github.io/data-anonymization/

MS SQL Server - depersonalise data

You can use a Table type:

Create Type [dbo].[Columns] AS TABLE(
[name] [sysname] NOT NULL
)
GO

Create Proc Anonymise(
@table sysname
, @Columns [dbo].[Columns] READONLY
) as
begin
set nocount on

--Checks:
--If @table not int sys.tables => error
--If @Columns empty => error
--@Columns not in sys.columns for @tables => error
--Column type not char/varchar or type xxx => error

Declare @list nvarchar(max), @sql nvarchar(max)
Select @list = coalesce(@list+N', ', N' ')+name+N'=AnonymiseFunction('+name+N')' From @Columns

Set @sql = N'Update ['+@table+'] Set '+@list
print @sql
Exec sp_executesql @sql
end
GO

And execute:

Declare @cols [dbo].[Columns]
Insert Into @cols Values('x',) ('y')
Exec Anonymise 'table', @cols

You can also replace , @Columns [dbo].[Columns] by a nvarchar(max) comma separated.
At the begining of the proc, you must declare a table variable:

declare @columns Table(name sysname)

and use a CTE to split the string into @columns table.

Split string: http://blogs.msdn.com/b/amitjet/archive/2009/12/11/sql-server-comma-separated-string-to-table.aspx

Performance anonymize production data

I solved it like this. It reduced the time from 160 seconds to 0.09 seconds

u = User.last
u.password = '123456'
User.update_all("email = CONCAT('user', ID, '@example.com'), crypted_password = '#{u.crypted_password}', password_salt = '#{u.password_salt}'")

recommendation for maintaining dev database

In my experience, having a centralized DB+data for each environment: Development, Testing+Integration and Production has been the best approach.

  • Development: let the developers do whatever they want with it. If production-like data is required, obfuscate/remove sensitive data. The more lightweight this database is, the better for you to move, maintain and backup.
  • Testing: use it to simulate the production environment and let the
    testers to input/retrieve all the data the want but only through your
    application interfaces. This environment also allows you to test your deployments
    before sending them to production, you don't want a bad DB installer
    to leave the production app in an unusable state. If required, you
    can input this environment with production data but obfuscate/remove
    sensitive data too. You could use high volumes to spot performance issues before they get to production.
  • Production: Leave your production data/environment alone, you don't
    want sensitive data to end up in the wrong hands or a DB error configuration to allow the developers to change data accidentally.


Related Topics



Leave a reply



Submit