`
`,—,.
`
`K. Pommerening, M. Miller,-
`I. Schmidtmann, J. Michaelis
`
`Institut fur Medizinische Statistik
`und Dokumentation
`
`der Johannes—Gutenberngniversitat,
`Mainz, Germany
`
`Pseudonyms for Cancer Registries
`
`Abstract: In order to conform to the rigid German legislation on data priva—
`cy and security we developed a new concept of data flow and data storage
`for population—based cancer registries. A special trusted office generates a
`pseudonym for each case by a cryptographic procedure. This office also
`handles the notification of cases and communicates with the reporting
`physicians.
`it passes pseudonymous records to the registration office for
`permanent storage. The registration office links the records according to the
`pseudonyms. Starting from a requirements analysis we show how to con—
`struct the pseudonyms; we then show that they meet the requirements. We
`discuss how the pseudonyms have to be protected by cryptographic and
`organizational means. A pilot study showed that the proposed procedure
`gives acceptable synonym and homonym error rates. The methods de—
`scribed are not restricted to cancer registration and may serve as a model
`for comparable applications in medical informatics.
`
`Keywords: Cancer Registry, Data Protection, Data Encryption, Pseudonyms,
`Record Linkage.
`
`1. Introduction
`
`the rigid German
`recently,
`Until
`legislation on data privacy and data
`security has hindered comprehensive
`cancer registration in major parts of
`Germany. The new European directive
`on data protection [1] may pose further
`difficulties. The basic premise states
`that permanent storage of an individ-
`ual’s medical data together with his/
`her identification data is allowed on the
`
`basis of informed consent only. How—
`ever, many cancer
`atients nowadays
`
`__._,.,.w«—“_,.
`
`the naturatn"Wé and, there-
`
`
`
`
`
`is desirable that
`it
`registry. Hence,
`physicians should have the right to noti—
`fy incident cases without obtaining in-
`formed consent in order to assure the
`
`necessary completeness of cancer regis—
`tration. Notification without informed
`
`consent is regarded as violation of an
`individual’s constitutional right to data
`
`
`112
`
`is compensated by
`
`privacy, unless it
`anonymity.
`A cancer registry, however, needs
`identification data for record linkage, to
`identify multiple notifications of the
`same individual, and to record follow—
`up information on individuals. On the
`other hand, scientific analysis of the
`registry data is generally performed
`anonymously and does not include any
`reference to individual
`identification
`data.
`To minimize the violation of data
`
`privacy we developed a new organiza—
`tional and technical concept for cancer
`registries which has been approved by
`data—protection officials and incorpo-
`rated into the corresponding German
`federal legislation [2]. In our concept
`the registry is separated into two offices
`with complementary functions. The
`concept makes extensive use of data
`encryption and provides data privacy by
`pseudonymous data storage. This mode
`of data storage allows record linkage by
`matching of pseudonyms and does not
`
`intcrfcrc with the scientific require—
`ments 'of a cancer registry. In certain
`cases a‘controlled re—identification of
`
`records might be necessary to obtain
`follow—up information about cases. The
`concept includes provisions for achiev—
`ing this.
`A pilot study was initiated in 1992 to
`explore the possibilities for
`running
`a population-based cancer
`registry
`in Rheinland—Pfalz ' (Rhineland-Palati—
`nate) on the basis of this concept {3—5].
`The results show that
`the proposed
`compromise between research interests
`and privacy issues is practicable and
`sound. Further overviews have been
`
`given in [6-8]. The concept has also
`been adopted for the pilot phase of
`the cancer registry of Niedersachsen
`(Lower Saxonia) [9].
`The cryptographic concept of pseu—
`donymity can be adapted to other situa—
`tions where a fundamental conflict
`
`between the goals of privacy and public
`interest needs to be solved, e. g., to con-
`trol the effiency of health care [10, 11].
`
`
`MWMWVW‘.M..,..
`
`2.}
`
`ider
`OI‘Cit
`info
`cont
`
`pres
`long
`call:
`the
`owr
`
`t
`is
`was
`
`Che
`
`priv
`elec
`actir
`
`app
`ano
`
`[10}
`betr
`
`pan
`
`of ;
`follr
`
`E" seat/rots
`
`rte.qwnmrtr—jmmoro
`
`tJt‘i
`
`f"?
`30ELM
`the
`
`pser
`inst
`
`nyn
`
`psei
`algt
`pro:
`hasi
`valr
`tion
`
`CYYI
`14].
`inst
`
`Meti
`
`
`
`SYMPHONY00066120
`
`
`
`the
`registry,
`pseudonym for cancer
`procedure should depend on a secret
`key which is kept by the trusted insti~
`tution. Such a pseudonym can by no
`means be uncovered;
`the key-depen-
`dent procedure even prevents
`un—
`authorized trial encryption, at
`least
`from outside.
`
`This kind of pseudonym does not ‘
`meet requirement 2, the reason is lack
`of fault tolerance: the encryption pro—
`cess cannot compensate for slight varia—
`tions in the identification data, e. g, mis-
`takes in spelling the name. This is not a
`problem when machine~readable iden~
`tification data on patient cards can be
`used; but this is not always the case.
`Certain notifying institutions, such as
`pathologists, may not have access to the
`patient card. Old data (from the time
`before the introduction of patient
`cards) should also be linked. In any
`case, requirement 2 conflicts with com~
`plete anonymity;
`the model has
`to
`provide a balance between these two
`conflicting goals. What we need is a
`concept of error detection and error
`‘ correction for encrypted data. Finding
`an optimal solution is an interesting
`problem for further research. As a first
`solution we divide the ‘one—way’ part of
`the pseudonym into a set of ‘linkage
`data’
`that satisfy requirements I, 2
`and 5.
`
`In order to meet requirement 4 we
`add a second part to the pseudonym.
`This part derives from the identification
`data of the patient by encryption; the
`key is known only to the trusted institu—
`tion. For reasons to be discussed later
`
`3. Organizational structure
`of registry
`
`The cancer registry consists of two
`separate offices at separate locations.
`The first office (trusted office, “Ver—
`trauensstelle”) basically serves for the
`notification and generates the pseudow
`nyms. The second office (registration
`office, “Registerstelle”) links the re—
`cords and stores data permanently.
`
`3.]. Identity Data and
`Epidemiological Data
`
`In the following we distinguish
`between identity data and epidemiolog-
`ical data. Identity data are:
`— surname, former surname(s), given
`name(s),
`— address,
`— date of birth, date of death,
`4 date of diagnosis,
`4 notifying physician or healthucare in—
`stitution.
`
`Epidemiological data are those data
`that are needed in every meaningful
`statistical evaluation of
`the registry
`data:
`
`— gender,
`— census code of place of residence,
`~— professional group,
`4 year of birth, year of death,
`— year of diagnosis,
`— date of notification,
`— tumor classification,
`w further medical data.
`
`3.2. The Trusted Office
`
`2. Pseudonyms
`
`Pseudonyms are distinct, unlinkable
`identities that an individual assumes in
`
`order to hide his or her true identity. In
`information technology pseudonyms
`control
`the matching of data while
`preserving privacy. A pseudonym be—
`longs to one person only (henceforth
`called ‘the owner”) but does not reveal
`the identity of that person. If only the
`owner can uncover the pseudonym, it
`is called ‘untraceable’. This concept
`was
`introduced into cryptology by
`Chaum [12];
`it
`is useful
`to protect
`privacy in electronic banking, electronic
`elections, and other electronic trans—
`actions. Possible (but not yet realized)
`applications in the medical domain are
`anonymous
`electronic
`prescriptions
`[10] or
`the settlement of accounts
`between physicians and insurance com-
`panies [11].
`Cancer registries need a distinct kind
`of pseudonyms which must satisfy the
`following requirements:
`1. The registry must be able to re-
`cognize multiple notifications of the
`same case (record linkage).
`2. The record linkage procedure should
`minimize synonym and homonym
`errors (see section 6) to yield suffi-
`cient data quality.
`3. Collaborating registries should be
`able to match their records.
`4. In certain controlled circumstances
`
`a pseudonym
`the uncovering of
`should be possible for obtaining ad-
`ditional information, e.g. within the
`scope of case—control studies.
`5. The owner should not be able to
`
`
`
`uncover his own pseudonym.
`This last point derives from the right
`to notify a case without informing the
`patient about his disease. It implies that
`the owner should not generate his
`pseudonym; instead, we need a trusted
`institution that generates the pseudo—
`nyms.
`shared among too many parties. There
`To satisfy the first requirement the
`fore, for inter—registry linking we pro-
`' pseudonym should be generated by an
`pose a re~encryption of the first part
`algorithmic procedure that can be re—
`of the pseudonym with a temporary
`produced. The prefered method is
`(one—time) key (for details, see sec—
`hashing [13, par. 6.4]. Since the hash
`tion 5.3).
`values should not reveal any informa-
`tion about the original data, we use a
`Our concept of pseudonymity in
`cancer registry needs an organizational
`cryptographic hash function [14, chap.
`framework that is described in the next
`handled in the same way as notification
`14]. Since no one except the trusted
`forms.
`section.
`institution should be able to generate a
`
`we use asymmetric encryption with two
`keys (see section 5.1).
`The reason for requirement 3 is
`that
`the German Federal States will
`
`have separate registries. To enable
`anonymous data matching between
`these registries they could use a com—
`mon cryptographic key, but this is not
`advisable: A secret loses its value if
`
`The trusted office accepts incoming
`reports from physicians or hospital—
`based cancer registries. These reports
`are checked for completeness
`and
`plausibility. If necessary, this office ob—
`tains additional information from the
`
`reporting physicians. It codes the re-
`ported diseases according to classifi-
`cation schemes such as
`ICD—9 and
`
`ICD—IO. Thereafter, it assigns a pseudo
`nym to the record, and sends the pseu-
`donymous record to the registration
`office. After a short period of time,
`when any discrepancies are cleared,
`the trusted office deletes the records
`in its database. Death certificates are
`also sent
`to the trusted office and
`
`Meth. Inform. Med, Vol. 35, No.2, 1996
`
`113
`
`SYMPHONY00066121
`
`
`
`
`
`
`
`
`Fig.1 Organiza—
`tional structure and
`information flow.
`
`
` Physician
`Trusted Office
`
`ncrypts identification
`
`
`ata
` Hospital based
`
`registry
`
` forwards
`reports
`Health care
`
`data
`
`institution
`; implausible
`
`
`data
`
`Public health
`
`
`Registration office
`department
`
`
`(death
`stores
`
`
`
`certificates)
`
`pseudonyms
`
`
`epidemiological data
`
`a
`.
`.
`
`e d
`
`l l
`
`
`
`taining the sequence number and per~
`sonal identification data is sent to the
`
`trusted office in parallel. This office
`generates the pseudonym and sends it
`to the registration office, together with
`the sequence number. The registration
`office performs the record linkage and
`generates a record which contains the
`sequence number and the epidemiolog—
`ical data stored in the registry. Thereaf~
`ter, epidemiological data and exposi—
`tion data may be linked for further anal-
`ysis by using the sequence number. This
`procedure ensures that for the purpose
`of the study nobody sees which cohort
`members were diseased.
`
`A corresponding procedure applies
`to casevcontrol studies if only the epi—
`demiological data which are kept in the
`registry are needed for such a study.
`If it is necessary to obtain additional
`information from the diseased patients,
`the identification data may be decrypt—
`ed using the ire—identification key which
`
`is kept in the supervising office (see sec~
`tion 3.2). Re—identification has to be ap—
`proved by an ethics committee and is
`done in the supervising office; techni-
`cally this could also be realized with a
`portable PC operated by an employee
`of the supervising office. The decrypted
`identification data are then given to the
`trusted office. In some cases the neces~
`
`sary data can be retrieved from the
`notifying institution. If it is necessary to
`contact
`the patient for an additional
`inquiry, the trusted’office has to obtain
`informed consent from the patient via
`the notifying or
`treating physician
`whose identity is stored as part of the
`(encrypted) identification data of the
`patient (see section 3.1).
`
`
`4. .A Registry Model
`
`Since a strict formalization of the
`
`procedures of the previous section in
`
`The trusted office is directed by a
`physician and, therefore, is subject to
`professional discretion in addition to
`data-protection laws. It is trusted by all
`other parties, hence the German name
`“Vertrauensstelle”. Nevertheless,
`the
`decryption key — the ‘private’ key of
`the asymmetric encryption procedure,
`henceforth
`called
`‘re—identification
`
`key’ - is held in a second trusted institu—
`tion outside the cancer registry. There
`are several sensible choicesfor this in—
`stitution; in the following we call it the
`‘supervising office’. The separate hand—
`ling of the re-identification key empha-
`sizes the ‘separation of informational
`powers’ and makes clear that decryp—
`tion (2 re—identification) is an excep-
`tional process. Moreover, it gives addi—
`tional security in case of a compromised
`encryption key.
`
`3.3. The Registration Office
`
`The registration office receives pseu—
`donymous data only. With these data it
`performs record linkage and detects
`duplicate notifications; then it stores the
`pseudonyms and the epidemiological
`data permanently. If the record linkage
`reveals any inconsistencies,
`these are
`reported back to the trusted office
`which, in turn, may sort out any dis—
`crepancies by contacting the reporting
`physicians. in the same way the office
`links a death certificate to an existing
`patient record. Figure 1 illustrates the
`data flow. Only the registration office
`stores records permanently.
`
`3.4. Epidemiological Studies I
`
`
`
`
`
`
`The pseudonymous records serve for
`
`Sequence #
`Source of Cohort
`
`identification data
`routine analyses of the cancer registry
`
`Trusted Office
`,
`Sequence #
`
`as well as for epidemiological studies.
`
`identification data
`
`Sequence #
`Figure 2 illustrates the procedure for a
`Exposure data
`Pseudonym
`cohort study: if a well—defined cohort
`
`(e.g., occupationally exposed employ—
`Registration office
`
`ees of a company) is to be analyzed for
`Sequence #
`
`Pseudonym
`Exposure data
`the occurrence of cancer, a sequence
`
`Epidemiological data
`number is assigned to each individual
`
`member of the cohort and possibly
` Sequence #
`
`also to non—exposed controls. These se—
`Epidemiological data
`
`quence numbers serve as simple tempo—
`Research institute
`
`rary pseudonyms for the study. A re—
`search institute (which could also be the
` Sequence #
`
`
`Exposure data
`registry) obtains a record for each indi—
`Fig.2 Record
`Epidemiological data
`vidual containing the sequence number
`linkage for cohort
`
`and the exposure data. A record COl’l—.
`studies.
`_
`..
`M
`
`
`
`
`
`H4
`
`
`
`Meth. Inform. Med, Vol. 35, No.2, 1996
`
`SYMPI-ioNY00066122
`
`
`
`nd
`
`BC-
`
`1p-
`is
`
`1 a
`'ee
`
`.ed
`he
`es-
`he
`f0
`ial
`iin
`Jia
`an
`ac
`ne
`
`ne
`
`in
`
`the sense of [15] would be too technical
`for this paper, we only give a systematic
`verbal
`(semi—formal) description and
`the access matrix of the registry model;
`some of the less relevant details are
`
`given in a slightly simplified form.
`Every assumption of
`the model
`should be critically examined as
`to
`whether it is sound. For instance, can a
`
`party do things it is not supposed to do?
`What can two or more parties achieve
`through collaboration? The model will
`not give absolute security but will
`Show where additional (organizational)
`'means should be provided. The organ—
`izational framework has to guarantee
`the model assumptions and fill
`the
`security gaps that
`the cryptographic
`procedures leave open.
`In discussing the security of the mod—
`el we assume that the cryptographic
`algorithms are secure and that they are
`implemented in a secure way. The first
`assumption is justified by using state-
`of—the—art
`cryptographic
`techniques.
`The second assumption is more prob—
`lematic and needs careful organization—
`al measures.
`
`~ The exchange key for inter-registry
`record linkage (see 5.4).
`Moreover, we have the identification
`data of the notifying institution for
`clearing discrepancies,
`for obtaining
`follow—up information,
`for
`reporting
`follow~up information in the case where
`the notifying institution is a clinical
`cancer registry, and for compensating
`the reporting physician for his notifica—
`tion. The trusted office also stores other
`administrative data.
`
`The relevant parties for our model
`are the following; for each of these par—
`ties we have to define what knowledge
`it has or transfers and which other par-
`ties it trusts:
`
`— The patient has access to his own
`data, but only via his treating physi—
`cian.
`
`— The notifying institution knows the
`data of its own patients:
`— The treating physician notifies the
`registry of his patients and can be
`asked by the trusted office about
`them.
`institutions
`health—care
`— Other
`which also send notifications are
`
`registries,
`clinical cancer
`care
`institutions,
`and
`Health offices.
`- — The trusted office sees all the data
`
`after~
`Public
`
`except the re-identification key and
`the storage key.
`It permanently
`stores only the encryption key and
`the linkage data key.
`— The supervising office keeps the re‘
`identification key and sees the iden-
`tity data of re—identified cases. '
`-« The registration office sees the pseu-
`donym, the epidemiological data, the
`sequence number, the storage key,
`and also stores these data perma-
`nently (except
`the sequence num~
`'ber).
`~~ The cooperating registry:
`— The trusted office sees the ex—
`
`.
`
`4.]. Data and Parties
`
`In the semi—formal description of
`the model we speak of the patient, the
`cooperating
`registry,
`the
`sequence
`number etc., although in reality there
`are several instances of each of these
`classes.
`
`The knowledge (or data) in our model
`consists of the following parts:
`— The identity data (see 3.1).
`— The pseudonym
`— the encrypted identity (see 5.1),
`— the linkage data (see 5.3);
`they
`occur in “pure hash’ format,
`in
`‘linkage’ format, in ‘storage’ for—
`mat, and in ‘exchange’
`format
`(see Fig. 5).
`— The epidemiological data (see 3.1).
`— The sequence number, a temporary
`pseudonym for a research project as
`in 3.4.
`
`- The encryption key for asymmetric
`encryption of identification data.
`, The re—identification key for re—iden—
`tification of identity data.
`— The linkage data key for generating
`the linkage data (see 5.3).
`~ The storage key for permanent stor—
`age of the linkage data (see 5.3).
`
`~— The outsider: any person or institu—
`tion other than those listed above ~
`
`has access only to communication
`paths
`and
`perhaps
`to
`storage
`media, if these leave the registration
`office, say,
`in case of a hardware
`defect.
`
`notifying-
`the
`bank where
`The
`is ignored.
`physician has his account
`Only a very small amount of informa-
`tion can be gained by observing the
`financial transfers, e. g, that a certain
`physician has a cancer patient at a cer—
`tain time.
`
`In the following we discuss only the
`parts of the model that are relevant for
`the pseudonymity aspect. For example, ’
`data on storage and communication
`media should be useless for the outsid-
`
`er; this is achieved by encryption of all
`communication paths and all storage
`media. In particular, the notifying insti—
`tutions should communicate with the
`
`trusted office in a secure manner, i.e.,
`using encrypted data transfer. Hence—
`forth, we assume that the outsider can
`gain data access only through collabora-
`tion with some other institution, and
`leave the security of communication
`and storage outside the scope of this
`papen
`
`4.2. The Access Matrix
`
`Figure3 gives the access matrix of
`the registry model. We have to show
`thatno party can get additional infor—
`mation by inferencing,’ in other words,
`that the access matrix as shown in Fig. 3
`is complete. Since the model
`involves
`cryptographic keys, i.e., data that imply
`access to other data,
`the question is
`what subsets of the set of data in the
`access matrix are ‘closed’ with respect
`to infereneing. This gives only a 'naive’
`proof of security;
`there are indirect
`ways for getting additional informations
`(see section 4.3).
`We have a single inference that
`needs no key:
`id 7» ldh,
`where the symbols are'takenl from Fig. 3
`and the arrow denotes the inference. In
`other words: whoever has the iden-
`
`change key and the pseudonyms,
`even in pure hash format.
`— The registration office sees the
`linkage data in its own linkage for—
`mat. In case of a match it gets the
`full registry data, which is the aim
`of the linking procedure.
`tification data can derive the linkage
`— The research institute gets the se-
`data in‘pure hash format, because the
`quence number and the epidemi-
`hash algorithm is publicly known and
`ological data as well as the exposure
`needs no key. The complete list of key-
`data which are outside the scope of
`dependent inferences is as follows:
`the registry model (see 3.4).
`
`“alCA
`
`Meth. Inform. Med, Vol. 35, No.2, 1996
`
`l15
`
`SYMPHONY00666123
`
`
`
`
`
`
`
`Linkagedata(storageformat).[ldlsjiinkag'e'ézita'éiéiiénger.)[ldxl
`
`
`
`Epidemiologicaldata[ep]
`
`
`
`
`
`Sequencenumber[sq]
`
`............-...
`'Reidentificationkey[kre]
`
`
`
`Encryptionkey[kei
`
`-illinkag'e’datakeylkldl
`
`-Storagekeylkstl
`
`‘'éiéiéigéare;
`
`
`
`Fig. 3 Access matrix of the registry model. 1only own patients; 2 only re~identified cases;
`3in its own linkage format.
`
`The cooperating trusted office sees
`the linkage data even in pure hash
`format and could perform a
`trial
`encryption. However,
`it
`is trusted by
`definition.
`
`The registration office could try
`illegal data matching with the epidemi—
`ological data and a statistical attack at
`the linkage data in linkage format.
`The supervising office sees the iden«
`tity data of re—identified cases. How—
`ever, it is also trusted, and it gets only
`few data.
`The trusted office sees the iden—
`
`tification data and the epidemiological
`data, but it is trusted by definition.
`The notifying institution and the
`patient get no knowledge of data they
`should not know. They know their own
`data only.
`The question what a party can do
`that has unauthorized knowledge of an
`additional piece of data, say, by col—
`laborating with another party, can be
`answered by the analysis in section 4.2.
`Covert channels could be exploited, for
`instance, by faking notifications; we
`come back to this
`in section 7.1.
`
`employee of the registration office or of
`the research institute; the trusted office
`that also sees the epidemiological data
`sees the identity anyway.
`
`
`
`5. Encryption Procedures
`
`Encryption of identifying data is per—
`, formed by using different techniques
`which are suited for different purposes.
`A detailed technical description of the
`basic algorithms is given in [14]. As a
`basis to assess the performance of the
`procedures one has to take an expected '
`number of 50,000 notifications each
`year for Rheinland-Pfalz. The efficien-
`cy of the procedures also suffices for
`larger registries.
`
`5.1. Asymmetric Encryption
`0f Identification Data
`
`Asymmetric encryption techniques
`use two different keys for encryption
`and decryption, often called ‘public key’
`and ‘private key’. This notation, how—
`ever does not fit in the present context.
`Unauthorized matching with epidemi—
`ological data is only possible for an
`Therefore we speak of ‘encryption key’
`
`
`”6
`
`
`‘ M Mnmausmwmmkasn 9.. .L.
`
`
`Meth. Inform. Med, Vol. 35, No.2, 1996
`
`SYMPHONY000661éZ 7
`
`7
`
`Linkagedata(linkageformat)[1d,]
`
`
`
`
`
`
`
`_,,,,...,..,,
`...-.u.......
`
`a.dt.«J
`,13
`c
`.9.4
`t6
`4:.V A
`E:
`g:H.
`
`25
`
`s : sees
`
`(and temporarily
`stores)
`k = keeps
`(= permanently
`stores)
`
`d = can derive
`
`Pseudonym(encryptedidentity)[ps]
`
`
`
`..Linkagedata(purehashf.).[.liih]
`
`
`
`
`
`Patient
`
`straggggg‘ita;"“"
`Tmstedoffice .. .. .. .. .. .... .. .... .
`"gassing; Iiiééé" " ......
`Reglstmuonofficg ........... ,
`Cooperating trusted office
`
`3
`
`2
`
`.
`
`.
`
`.
`
`.
`
`
`Research institute
`
`Outsider
`
`
`
`l
`
`id -> ps,
`ke:
`km: ps -> id,
`km: idh H ldl,
`k“: 1d, <~+ ldS,
`kx: 1d,, <—>
`ldx.
`Therefore, the access matrix is com-
`plete. The only way to infer the iden~
`tification data id is by knowledge of
`ps and km, the encrypted identification
`data and the re—identification key.
`Hence this can only be done by the
`supervising office.
`
`4.3. Indirect Ways
`for Re—identification
`
`The goal of the registry model is to
`make unauthorized re-identification as
`difficult as possible. However, what is
`possible, if the access matrix is guaran-
`teed by the implementation of the mod—
`el? The multitude and nature of indirect
`
`ways for making inferences about the
`data cannot be completely delineated.
`This is the main difficulty in proving the
`validity of any security model formally.
`Some relevant methods that should be
`considered are:
`
`— trial encryption (guessed plain~text
`attack),
`— data matching with outside sources
`{16}.
`~ statistical attacks [16],
`covert channels [17],
`— social
`engineering
`forced collaboration).
`The outsider sees none of the data.
`
`(voluntary or
`
`He could gain access only by collabora-
`tion with another party.
`The
`research institute
`
`sees
`
`the
`
`epidemiological data and could try an
`unauthorized matching with an external
`data source. This danger is inherent in
`the granularity of the epidemiological
`data and cannot be made smaller by
`any model whatsoever. Therefore, the
`release ofsubsets of epidemiological
`data is restricted according to avspecific
`project.
`The cooperating registration office
`only sees the linkage data in its own
`linkage format. It could try a statistical
`attack to find out some frequent names
`or use distribution anomalies of birth I
`data. But this will hardly suffice to iden~
`tify even a single case other than those
`that
`this registry has among its own
`records.
`W
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`......
`
`......s.......u.-.. s.......i.i...
`
`
`
`
`
`and ‘re—identification key’. Knowledge
`of one of the keys does not help in any
`way to derive the other.
`The identity data of each incoming
`record are encrypted in the trusted of—
`fice using the encryption key, see Fig. 4.
`If, under special circumstances (as in
`3.4),
`the decryption of some iden-
`tification data becomes necessary, the
`registration office sends the encrypted
`identity data back to the trusted office
`that initiates the re-identification, see
`section 3.4.
`
`The most suitable asymmetric en~
`cryption method, according to the state—
`of—the—art, is the RSA algorithm [14, 18,
`19]. It uses the mathematical operation
`of modular exponentiation, x e x6 mod
`n; character strings are treated as num-
`bers according to their bit patterns and
`decomposed into blocks such that each
`block represents a number smaller than
`n. The modulus n is, a very large num-
`ber. The exponent e is the encryption
`key. The re—identification key d has a
`size similar to n and the property that
`xed E x (mod n). Thus, modular expo-
`nentiation with d is the inverse opera—
`tion of modular exponentiation with e.
`Deriving e from n and d requires de—
`composition of n into its prime factors,
`a task that is mathematically infeasible,
`if n is large enough. Experts recom—
`mend a key length of >700 bits [20].
`Since in a cancer
`registry data are
`stored for a long time, one should rath—
`er choose a key length of >1,000 bits to
`be prepared for possible technological
`progress. For performance reasons, in-
`stead of RSA one could use a hybrid
`encryption method [19, section V1.7]
`such as RSA + DES or I’GP (RSA +
`
`IDEA) [14, section 17.9]. This makes
`sense as soon as the data to be encrypt~
`ed are longer than a single RSA block.
`DES and IDEA are symmetric encryp—
`tion procedures, meaning that encryp-
`tion and decryption use the same key.
`The exact description is too complicat-
`ed to be given here; we refer to [14, 17].
`They are several orders of magnitude
`faster than all known asymmetric pro
`cedures but do not fit directly to our
`model which relies on asymmetric en—
`cryption. Therefore, a hybrid combina—
`tion with RSA has to be used.
`
`If an employee of the registration of—
`fice gains knowledge of the encryption
`key, or if an outsider gains knowledge
`of the encryption key and access to the
`registered data, he could perform a trial
`encryption (‘chosen plain-text attack’)
`with the corresponding identity data.
`In order to prevent this possible misuse,
`each record is complemented by a
`random number before
`encryption.
`As shown in Fig. 4, this random number
`is kept
`in the encrypted part of the
`record.
`
`5.2. Key Management
`
`The keys have to be generated in a
`secure manner under special organiza—
`tional precautions, e.g., in the supervis—
`ing office. The encryption key is kept in
`the trusted office. It has not necessarily
`to be kept secret because the encryption
`is randomized (see section 5.1). There—
`fore, there is no need for a cryptograph—
`ic token, like a smart card, to hold this
`key. But a smart card is desirable as
`access~control token. It could then also
`
`hold the key. On the other hand, the
`
`
`
`‘need to know’ principle says that it is
`better keeping the key secret.
`There are two cases where a change
`of the encryption and re-identification
`keys becomes necessary:
`— The actual keys are compromised; at
`least there is suspicion that an unau-
`thorized person has got the keys.
`— The progress of cryptanalysis or the
`performance of hardware have ad
`vanced to a great extent such that the
`chosen key length can no longer be
`assumed to be sufficient.
`
`In these cases a new, more secure
`pair of encryption and re-identification
`keys has to be generated and used. This
`could be done by decrypting and then
`re—encrypting all the stored records in
`the trusted office. [However, the Ger-
`man BSI (‘Bundesamt fur Sicherheit in
`der Informationstechnik’, Federal Of—
`fice for Security in Information Techno-
`logy) proposed a more efficient meth—
`od: define the new encryption method
`to be the composition of the old one
`and the “over—encryption” with the new
`key, thereby avoiding even a temporal
`exposition of the plain—text data; the
`future decryption key is the composi~
`tion of the old and the new keys. Over-
`encryption of the old records can be
`done in the registration office under
`special security precautions. An analo~
`gous procedure also applies in case the
`chosen encryption method is invalidat—
`ed by new research results.
`An alternative method to handle key
`changes without temporarily generating
`plain text was proposed by Miller [21]. .
`It eliminates the need of superimposing
`the old and new encryption procedures
`and keeping the old key. On the other
`hand, it works only with a slightly re—
`stricted version of the RSA algorithm.
`
`5.3. Linkage Data and Anonymous
`Data Matching
`
`
`
`
`
`
`
`f
`
`
`
`Muller-Liidemcheid
`MariewLuise
`BeispielstraBe 123
`45678 Musterstadt
`21.7.1966
`28.2.1995
`uh
`
`
`
`3j&kl98abx?b
`
`Epidemiological
`data
`
`Trusted Office
`
`Epidemioiogical
`data
`
`Registration office
`
`To generate the linkage data we ex—
`tract
`the following components from
`the identity data: Name'(s), surname(s),
`phonetic codes,
`the name code of
`the former GDR, day and month of
`birth. Then these
`components
`are
`separately encrypted, in a first step by
`using a one—way hash function [14], in a
`subsequent step by using a symmetric
`encryption algorithm [14] with the ‘link~
`age~data key’; then they are in ‘linkage
`Fig.4 Asymmetric encryption of identification data.
`
`Meth. Inform. Med, Vol.35, No.2, 1996
`
`1.17
`
`' SYMPHONYOdOBBQEI
`
`
`
`
`
`From left to right the security increases:
`The clear—text format shows the full in—
`
`formation; the pure hash format allows
`trial encryption and record linkage; the
`linkage format allows record linkage
`only; and the storage format gives com—
`plete anonymity.
`For record linkage the registration
`office compares the linkage data and
`other unencrypted identifying data of a
`new case with all the stored records. In
`case of small differences,
`if there is a
`reasonable evidence of match, the case
`(is reported back to the trusted office
`that tries to clarify the case. In very few
`exceptional cases this procedure could
`necessitate a re»identification as in sec—
`tion 3.4.
`
`5.4. Inter-registry Matching
`
`From time to time, e.g., once per
`year,
`the collaborating registries are
`allowed to link their records in order
`
`to detect common notifications, e.g.,
`caused by change of residence, or notifi—
`cations by a treating physician and a
`hospital in the hinterland of the other
`registry.
`For this purpose two registries A
`and B agree upon a temporary one—time
`‘exchange’ key. Registration office A
`transfers a file with the linkage data to
`its trusted office which removes the en—
`
`cryption, getting the ‘pure’ hash values,
`and encrypts these with the exchange
`key. Then it sends them to the trusted
`office of registry B, which removes the
`exchange encryption and does the usual
`linkage~data encryption for i