`a National Address File
`
`Peter Christen1 ?, Tim Churches2, and Alan Willmore2
`
`1 Department of Computer Science, Australian National University,
`Canberra ACT 0200, Australia, peter.christen@anu.edu.au
`2 Centre for Epidemiology and Research, New South Wales Department of Health,
`Locked Mail Bag 961, North Sydney NSW 2059, Australia,
`ftchur,awillg@doh.health.nsw.gov.au
`
`Abstract. It is estimated that between 80% and 90% of governmental
`and business data collections contain address information. Geocoding {
`the process of assigning geographic coordinates to addresses { is becom-
`ing increasingly important in many application areas that involve the
`analysis and mining of such data. In many cases, address records are
`captured and/or stored in a free-form or inconsistent manner. This fact
`complicates the task of robustly matching such addresses to spatially-
`annotated reference data. In this paper we describe a geocoding system
`that is based on a comprehensive high-quality geocoded national address
`database. It uses a learning address parser based on hidden Markov mod-
`els to separate free-form addresses into components, and a rule-based
`matching engine to determine the best set of candidate matches to a
`reference (cid:12)le. The geocoding software modules are implemented (as part
`of the Febrl open source data linkage system) in the object-oriented lan-
`guage Python, which allows rapid prototype development and testing.
`
`Keywords: Data mining preprocessing, geocoding, spatial data analysis,
`data linkage, data cleaning, indexing, G-NAF, hidden Markov model.
`
`1 Geocoding
`
`Increasingly, many data mining and data analysis projects need information
`from multiple data sources to be integrated, matched, combined or linked in
`order to enrich the available data and to allow more detailed analysis. The
`aim of such linkages is to merge all records relating to the same entity, such
`as a patient, customer or business. Most of the time the linkage (or matching)
`process is challenged by the lack of a common unique entity identi(cid:12)er, and thus
`becomes non-trivial [3, 8, 15]. In such cases, the available partially identifying
`information { like names, addresses, and dates of birth { is used to decide if
`two (or more) records correspond to the same entity. This process is compute
`intensive, and linking todays large data collections becomes increasingly di(cid:14)cult
`using traditional linkage techniques.
`
`? Corresponding author
`
`Google, Exhibit 1016
`IPR2022-00742
`Page 1 of 13
`
`
`
`A special case of linkage is geocoding, the matching of a data source with
`geocoded reference data (which is made of cleaned and standardised records
`containing address information plus their geographical location). The US Federal
`Geographic Data Committee estimates that geographic location is a key feature
`in 80% to 90% of governmental data collections [14]. In many cases, addresses
`are the key to spatially enable data. The aim of geocoding is to generate a
`geographical location (longitude and latitude) from street address information
`in the data source. Once geocoded, the data can be used for further processing,
`in spatial data mining [6] projects, and it can be visualised and combined with
`other data using Geographical Information Systems (GIS).
`The applications of spatial data analysis and mining are widespread. In the
`health sector, for example, geocoded data can be used to (cid:12)nd local clusters of
`disease. Environmental health studies often rely on GIS and geocoding software
`to map areas of potential exposure and to locate where people live in relation to
`these areas. Geocoded data can also help in the planning of new health resources,
`e.g. additional health care providers can be allocated close to where there is an
`increased need for services. An overview of geographical health issues is given
`in [1]. When combined with a street navigation system, accurate geocoded data
`can assist emergency services (cid:12)nd the location of a reported emergency (for
`example, if a caller reports an incomplete or unclear address).
`Geocoded customer data, combined with additional demographic data, can
`help businesses to better plan marketing and future expansion, and the analysis
`of historical geocoded data, for example, can show changes in their customer
`base. Within census, geocoding can be used to assign people or households to
`small area units, for example census collection districts, which are then the basis
`of further statistical analysis.
`There are two basic scenarios for geocoding user data. In the (cid:12)rst, a user
`wants to automatically geocode a data set. The geocoding system should (cid:12)nd
`the best possible match for each record in the user data set without human in-
`tervention. Each record needs to be attributed with the corresponding location
`plus a match status which indicates the accuracy of the match obtained (for
`example an exact address match, or a street level match, or a postcode level
`match). This scenario might become problematic if the user data is not of high
`quality, and contains records with missing, incorrect or out-of-date address in-
`formation. Typographical errors are common with addresses, especially when
`they are recorded over the telephone or from hand-written forms. As reported
`in [11], a match rate of 70% successfully geocoded records is often considered
`an acceptable result. In the second scenario a user wants to geocode a single
`address that may be incomplete, erroneous or unformatted. The system should
`return the location if an exact match can be found, or alternatively a list of
`possible matches, together with a matching status and a likelihood rating. This
`geocoding of a single record should be done in (near) real time (i.e. less than a
`couple of seconds response time) and be available via a suitable user interface
`(e.g. a Web site).
`
`Google, Exhibit 1016
`IPR2022-00742
`Page 2 of 13
`
`
`
`.
`1
`
`. 3
`
`.2
`
`.
`
`5
`
`. 4
`
` .
`10
` .
`8
`
`.
`
`6
`
`12
` .
`
` .
`13
` .
`11
`
`. 7
`
` .
`9
`
`Fig. 1. Example geocoding using property parcel centres (numbers 1 to 7) and street
`reference (cid:12)le centreline (dashed line and numbers 8 to 13, with the dotted lines corre-
`sponding to a global street o(cid:11)set).
`
`Standard data (or record) linkage techniques [3, 8, 15], where the aim is to link
`(or match) together all records belonging to the same entity, normally classify
`compared record pairs into one of the three classes links, non-links and possi-
`ble links, with the latter class containing those record pairs for which human
`oversight, also known as clerical review, is needed to decide their (cid:12)nal linkage
`status. Often no additional information is available so the clerical review process
`becomes one of applying human intuition, experience or common sense to the
`decision based on available data. This is similar to the second geocoding scenario
`described above, where the user is presented with a selection of possible matches
`(sorted according to their matching status and likelihood rating).
`Many GIS software packages provide for street level geocoding. As a recent
`study shows [2], substantial di(cid:11)erences in positional error exist between addresses
`which are geocoded using street reference (cid:12)les (containing geographic centreline
`coordinates, street numbers and names, and postcodes) and the corresponding
`true locations. The use of point property parcel coordinates (i.e. the centres or
`centroids of properties), derived from cadastral data, is expected to signi(cid:12)cantly
`reduce these positional errors. Figure 1 gives an illustrative example. Even small
`discrepancies in geocoding can result in addresses being assigned to, for example,
`di(cid:11)erent census collection districts, which can have huge implications when doing
`small area analysis. A comprehensive property based database is now available
`for Australia: the Geocoded National Address File (G-NAF). It is presented in
`details in Section 1.1.
`We give an overview of our geocoding system in Section 2. The two central
`technical issues for a geocoding system are (1) the accurate and e(cid:14)cient matching
`of user input addresses with the address information stored in the geocoded
`reference data, and (2) the e(cid:14)cient retrieval of the address location (longitude
`and latitude) of the matched geocoded records. In order to achieve accurate
`match results, addresses both in the user data set and the geocoded reference
`data need to be cleaned and standardised in the same way. We cover this issue
`in more details in Section 2.1. Address locations can e(cid:14)ciently be retrieved from
`
`Google, Exhibit 1016
`IPR2022-00742
`Page 3 of 13
`
`
`
`1
`
`n
`
`1
`
`1
`
`n
`
`1
`
`n
`
`n
`
`Locality Alias
`
`Adress Alias
`
`Adress Site
`Geocode
`
`Locality
`Geocode
`
`1 1
`
`Locality
`
`1 n
`
`Adress Detail
`
`n 1
`
`Adress Site
`
`Street Locality
`Alias
`
`1n
`
`1
`
`n
`
`Street Locality
`Geocode
`
`n
`
`1n
`
`Street
`
`1
`
`Fig. 2. Simpli(cid:12)ed G-NAF data model (10 main (cid:12)les only). Links 1{n denote one-to-
`many, and links 1{1 denote one-to-one relationships.
`
`the geocoded reference data by converting the traditional database tables (or
`(cid:12)les) into inverted indexes, as presented in Section 2.2. The geocode matching
`engine is the topic of Section 3, with some initial experimental results presented
`in Section 4, and conclusions and an outlook to future work is given in Section 5.
`
`1.1 G-NAF { A Geocoded National Address File
`
`In many countries geographical data is collected by various state and territory
`agencies. In Australia, for example, each state and territory have their own gov-
`ernmental agency that collect data to be used for land planning, as well as
`property, infrastructure or resource management. Additionally, national organ-
`isations like post and telecommunications, electoral rolls and statistics bureaus
`collect their own data. All these data sets are collected for speci(cid:12)c purposes,
`have varying content and are stored in di(cid:11)erent formats.
`The need for a nation-wide, standardised and high-quality geocoded data set
`has been recognised in Australia since 1990 [11], and after years of planning,
`collaborations and development the G-NAF was (cid:12)rst released in March 2004.
`Approximately 32 million address records from 13 organisations were used in a
`(cid:12)ve-phase cleaning and integration process, resulting in a database consisting of
`22 normalised (cid:12)les (or tables). Figure 2 shows the simpli(cid:12)ed data model of the
`10 main G-NAF (cid:12)les.
`G-NAF is based on a hierarchical model, which stores information about ad-
`dress sites separately from locations and streets. It is possible to have multiple
`geocoded locations for a single address, and vice versa, and aliases are available
`at various levels. Three geocode (cid:12)les contain location (longitude and latitude) in-
`formation for di(cid:11)erent levels. If an exact address match can be found, its location
`can be retrieved from the ADDRESS SITE GEOCODE (cid:12)le. If there is only a match
`on street level (but not street number), the STREET LOCALITY GEOCODE (cid:12)le will
`
`Google, Exhibit 1016
`IPR2022-00742
`Page 4 of 13
`
`
`
`Table 1. Characteristics of the 10 main G-NAF (cid:12)les (NSW data only).
`
`G-NAF data (cid:12)le
`
`Numbers of records
`and attributes
`
`Keys (persistent
`identi(cid:12)ers)
`
`ADDRESS ALIAS
`
`289,788 / 6
`
`ADDRESS DETAIL
`
`4,145,365 / 28
`
`ADDRESS SITE
`ADDRESS SITE GEOCODE
`LOCALITY
`LOCALITY ALIAS
`
`LOCALITY GEOCODE
`STREET
`STREET LOCALITY ALIAS
`
`4,096,507 / 6
`3,336,778 / 12
`5,017 / 7
`700 / 5
`
`4,978 / 11
`58,083 / 6
`5,584 / 6
`
`STREET LOCALITY GEOCODE
`
`128,609 / 13
`
`PRINCIPAL PID
`ALIAS PID
`GNAF PID
`LOCALITY PID
`STREET PID
`ADDRESS SITE PID
`ADDRESS SITE PID
`ADDRESS SITE PID
`LOCALITY PID
`LOCALITY PID
`ALIAS PID
`LOCALITY PID
`STREET PID
`STREET PID
`LOCALITY PID
`STREET PID
`LOCALITY PID
`
`provide an overall street geocode. Finally, if no street level match can be found
`the LOCALITY GEOCODE (cid:12)le contains geocode information for localities (e.g. towns
`and suburbs). Both the STREET LOCALITY GEOCODE and LOCALITY GEOCODE (cid:12)les
`also contain information about the extent of streets and localities.
`For our project we only used the G-NAF records covering the Australian state
`of New South Wales (NSW), containing around 4 million address, 60,000 street
`and 5,000 locality records. Table 1 gives an overview of the size and content of
`the 10 main G-NAF data (cid:12)les used.
`
`2 System Overview
`
`The geocoding system presented in this paper is part of the Febrl (Freely Ex-
`tensible Biomedical Record Linkage) data linkage system [3, 7], that contains
`modules to clean and standardise data sets which can contain names, addresses
`and dates; and link and deduplicate such cleaned data. An overview of the Febrl
`geocoding system is shown in Figure 3. The geocoding process can be split into
`the preprocessing of the G-NAF data (cid:12)les (which is described in detail in Sec-
`tions 2.1 and 2.2), and the matching with user-supplied addresses as presented
`in Section 3.
`The preprocessing step takes the G-NAF data (cid:12)les and uses the Febrl address
`cleaning and standardisation routines to convert the detailed address values
`(like street names, types and su(cid:14)xes, house numbers and su(cid:14)xes, (cid:13)at types
`
`Google, Exhibit 1016
`IPR2022-00742
`Page 5 of 13
`
`
`
`G−NAF data
`files
`
`AustPost data
`GIS data
`
`User data
`file
`
`Geocoding module
`
`Web server module
`
`Febrl clean
`and
`standardise
`
`Web interface
`input data
`
`Febrl clean
`and
`standardise
`
`Build inverted
`indexes
`
`Inverted index
`data files
`
`Febrl geocode
`match engine
`
`Geocoded
`Web data
`
`Process−GNAF module
`
`Geocoded
`user data file
`
`Fig. 3. Overview of the Febrl geocoding system.
`
`and numbers, locality names, postcodes, etc.) into a form which makes them
`consistent with the user data after Febrl standardisation. Note that the G-NAF
`data (cid:12)les already come in a highly standardised form, but the (cid:12)ner details, for
`example how whitespaces within locality names are treated, make the di(cid:11)erence
`between successful or failed matching. The cleaned and standardised reference
`records are then inserted into a number of inverted index data structures.
`Additional data used in the preprocessing step are a postcode-suburb look-
`up table which is publicly available, and which can be used to impute missing
`postcodes or suburb values in the G-NAF locality (cid:12)les; and a table extracted from
`a commercial GIS system containing postcode and suburb boundary information,
`which is used to create neighbouring region look-up tables.
`The geocode matching engine takes as input the inverted indexes and the
`raw user data, which is cleaned and standardised before geocoding is attempted.
`As shown in Figure 3, the user data can either be loaded from a data (cid:12)le,
`geocoded and then stored back into a data (cid:12)le, or it can be passed as one or
`more address(es) to the geocoding system and returned via a Web interface.
`The complete Febrl system, including the geocoding and Web server mod-
`ules, is implemented in the object-oriented open source language Python 1, which
`allows rapid prototype development and testing.
`
`2.1 Probabilistic Address Cleaning and Standardisation
`
`The (cid:12)rst crucial step when processing both the geocoded reference (cid:12)les and the
`user data is the cleaning and standardisation of the data (i.e. addresses) used
`for geocoding. It is commonly accepted that real world data collections contain
`erroneous, incomplete and incorrectly formatted information. Data cleaning and
`standardisation are important preprocessing steps for successful data linkage and
`before including such data in a data warehouse for further analysis [13]. Data may
`be recorded or captured in various, possibly obsolete, formats and data items may
`
`1 http://www.python.org
`
`Google, Exhibit 1016
`IPR2022-00742
`Page 6 of 13
`
`
`
`be missing, out-of-date, or contain errors. The cleaning and standardisation of
`addresses is especially important for data linkage and geocoding so that accurate
`matching results can be achieved.
`The main task of cleaning and standardising addresses is the conversion of
`the raw input data into well de(cid:12)ned, consistent forms and the resolution of
`inconsistencies in the way address values are represented or encoded. Rule-based
`data cleaning and standardisation is currently used by many commercial systems
`and is cumbersome to set up and maintain, and often needs adjustments for
`new data sets. We have recently developed (and implemented within Febrl ) new
`probabilistic techniques [4] based on hidden Markov models (HMMs) [12] which
`achieved better address standardisation accuracy and are easier to set-up and
`maintain compared to popular commercial linkage software.
`A HMM is a probabilistic (cid:12)nite-state machine consisting of a set of obser-
`vation or output symbols, a (cid:12)nite set of discrete, hidden (unobserved) states, a
`matrix of transition probabilities between those hidden states, and a matrix of
`probabilities with which each hidden state emits an observation symbol [12] (this
`emission matrix is also called an observation matrix ). In our case, the hidden
`states of the HMM correspond to the output (cid:12)elds of the standardised addresses.
`The Febrl approach to address cleaning and standardisation consist of the
`following three steps.
`
`1. The user input addresses are cleaned. This involves converting all letters to
`lower-case, removing certain characters (like punctuations), and converting
`various sub-strings into their canonical form, for example ’c/-’, ’c/o’ and
`’c.of ’ would all be replaced with ’care of ’. These replacements are based on
`user-speci(cid:12)ed and domain speci(cid:12)c substitution tables. Note that these sub-
`stitution tables can also contain common misspellings for street and locality
`names, for example, and thus help to increase the matching quality.
`2. The cleaned input strings are split into a list of words, numbers and charac-
`ters, using whitespace marks as delimiters. Look-up tables and some hard-
`coded rules are then used to assign one or more tags to the elements in this
`list. These tags will be the observation symbols in the HMM used in the next
`step.
`3. The list of tags is given to a HMM, and assuming that each tag (obser-
`vation symbol) has been emitted by one of the hidden states, the Viterbi
`algorithm [12] will (cid:12)nd the most likely path through the HMM, and the cor-
`responding hidden states will give the assignment of the elements from the
`input list to the output (cid:12)elds.
`
`Consider for example the address ’73 Miller St, NORTH SYDENY 2060’, which
`will be cleaned (SYDENY corrected to sydney), split into a list of words and
`numbers, and tagged in steps one and two. The resulting lists of words/numbers
`and tags looks as follows.
`
`[’73’, ’miller’, ’street’, ’north sydney’, ’2060’]
`[’NU’, ’UN’,
`’WT’,
`’LN’
`, ’PC’
`]
`
`with ’NU’ being the tag for numbers, ’UN’ the tag for unknown words (not found
`in any look-up table or covered by any rule), ’WT’ the tag for a word found in
`
`Google, Exhibit 1016
`IPR2022-00742
`Page 7 of 13
`
`
`
`the wayfare (street) type look-up table, ’LN’ the tag for a sequence of words
`found to be a locality name, and ’PC’ the tag for a known postcode.
`In the third step the tag list is given to a HMM (which has previously been
`trained using similar address training data), and the Viterbi algorithm will re-
`turn the most likely path through the HMM which will correspond to the fol-
`lowing sequence of output (cid:12)elds.
`
`’street number’: ’73’
`’street name’: ’miller’
`’street type’: ’street’
`’locality name’: ’north sydney’
`’postcode’: ’2060’
`
`Details about how to e(cid:14)ciently train the HMMs for address (as well as name)
`standardisation, and experiments with real-world data are given in [4]. Training
`of the HMMs is quick and does not require any specialised skills. For addresses,
`our HMM approach produced equal or better standardisation accuracies than a
`widely-used rule-based system.
`
`2.2 Processing the G-NAF Files
`
`Processing the G-NAF data (cid:12)les consists of two steps, the (cid:12)rst being the cleaning
`and standardisation as described above, and the second being the building of
`inverted indexes. Such an inverted index is a keyed hash-table in which the
`keys are the values from the cleaned G-NAF data (cid:12)les, and the entries in the
`hash-table are sets with the corresponding PIDs (persistent identi(cid:12)ers) of the
`values. For example, assume there are four records in the LOCALITY (cid:12)le with the
`following content (the (cid:12)rst line is a header-line with the attribute names).
`
`locality_pid,
`60310919,
`60709845,
`60309156,
`61560124,
`
`locality_name,
`sydney,
`north_sydney,
`north_sydney,
`the_rocks,
`
`state_abbrev,
`nsw,
`nsw,
`nsw,
`nsw,
`
`postcode
`2000
`2059
`2060
`2000
`
`The inverted indexes for the three attributes locality name, state abbrev and
`postcode then are (square brackets denote lists and round brackets denote sets):
`
`locality_name_index = [’north_sydney’:(60709845,60309156),
`’sydney’:(60310919),
`’the_rocks’:(61560124)]
`
`state_abbrev_index = [’nsw’:(60310919,60709845,60309156,61560124)]
`
`postcode_index = [’2000’:(60310919,61560124),
`’2059’:(60709845),
`’2060’:(60309156)]
`
`The matching engine then (cid:12)nds intersections of the inverted index sets for the
`values in a given record. For example, a postcode value ’2000’ would result in
`a set of PIDs (60310919,61560124), and when intersected with the PIDs for
`
`Google, Exhibit 1016
`IPR2022-00742
`Page 8 of 13
`
`
`
`Table 2. G-NAF attributes used for geocode matching.
`
`G-NAF data (cid:12)le
`
`Attributes used
`
`ADDRESS DETAIL
`
`LOCALITY ALIAS
`LOCALITY
`STREET
`STREET LOCALITY ALIAS
`
`(cid:13)at number pre(cid:12)x, (cid:13)at number, (cid:13)at number su(cid:14)x,
`(cid:13)at type, level number, level type, building name,
`location description, number (cid:12)rst pre(cid:12)x,
`number (cid:12)rst, number (cid:12)rst su(cid:14)x, number last pre(cid:12)x,
`number last, number last su(cid:14)x, lot number pre(cid:12)x,
`lot number, lot number su(cid:14)x
`locality name, postcode, state abbrev
`locality name, postcode, state abbrev
`street name, street type, street su(cid:14)x
`street name, street type, street su(cid:14)x
`
`locality name value ’the rocks’, would result in the single PID set (61560124)
`which corresponds to the original record. The location of this PID can then
`be look-up in the corresponding G-NAF geocode index. Table 2 shows the 23
`attributes for which inverted indexes are built.
`
`2.3 Additional Data Files
`
`Additional information is used in the Febrl geocoding system during the prepro-
`cessing step to verify and correct (if possible) postcode and locality name values,
`and in the matching engine to enable searching for matches in neighbouring re-
`gions (postcodes and suburbs) if no exact match can be found.
`Australia Post publishes a look-up table containing postcode and suburb in-
`formation2, which can be used when processing the G-NAF locality (cid:12)les to verify
`and even correct wrong or missing postcodes and suburb names. For example,
`if a postcode is missing in a record, the Australia Post look-up table can be
`used to (cid:12)nd the o(cid:14)cial postcode(s) of the suburb in this record, and if this is
`a unique postcode it can be safely imputed into the record. Similarly, missing
`suburb names can be imputed if they correspond to a unique postcode.
`Other look-up tables are used to (cid:12)nd neighbouring regions for postcodes and
`suburbs, i.e. for a given region these tables contain all its neighbours. These
`look-up tables are created using geographical data extracted from a commercial
`GIS system, and integrated into the Febrl geocode matching engine.
`Look-up tables of both direct and indirect neighbours (i.e. neighbours of di-
`rect neighbours) are used in the geocode matching engine to (cid:12)nd matches in
`addresses where no exact postcode or suburb match can be found. Experience
`shows that people often record di(cid:11)erent postcode or suburb values if a neigh-
`bouring postcode or suburb has a higher perceived social status (e.g. ’Double
`Bay’ and ’Edgecli(cid:11) ’ ), or if they live close to the border of such regions.
`
`2 http://www.auspost.com.au/postcodes/
`
`Google, Exhibit 1016
`IPR2022-00742
`Page 9 of 13
`
`
`
`3 Geocode Matching Engine
`
`Febrl ’s geocode matching engine is based on the G-NAF inverted index data,
`and takes a rule-based approach to (cid:12)nd an exact match or alternatively one or
`more approximate matches. Its input is a cleaned and standardised user record.
`The matching engine tries to (cid:12)nd an exact match (cid:12)rst, but if none can be
`found it extends its search to neighbouring postcode and suburb regions. First
`direct neighbouring regions (level 1) are searched, then direct and indirect neigh-
`bouring regions (level 2), until either an exact match or a set of approximate
`matches can been found. In the latter case, either a weighted average location
`over all the found matches is returned, or a ranked (according to a likelihood
`rating) list of possible matches. The following steps explain in more detail (but
`still on a high conceptual level) how the matching engine works.
`
`1. Find the set of address level matches (using street number and su(cid:14)x) and
`the set of street level matches (using street name and type).
`2. Find common matches between street and address levels (using set intersec-
`tion).
`3. Set the neighbour search level to 0 (no neighbouring regions are searched).
`4. Find the locality match set (using locality name, quali(cid:12)er and postcode)
`according to the current value of the neighbour search level. Postcode infor-
`mation is only used if no other locality information is available.
`5. Find common matches between locality and address level, and between lo-
`cality and street level (using set intersections).
`6. If no matches between locality and address, and locality and street were
`found, increase the neighbour level (up to a maximum of 2) and jump back
`to step 4.
`7. If matching records have been found, try to re(cid:12)ne the match set using the
`postcode value (only if the postcode has not been used for the locality
`matches in step 4), as well as unit, (cid:13)at and building (or property) infor-
`mation (if such information is available in the record).
`8. If matches between street and address, or locality and address have been
`found, get their coordinates from the address geocode index. If only one
`match has been found, or if all found matches have the same location (this
`might be due to several G-NAF records corresponding to the same building)
`return the found location (longitude and latitude) together with an ’exact
`address match’ status. If more than one match with di(cid:11)erent locations
`have been found then calculate the average location and return it together
`with an ’average address match’ status.
`9. If no address level match has been found use the street level match set. If
`only one match has been found or if all matches have the same location
`return the found location together with an ’exact street match’ status.
`If several street matches with di(cid:11)erent location were found return a ’many
`street match’ status and the list of found PIDs.
`10. If no street level match has been found use the locality level match set. If only
`one match has been found or if all matches have the same location return the
`
`Google, Exhibit 1016
`IPR2022-00742
`Page 10 of 13
`
`
`
`Table 3. Matching results for geocoding 10,000 free-form LPI address records.
`
`Match status
`
`Number of records
`
`Percentage
`
`Exact address level match
`Average address level match
`Exact street level match
`Many street level match
`Exact locality level match
`Many locality level match
`No match
`
`7,288
`213
`1,290
`154
`917
`135
`3
`
`72.87 %
`2.13 %
`12.90 %
`1.54 %
`9.17 %
`1.35 %
`0.03 %
`
`found location together with an ’exact locality match’ status. If several
`locality matches with di(cid:11)erent location were found return a ’many locality
`match’ status and the list of found PIDs.
`11. If no match was found return a ’no match’ status.
`
`Geocoding of multiple addresses is an iterative process where each record is (cid:12)rst
`cleaned and standardised, then geocoded and written into an output data set
`with coordinates and a match status added to each record.
`
`4 Experimental Results
`
`We have run experiments with geocoding various data sets. In this section we
`present initial results of geocoding a NSW Land and Property Information data
`set containing 10,000 randomly selected free-form addresses (from a data set
`containing around 2.7 million records). Table 3 shows the matching results. A
`total of 94.94% exact matches could be found at di(cid:11)erent levels. A closer analysis
`of the results showed that for 456 records no exact address match was found due
`to missing coordinates in the ADDRESS SITE GEOCODE (cid:12)le (i.e. our G-NAF data
`set did not have coordinates for these addresses). With better quality of future
`G-NAF releases we can therefore expect improved matching qualities.
`Using a SUN Enterprise 450 shared memory (SMP) server with four 480
`MHz Ultra-SPARC II processors and 4 Giga Bytes of main memory, it took
`23 minutes and 50 seconds to geocode the 10,000 address records, which is an
`average of 143 milli-seconds per record.
`
`5 Conclusions and Future Work
`
`In this paper we have described a geocoding system based on a geocoded national
`address (cid:12)le. We are currently evaluating and improving this system using raw
`uncleaned addresses taken from various administrative health related data sets.
`We are also planning to compare the accuracy of our geocoding system with
`commercial street level based GIS systems, and similar to [2] we expect more
`
`Google, Exhibit 1016
`IPR2022-00742
`Page 11 of 13
`
`
`
`accurate results. We are also fully integrating our geocoding system into the
`Febrl data linkage system [3, 7] and will publish it under an open source software
`license later this year.
`Our main future e(cid:11)orts will be directed towards the re(cid:12)nement of the geocode
`matching engine to achieve more accurate matching results, as well as improving
`the performance of the matching engine (i.e. reducing the time needed to match
`a record). Three other areas of future work include:
`
`{ The Febrl standardisation routines currently return (cid:12)elds (or attributes)
`which are di(cid:11)erent from the ones available in G-NAF. This makes it necessary
`to map Febrl (cid:12)elds to G-NAF (cid:12)elds within the geocode matching engine.
`Better would be if the Febrl standardisation returns the same (cid:12)elds as the
`ones available in G-NAF, resulting in explicit (cid:12)eld by (cid:12)eld comparisons. We
`are planning to modify the necessary Febrl standardisation routines.
`{ Currently both the G-NAF preprocessing and indexing, as well as the geocode
`matching engine work in a sequential fashion only. Due to the large data (cid:12)les
`involved parallel processing becomes desirable. In the preprocessing step, the
`G-NAF data (cid:12)les can be processed independently or in a blocking fashion
`distributed over a number of processors, with only the (cid:12)nal inverted indexes
`that need to be merged. Geocoding of a large user data (cid:12)le can easily be
`done in parallel as the cleaning, standardisation and matching of each record
`is independent from all others. An additional advantage of parallelisation is
`the increased amount of main memory available on many parallel platforms.
`We are planning to explore such parallelisation techniques and implement
`them into the Febrl system to allow faster geocoding of larger data sets. Ad-
`ditional performance improvements can be achieved by pro(cid:12)ling and then
`replacing the core computational routines in the matching engine with C or
`C++ code.
`{ Geocoding uses identifying information (i.e. addresses) which raises privacy
`and con(cid:12)dentiality issues. Organisations that collect sensitive health data
`(e.g. cancer registries) cannot send their data to a geocoding service as this
`results in the loss of privacy for individuals involved. Methods are desirable
`which allow for privacy preserving geocoding of addresses. We aim to develop
`such methods based on techniques recently developed for blindfolded data
`linkage [5, 9, 10].
`
`Acknowledgements
`
`This work is funded by the NSW Department of Health, Centre for Epidemiology
`and Research. The authors would like to thank David Horgan (student at the
`University of Queensland) who worked on a (cid:12)rst version of this system while he
`was a summer student at the ANU. The authors also wish to thank PSMA for
`providing the G-NAF data (cid:12)les.
`
`Google, Exhibit 1016
`IPR2022-00742
`Page 12 of 13
`
`
`
`References
`
`1. Boulos, M.N.K.: Towards evidence-based, GIS-