Researching Quality Issues on the National Geographic Database at Statistics Canada
By Glen Hohlmann and Robert Parenteau

Introduction
Over the past decade, numerous public and private organizations throughout the world have begun creating large regional, national, and international spatial data infrastructures. Street network files containing accurate road attribute ranges such as name, type, and address are a major component of such infrastructure projects. These files have become increasingly important because they support a growing number of GIS applications that include site selection, address geocoding, emergency services, and other mapping activities.
      Statistics Canada (STC) and Elections Canada (EC) have jointly produced a national street network database to support their respective census, electoral, and digital cartographic needs. This represents a major commitment of time, money and resources on the part of both organizations. The National Geographic Database (NGD) was designed to be centrally maintained and shared by both organizations.
      The importance of the NGD, resulting in early recognition of the benefits of designing and implementing a high-quality issue-tracking system that would be separate from the standard quality control issues of production. To date, a number of different issues have been logged by the system. Many of these were dealt with immediately, including eliminating sliver polygons that were created by integrating data from different sources. However, there have been instances where the issues were much larger in scope, and more research was necessary before corrective measures could be employed. This article briefly discusses some of the current research being done at Statistics Canada to address quality issues that concern the NGD.

Vertical Integration
Built from a variety of sources, the NGD represents a major data integration project. As a result, there have been numerous quality challenges. One such challenge has been the integration of EC and STC road data with thematic layers taken from NRCan's National Topographic Database (NTDB) and Digital Chart of the World (DCW). In some cases, because of differences in scale and registration, plus due to shifts in geometry between the various data sources, features such as railroads and bodies of water are not in their correct position relative to roads. For example, in Figure 1(a), both roads and hydrology are mapped at scales of 1:50,000 and lie in the correct position relative to each other. However, in Figure 1(b), the same roads from a 1:50,000 source are combined with hydrographic data from a 1:250,000 source. As a result of generalization, displacement rules, and positional accuracy related to scale, the hydrographic feature has been mapped as a linear feature (water course) instead of as an area feature (water body), and has shifted from the west side to the east side of the road. From this point forward, such anomalies are referred to as "vertical integration errors."
      Because features such as railroads and water courses are often used by enumerators in the field as references for orientation, it is important for them to lie in the correct position relative to the road network. Vertical integration errors are especially significant when census boundary limits are meant to follow such features as shorelines, rivers, or railroad lines. A study was therefore undertaken to determine the extent of vertical integration errors on the NGD.
      This study found that approximately 15 percent of the areas sampled contained vertical integration errors. Most of these were the result of the integration of EC/STC roads relative to railroads. This led to a decision to verify the entire NGD and repair all road/railroad vertical integration errors. (Figure 2 and 3) This has recently been completed, and work has now begun on correcting road/water-body vertical integration errors as well.

Testing for Absolute Positional Accuracy
Statistics Canada's mission to enumerate and profile Canada's people and its institutions does not require a high level of positional accuracy in geographic products. Therefore, maps are primarily designed to show the relative position of elements. However, external clients who wish to integrate census boundary data with their own digital files have raised concerns regarding the positional accuracy of products obtained from Statistics Canada.
      During the compilation of the NGD, an attempt was made to geometrically adjust all roads to information from the NTDB (50K and 250K) or DCW, which were used for reference. It is therefore expected that these geometrically matched arcs will have positional accuracies similar to the corresponding reference data used during construction of the database. It should be noted that reference sources were selected on a tile-by-tile basis and were dependant upon a variety of factors such as population size, geographical location (whether urban or rural), and the availability of NTDB/DCW data in EC/STC holdings. The positional accuracy of arcs that could not be matched because they were not present on the reference data is, however, unknown. Many of these arcs were digitized on-screen from paper maps provided by staff out in the field. Although highly valuable and accurate in both attribute information and positions relative to other features, the absolute positional accuracy of these roads is questionable.
      An investigation of the absolute positional accuracy of roads on the NGD is underway. This information will provide Statistics Canada and Elections Canada with a clearer understanding of the positional accuracy of this product, which will be extremely useful when trying to integrate the NGD with other data sets. This will also give users a better sense of the overall limitations of the product, thus allowing them to make decisions with greater confidence.
      An extension (RMSEr2.avx) for Arcview, one that calculates the spatial accuracy of two data sets using the National Standard for Spatial Data Accuracy test methodology, was used to conduct the accuracy assessment. This extension tests the accuracy between a check theme and a reference theme of higher accuracy, with the testing achieved by identifying ground control points (GCPs) on these two themes. First, a position on the reference theme is selected; second, the corresponding position on the check theme is selected. The extension records the number of test points, point coordinates, and cumulative statistics including the Root Mean Square (RMS) error and the National Standard for Spatial Data Accuracy (NSSDA) value. The RMS error is pegged at a confidence level of 95 percent.
      A pilot study was conducted using a data reference set provided by the Regional Municipality of Halifax, Nova Scotia. Two independent tests were carried out. The first tested the positional accuracy of NGD arcs that were rubber-sheeted to match 50k NTDB data. The second tested the accuracy of road updates to the NGD that could not be rubber-sheeted. A total of 30 well distributed and easily identifiable test points (GCPs) were selected each time the test was conducted. (Figures 4 and 5)
      The results from the accuracy assessments are summarized here in Figure 6. The average positional difference between the Halifax data set and various rubber-sheeted NGD arcs was found to be approximately six meters, with a RMS error of nine meters and an NSSDA value of 16 meters. These figures are similar to values obtained when NTDB 50k data was compared to the Halifax data set. This is a substantial improvement over past Statistics Canada Street Network File (SNF) coverages that were not rubber-sheeted which were found to have an average positional difference of approximately 25 meters from 50k NTDB data.
      The positional difference is substantial, however, between the Halifax data and NGD roads that were not rubber-sheeted. The average positional difference was found to be almost 200 meters, with an RMS error of 77 meters and an NSSDA value of 133 meters. These positional differences were highly variable, ranging from a low of only five meters to a high of 316 meters. As a result, a large standard deviation was obtained which underlines the highly unpredictable positional accuracy of these roads.
      Future testing will be expanded to cover more areas, accomplished by comparing the NGD to data taken from the Canadian Data Alignment Layer (CDAL). Since CDAL will be used as the framework data for the Canadian Geospatial Data Infrastructure (CGDI), knowledge of the accuracy of the NGD relative to this data will be extremely useful.

Address-quality Enhancement
Geocoding is one of the most widespread uses of GIS street network data. At Statistics Canada, street network data is used to geocode the address register, a list of dwelling addresses. This register contains millions of records and forms a crucial part of Statistics Canada's operational activities. The geocoding of such a large data set is often a difficult task. To facilitate the process, it is imperative that street network attributes such as road names and address ranges are maintained in a consistent manner.
      An address-quality enhancement project has begun updating and validating the corner civic addresses on the NGD. This process involves looking at street segments that have possible logic errors, or may be missing street addresses or other information, and then updating them by using municipal maps and the Canada Post browser. This process is further designed to improve the quality of the address range data on the NGD, which in turn will improve the quality of the geocoding of the address register to a blockface or sub-blockface.
      Work on this project is done at the municipal level, whereby an automatic process identifies specific types of errors. Figure 7 shows the results of this process. Blue arcs correspond to unnamed street segments. Although most are valid (e.g., highway ramps), they are nonetheless verified. Red arcs correspond to one of many possible logic errors when road names and types are the same. (Figure 8)

Conclusion
The NGD forms the foundation for most of the cartographic products produced by the geography divisions of both Statistics Canada and Elections Canada. Therefore, maintaining and improving the quality of this information resource is extremely important. To ensure that this is done, a quality-tracking system has been created to investigate quality issues and request changes to the base where needed. Some issues are isolated instances and only require minor adjustments. Others are more complex and, being larger in scope, require greater research.
      Although this type of research takes time, Statistics Canada and Elections Canada are committed to improving the quality of the NGD. This is needed if the partner agencies are to take advantage of technological innovations. For example, the value of GPS technology allowing field workers to more easily relocate addresses for follow-up operations has already been proven, but the process is contingent upon the availability of a high-quality, accurate street network file.

About the Author:
Glen Hohlmann is GIS/Informatics Technology Officer, and Robert Parenteau is Project Manager/Spatial Data Infrastructure in the Geography Division at Statistics Canada, Ottawa, Ontario. Mr. Hohlman may be reached via e-mail at [email protected]. Mr. Parenteau may be reached via e-mail at [email protected].

Back