Categories
Election Data Analysis Election Integrity technical

VA Voter List data standardization and normalization

EPEC has purchased and downloaded the full statewide VA Registered Voter List (RVL), the full Voter History List (VHL) and the Monthly Update Subscription (MUS) to the voter list as of 2023-06-30. These files are provided by ELECT as comma-separated-value files, but contain numerous idiosyncrasies, formatting issues and errors.

We combined the MUS information with our baseline list to create a new Statewide voter list record incorporating all of the relevant changes. As we had just downloaded our baseline list only the day before we received the MUS, there were a number of entries in the MUS that had already been incorporated into our baseline dataset, however there were a few significant deletions / adds / modifications.

The updated RVL and VHL is currently being processed using the following methods, among others:

  • The Statewide RVL and VHL are being split into smaller data files organized by LOCALITY_NAME, LOCALITY_PRECINCT_NAME, CONG_CODE_VALUE, STHOUSE_CODE_VALUE, STSENATE_CODE_VALUE, and CITY
  • The data has been standardized and normalized to remove whitespace errors, all fields have been converted to upper case, observed field name issues have been corrected, and missing fields in the VHL have been added.
    • The VHL does not contain “LOCALITY_NAME” or “PRECINCT_NAME” fields, but does reference each by code value. The missing fields have been added into the VHL after correlating with the RVL data in order to endure commonality between the datasets, and to allow for splitting into the folder structure defined above.
    • The formatting for precinct names in the RVL is inconsistent in its use of spaces and dashes between the precinct code and name. This has been standardized to be the ” – ” separator.
    • The inconsistent use of the ampersand symbol (“&”) in county names, such as “KING & QUEEN COUNTY”, has been standardized to always use the word “AND” instead.
    • etc. We will continue to update these standardizations and error checks as we discover new issues.
  • The primary and mailing addresses from the RVL have been fed as input to an NCOA processing system (truencoa.com) and the resultant reports have been collated for each grouping as listed above.
  • The RVL fields have also been collated against version 13 of the US Dept of Transportation’s National Address Database and the RVL entries have been augmented with the information regarding whether a match was found or not, as well as the type of match. Our best attempt has been made to match addresses to the RVL entries, but there are still inconsistencies and mis-spellings in both the NAD and RVL data that we are continuing to work to identify and improve.
    • Prior to matching to the NAD listings the RVL primary and mailing addresses are normalized and standardized according to the US Post Offices published list of common street suffix abbreviations.
    • Initial matches are attempted based on a Strict match to either the Primary or Mailing address
    • Subsequent matches use iterative relaxing of various criteria, such as ignoring the street suffix, or flipping the position of the street direction indicator. We have denoted the USDOT_MATCH_TYPE in the augmented RVL dataset to allow filtering on these different matching criteria.

EPEC is working to make this value-added data available for those entities that are authorized to handle VA election information. Interested parties may contact us for more details.

Leave a Reply