Categories
Election Data Analysis Election Integrity technical

Ballot Completion Rate for VA Absentee Voters

Below is an analysis of the VA statewide voter completion rate for absentee ballots compiled from the 2022 General Election Daily Absentee List (DAL) file downloaded from the VA Dept of Elections (“ELECT”) on 2022-11-15 17:46:21.

The DAL file records the transactions of all absentee ballots during the early voting period in VA elections. It includes records for both mail-in and in-person early voting transactions. It does not record the the actual values of the voted ballots, but the “fact-of” a registered voters checking in to an early voting site, or mailing their ballot application or completed ballot to the registrar, etc. The DAL record is published daily over the course of the early voting period and the file is cumulative.

For the purposes of this analysis a “Completed” ballot is a ballot that has been recorded in the DAL file as reaching a state in which the ballot can be considered to be tabulate-able. A “Completed” ballot must have its “APP_STATUS” field set to “Approved” AND have the “BALLOT_STATUS” field set to (“FWAB” OR “Marked” OR “On Machine” OR “Pre-Processed”).

The “VOTER_TYPE” field was used to separate records into “Military”, “Overseas” or “Temporary(Federal-Only Ballot)” and the ballot completion rate was computed for each sub-category, as well as overall.

All Absentee Voters
Avg Transactions Per Voter1.03
Avg Completion Rate Per Voter91.95%
Num Of Unique Voters1,057,268
Military Voters (VOTER_TYPE==”Military”)
Avg Transactions Per Voter1.08
Avg Completion Rate Per Voter78.60%
Num Of Unique Voters9,346
Overseas Voters (VOTER_TYPE==”Overseas”)
Avg Transactions Per Voter1.17
Avg Completion Rate Per Voter63.63%
Num Of Unique Voters7,052
Temporary Federal Voters (VOTER_TYPE==”Temporary(Federal-Only Ballot)”
Avg Transactions Per Voter1.21
Avg Completion Rate Per Voter61.14%
Num Of Unique Voters1,539
Discussion:

The data above shows that there is a distinct statistical discrepancy in the ability of Military, Overseas, or Temporary Federal Workers to complete their absentee ballots in comparison to standard ballots. These categories of voters are specifically reliant on the Mail-In absentee ballot process, and are demonstrably not having the same ability to have their votes cast and counted as is provided to standard absentee voters.

This discrepancy might be due to any number of potential reasons or mechanisms, which cannot be determined from the DAL data as provided by ELECT. The discrepancy demonstrably exists, though, and it should be investigated and remedied by legislators and officials in order to remedy the comparative disenfranchisement of specific classes of VA voters..

I will note for completeness that the first discovery and observation of this discrepancy was due to the diligent work of a fellow EPEC board member. I independently validated his results and created the scripts to process the data on a statewide basis to produce these tables. As always I am happy to provide the raw data, scripts and results to interested parties that are capable of receiving and handling VA election data according to VA law and the policies of ELECT. Interested parties can contact us to request more information.

Categories
Election Data Analysis programming technical

Changes in number of detected cloned VA voter registration records Nov 2022 – July 2023

BLUF: The number of detected exact (Full Name + DOB) “clone” registration records in the VA Registered Voter List (RVL) file has decreased overall from 2022-11-23 to 2023-07-01, however there are additional new clones still being added to the database.

There has been a concentrated effort by various election integrity groups and public officials around the state of VA to clean up the voter rolls. Specifically the VA Department of Elections (“ELECT”) made a concerted effort in early 2023 to remove and clean up a large number (~19,000) of deceased voters and other errant records. The data below shows that these efforts have made an impact on the number of exact “clones” identified in the database, but that there is still more work to do.

BACKGROUND: This is a continuation of exploratory analysis on the existence of “cloned” records in the VA Registered Voter List. Please see previous posts here, here, here and here for background information.

As a reminder and for the purposes of this analysis, a potential “cloned” record is defined as a record in the VA Registered Voter List (RVL) where the Full Name (First + Middle + Last + Suffix) and full Date of Birth (mm/dd/yyyy) exactly matches another record but they have different Voter Identification Numbers. In my previous analyses I was focusing on the Active registration records that had been identified as clones, but even if a cloned record is marked as Inactive in the database, it still holds the potential to be voted as any interaction with the voter immediately moves the registration from Inactive to Active. Therefore the analysis below includes both Active and Inactive records.

It is important to note and emphasize that this analysis is only specifically focusing on exact clones, and not any other of the number of potential errors that could be represented in a voter database. There are a couple of reasons for this narrow focus:

  1. The detection of exact clones in the database requires no additional data correlations and can operate directly on the data provided from ELECT. It is easily defined and scriptable, and can be replicated by other researchers and public officials for verification.
  2. Due to item (1), the identification of exact clones is a good candidate to track over time as a proxy indicator for issues with the database.
  3. There are some rather interesting non-random distribution patterns of the already observed cloned records that I have previously discussed here, and I am interested in observing and understanding the cause of these distribution shapes.

DETAILS: As we now have collected multiple statewide voter registration lists, I was curious as to how the numbers of detected cloned registrant records have changed over time, specifically with respect to the REGISTRATION_DATE field that is reported in each record.

In Figure 1 below I’ve plotted the number of identified cloned records in the 2022-11-23 RVL stratified by registration date year. Figure 2 is the same plot from the 2023-06-30 RVL file we recently purchased. Figure 3 shows the number of additions, removals and net change in the number of identified cloned records in each date bin between these two datasets, based on the unique set of cloned Voter ID Numbers that fall within each bin.

Figure 1 Distribution of identified cloned records with respect to registration date in the 2022-11-23 Registered Voter List purchased from VA Department of Elections. The total number of identified (First + Middle + Last + Suffix + DOB) clones in the dataset was 2,445.
Figure 2 Distribution of identified cloned records with respect to registration date in the 2023-06-30 Registered Voter List purchased from VA Department of Elections. The total number of identified (First + Middle + Last + Suffix + DOB) clones in the dataset was 1,485.
Figure 3 Differences between the 2023-06-30 and 2022-11-23 RVL datasets.

The total number of identified clones in the 2022-11-23 dataset was 2,445. The total number of identified clones in the 2023-06-30 was 1,485. We can see in Figure 3 that while there has been an overall reduction of 960 in the number of cloned records (which is good!), there are still new clones being added to the voter registration database, even as previously identified clones have been removed. This suggests that there is an ongoing process(es) or mechanism(s) that is continuously adding cloned records to the voter list database.

It is not readily apparent as to what causes the added cloned records. It could be any number of technology issues, such as a poor input verification or coding practices, or related to human error and poor procedures and/or training, or a mixture of issues.

The two full datasets above (2022-11-23 and 2023-06-30) were purchased directly from the department of elections. The lists were parsed, standardized and normalized for string case, known typos, whitespace and punctuation issues, but otherwise the raw data entries were unadjusted.

We also purchased the Monthly Update Subscription (MUS) from ELECT at the time that we ordered the 2023-06-30 RVL. The MUS is generated on a monthly basis and captures the changes to the voter list over the prior month. We received the 2022-07-01 MUS, and applied the changes within it to the full 2023-06-30 RVL that we had received the day before. As the MUS contains all the changes over the previous month, and we had purchased our full dataset the prior day, we did not expect there to be many adjustments required, but there were a few. Applying the MUS to the 2023-06-30 dataset resulted in the generation of an updated 2023-07-01 dataset.

For completeness, the same plots that were generated above for the directly purchased data is repeated below for the updated RVL dataset. Figure 4 plots the number of identified cloned records in the 2023-07-01 RVL stratified by registration date year, and Figure 5 shows the differences with the 2022-06-23 dataset.

Figure 4 Distribution of identified cloned records with respect to registration date in the 2023-07-01 Registered Voter List generated from the purchased 2023-06-30 RVL and 2023-07-01 MUS files. The total number of identified (First + Middle + Last + Suffix + DOB) clones in the dataset was 1,575.
Figure 5 Differences between the 2023-07-01 and 2022-11-23 RVL datasets.

One thing that I find interesting is the difference in detected cloned entries between the purchased 2023-06-30 dataset and the 2023-07-01 dataset after the MUS entries have been applied. This is presented in Figure 6 below. We see that the application of the MUS did not remove any clones, but 90 were added. I’m not sure what this means yet, as we only have a single MUS file and it was generated so close to our full download. We will monitor and see how this progresses as we continue to receive the MUS files throughout the 2023 election cycle.

Categories
Election Data Analysis Election Forensics Election Integrity mathematics programming technical

Voter ID Number distribution patterns in VA Registered Voter List

One thing that I have been asked about repeatedly is if there is any sort of patterns in the assignment of voter ID numbers in the VA data. Specifically, I’ve been asked repeatedly if I’ve found any similar pattern to what AuditNY has found in the NY data. It’s not something that I have looked at in depth previously due mostly to lack of time, and because VA is setup very differently than NY, so a direct comparison or attempt to replicate the AuditNY findings in VA isn’t as straightforward as one would hope.

The NY data uses a different Voter ID number for counties vs at the state level, which is the “Rosetta Stone” that was needed for the NY team to understand the algorithms that were used to assign voter ID numbers, and in turn discover some very (ahem) “interesting” patterns in the data. VA doesn’t have such a system and only uses a single voter ID number throughout the state and local jurisdictions.

Well … while my other machine is busy crunching on the string distance computations, I figured I’d take a crack at looking at the distribution of the Voter ID numbers in the VA Registered Voter List (RVL) and just see what I find.

To start with, here is a simple scatter plot of the Voter ID numbers vs the Registration date for each record in the 2023-07-01 RVL. From the zoomed out plot it is readily apparent that there must have been a change in the algorithm that was used to assign voter identification numbers sometime around 2007, which coincides nicely with the introduction of the current Virginia Election and Registration Information System (VERIS) system.

From a high level, it appears that the previous assignment algorithm broke the universe of possible ID numbers up into discrete ranges and assigned IDs within those ranges, but favoring the bottom of each range. This would be a logical explanation for the banded structure we see pre-2007. The new assignment algorithm post-2007 looks to be using a much more randomized approach. Nothing strange about that. As computing systems have gotten better and security has become more of a concern over the years there have been many systems that migrated to more randomized assignments of identification numbers.

Looking at a zoomed in block of the post-2007 “randomized” ID assignments we can see some of the normal variability that we would expect to see in the election cycles. We see that we have a high density of new assignments around November of 2016 and 2020, with a low density section of assignments correlated to the COVID-19 lockdowns. There are short periods where it looks like there were lulls in the assignment of voter ID’s, these are perhaps due to holidays or maintenance periods, or related to the legal requirements to “freeze” the voter rolls 30 days before any election (primaries, runoffs, etc). Note that VA now has same day voter registration as of the laws passed by the previous democratic super-majority that went into affect in 2022, so going forward we would likely see these “blackout periods” be significantly reduced.

We can see more clearly the banded assignment structure of the pre-2007 entries by zooming in on a smaller section of the plot, as shown below. It’s harder to make out in this banded structure, but we still see similar patterns of density changes presumably due to the natural election cycles, holidays, maintenance periods, legally required registration lockouts periods, etc. We can also see the “bucketing” of ID numbers into distinct bands, with the bias of numbers filling the lower section of each band.

All of that looks unremarkable and seems to make sense to me … however … if we zoom into the Voter ID address range of around [900,000,000 to 920,000,000] we do see something that catches my curiosity. We see the existence of the same banded structure as above between 900,000,000 and 915,000,000 AND pre-2007, but there is another band of assignments super-imposed on the entire date range of the RVL. This band does not seem to be affected by the introduction of the VERIS system (presumably), which is very interesting. There is also what looks like to be a vertical high-density band between 2007 and 2010 that extends along the entire vertical axis, but we only see it once we zoom in to the VERIS transition period.

The horizontal band that extends across all date ranges only exists in the [~915,000,000 to ~920,000,000] ID range. It trails off in density pre ~1993, but it exists throughout the full registration date range. I will note that the “Motor Voter” National Voter Registration Act (NVRA) was implemented in 1993, so perhaps these are a reserved universe block for DMV (or other externally provided) registrations? (That’s a guess, but an educated one.)

A plausible explanation I can imagine for the distinct high density band between 2007-2010 is that this might be related to how the VERIS system was implemented and brought into service, and there was some sort of update around 2010 that made correction to its internal algorithms. (But that is just a guess.) That still wouldn’t entirely explain the huge change in the density of new registrants added to the rolls.

Another, or additional, explanation might be that when VERIS came online there were a number of registrants that had their Voter ID number regenerated and/or their registration date field updated as part of the rollout of the new VERIS software. Meaning that while VERIS was coming online and handling the normal amount of new real registrations, it was also moving/updating a large number of historic registrations, which would account for the higher density as VERIS became the system of record. That seems to be a poor systems administration and design choice, in my opinion, as it makes inaccurate those moved registrant records by giving them a false registration date. However, if that was the case, and VERIS was resetting registration dates as it ingested voter records into its databases, why do we see any records with pre-2007 registration dates at all? (This is again, merely an educated guess on my part, so take with a grain of salt.)

Incorporating the identification of cloned registrations

In attempting to incorporate some of my early results on the most recent RVL data doing duplicate record identification (technically they are “cloned” records, as “duplicates” would have the same voter ID numbers. This was pointed out to me a few days ago.) on this dataset, I did a scatter plot of only those records that had an identified exact match of (FullName +DOB) to other records in the dataset, but with unique Voter ID numbers. The scatter plot of those records is shown below, and we can see that there is a distinct ~horizontal cluster of records that aligns with the 915M – 920M ID band and pre-2007. In the post-2007 block we see the cloned records do not seem to be totally randomly distributed, but have a bias towards the lower right of the graph.

Superimposing the two plots produces the following, with the red indicating the records with identified Full Name + DOB string matches.

Zooming in to take a closer look at the 915M-920M band again, gives the following:

It is curious that there seems to be an alignment of the exact Full Name + DOB matching records with the 915M-920M, pre-2007 ID band. Post-2007 the exact cloned matches have a less structured distribution throughout the data, but they do seem to cluster around the lower right.

If the cloned records were simply due to random data entry errors, etc. I would expect to see sporadic red datapoints distributed “salt-n-pepper” style throughout the entirety of the area covered by the blue data. There might be some argument to be made for there being a bias of more of the red data points to the right side of the plot, as officials have not yet had time to “catch” or “clean-up” erroneous entries, but there is little reason to have linear features, or to have a bias for lower ID numbers in the vertical axis.

I am continuing to investigate this data, but as of right now all I can tell you is that … yes, there does seem to be interesting patterns in the way Voter IDs are assigned in VA, especially with records that have already been found and flagged to be problematic (clones).

Categories
Election Data Analysis Election Forensics Election Integrity programming technical

Preliminary Results of 2023-07-01 VA RVL duplicate detections

Below are the preliminary results from performing exact (string distance of 0) duplicate record checks on the 2023-07-01 VA Registered Voter List information. Note that these are the numbers of ordered matches discovered, not the number of individual unique registrants. Each count represents two different registration records, with unique voter IDs, that match the given criteria. Match pairs are directional, in that a pair (A,B) is counted separately from the pair (B, A). Matches are grouped into this table according to the LOCALITY_NAME of the first element of the identified pair. Some pairs can have different locality information, except for in the strictest case (3rd column), so a match might be counted in one locality while its mirror is counted in the other.

The first data column of the table below is equivalent to the criteria used by the MOU between the VA Department of Elections (ELECT) and the Department of Motor Vehicles (DMV), as discussed and documented in a previous post. There were 5,290 matches for this criteria across the state.

The second data column below is based on a match of the registrants Full Name and Day+Month+Year of birth information, but NOT the registrants listed address information. There were 1,200 matches for this criteria across the state.

The third data column is the strictest match criteria and includes the Gender and Address of the registrant. There were 208 matches for this criteria across the state.

I am only considering Active registrations in the table below. I previously computed similar statistics using the previous purchased 2022-11-23 RVL, I have not yet done a comparison of the two datasets, but will do so once I complete the string distance processing on this latest set.

Row LabelsSum of Exact Same First+Last+DobSum of Exact Same First+Middle+Last+Sfx+DOBSum of Exact Same First+Middle+Last+Sfx+Gender+Address+DOB
ACCOMACK COUNTY2470
ALBEMARLE COUNTY1165722
ALEXANDRIA CITY60130
ALLEGHANY COUNTY1240
AMELIA COUNTY500
AMHERST COUNTY3282
APPOMATTOX COUNTY1500
ARLINGTON COUNTY152620
AUGUSTA COUNTY6962
BATH COUNTY200
BEDFORD COUNTY67130
BLAND COUNTY500
BOTETOURT COUNTY2420
BRISTOL CITY800
BRUNSWICK COUNTY1710
BUCHANAN COUNTY1150
BUCKINGHAM COUNTY700
BUENA VISTA CITY300
CAMPBELL COUNTY39134
CAROLINE COUNTY1800
CARROLL COUNTY1400
CHARLES CITY COUNTY200
CHARLOTTE COUNTY1110
CHARLOTTESVILLE CITY2580
CHESAPEAKE CITY136138
CHESTERFIELD COUNTY38012734
CLARKE COUNTY1100
COLONIAL HEIGHTS CITY1240
COVINGTON CITY500
CRAIG COUNTY300
CULPEPER COUNTY2620
CUMBERLAND COUNTY1352
DANVILLE CITY2140
DICKENSON COUNTY600
DINWIDDIE COUNTY2120
EMPORIA CITY930
ESSEX COUNTY910
FAIRFAX CITY1220
FAIRFAX COUNTY55018814
FALLS CHURCH CITY1280
FAUQUIER COUNTY4120
FLOYD COUNTY900
FLUVANNA COUNTY2330
FRANKLIN CITY842
FRANKLIN COUNTY2022
FREDERICK COUNTY5494
FREDERICKSBURG CITY1420
GALAX CITY300
GILES COUNTY300
GLOUCESTER COUNTY2420
GOOCHLAND COUNTY2320
GRAYSON COUNTY1310
GREENE COUNTY1610
GREENSVILLE COUNTY700
HALIFAX COUNTY2364
HAMPTON CITY131378
HANOVER COUNTY73100
HARRISONBURG CITY1882
HENRICO COUNTY2457534
HENRY COUNTY2944
HIGHLAND COUNTY200
HOPEWELL CITY3374
ISLE OF WIGHT COUNTY2710
JAMES CITY COUNTY3110
KING AND QUEEN COUNTY400
KING GEORGE COUNTY1700
KING WILLIAM COUNTY1740
LANCASTER COUNTY830
LEE COUNTY38142
LEXINGTON CITY830
LOUDOUN COUNTY158316
LOUISA COUNTY2610
LUNENBURG COUNTY1350
LYNCHBURG CITY75294
MADISON COUNTY1020
MANASSAS CITY2332
MANASSAS PARK CITY410
MARTINSVILLE CITY900
MATHEWS COUNTY522
MECKLENBURG COUNTY2530
MIDDLESEX COUNTY600
MONTGOMERY COUNTY4750
NELSON COUNTY1554
NEW KENT COUNTY1810
NEWPORT NEWS CITY88120
NORFOLK CITY111150
NORTHAMPTON COUNTY832
NORTHUMBERLAND COUNTY810
NOTTOWAY COUNTY1084
ORANGE COUNTY2940
PAGE COUNTY2330
PATRICK COUNTY1010
PETERSBURG CITY3160
PITTSYLVANIA COUNTY6292
POQUOSON CITY800
PORTSMOUTH CITY80100
POWHATAN COUNTY2660
PRINCE EDWARD COUNTY2050
PRINCE GEORGE COUNTY2452
PRINCE WILLIAM COUNTY178410
PULASKI COUNTY1620
RADFORD CITY810
RAPPAHANNOCK COUNTY610
RICHMOND CITY159482
RICHMOND COUNTY563810
ROANOKE CITY4000
ROANOKE COUNTY5770
ROCKBRIDGE COUNTY1720
ROCKINGHAM COUNTY3470
RUSSELL COUNTY1900
SALEM CITY1110
SCOTT COUNTY1740
SHENANDOAH COUNTY2240
SMYTH COUNTY1700
SOUTHAMPTON COUNTY1860
SPOTSYLVANIA COUNTY86224
STAFFORD COUNTY92212
STAUNTON CITY1500
SUFFOLK CITY5152
SURRY COUNTY820
SUSSEX COUNTY1122
TAZEWELL COUNTY2100
VIRGINIA BEACH CITY244252
WARREN COUNTY2510
WASHINGTON COUNTY3552
WAYNESBORO CITY1300
WESTMORELAND COUNTY1920
WILLIAMSBURG CITY1580
WINCHESTER CITY1120
WISE COUNTY2450
WYTHE COUNTY2120
YORK COUNTY2600
Grand Total52901200208
Categories
Election Data Analysis Election Integrity technical

VA Voter List data standardization and normalization

EPEC has purchased and downloaded the full statewide VA Registered Voter List (RVL), the full Voter History List (VHL) and the Monthly Update Subscription (MUS) to the voter list as of 2023-06-30. These files are provided by ELECT as comma-separated-value files, but contain numerous idiosyncrasies, formatting issues and errors.

We combined the MUS information with our baseline list to create a new Statewide voter list record incorporating all of the relevant changes. As we had just downloaded our baseline list only the day before we received the MUS, there were a number of entries in the MUS that had already been incorporated into our baseline dataset, however there were a few significant deletions / adds / modifications.

The updated RVL and VHL is currently being processed using the following methods, among others:

  • The Statewide RVL and VHL are being split into smaller data files organized by LOCALITY_NAME, LOCALITY_PRECINCT_NAME, CONG_CODE_VALUE, STHOUSE_CODE_VALUE, STSENATE_CODE_VALUE, and CITY
  • The data has been standardized and normalized to remove whitespace errors, all fields have been converted to upper case, observed field name issues have been corrected, and missing fields in the VHL have been added.
    • The VHL does not contain “LOCALITY_NAME” or “PRECINCT_NAME” fields, but does reference each by code value. The missing fields have been added into the VHL after correlating with the RVL data in order to endure commonality between the datasets, and to allow for splitting into the folder structure defined above.
    • The formatting for precinct names in the RVL is inconsistent in its use of spaces and dashes between the precinct code and name. This has been standardized to be the ” – ” separator.
    • The inconsistent use of the ampersand symbol (“&”) in county names, such as “KING & QUEEN COUNTY”, has been standardized to always use the word “AND” instead.
    • etc. We will continue to update these standardizations and error checks as we discover new issues.
  • The primary and mailing addresses from the RVL have been fed as input to an NCOA processing system (truencoa.com) and the resultant reports have been collated for each grouping as listed above.
  • The RVL fields have also been collated against version 13 of the US Dept of Transportation’s National Address Database and the RVL entries have been augmented with the information regarding whether a match was found or not, as well as the type of match. Our best attempt has been made to match addresses to the RVL entries, but there are still inconsistencies and mis-spellings in both the NAD and RVL data that we are continuing to work to identify and improve.
    • Prior to matching to the NAD listings the RVL primary and mailing addresses are normalized and standardized according to the US Post Offices published list of common street suffix abbreviations.
    • Initial matches are attempted based on a Strict match to either the Primary or Mailing address
    • Subsequent matches use iterative relaxing of various criteria, such as ignoring the street suffix, or flipping the position of the street direction indicator. We have denoted the USDOT_MATCH_TYPE in the augmented RVL dataset to allow filtering on these different matching criteria.

EPEC is working to make this value-added data available for those entities that are authorized to handle VA election information. Interested parties may contact us for more details.