Categories
Election Data Analysis Election Forensics Election Integrity Interesting programming technical

Potential duplicate registrants in VA voter list via Hamming Distance

Using the 2022-11-23 Registered Voter List (RVL) and the 2023-01-26 Voter History List (VHL) purchased from the VA Department of Elections (ELECT) I wrote up an analysis script to check for potentially duplicated registrant records in the RVL and cross reference duplicate pairings with the VHL to identify potential duplicate votes. The details are summarized below.

Please note that I will not publish voter Personally Identifiable Information (PII) on this blog. I have substituted fictitious PII information for all examples given below, and cryptographically hashed all voter information in the downloadable results file. I will make available the detailed information to those that have the authorization to receive and process voter data upon request (contact us).

Summary of Results:

We should mathematically expect approximately 11 exact string collisions in the full RVL dataset when comparing (First Name + Middle Name + Last Name + Suffix + Full DOB), but instead we see 1982 such collisions, which is over an order of magnitude increase from the expected value. While its possible that some of these collisions are false positives, there are quite a number of them that are deserving of further scrutiny.

Method:

For every entry in the latest RVL, I performed a string distance comparison, based on Hamming distance, between every possible pair of strings of (FIRST NAME + MIDDLE NAME + LAST NAME + SUFFIX + FULL DOB).  So for the ~6M different RVL entries, we need to compute ~3.6 x 10^13 different string comparisons. A hamming distance of 0 indicates the strings being compared are identical, a hamming distance of 1 indicates that there is a single character different between the two strings, a hamming distance of 2 indicates 2 characters are different, etc.  This obviously is a very computationally intensive process and it took over two days to complete the processing, once I got the bugs worked out.  (I’ve been quietly working on this one for a while now … )

Note that the Hamming distance only compares each respective position in a string and does not account for adding or removing a character completely from a string. A metric that does include addition and subtraction is the Levenshtein Edit Distance, which is much more computationally expensive (but more rigorous) metric. The Hamming distance is related to the Levenshtein distance in that it is mathematically the upper bound on the Levenshtein distance for arbitrary strings. I haven’t yet finished making an optimized GPU accelerated version of the Levenshtein edit distance metric, but it is in the works and I will redo this analysis with the new metric once that is completed.

I aggregated all of the Hamming distance pairings that were less than or equal to 3 characters different in order to identify potential (key word) duplicated registrants, and additionally for each pairing looked at the voter history information for each registrant in the pair to determine if there was a potential (again … key word) for multiple ballots to be cast by the same person in any given election.  As we allow for more characters to be different, we potentially are including many more likely false positive matches, even if we are catching more true positives.

For example: At a Hamming distance of 4 the strings of “Dave Joseph Smith M 10/01/1981” and “Tony Joseph Smith M 10/01/1981” at the same address would produce a potential match, but so would “Davey Joseph Smith M 10/01/1981” and “David Josiph Smith M 10/02/1981”. The first pair is more likely to be a false positive due to twins, while the second is more likely to be due to typo’s, mistakes, or use of nicknames and might warrant further investigation. A much stronger potential match would be something like “David Josiph Smith M 10/01/1981” and “David Joseph Smith M 10/01/1981”, with a Hamming distance of 1 at the same address. In an attempt to limit false positives, I have clamped the Hamming distance checks to <= 3 in this analysis.

One of the drawbacks of using Hamming distance over a more complete metric such as Levenshtein, is that the Hamming distance would give a very high score, and would therefore filter out of our results, an example pairing such as: “David Joseph Smith M 10/01/1981” and “Dave Joseph Smith M 10/01/1981”. The change from “id” to “e” adds/subtracts a character making the rest of the characters in the remainder of the string shift position and also not match. A Levenshtein metric would correctly return a small distance of 2, whereas the hamming distance returns 27. (As mentioned earlier, I am working on a Levenshtein implementation, but it is not yet complete.)

Note that with the official records obtained from ELECT, and in accordance with the laws of VA, I do not have access to the social security number or drivers license numbers for each registration record, which would help in identifying and discriminating potential duplicate errors vs things like twins, etc. I only have the first name, middle name, last name, suffix, month of birth, day of birth, year of birth, gender, and address information that I can work with.  I can therefore only take things so far before someone else (with investigative authority and ability to access those other fields) would need to step in and confirm and validate these findings.

Results:

The summary totals are as follows, with detailed examples.

Hamming Distance0123
Number of Potential Duplicate Registrant Pairs1982327621864120642
Number of Potential Duplicate Ballots110324831210175872
Number of Potential Duplicate Voters5613261325274918

According to my derivations and simulations that are described in detail at the end of this article, we should only expect to see an average of 11 (+/- 3) potential duplicate pairs (a.k.a. “collisions”) at a Hamming distance of 0. This is over two orders of magnitude different than what we observe in the compiled results table above. Such a discrepancy deserves further investigation and verification.

Examples of Types of Issues Observed:

NOTE THE BELOW INFORMATION HAS HAD THE VOTER PERSONALLY IDENTIFIABLE INFORMATION (“PII”) FICTIONALIZED. WHILE THESE ARE BASED ON REAL DATA TO ILLUSTRATE THE DIFFERENT TYPES OF OBSERVATIONS, THEY DO NOT REPRESENT REAL VOTER INFORMATION.

Example #1: The following set of records has the exact match (Hamming Distance = 0) of full name and full birthdate (including year), but different address and different voter ID numbers AND there was a vote cast from each of those unique voter ID’s in the 2020 General Election.  While it’s remotely possible that two individuals share the exact same name, month, day and year of birth … it is probabilistically unlikely (see section below on mathematical derivation of probabilities if interested), and should warrant further scrutiny.

Voter Record A:

AMY BETH McVOTER 12/05/1970 F 12345 CITIZEN CT

Voter Record B:

AMY BETH McVOTER 12/05/1970 F 5678 McPUBLIC DR

Example #2: This set of records has a single character different (Hamming distance of 1) in their first name, but middle name, last name, birthdate and address are identical AND both records are associated with votes that were cast in the 2020, 2021, and 2022 November General Elections.  While it is possible that this is a pair of 23 year old twins (with same middle names) that live together, it at least bears looking into.

Voter Record A:

TAYLOR DAVID VOTER 02/16/2000 M 6543 OVERLOOK AVE NW

Voter Record B:

DAYLOR DAVID VOTER 02/16/2000 M 6543 OVERLOOK AVE NW

Example #3: This set of records has two characters different (Hamming distance of 2) in their birthdate, but name and address are identical AND the birth years are too close together for a child/parent relationship, AND both records are associated with votes that were cast in the 2020 and 2022 November General Elections. 

Voter Record A:

REGINA DESEREE MACGUFFIN 02/05/1973 F 123 POPE AVE

Voter Record B:

REGINA DESEREE MACGUFFIN 03/07/1973 F 123 POPE AVE

Example #4: This set of records has again a single character different (Hamming distance of 1) in the first name (but not the first letter this time) and the last name, birthdate and address are identical.  There were also multiple votes cast in the 2019 and 2022 November General from these registrants.

Voter Record A:

EDGARD JOHNSON 10/19/1981 M 5498 PAGELAND BLVD

Voter Record B:

EDUARD JOHNSON 10/19/1981 M 5498 PAGELAND BLVD

Example #5: This set of records has two characters different (Hamming distance of 2) in the first and middle names and the last name, birthdate, gender and address are identical.  There were also multiple votes cast in the 2021 and 2022 November General from these registrants. Again it is possible that these records represent a set of twins given the information that ELECT provides.

Voter Record A:

ALANA JAVETTE THOMPSON 01/01/2003 F 123 CHARITY LN

Voter Record B:

ALAYA YAVETTE THOMPSON 01/01/2003 F 123 CHARITY LN

Example #6: The following set of records has the exact match (Hamming Distance = 0) of full name and full birthdate (including year), and same address but different voter ID numbers.  There was no duplicated votes in the same election detected between the two ID numbers.

Voter Record A:

JAMES TIBERIUS KIRK 03/22/2223 M 1701 Enterprise Bridge

Voter Record B:

JAMES TIBERIUS KIRK 03/22/2223 M 1701 Enterprise Bridge

Example #7: The following set of records has the exact match (Hamming Distance = 0) of full name and full birthdate (including year), same address but different gender and voter ID numbers.  There was no duplicated votes in the same election detected between the two ID numbers.

Voter Record A:

MAXWELL QUAID CLINGER 11/03/2004 M 4077 MASH DR

Voter Record B:

MAXWELL QUAID CLINGER 11/03/2004 U 4077 MASH DR

Results Dataset:

A full version of the aggregated excel data is provided below, however all voter information (ID, first name, middle name, last name, dob, gender, address) have been removed and replaced by a one-way hash number, with randomized salt, based on the voter ID. The full file with specific voter information can be provided to parties authorized by ELECT to recieve and process voter information, Election Officials, or Law Enforcement upon request.

On the mathematical probability of matches:

Below I present the theory and derivation as to how I arrived at the expected value of 11 collisions (+/- 3) as mentioned above. I’ve tried to make the derivation below as digestible as possible, with accessible references, but it is admittedly still a very technical read. I think its important to “show my work” on the subject, though, and I present it here and am happy to take comments and criticism (contact).

Q: How much of a chance do we actually have of getting an exact (Hamming distance of 0) collision in the full name and full date of birth? Well, there is a similar and well known probability puzzle that asks how many random people do you need to approximately have a 50% chance of 2 of them sharing the same birthday (not including the year of birth). This is known as the “Birthday Problem” in probability theory, and rather surprisingly, the answer is that you only need about 23 people in your population sample to have a 50% probability that 2 of those people will share a day-of-year of birth. To quote the wikipedia article on the matter “… While it may seem surprising that only 23 individuals are required to reach a 50% probability of a shared birthday, this result is made more intuitive by considering that the birthday comparisons will be made between every possible pair of individuals. With 23 individuals, there are 23 × 22/2 = 253 pairs to consider, far more than half the number of days in a year.” The same mathematics of the birthday problem is the basis of the Birthday Attack cryptographic exploit, and it is therefore a well-studied problem in cryptography and cyber security.

Figure 1: The computed probability of at least two people sharing a birthday versus the number of people. A recreation of the classic “Birthday Problem”.

Now, as interesting as the toy birthday problem is as described above, it is over simplified for the problem we are looking at here. Firstly, the problem setup assumes independent and identically distributed random variable (e.g. an “IID” set of variables). While this is not exactly the case, the IID assumption provides for a computable first order estimate, and in the case of the classical birthday problem the estimate has been shown to be fairly accurate under experimental conditions.

Secondly, when we start additionally considering the year of birth, or sharing of first names, middle names and last names, things get much more complicated to compute, but the method is the same. We want to determine the probability of 2 people sharing the same First Name, Middle Name, Last Name, Suffix, Month-of-Birth, Day-of-Birth and Year-of-Birth in the population of unique registrants in the Registered Voter List. This means that in addition to the 365 day-of-birth possibilities, we need to estimate the number of possible years to cover, the number of possible first names, the number of possible middle names, the number of possible last names, the number of possible suffix strings and then include these possibilities into the same formulation as the birthday problem setup.

For determining how many years we should cover, I will simply use the average life expectancy of approximately 79 years. We can therefore update our N value of the birthday problem from 365 to 365 * 79 = 28835. When we perform the same analysis as the standard birthday problem with just this new parameter included, we end up needing 200 people in our sample population to have a 50% probability of of 2 people having a match.

Figure 2: The computed probability of at least two people sharing a birthday versus the number of people in the sample population. A recreation of the classic “Birthday Problem”, but we’ve updated the analysis to include the year of birth, and assumed the average life expectancy of 79 years. This moves the 50% crossover point to a population size of 200 from 23 for the standard Birthday Problem setup.

A similar analysis can be done with the number of names being considered, etc. For each (assumed independent and uniform) variable we add to the setup, we multiply the number of possible states (N) by the number of unique variable settings.

We can estimate the universe of possible names using the frequentist method from the RVL data itself: We know that we have 6,127,859 unique voter ID’s in the RVL, and there are 14 unique SUFFIX entries, 291,368 unique FIRST names, 405,591 unique MIDDLE names, and 465,185 unique LAST names. So multiplying out 365 x 79 x 14 x 291368 x 405591 x 465185 = 2.22 x 10^22 potential states to consider.

Now unfortunately, as we start dealing with bigger and bigger N values the ability of computers to maintain the necessary precision to carry out the mathematics for direct computation becomes harder and harder, eventually resulting in Infinite or divide-by-zero answers as the probabilities get smaller and smaller. So lets begin by first determining if we can find the 50% crossover point for the unique voter ID population size. We find that we only need 410 unique First, Middle, and Last names (each) to break the 50% probability limit.

Figure 3: The computed probability of at least two people sharing a first name, middle name, last name, suffix, month-of-birth, day-of-birth, year-of-birth versus the number of people in the sample population. This assumes the Nyears = 79, Nsuffix = 14, Nfirst = 410, Nmiddle = 410, Nlast = 410.

As we increase the number of unique (first, middle, last) names under consideration, we find that we very quickly reduce the probability to near zero (again … this is assuming an IID set of variables … more on that later). In fact we only need to assume that there are 1300 unique first names, middle names and last names before the probability drops to under 1%. This is two full orders of magnitude below the actual number of unique first names, middle names and last names (each) that we find by simple examination of the RVL file, so the actual probability of a collision under these conditions should be much, much, much lower. While not exactly zero, it is computationally indistinguishable from zero given machine precision. Note (again) that this formulation is still simplified in that it assumes a uniform distribution within the N possible states, but it still serves to give a first order approximation and sanity check.

Figure 4: The computed probability of at least two people sharing a first name, middle name, last name, suffix, month-of-birth, day-of-birth, year-of-birth versus the number of people in the sample population. This assumes the Nyears = 79, Nsuffix = 14, Nfirst = 1300, Nmiddle = 1300, Nlast = 1300.

As we start approaching the limit of computational precision we have to resort to approximation methods for computing the very small, but non-zero probability of collision given the actual number of unique first, middle and last names observed in the RVL dataset. We can use the Taylor series expansion for small powers in order to do this, and our equation for computing the probability becomes: Pb = 1 – exp(-k*(k-1) / (2 *N)).

Replicating our earlier example in Figure 4 above with Nfirst == Nmiddle == Nlast == 1300 to show the comparison of the Taylor expansion to the explicit computation produces the graphic in Figure 5 below. We see that the small value approximation is close, but slightly over-estimates the directly computed probability for IID variables.

Figure 5: The computed probability of at least two people sharing a first name, middle name, last name, suffix, month-of-birth, day-of-birth, year-of-birth versus the number of people in the sample population. This assumes the Nyears = 79, Nsuffix = 14, Nfirst = 1300, Nmiddle = 1300, Nlast = 1300.

When we perform this Taylor series approximation and look to find the number of records required in order to obtain a 50% probability that any 2 records would match given our updated universe of possible matches, we end up with requiring K = 176,000,000,000, or 176 Billion records. When we again try to evaluate the Taylor series for the explicit number of unique Voter ID’s present in the RVL file, which is just over 6M, we again obtain a number that is computationally indistinguishable from 0. (To be absolutely meticulous … its a bigger number that is indistinguishable from 0 than we previously computed, but it is still indistinguishable from zero.)

Figure 5: The computed probability of at least two people sharing a first name, middle name, last name, suffix, month-of-birth, day-of-birth, year-of-birth versus the number of people in the sample population. This assumes the Nyears = 79, Nsuffix = 14, Nfirst = 291368, Nmiddle = 405591, Nlast = 465185, and the computed 50% crossover point occurs at approximately 176 Billion samples required.

Another Implementation note: In order to explicitly code the above direct computations we also need to do some clever tricks with logarithms in order to avoid numerical overflow / underflow issues as much as possible. The formula for computing the permutations, which is N! / (N-K)! = N x (N-1) x … x (N-K+1) can have numerical issues when N becomes large. However if we take the base-10 logarithm of the equation, we can use the product and quotient rules of logarithms to compute the result and avoid numerical overflow: log10( N! / (N-K)! ) = log10(N!) – log((N-K)!) = log10(N) + log10(N-1) + … + … log10(N-K+1), which is a much more stable computation.

We can perform a similar trick in order to compute the denominator of N^k by using the power property of logarithms such that log10( N^k ) = k x log10(N).

You must of course remember to reverse the logarithm once you’ve computed the log-sums. So the final computation of Pb becomes the following:

Vnr = log10(N) + log10(N-1) + … + … log10(N-K+1), where N is the number of possible states N = 365 x Nyears x Nfirst x Nmiddle x Nlast x Nsuffix.

Vt = k x log10(N)

Pa_log10 = Vnr – Vt = log10(Pa) = log10(Vnr/Vt)

Pb = 1 – 10^(Pa)

Updating from uniform distributions to non-uniform distributions

So what happens when we take into account the fact that names and birthdays are not uniformly distributed? (e.g. the last name of “Smith” is more frequent than “Sandeval”) This fact increases the probability of a collision occurring in the dataset. This increase also makes intuitive sense as we can anecdotally observe that coincident names and birthdates, while still rare … do actually happen in real life with common names.

However, in the non-uniform case we don’t have as nearly of a nice closed set of formulas for computing the probability. What we can do instead to estimate the probability is perform a number of Monte Carlo simulations of selecting K values from the weighted possibilities, and determine how many collisions occurred in each simulation trial. By setting K equal to the number of unique Voter ID values in the RVL dataset, we can directly answer the question via simulation of “how many collisions of First+Middle+Last+Suffix+DOB should we expect when looking at the VA Registered Voter List file“?

We can determine the weightings for each variable easily enough from the distributions of unique values in the data itself.

The below MATLAB weightedCollisionSim(…) function is a program that can be used to perform this analysis. It assumes that the RVL table object is a global variable to setup the trials, and uses the MATLAB built-in randsample(…) function to perform each draw.

After 100 simulation runs, the results are that for the K=6,127,859 unique voter ID’s in the RVL, we should expect to have an average of about 11 collisions at Hamming distance of 0, with a standard deviation of roughly 3.

I will note that as a validation and verification step, the MATLAB simulation code below, when used with uniform sampling, produces similar results to what we analytically derived above.

function [p,m,s] = weightedCollisionSim(k,ntrials,varargin)
% To compute the probability the ntrials must be >> 1:
% [p,m,s] = weightedCollisionSim(k,ntrials,values1,weights1,...,values2,weights2)
% [p,m,s] = weightedCollisionSim(k,ntrials,Nvalues1,weights1,...,Nvalues2,weights2)
%
% OUTPUTS:
% p = Probability of a collision
% m = mean number of collisions
% s = standard deviation of collisions

if nargin == 0
    global rvl; % Assume the RVL is an available global var

    ntrials = 100; % Number of trials
   
    % Population size set as num of unique voter IDs in RVL
    npop = numel(unique(rvl.IDENTIFICATION_NUMBER));

    % Convert the DOB strings to datetime objects
    dob = datetime(rvl.DOB);

    % How many unique days of the year are there?
    [ud,uda,udb] = unique(day(dob,'dayofyear'));
    % How often do they occur?
    nud = accumarray(udb,1,[numel(ud),1]);
    Ndays = numel(ud);

    % How many unique years of birth are there?
    [uy,uya,uyb] = unique(year(dob));
    % How often do they occur?
    nuy = accumarray(uyb,1,[numel(uy),1]);
    Nyears = numel(uy);

    % How many unique suffix strings are there?
    [us,usa,usb] = unique(rvl.SUFFIX);
    % How often do they occur?
    nus = accumarray(usb,1,[numel(us),1]);
    Nsuffix = numel(us);

    % How many unique first names are there?
    [uf,ufa,ufb] = unique(rvl.FIRST_NAME);
    % How often do they occur?
    nuf = accumarray(ufb,1,[numel(uf),1]);
    Nfirst = numel(uf);

    % How many unique middle names are there?
    [um,uma,umb] = unique(rvl.MIDDLE_NAME);
    % How often do they occur?
    num = accumarray(umb,1,[numel(um),1])
    Nmiddle = numel(um);

    % How many unique last names are there?
    [ul,ula,ulb] = unique(rvl.LAST_NAME);
    % How often do they occur?
    nul = accumarray(ulb,1,[numel(ul),1]);
    Nlast = numel(ul);
        
    % Initializing the weighting vectors
    w0 = nus;
    w1 = nud;
    w2 = nuy;
    w3 = nuf;
    w4 = num;
    w5 = nul;

    % Recursively compute results and return
    [p,m,s] = weightedCollisionSim(npop,ntrials,1:Nsuffix,w0,1:Ndays,w1,1:Nyears,w2,...
        1:Nfirst,w3,1:Nmiddle,w4,1:Nlast,w5);
    return
end

if nargin < 2 || isempty(ntrials)
    ntrials = 1;
end

nc = zeros(ntrials,1);
for j = 1:ntrials
    fprintf('Trial %d\n',j);
    y = zeros(k,numel(varargin)/2);
    m = 1;
    for i = 1:2:numel(varargin)
        w = varargin{i+1};
        v = varargin{i};
        if ~isempty(w) && isvector(w)
            % Non-uniform weightings
            y(:,m) = randsample(v,k,true,w);
        else
            % Uniform sampling
            y(:,m) = randsample(v,k,true);
        end
        m = m+1;
    end
    [u,~,ib] = unique(y,'rows');
    nu = accumarray(ib,1,[size(u,1),1]);
    nc(j) = sum(nu > 1);
end
p = mean(nc>0);
m = mean(nc);
s = std(nc);
Categories
Election Data Analysis Election Forensics Election Integrity programming technical

Distribution of Invalid Voter Addresses and Absentee Ballots in VA 2022 General Election, with Mailing Address Substitution

Forward:

In a previous post I documented the results from an United States Postal Service (USPS) National Change of Address (NCOA) database check with the 2022 Virginia Registered Voter List (RVL) primary address records joined with the Daily Absentee List (DAL) reports of absentee ballots cast. There was a significant public reaction to the fact that over 15K RVL primary addresses associated with voters who cast ballots in the VA 2022 general election were not recognized by the USPS database as valid addresses (among other issues). I reported the data as I found it, but a common commentary was that my analysis did not account for rural voters who do not have a traditional street address, or that do not have mail receptacles and use PO boxes as their primary delivery mechanism.

The requirements for voter registration and primary address are specified by the VA Constitution, Federal and VA law, and require the following:

  1. The VA constitution (Section II-1 and II-2) and the National Voter Registration Act (NVRA) requires that registered voter primary addresses be an actual physical (and deliverable) street address. The de-facto arbiter of what defines a recognizable and/or deliverable address is the USPS, therefore a street address that is not recognized or is undeliverable according to USPS is not compliant. We should be making every effort to ensure that the primary address associated with each voter is able to be correctly recognized and translated into a deliverable address by the USPS. This may require adjusting VA’s data normalization policies such that input street addresses are correctly mapped to USPS addresses, or a legislative action in order to correct.
  2. There is also the legal restriction that registered VA voters are not allowed to use PO Boxes as their primary registered voter address. There is an exception that “protected” voters are allowed to provide a PO box address to be displayed in public records, but their actual address on file must still be a physical address.
  3. The 2022 VA GREB, section 6..2.3 states that in special cases, a rural voter may supply the name of the highway and enough detail in the comments section of the voter application that the registrar may ascertain where the physical address is. This seems (IMO, but I’m not a lawyer) in contradiction to the language in the VA constitution and in the NVRA that make the implicit requirement of deliverable addresses. Also, if the address as entered is not recognized by the USPS NCOA system, then it will constantly be generating validation errors every time it is checked against the NCOA database, which VA is currently required to do at least annually. Again, this may require adjusting VA’s data normalization policies such that input street addresses are correctly mapped to USPS addresses, or a legislative action in order to correct.
  4. Additionally the USPS is supposed to recognize “Post Box Street Address” (PBSA) locations such that when someone addresses an envelope or package to the street address of a residence that is served only by delivery to a PO box, the USPS should automatically recognize this and adjust accordingly. Per the NCOA documentation, the USPS NCOA database checks are supposed to be doing this detection and translation already.

And again, to be clear, my point is not to accuse anyone of malicious intent or wrongdoing … I am simply trying to point out that the way we are using and managing our data is discrepant with what the requirements actually are. We either need to change the data and our practices to conform to the legal requirements, or change the law to fit how we are actually (and practically) using the data.

That being said, and in order to show that there are still significant data issues even after adjusting for rural routes, etc., I’ve re-run my NCOA analysis to account for records with invalid primary addresses but that had valid mailing addresses, even if those mailing addresses are PO Boxes. Any RVL record with a primary address that was marked invalid by an NCOA check, but had a different mailing address AND that mailing address was returned from a second NCOA check as NOT-invalid (even if it was a PO Box), was replaced with the mailing address listing and NCOA results. While the requirement is still that primary addresses are supposed to be valid, this re-processing and allowing for primary OR mailing addresses to be used is more in line with the way VA has actually implemented its voter registrations across the state … even though that implementation seems to be at odds with the requirements.

The rest of the analysis is performed the same as before, and is presented in the same format for consistency. I have updated the dates and numerical results as appropriate, but have otherwise kept the format and much of the language and layout the same as the previous analysis.


BLUF (Bottom Line Up Front):

There were 2,164 ballots cast during early voting in the VA 2022 General Election where the voters’ (primary or mailing) registered address on record were flagged as “Invalid” by a National Change of Address (NCOA) database check. If we include addresses that were identified as 90-day Vacant the total rises to 4,274. (The previous analysis that used only the primary addresses returned 15,419 invalid records with associated ballots, and went up to 17,244 when including addresses flagged as vacant.)

A certified commercial provider of NCOA data verification was used to facilitate this analysis on raw data obtained from the VA Dept of Elections (“ELECT”). It is not technically possible to obtain a truly time-synchronized complete set of data for any election due to the way elections are run in VA, but we made every effort to obtain the data from the state as close in time as was possible. The NCOA database is maintained and curated by the United States Postal Service (USPS).

For those wishing to review specific entries, or to help validate these issues, and who are part of an organization that is able to receive and handle election information according to VA law and the VA Dept of Elections requirements, you may contact us to request the raw data breakdowns. We will need to validate your organization or employment and will make data available as legally allowed.

Commentary and Discussion:

I would like to be very clear: We are simply presenting the data as compiled to facilitate public discourse. We have strived to only utilize data directly obtained from authoritative sources (ELECT, the USPS via TrueNCOA provider).

The designation of “Invalid” addresses is according to the definition by the USPS and TrueNCOA, i.e. the TrueNCOA check has reported the addresses as listed in the RVL have no match in the USPS database. Invalid addresses do not include things like valid P.O. Boxes or valid rural addresses that are automatically recognized and translated by USPS to Post Box Street Address (PBSA) records.

The VA Constitution (Section II-1 and II-2) specify the requirements for voter eligibility to include that voters are required to supply a primary address for their registration record, regardless of their method of voting. VA is required by law to consistently maintain and validate these records. Based on the below analysis, the data shows that there is a small but statistically significant number of “Invalid” addresses associated for voters who cast ballots in the Nov 2022 election.

Continuing EPEC’s mission to promote voter participation, analyze election technology, and educate the public about best practices in managing election technology systems; we are providing the below analysis in order to educate and inform the public, legislators and elections officials about the existence of these discrepancies.

Details:

After receiving the results of a National Change of Address (NCOA) database check on the registration (not the temporary) addresses in the latest VA Registered Voter List (RVL). I’ve gone through and collated the flagged addresses and reconciled them with the entries in the Daily Absentee List (DAL) file records provided by the VA Dept. of Elections (“ELECT”).

The DAL file (dated 2022-12-13) provides a records of all of the voters that cast absentee (either Early In-Person or Mail-In) ballots in the election, and the RVL (dated 2022-11-23) gives all of the registered voter addresses and other pertinent information. Both datasets come directly from the VA Dept. of Elections and must be purchased. Total cost was ~$7000. The two datasets can be tied together using the voter Identification Number that is assigned to each (supposedly) unique voter by the state. Entries in the RVL should be unique to each registered voter (although there are a small number of duplicate voter IDs that I have seen … but thats for another post), whereas the DAL file can have multiple entries attributed to a single voter recording the various stages of ballot processing.

The NCOA check was performed on all addresses in the RVL file in order to detect recent moves, invalid addresses, vacant addresses, P.O. Boxes, commercial addresses, etc. The NCOA check takes multiple days to run using a commercial service provider and was executed between 2022-12-19 through 2022-12-30.

*** As noted above, in this run I substituted those primary addresses that evaluated to “invalid” with their corresponding “mailing” address records that had been recognized as “valid”, if possible. ***

Results:

Raw TrueNCOA Processing result stats on the full RVA dataset:
NCOA Processing of VA RVL RecordsFull RVL 11-23 Primary Addresses (12/30/2022)%Unique RVL Mailing Addresses (12/30/2022)%
Records Processed6,127,856
216,896
18 – Month NCOA Moves282,6694.61%8,2473.80%
48 – Month NCOA Moves163,9312.68%8,0993.73%
Moves with no Forwarding Address20,9130.34%1,4150.65%
Total NCOA Moves467,5137.63%17,7618.19%
Vacant Flag29,3150.48%22,76310.49%
DPV Updated/Address Corrected Records601,5399.82%18,5558.55%
DPV Deliverable Records5,835,23095.22%165,93976.51%
DPV Non-Deliverable Records185,7523.03%35,10016.18%
LACS Updated (Rural Address converted to Street Address)33,4970.55%2540.12%
Residential Delivery Indicator5,970,53197.43%194,34289.60%
Addresses matched to the USPS Database6,020,98398.26%201,04092.69%
Invalid Addresses107,3041.75%15,9687.36%
Expired Addresses10,8220.18%2760.13%
Business Move (B)3400.01%610.03%
Family Move (F)1168741.91%5,3902.49%
Individual Move (I)3502995.72%12,3105.68%
General Delivery Address1590.00%1700.08%
High Rise Address76492212.48%7,0973.27%
PO Box Address280400.46%158,35973.01%
Rural Route Address810.00%2690.12%
Single Family Address524332185.57%33,58215.48%
Unknown519030.85%15,5377.16%
Reporting as presented from the TrueNCOA data service. The TrueNCOA data dictionary is presented here.
Combining NCOA results of RVL Addresses with the DAL data:
Vacant Addresses:

There were 2,112 records across the state with addresses that have been flagged as (90-day) “Vacant” by the NCOA check and also had an Early In-Person, Mail-In, FWAB or Provisional ballot cast in the VA 2022 General Election according to the DAL file. Of those records, 1,542 were Early In-Person and 547 were Mail-In. The geographic distribution of the addresses (based on the ZIP+4), as reported by the NCOA service, is shown below, with the size of the marker proportional to the total number of counts at that ZIP+4 location. (Note this is actually an increase over the previous results that were not adjusted for potential mailing address matches.)

P.O. Boxes (Non-protected):

There were 13,492 records across the state with addresses that have been flagged as P.O. Box Addresses by the NCOA check and also had an Early In-Person, Mail-In, FWAB or Provisional ballot cast in the VA 2022 General Election AND were NOT listed as protected entries according to the DAL file. (VA allows for voters who have a legal protective order to list a P.O. Box as their address of record on public documents) Of those records, 11,426 were Early In-Person and 2020 were Mail-In. The geographic distribution of the addresses (based on the ZIP+4), as reported by the NCOA service, is shown below, with the size of the marker proportional to the total number of counts at that ZIP+4 location.

Note, this is actually an increase over the previous results that were not adjusted for potential mailing address matches. This is not unexpected, as there are a number of PO box mailing addresses that have been substituted in for invalid primary street addresses. PO Boxes are not supposed to be allowed as registered voting addresses. (You should talk to your legislators about this discrepancy, because this catch-22 will likely need to be changed by legislative action!)

Invalid Addresses:

There were 2,164 records across the state with addresses that have been flagged as “Invalid” Addresses by the NCOA check and also had an Early In-Person, Mail-In, FWAB or Provisional ballot cast in the VA 2022 General Election. Of those records, 1,535 were Early In-Person and 598 were Mail-In. The geographic distribution of the addresses (based on the ZIP+4), as reported by the NCOA service, is shown below, with the size of the marker proportional to the total number of counts at that ZIP+4 location. Note, this is actually a significant decrease over the previous results that were not adjusted for potential mailing addresses. This is not unexpected, as there were a number of invalid primary address with a valid (even if PO Box) mailing address. (The fact that the primary addresses do not validate with the USPS is it’s own issue, but we are ignoring that for this analysis as noted above.)

Invalid OR Vacant Addresses:

There were 4,274 records across the state with addresses that have been flagged as “Invalid” or “Vacant” Addresses by the NCOA check and also had an Early In-Person, Mail-In, FWAB or Provisional ballot cast in the VA 2022 General Election. Of those records, 3,077 were Early In-Person and 1,143 were Mail-In. The geographic distribution of the addresses (based on the ZIP+4), as reported by the NCOA service, is shown below, with the size of the marker proportional to the total number of counts at that ZIP+4 location.

Record of Moves Out-of-State:

There were 797 records that had records of NCOA moves to valid out-of-state addresses before 2022-08 that also had an Early In-Person, Mail-In, FWAB or Provisional ballot cast in the VA 2022 General Election. Of those records, 346 were Early In-Person and 450 were Mail-In. The geographic distribution of the addresses (based on the ZIP+4), as reported by the NCOA service, is shown below, with the size of the marker proportional to the total number of counts at that ZIP+4 location.

Results By District:

District 01:
Invalid Addresses:

There were 213 records with registered addresses that have been flagged as “Invalid” Addresses by the NCOA check and also had an Early In-Person, Mail-In, FWAB or Provisional ballot cast in the VA 2022 General Election in District 01. Of those records, 170 were Early In-Person and 38 were Mail-In. The geographic distribution of the addresses (based on the ZIP+4), as reported by the NCOA service, is shown below, with the size of the marker proportional to the total number of counts at that ZIP+4 location.

Invalid OR Vacant Addresses:

There were 364 records with registered addresses that have been flagged as “Invalid” or “Vacant” Addresses by the NCOA check and also had an Early In-Person, Mail-In, FWAB or Provisional ballot cast in the VA 2022 General Election in District 01. Of those records, 288 were Early In-Person and 71 were Mail-In. The geographic distribution of the addresses (based on the ZIP+4), as reported by the NCOA service, is shown below, with the size of the marker proportional to the total number of counts at that ZIP+4 location.

District 02:
Invalid Addresses:

There were 183 records with registered addresses that have been flagged as “Invalid” Addresses by the NCOA check and also had an Early In-Person, Mail-In, FWAB or Provisional ballot cast in the VA 2022 General Election in District 02. Of those records, 147 were Early In-Person and 33 were Mail-In. The geographic distribution of the addresses (based on the ZIP+4), as reported by the NCOA service, is shown below, with the size of the marker proportional to the total number of counts at that ZIP+4 location.

Invalid OR Vacant Addresses:

There were 441 records with registered addresses that have been flagged as “Invalid” or “Vacant” Addresses by the NCOA check and also had an Early In-Person, Mail-In, FWAB or Provisional ballot cast in the VA 2022 General Election in District 02. Of those records, 344 were Early In-Person and 90 were Mail-In. The geographic distribution of the addresses (based on the ZIP+4), as reported by the NCOA service, is shown below, with the size of the marker proportional to the total number of counts at that ZIP+4 location.

District 03:
Invalid Addresses:

There were 37 records with registered addresses that have been flagged as “Invalid” Addresses by the NCOA check and also had an Early In-Person, Mail-In, FWAB or Provisional ballot cast in the VA 2022 General Election in District 03. Of those records, 27 were Early In-Person and 10 were Mail-In. The geographic distribution of the addresses (based on the ZIP+4), as reported by the NCOA service, is shown below, with the size of the marker proportional to the total number of counts at that ZIP+4 location.

Invalid OR Vacant Addresses:

There were 324 records with registered addresses that have been flagged as “Invalid” or “Vacant” Addresses by the NCOA check and also had an Early In-Person, Mail-In, FWAB or Provisional ballot cast in the VA 2022 General Election in District 03. Of those records, 223 were Early In-Person and 94 were Mail-In. The geographic distribution of the addresses (based on the ZIP+4), as reported by the NCOA service, is shown below, with the size of the marker proportional to the total number of counts at that ZIP+4 location.

District 04:
Invalid Addresses:

There were 116 records with registered addresses that have been flagged as “Invalid” Addresses by the NCOA check and also had an Early In-Person, Mail-In, FWAB or Provisional ballot cast in the VA 2022 General Election in District 04. Of those records, 104 were Early In-Person and 12 were Mail-In. The geographic distribution of the addresses (based on the ZIP+4), as reported by the NCOA service, is shown below, with the size of the marker proportional to the total number of counts at that ZIP+4 location.

Invalid OR Vacant Addresses:

There were 318 records with registered addresses that have been flagged as “Invalid” or “Vacant” Addresses by the NCOA check and also had an Early In-Person, Mail-In, FWAB or Provisional ballot cast in the VA 2022 General Election in District 04. Of those records, 256 were Early In-Person and 59 were Mail-In. The geographic distribution of the addresses (based on the ZIP+4), as reported by the NCOA service, is shown below, with the size of the marker proportional to the total number of counts at that ZIP+4 location.

District 05:
Invalid Addresses:

There were 383 records with registered addresses that have been flagged as “Invalid” Addresses by the NCOA check and also had an Early In-Person, Mail-In, FWAB or Provisional ballot cast in the VA 2022 General Election in District 05. Of those records, 299 were Early In-Person and 81 were Mail-In. The geographic distribution of the addresses (based on the ZIP+4), as reported by the NCOA service, is shown below, with the size of the marker proportional to the total number of counts at that ZIP+4 location.

Invalid OR Vacant Addresses:

There were 584 records with registered addresses that have been flagged as “Invalid” or “Vacant” Addresses by the NCOA check and also had an Early In-Person, Mail-In, FWAB or Provisional ballot cast in the VA 2022 General Election in District 05. Of those records, 445 were Early In-Person and 134 were Mail-In. The geographic distribution of the addresses (based on the ZIP+4), as reported by the NCOA service, is shown below, with the size of the marker proportional to the total number of counts at that ZIP+4 location.

District 06:
Invalid Addresses:

There were 202 records with registered addresses that have been flagged as “Invalid” Addresses by the NCOA check and also had an Early In-Person, Mail-In, FWAB or Provisional ballot cast in the VA 2022 General Election in District 06. Of those records, 144 were Early In-Person and 53 were Mail-In. The geographic distribution of the addresses (based on the ZIP+4), as reported by the NCOA service, is shown below, with the size of the marker proportional to the total number of counts at that ZIP+4 location.

Invalid OR Vacant Addresses:

There were 400 records with registered addresses that have been flagged as “Invalid” or “Vacant” Addresses by the NCOA check and also had an Early In-Person, Mail-In, FWAB or Provisional ballot cast in the VA 2022 General Election in District 06. Of those records, 301 were Early In-Person and 91 were Mail-In. The geographic distribution of the addresses (based on the ZIP+4), as reported by the NCOA service, is shown below, with the size of the marker proportional to the total number of counts at that ZIP+4 location.

District 07:
Invalid Addresses:

There were 243 records with registered addresses that have been flagged as “Invalid” Addresses by the NCOA check and also had an Early In-Person, Mail-In, FWAB or Provisional ballot cast in the VA 2022 General Election in District 07. Of those records, 169 were Early In-Person and 68 were Mail-In. The geographic distribution of the addresses (based on the ZIP+4), as reported by the NCOA service, is shown below, with the size of the marker proportional to the total number of counts at that ZIP+4 location.

Invalid OR Vacant Addresses:

There were 370 records with registered addresses that have been flagged as “Invalid” or “Vacant” Addresses by the NCOA check and also had an Early In-Person, Mail-In, FWAB or Provisional ballot cast in the VA 2022 General Election in District 07. Of those records, 274 were Early In-Person and 87 were Mail-In. The geographic distribution of the addresses (based on the ZIP+4), as reported by the NCOA service, is shown below, with the size of the marker proportional to the total number of counts at that ZIP+4 location.

District 08:
Invalid Addresses:

There were 165 records with registered addresses that have been flagged as “Invalid” Addresses by the NCOA check and also had an Early In-Person, Mail-In, FWAB or Provisional ballot cast in the VA 2022 General Election in District 08. Of those records, 69 were Early In-Person and 94 were Mail-In. The geographic distribution of the addresses (based on the ZIP+4), as reported by the NCOA service, is shown below, with the size of the marker proportional to the total number of counts at that ZIP+4 location.

Invalid OR Vacant Addresses:

There were 413 records with registered addresses that have been flagged as “Invalid” or “Vacant” Addresses by the NCOA check and also had an Early In-Person, Mail-In, FWAB or Provisional ballot cast in the VA 2022 General Election in District 08. Of those records, 226 were Early In-Person and 183 were Mail-In. The geographic distribution of the addresses (based on the ZIP+4), as reported by the NCOA service, is shown below, with the size of the marker proportional to the total number of counts at that ZIP+4 location.

District 09:
Invalid Addresses:

There were 328 records with registered addresses that have been flagged as “Invalid” Addresses by the NCOA check and also had an Early In-Person, Mail-In, FWAB or Provisional ballot cast in the VA 2022 General Election in District 09. Of those records, 257 were Early In-Person and 66 were Mail-In. The geographic distribution of the addresses (based on the ZIP+4), as reported by the NCOA service, is shown below, with the size of the marker proportional to the total number of counts at that ZIP+4 location.

Invalid OR Vacant Addresses:

There were 479 records with registered addresses that have been flagged as “Invalid” or “Vacant” Addresses by the NCOA check and also had an Early In-Person, Mail-In, FWAB or Provisional ballot cast in the VA 2022 General Election in District 09. Of those records, 373 were Early In-Person and 100 were Mail-In. The geographic distribution of the addresses (based on the ZIP+4), as reported by the NCOA service, is shown below, with the size of the marker proportional to the total number of counts at that ZIP+4 location.

District 10:
Invalid Addresses:

There were 174 records with registered addresses that have been flagged as “Invalid” Addresses by the NCOA check and also had an Early In-Person, Mail-In, FWAB or Provisional ballot cast in the VA 2022 General Election in District 10. Of those records, 104 were Early In-Person and 68 were Mail-In. The geographic distribution of the addresses (based on the ZIP+4), as reported by the NCOA service, is shown below, with the size of the marker proportional to the total number of counts at that ZIP+4 location.

Invalid OR Vacant Addresses:

There were 246 records with registered addresses that have been flagged as “Invalid” or “Vacant” Addresses by the NCOA check and also had an Early In-Person, Mail-In, FWAB or Provisional ballot cast in the VA 2022 General Election in District 10. Of those records, 164 were Early In-Person and 82 were Mail-In. The geographic distribution of the addresses (based on the ZIP+4), as reported by the NCOA service, is shown below, with the size of the marker proportional to the total number of counts at that ZIP+4 location.

District 11:
Invalid Addresses:

There were 122 records with registered addresses that have been flagged as “Invalid” Addresses by the NCOA check and also had an Early In-Person, Mail-In, FWAB or Provisional ballot cast in the VA 2022 General Election in District 11. Of those records, 46 were Early In-Person and 75 were Mail-In. The geographic distribution of the addresses (based on the ZIP+4), as reported by the NCOA service, is shown below, with the size of the marker proportional to the total number of counts at that ZIP+4 location.

Invalid OR Vacant Addresses:

There were 335 records with registered addresses that have been flagged as “Invalid” or “Vacant” Addresses by the NCOA check and also had an Early In-Person, Mail-In, FWAB or Provisional ballot cast in the VA 2022 General Election in District 11. Of those records, 181 were Early In-Person and 152 were Mail-In. The geographic distribution of the addresses (based on the ZIP+4), as reported by the NCOA service, is shown below, with the size of the marker proportional to the total number of counts at that ZIP+4 location.

Summary Data Files by Locality:

The complete set of graphics and statistics for each locality, and each congressional district in VA can be downloaded here as a zip file. The tabulated summary results can also be downloaded in excel, csv, or numbers format:

Categories
Election Data Analysis Election Forensics Election Integrity Interesting programming technical

Updates to Henrico CVR processing

Note: For background information, please see my introduction to Cast Vote Records processing and theory here: Statistical Detection of Irregularities via Cast Vote Records.

Since I posted my initial analysis of the Henrico CVR data, one comment was made to me by a member of the Texas election integrity group I have been working with: We have been assuming, based on vendor documentation and the laws and requirements in various states, that when a cast vote record is produced by vendor software the results are sorted by the time the ballot was recorded onto a scanner. However, when looking at the results that we’ve been getting so far and trying to figure out plausible explanations for what we were seeing, he realized it might be the case that the ordering of the CVR entries are being done by both time AND USB stick grouping (which is usually associated with a specific scanner or precinct) but then simply concatenating all of those results together.

While there isn’t enough information in the Henrico CVR files to breakout the entries by USB/Scanner, and the Henrico data has record ID numbers instead of actual timestamps, there is enough information to break out them by Precinct, District and Race, with the exception of the Central Absentee Precincts (CAP) entries where we can only break them out by district given the metadata alone. However, with some careful MATLAB magic I was able to cluster the results marked as just “CAP” into at least 5 different sub-groupings that are statistically distinct. (I used an exponential moving average to discover the boundaries between groupings, and looking at the crossover points in vote share.) I then relabeled the entries with the corresponding “CAP 1”, “CAP 2”, … , “CAP 5” labels as appropriate. My previous analysis was only broken out by Race ID and CAP/Non-CAP/Provisional category.

Processing in this manner makes the individual distributions look much cleaner, so I think this does confirm that there is not a true sequential ordering in the CVR files coming out of the vendor software packages. (If they would just give us the dang timestamps … this would be a lot easier!)

I have also added a bit more rigor to the statistics outlier detection by adding plots of the length of observed runs (e.g. how many “heads” did we get in a row?) as we move through the entries, as well as the plot of the probability of this number of consecutive tosses occurring. We compute this probability for K consecutive draws using the rules of statistical independence, which is P([a,a,a,a]) = P(a) x P(a) x P(a) x P(a) = P(a)^4. Therefore the probability of getting 4 “heads” in a row with a hypothetical 53/47 weighted coin would be .53^4 = 0.0789. There are also plotted lines for a probability 1/#Ballots for reference.

Results

The good news is that this method of slicing the data and assuming that the Vendor is simply concatenating USB drives seems to produce much tighter results that look to obey the expected IID distributions. Breaking up the data this way resulted in no plot breaking the +/- 3/sqrt(N-1) boundaries, but there still are a few interesting datapoints that we can observe.

In the plot below we have the Attorney Generals race in the 4th district from precinct 501 – Antioch. This is a district that Miyares won handily 77%/23%. We see that the top plot of the cumulative spread is nicely bounded by the +/- 3/sqrt(N-1) lines. The second plot from the top gives the vote ratio in order to compare with the work that Draza Smith, Jeff O’Donnell and others are doing with CVR’s over at Ordros.com. The second from bottom plot gives the number k of consecutive ballots (in either candidates favor) that have been seen at each moment in the counting process. And the bottom plot raises either the 77% or 23% overall probability to the k-th power to determine the probability associated with pulling that many consecutive Miyares or Herring ballots from an IID distribution. The most consecutive ballots Miyares received in a row was just over 15, which had a .77^15 = 0.0198 or 1.98% chance of occurring. The most consecutive ballots Herring received was about 4, which equates to a probability of occurrence of .23^4 = 0.0028 or 0.28% chance. The dotted line on the bottom plot is referenced at 1/N, and the solid line is referenced at 0.01%.

But let’s now take a look at another plot for the Miyares contest in another blowout locality with 84% / 16% for Miyares. The +/- 3/sqrt(N-1) limit nicely bounds our ballot distribution again. There is, however, an interesting block of 44 consecutive ballots for Miyares about halfway through the processing of ballots. This equates to .84^44 = 0.0004659 or a 0.04659% chance of occurrence from an IID distribution. Close to this peak is a run of 4 ballots for Herring which doesn’t sound like much, but given the 84% / 16% split, the probability of occurrence for that small run is .16^4 = 0.0006554 or 0.06554%!

Moving to the Lt. Governors race we see an interesting phenomenon where where Ayala received a sudden 100 consecutive votes a little over midway through the counting process. Now granted, this was a landslide district for Ayala, but this still equates to a .92^100 = 0.000239 or 0.0239% chance of occurrence.

And here’s another large block of contiguous Ayala ballots equating to about .89^84 = 0.00005607 or 0.0056% chance of occurrence.

Tests for Differential Invalidation (added 2022-09-19):

“Differential invalidation” takes place when the ballots of one candidate or position are invalidated at a higher rate than for other candidates or positions. With this dataset we know how many ballots were cast, and how many ballots had incomplete or invalid results (no recorded vote in the cvr, but the ballot record exists) for the 3 statewide races. In accordance with the techniques presented in [1] and [2], I computed the plots of the Invalidation Rate vs the Percent Vote Share for the Winner in an attempt to observe if there looks to be any evidence of Differential Invalidation ([1], ch 6). This is similar to the techniques presented in [2], which I used previously to produce my election fingerprint plots and analysis that plotted the 2D histograms of the vote share for the winner vs the turnout percentage.

The generated the invalidation rate plots for the Gov, Lt Gov and AG races statewide in VA 2021 are below. Each plot below is representing one of the statewide races, and each dot is representing the ballots from a specific precinct. The x axis is the percent vote share for the winner, and the y axis is computed as 100 – 100 * Nvotes / Nballots. All three show a small but statistically significant linear trend and evidence of differential invalidation. The linear regression trendlines have been computed and superimposed on the data points in each graph.

To echo the warning from [1]: a differential invalidation rate does not directly indicate any sort of fraud. It indicates an unfairness or inequality in the rate of incomplete or invalid ballots conditioned on candidate choice. While it could be caused by fraud, it could also be caused by confusing ballot layout, or socio-economic issues, etc.

Full Results Download

References

  • [1] Forsberg, O.J. (2020). Understanding Elections through Statistics: Polling, Prediction, and Testing (1st ed.). Chapman and Hall/CRC. https://doi.org/10.1201/9781003019695
  • [2] Klimek, Peter & Yegorov, Yuri & Hanel, Rudolf & Thurner, Stefan. (2012). Statistical Detection of Systematic Election Irregularities. Proceedings of the National Academy of Sciences of the United States of America. 109. 16469-73. https://doi.org/10.1073/pnas.1210722109.
Categories
Election Data Analysis Election Forensics Election Integrity Interesting programming technical

CVR Analysis – Henrico County VA 2021

Update 2022-08-29 per observations by members of the Texas team I am working with, we’ve been able to figure out that (a) the vendor was simply concatenating data records from each machine and not sorting the CVR results and (b) how to mostly unwrap this affect on the data to produce much cleaner results. The results below are left up for historical reference.

For background information, please see my introduction to Cast Vote Records processing and theory here: Statistical Detection of Irregularities via Cast Vote Records. This entry will be specifically documenting the results from processing the Henrico County Virginia CVR data from the 2021 election.

As in the results from the previous post, I expanded the theoretical error bounds out to 6/sqrt(N) instead of 3/sqrt(N) in order to give a little bit of extra “wiggle room” for small fluctuations.

However the Henrico dataset could only be broken up by CAP, Non-CAP or Provisional. So be aware that the CAP curves presented below contain a combination of both early-vote and mail-in ballots.

The good news is that I’ve at least found one race that seems to not have any issues with the CVR curves staying inside the error boundaries. MemberHouseOfDelegates68thDistrict did not have any parts of the curves that broke through the error boundaries.

The bad news … is pretty much everything else doesn’t. I cannot tell you why these curves have such differences from statistical expectation, just that they do. We must have further investigation and analysis of these races to determine root cause. I’ve presented all of the races that had sufficient number of ballots below (1000 minimum for the race a whole, and 100 ballot minimum for each ballot type).

Categories
Election Data Analysis Election Forensics Election Integrity Interesting programming technical

Statistical Detection of Irregularities via Cast Vote Records

There has been a good amount of commotion regarding cast vote records (CVRs) and their importance lately. I wanted to take a minute and try and help explain why these records are so important, and how they provide a tool for statistical inspection of election data. I also want to try and dispel any misconceptions as to what they can or can’t tell us.

I have been working with other local Virginians to try and get access to complete CVRs for about 6 months (at least) in order to do this type of analysis. However, we had not had much luck in obtaining real data (although we did get a partial set from PWC primaries but it lacked the time-sequencing information) to evaluate until after Jeff O’Donnell (a.k.a. the Lone Raccoon) and Walter Dougherity did a fantastic presentation at the Mike Lindell Moment of Truth Summit on CVRs and their statistical use. That presentation seems to have broken the data logjam, and was the impetus for writing this post.

Just like the Election Fingerprint analysis I was doing earlier that highlighted statistical anomalies in election data, this CVR analysis is a statistics based technique that can help inform us as to whether or not the election data appears consistent with expectations. It only uses the official results as provided by state or local election authorities and relies on standard statistical principles and properties. Nothing more. Nothing less.

What is a cast vote record?

A cast vote record is part of the official election records that need to be maintained in order for election systems to be auditable. (see: 52 USC 21081 , NIST CVR Standard, as well as the Virginia Voting Systems Certification Standards) They can have many different formats depending on equipment vendor, but they are effectively a record of each ballot as it was recorded by the equipment. Each row in a CVR data table should represent a single ballot being cast by a voter and contain, at minimum, the time (or sequence number) when the ballot was cast, the ballot type, and the result of each race. Other data might also be included such as which precinct and machine performed the scanning/recording of the ballot, etc. Note that “cast vote records” are sometimes also called “cast voter records”, “ballot reports” or a number of other different names depending on the publication or locality. I will continue to use the “cast vote record” language in this document for consistency.

Why should we care?

The reason these records are so important, is based on statistics and … unfortunately … involves some math to fully describe. But to make this easier, let’s try first to walk through a simple thought experiment. Let’s pretend that we have a weighted, or “trick” coin, that when flipped it will land heads 53% of the time and land tails 47% of the time. We’re going to continuously flip this coin thousands of times in a row and record our results. While we can’t predict exactly which way the coin will land on any given toss, we can expect that, on average, the coin will land with the aforementioned 53/47 split.

Now because each coin toss constitutes an independent and identically distributed (IID) probability function, we can expect this sequence to obey certain properties. If as we are making our tosses, we are computing the “real-time” statistics of the percentage of head/tails results, and more specifically if we plot the spread (or difference) of those percentage results as we proceed we will see that the spread has very large swings as we first begin to toss our coin, but very quickly the variability in the spread becomes stable as more and more tosses (data) are available for us to average over. Mathematically, the boundary on these swings is inversely proportional to the square root of how many tosses are performed. In the “Moment of Truth” video on CVRs linked above, Jeff and Walter refer to this as a “Cone of Probability”, and he generates his boundary curves experimentally. He is correct. It is a cone of probability as its really just a manifestation of well-known and well-understood Poisson Noise characteristic (for the math nerds reading this). In Jeff’s work he uses the ratio of votes between candidates, while I’m using the spread (or deviation) of the vote percentages. Both metrics are valid, but using the deviation has an easy closed-form boundary curve that we don’t need to generate experimentally.

In the graphic below I have simulated 10 different trials of 10,000 tosses for a distribution that leans 53/47, which is equivalent to a 6% spread overall. Each trial had 10,000 random samples generated as either +1 or -1 values (a.k.a. a binary “Yes” or “No” vote) approximating the 53/47 split and I plotted the cumulative running spread of the results as each toss gets accumulated. The black dotted outline is the 95% confidence interval (or +/-3x the standard deviation) across the 10 trials for the Nth bin, and the red dotted outline is the 3/sqrt(n-1) analytical boundary.

So how does this apply to election data?

In a theoretically free and perfectly fair election we should see similar statistical behavior, where each coin toss is replaced with a ballot from an individual voter. In a perfect world we would have each vote be completely independent of every other vote in the sequence. In reality we have to deal with the fact that there can be small local regions of time in which perfectly legitimate correlations in the sequence of scanned ballots exist. Think of a local church who’s congregation is very uniform and they all go to the polls after Sunday mass. We would see a small trend in the data corresponding to this mass of similar thinking peoples going to the polls at the same time. But we wouldn’t expect there to be large, systematic patterns, or sharp discontinuities in the plotted results. A little bit of drift and variation is to be expected in dealing with real world election data, but persistent and distinct patterns would indicate a systemic issue.

Now we cannot isolate all of the variables in a real life example, but we should try as best as possible. To that effect, we should not mix different ballot types that are cast in different manners. We should keep our analysis focused within each sub-group of ballot type (mail-in, early-vote, day-of, etc). It is to the benefit of this analysis that the very nature of voting, and the procedures by which it occurs, is a very randomized process. Each sub-grouping has its own quasi-random process that we can consider.

While small groups (families, church groups) might travel to the in-person polls in correlated clusters, we would expect there to be fairly decent randomization of who shows up to in-person polls and when. The ordering of who stands in line before or after one another, how fast they check-in and fill out their ballot, etc, are all quasi-random processes.

Mail-in ballots have their own randomization as they depend on the timing of when individuals request, fill-out and mail their responses, as well as the logistics and mechanics of the postal service processes themselves providing a level of randomization as to the sequence of ballots being recorded. Like a dealer shuffling a deck of cards, the process of casting a mail-in vote provides an additional level of independence between samples.

No method is going to supply perfect theoretical independence from ballot to ballot in the sequence, but theres a general expectation that voting should at least be similar to an IID process.

Also … and I cannot stress this enough … while these techniques can supply indications of irregularities and discrepancies in elections data, they are not conclusive and must be coupled with in-depth investigations.

So going back to the simulation we generated above … what does a simulation look like when cheating occurs? Let’s take a very simple cheat from a random “elections” of 10,000 ballots, with votes being representative of either +1 (or “Yes”) or -1 (or “No”) as we did above. But lets also cheat by randomly selecting two different spots in the data stream to place blocks of 250 consecutive “Yes” results.

The image below shows the result of this process. The blue curve represents the true result, while the red curve represents the cheat. We see that at about 15% and 75% of the vote counted, our algorithm injected a block of “Yes” results, and the resulting cumulative curve breaks through the 3/sqrt(N-1) boundary. Now, not every instance or type of cheat will break through this boundary, and there may be real events that might explain such behavior. But looking for CVR curves that break our statistical expectations is a good way to flag items that need further investigation.

Computing the probability of a ballot run:

Section added on 2022-09-18

We can also a bit more rigor to the statistics outlier detection by computing the probability of the length of observed runs (e.g. how many “heads” did we get in a row?) occurring as we move through the sequential entries. We can compute this probability for K consecutive draws using the rules of statistical independence, which is P([a,a,a,a]) = P(a) x P(a) x P(a) x P(a) = P(a)^4. Therefore the probability of getting 4 “heads” in a row with a hypothetical 53/47 weighted coin would be .53^4 = 0.0789.

Starting with my updated analysis of 2021 Henrico County VA, I’ve started adding this computation to my plots. I have not yet re-run the Texas data below with this new addition, but will do so soon and update this page accordingly.

Real Examples

UPDATE 2022-09-18:

  • I have finally gotten my hands on some data for 2020 in VA. I will be working to analyze that data and will report what I find as soon as I can, but as we are approaching the start of early voting for 2022, my hands are pretty full at the moment so it might take me some time to complete that processing.
  • As noted in my updates to the Henrico County 2021 VA data, and in my section on computing the probability of given runs above, the Texas team noticed that we could further break apart the Travis county data into subgroups by USB stick. I will update my results below as soon as I get the time to do so.

So I haven’t gotten complete cast vote records from VA yet (… which is a whole other set of issues …), but I have gotten my Cheeto stained fingers on some data from the Travis County Texas 2020 race.

So let us first take a look at an example of a real race where everything seems to be obeying the rules as set out above. I’ve doubled my error bars from 3x to 6x of the inverse square standard (discussed above) in order to handle the quasi-IID nature of the data and give some extra margin for small fluctuating correlations.

The plot below shows the Travis County Texas 2020 BoardOfTrusteesAt_LargePosition8AustinISD race, as processed by the tabulation system and stratified by ballot type. We can see that all three ballot types start off with large variances in the computed result but very quickly coalesce and approach their final values. This is exactly what we would expect to see.

Now if I randomly shuffle the ordering of the ballots in this dataset and replot the results (below) I get a plot that looks unsurprisingly similar, which suggests that these election results were likely produced by a quasi-IID process.

Next let’s take a look at a race that does NOT conform to the statistics we’ve laid out above. (… drum-roll please … as this the one everyone’s been waiting for). Immma just leave this right here and just simply point out that all 3 ballot type plots below in the Presidential race for 2020 go outside of the expected error bars. I also note the discrete stair step pattern in the early vote numbers. It’s entirely possible that there is a rational explanation for these deviations. I would sure like to hear it, especially since we have evidence from the exact same dataset of other races that completely followed the expected boundary conditions. So I don’t think this is an issue with a faulty dataset or other technical issues.

And just for completeness, when I artificially shuffle the data for the Presidential race, and force it to be randomized, I do in fact end up with results that conform to IID statistics (below).

I will again state that while these results are highly indicative that there were irregularities and discrepancies in the election data, they are not conclusive. A further investigation must take place, and records must be preserved, in order to discover the cause of the anomalies shown.

Running through each race that had at least 1000 ballots cast and automatically detecting which races busted the 6/sqrt(n-1) boundaries produces the following tabulated results. A 1 in the right hand column indicates that the CVR data for that particular race in Travis County has crossed the error bounds. A 0 in the right hand column indicates that all data stayed within the error bound limits.

RaceCVR_OOB_Irregularity_Detected
President_VicePresident1
UnitedStatesSenator1
UnitedStatesRepresentativeDistrict101
UnitedStatesRepresentativeDistrict171
UnitedStatesRepresentativeDistrict211
UnitedStatesRepresentativeDistrict251
UnitedStatesRepresentativeDistrict350
RailroadCommissioner1
ChiefJustice_SupremeCourt1
Justice_SupremeCourt_Place6_UnexpiredTerm1
Justice_SupremeCourt_Place71
Justice_SupremeCourt_Place81
Judge_CourtOfCriminalAppeals_Place31
Judge_CourtOfCriminalAppeals_Place41
Judge_CourtOfCriminalAppeals_Place91
Member_StateBoardOfEducation_District51
Member_StateBoardOfEducation_District101
StateSenator_District210
StateSenator_District241
StateRepresentativeDistrict471
StateRepresentativeDistrict481
StateRepresentativeDistrict491
StateRepresentativeDistrict501
StateRepresentativeDistrict510
ChiefJustice_3rdCourtOfAppealsDistrict1
DistrictJudge_460thJudicialDistrict1
DistrictAttorney_53rdJudicialDistrict1
CountyJudge_UnexpiredTerm1
Judge_CountyCourtAtLawNo_91
Sheriff1
CountyTaxAssessor_Collector1
CountyCommissionerPrecinct11
CountyCommissionerPrecinct31
AustinCityCouncilDistrict20
AustinCityCouncilDistrict40
AustinCityCouncilDistrict60
AustinCityCouncilDistrict70
AustinCityCouncilDistrict101
PropositionACityOfAustin_FP__2015_1
PropositionBCityOfAustin_FP__2022_1
MayorCityOfCedarPark0
CouncilPlace2CityOfCedarPark0
CouncilPlace4CityOfCedarPark0
CouncilPlace6CityOfCedarPark0
CouncilMemberPlace2CityOfLagoVista0
CouncilMemberPlace4CityOfLagoVista0
CouncilMemberPlace6CityOfLagoVista0
CouncilMemberPlace2CityOfPflugerville0
CouncilMemberPlace4CityOfPflugerville0
CouncilMemberPlace6CityOfPflugerville0
Prop_ACityOfPflugerville_2169_0
Prop_BCityOfPflugerville_2176_0
Prop_CCityOfPflugerville_2183_0
BoardOfTrusteesDistrict2SingleMemberDistrictAISD0
BoardOfTrusteesDistrict5SingleMemberDistrictAISD0
BoardOfTrusteesAt_LargePosition8AustinISD1
BoardOfTrusteesPlace1EanesISD1
Prop_AEanesISD_2246_0
BoardOfTrusteesPlace3LeanderISD0
BoardOfTrusteesPlace4LeanderISD0
BoardOfTrusteesPlace5ManorISD1
BoardOfTrusteesPlace6ManorISD0
BoardOfTrusteesPlace7ManorISD1
BoardOfTrusteesPlace6PflugervilleISD1
BoardOfTrusteesPlace7PflugervilleISD1
BoardOfTrusteesPlace1RoundRockISD1
BoardOfTrusteesPlace2RoundRockISD0
BoardOfTrusteesPlace6RoundRockISD0
BoardOfTrusteesPlace7RoundRockISD0
BoardOfTrusteesWellsBranchCommunityLibraryDistrict0
Var1470
BoardOfTrusteesWestbankLibraryDistrict0
Var1500
Var1510
DirectorsPlace2WellsBranchMUD0
DirectorsPrecinct4BartonSprings_EdwardsAquiferConservationDistr0
PropositionAExtensionOfBoundariesCityOfLakeway_1966_0
PropositionB2_yearTermsCityOfLakeway_1973_0
PropositionCLimitOnSuccessiveYearsOfServiceCityOfLakeway_1980_0
PropositionDResidencyRequirementForCityManagerCityOfLakeway_1980
PropositionEOfficeOfTreasurerCityOfLakeway_1994_0
PropositionFOfficialBallotsCityOfLakeway_2001_0
PropositionGAuthorizingBondsCityOfLakeway_2008_0
PropositionALagoVistaISD_2253_0
PropositionBLagoVistaISD_2260_0
PropositionCLagoVistaISD_2267_0
PropositionAEmergencyServicesDistrict1_2372_0

References and further reading:

Categories
Election Data Analysis Election Forensics Election Integrity programming technical

Number of duplicated voter records in each VA locality

Computed below is the number of duplicated voter records in each locality as of the 11/06/21 VA Registered Voter List (RVL). The computation is based on performing an exact match of LAST_NAME, DOB and ADDRESS fields between records in the file.

Note: If the combination of the name “Jane Smith”, with DOB “1/1/1980”, at “12345 Some Road, Ln.” appears 3 times in the file, there are 3 counts added to the results below. If the combination appears only once, there are 0 counts added to the results below, as there is no repetition.

Additionally I’ve done an even more restrictive matching which requires exact match on FIRST, MIDDLE and LAST name, DOB and ADDRESS fields in the second graphic and list presented below.

The first, more lenient, criteria will correctly flag multiple records with the same first or middle name, but misspelled such as “Muhammad” vs “Mahammad”, but could also include occurrences of voting age twins who live together or spouses with the same DOB.

The second, more strict, criteria requires that multiple rows flagged have exactly the same spelling and punctuation for FIRST, MIDDLE, LAST, DOB and ADDRESS fields. This has less false positive, but more false negatives, as it will likely miss common misspellings between entries, etc.

There are no attempts to match for common misspellings, etc. I did do a simple cleanup for multiple contiguous whitespace elements, etc., before attempting to match.

I have summarized the data here so as not to reveal any personally identifiable information (PII) from the RVL in adherence to VA law.

Update 2022-07-13 12:30: I have sent the full information, for both the lenient and strict criteria queries, to the Prince William County and Loudoun County Registrars. The Loudoun deputy registrar has responded and stated that all but 1 of the duplications in the stricter criteria had already been caught by the elections staff, but he has not yet looked at the entries in the more lenient criteria results file. I have also attempted to contact the Henrico County, Lynchburg City, and York County registrars but have not yet received a response or request to provide them with the full data.

Update 2022-07-31 23:03: I have also heard back from the PWC Registrar (Eric Olsen). Most of the entries that I had flagged in the 11/6/2021 RVL list have already been taken care of by the PWC staff already. There were only a couple that had not yet been noticed or marked as duplicates. Also, per our discussion, I should reiterate and clarify that the titles on the plots below simply refer to duplicated entries of the data files according to the filtering choice. It is a technically accurate description and should not be read as I am asserting other than the results of the matching operation.

Locality NameNumber of repeated entries
ACCOMACK COUNTY64
ALBEMARLE COUNTY311
ALEXANDRIA CITY225
ALLEGHANY COUNTY16
AMELIA COUNTY32
AMHERST COUNTY84
APPOMATTOX COUNTY18
ARLINGTON COUNTY446
AUGUSTA COUNTY119
BATH COUNTY2
BEDFORD COUNTY170
BLAND COUNTY4
BOTETOURT COUNTY45
BRISTOL CITY22
BRUNSWICK COUNTY30
BUCHANAN COUNTY30
BUCKINGHAM COUNTY32
BUENA VISTA CITY10
CAMPBELL COUNTY82
CAROLINE COUNTY67
CARROLL COUNTY42
CHARLES CITY COUNTY18
CHARLOTTE COUNTY28
CHARLOTTESVILLE CITY80
CHESAPEAKE CITY545
CHESTERFIELD COUNTY948
CLARKE COUNTY40
COLONIAL HEIGHTS CITY18
COVINGTON CITY2
CRAIG COUNTY4
CULPEPER COUNTY114
CUMBERLAND COUNTY8
DANVILLE CITY88
DICKENSON COUNTY22
DINWIDDIE COUNTY44
EMPORIA CITY6
ESSEX COUNTY25
FAIRFAX CITY38
FAIRFAX COUNTY2962
FALLS CHURCH CITY39
FAUQUIER COUNTY203
FLOYD COUNTY28
FLUVANNA COUNTY36
FRANKLIN CITY23
FRANKLIN COUNTY84
FREDERICK COUNTY210
FREDERICKSBURG CITY54
GALAX CITY0
GILES COUNTY24
GLOUCESTER COUNTY52
GOOCHLAND COUNTY84
GRAYSON COUNTY18
GREENE COUNTY32
GREENSVILLE COUNTY16
HALIFAX COUNTY48
HAMPTON CITY285
HANOVER COUNTY316
HARRISONBURG CITY40
HENRICO COUNTY676
HENRY COUNTY74
HIGHLAND COUNTY4
HOPEWELL CITY34
ISLE OF WIGHT COUNTY98
JAMES CITY COUNTY217
KING & QUEEN COUNTY13
KING GEORGE COUNTY42
KING WILLIAM COUNTY43
LANCASTER COUNTY10
LEE COUNTY24
LEXINGTON CITY12
LOUDOUN COUNTY1245
LOUISA COUNTY74
LUNENBURG COUNTY26
LYNCHBURG CITY165
MADISON COUNTY12
MANASSAS CITY64
MANASSAS PARK CITY24
MARTINSVILLE CITY14
MATHEWS COUNTY18
MECKLENBURG COUNTY54
MIDDLESEX COUNTY12
MONTGOMERY COUNTY159
NELSON COUNTY30
NEW KENT COUNTY26
NEWPORT NEWS CITY329
NORFOLK CITY411
NORTHAMPTON COUNTY18
NORTHUMBERLAND COUNTY22
NORTON CITY6
NOTTOWAY COUNTY12
ORANGE COUNTY70
PAGE COUNTY47
PATRICK COUNTY28
PETERSBURG CITY68
PITTSYLVANIA COUNTY84
POQUOSON CITY28
PORTSMOUTH CITY186
POWHATAN COUNTY55
PRINCE EDWARD COUNTY43
PRINCE GEORGE COUNTY77
PRINCE WILLIAM COUNTY1159
PULASKI COUNTY59
RADFORD CITY14
RAPPAHANNOCK COUNTY10
RICHMOND CITY300
RICHMOND COUNTY14
ROANOKE CITY133
ROANOKE COUNTY233
ROCKBRIDGE COUNTY28
ROCKINGHAM COUNTY113
RUSSELL COUNTY28
SALEM CITY58
SCOTT COUNTY18
SHENANDOAH COUNTY48
SMYTH COUNTY40
SOUTHAMPTON COUNTY28
SPOTSYLVANIA COUNTY345
STAFFORD COUNTY410
STAUNTON CITY14
SUFFOLK CITY194
SURRY COUNTY10
SUSSEX COUNTY14
TAZEWELL COUNTY52
VIRGINIA BEACH CITY922
WARREN COUNTY46
WASHINGTON COUNTY78
WAYNESBORO CITY26
WESTMORELAND COUNTY24
WILLIAMSBURG CITY22
WINCHESTER CITY42
WISE COUNTY40
WYTHE COUNTY35
YORK COUNTY178
Locality NameNumber of repeated entries
ACCOMACK COUNTY0
ALBEMARLE COUNTY4
ALEXANDRIA CITY0
ALLEGHANY COUNTY0
AMELIA COUNTY0
AMHERST COUNTY2
APPOMATTOX COUNTY0
ARLINGTON COUNTY10
AUGUSTA COUNTY0
BATH COUNTY0
BEDFORD COUNTY4
BLAND COUNTY0
BOTETOURT COUNTY0
BRISTOL CITY0
BRUNSWICK COUNTY0
BUCHANAN COUNTY2
BUCKINGHAM COUNTY0
BUENA VISTA CITY0
CAMPBELL COUNTY2
CAROLINE COUNTY0
CARROLL COUNTY0
CHARLES CITY COUNTY0
CHARLOTTE COUNTY0
CHARLOTTESVILLE CITY0
CHESAPEAKE CITY8
CHESTERFIELD COUNTY8
CLARKE COUNTY0
COLONIAL HEIGHTS CITY0
COVINGTON CITY0
CRAIG COUNTY0
CULPEPER COUNTY0
CUMBERLAND COUNTY0
DANVILLE CITY0
DICKENSON COUNTY0
DINWIDDIE COUNTY0
EMPORIA CITY0
ESSEX COUNTY0
FAIRFAX CITY0
FAIRFAX COUNTY54
FALLS CHURCH CITY0
FAUQUIER COUNTY2
FLOYD COUNTY0
FLUVANNA COUNTY0
FRANKLIN CITY3
FRANKLIN COUNTY0
FREDERICK COUNTY6
FREDERICKSBURG CITY0
GALAX CITY0
GILES COUNTY2
GLOUCESTER COUNTY0
GOOCHLAND COUNTY0
GRAYSON COUNTY0
GREENE COUNTY0
GREENSVILLE COUNTY0
HALIFAX COUNTY0
HAMPTON CITY8
HANOVER COUNTY0
HARRISONBURG CITY0
HENRICO COUNTY24
HENRY COUNTY2
HIGHLAND COUNTY0
HOPEWELL CITY2
ISLE OF WIGHT COUNTY4
JAMES CITY COUNTY0
KING & QUEEN COUNTY0
KING GEORGE COUNTY0
KING WILLIAM COUNTY0
LANCASTER COUNTY0
LEE COUNTY0
LEXINGTON CITY0
LOUDOUN COUNTY23
LOUISA COUNTY0
LUNENBURG COUNTY0
LYNCHBURG CITY16
MADISON COUNTY0
MANASSAS CITY0
MANASSAS PARK CITY0
MARTINSVILLE CITY0
MATHEWS COUNTY2
MECKLENBURG COUNTY0
MIDDLESEX COUNTY0
MONTGOMERY COUNTY2
NELSON COUNTY0
NEW KENT COUNTY0
NEWPORT NEWS CITY0
NORFOLK CITY0
NORTHAMPTON COUNTY0
NORTHUMBERLAND COUNTY0
NORTON CITY0
NOTTOWAY COUNTY0
ORANGE COUNTY0
PAGE COUNTY0
PATRICK COUNTY0
PETERSBURG CITY0
PITTSYLVANIA COUNTY4
POQUOSON CITY0
PORTSMOUTH CITY0
POWHATAN COUNTY0
PRINCE EDWARD COUNTY0
PRINCE GEORGE COUNTY0
PRINCE WILLIAM COUNTY8
PULASKI COUNTY0
RADFORD CITY0
RAPPAHANNOCK COUNTY0
RICHMOND CITY10
RICHMOND COUNTY0
ROANOKE CITY0
ROANOKE COUNTY11
ROCKBRIDGE COUNTY0
ROCKINGHAM COUNTY4
RUSSELL COUNTY0
SALEM CITY0
SCOTT COUNTY0
SHENANDOAH COUNTY0
SMYTH COUNTY0
SOUTHAMPTON COUNTY0
SPOTSYLVANIA COUNTY4
STAFFORD COUNTY4
STAUNTON CITY0
SUFFOLK CITY4
SURRY COUNTY0
SUSSEX COUNTY0
TAZEWELL COUNTY2
VIRGINIA BEACH CITY4
WARREN COUNTY0
WASHINGTON COUNTY0
WAYNESBORO CITY0
WESTMORELAND COUNTY2
WILLIAMSBURG CITY0
WINCHESTER CITY0
WISE COUNTY0
WYTHE COUNTY2
YORK COUNTY0
Categories
Election Data Analysis Election Forensics Election Integrity programming technical

Nationwide Nursing Home, Hospice Care and Assisted Living Facilities listing

Many election integrity investigators are looking through registration records and trying to find suspicious registrations based on the number of records attributed to a specific address as an initial way of identifying records of interest and of need of further scrutiny. This can often produce false positives for things like nursing homes, college dormitories, etc. Additionally, one of the concerns that has been raised is the risk of potential elder abuse, ID theft, manipulation or improper use of ballots for occupants of nursing home, hospice care or assisted living facilities.

According to https://npino.com : “The National Provider Identifier (NPI) is a unique identification number for covered health care providers (doctors, dentists, chiropractors, nurses and other medical staff). The NPI is a 10-digit, intelligence-free numeric identifier. This means that the numbers do not carry other information about healthcare providers, such as the state in which they live or their medical specialty. The NPI must be used in lieu of legacy provider identifiers in the HIPAA standards transactions. Covered health care providers and all health plans and health care clearing houses must use the NPIs in the administrative and financial transactions adopted under HIPAA (Health Insurance Portability and Accountability Act).”

I’ve compiled a list of every nursing home, hospice care, or assisted living facility in the country based on their current NPI code. I have mirrored and scraped the entire https://npino.com site as of 5-23-2022 and compiled the list of nationwide Nursing homes, Assisted Living and Hospice Care facilities into the below CSV file and am presenting it here in the hopes that it is useful for other researchers. I did do a small amount of regular expression based cleanup to the entries (e.x. // replacing “Ste.” with “Suite”, fixing whitespace issues, etc.) as well as manually addressing a handful of obviously incorrect addresses (e.x. // repeated/spliced street addresses, etc.).

Categories
Election Data Analysis Election Forensics Election Integrity Interesting programming technical

2021 VA Election Fingerprints

I finally had some time to put this together.

For additional background information, see here, here and here. As a reminder and summary, according to the published methods in the USAID funded National Academy of Sciences paper (here) that I based this work off of, an ideal “fair” election should look like one or two (depending on how split the electorate is) clean, compact Gaussian distributions (or “bulls-eye’s”). Other structural artifacts, while not conclusive, can be evidence and indicators of election irregularities. One such indicator with an attributed cause is that of a highly directional linear streaking, which implies an incremental fraud and linear correlation. Another known and attributed indicator is that of large distinct and extreme peaks near 100% or 0% votes for the candidate (the vertical axis) that are disconnected from the main Gaussian lobe which the authors label a sign of “extreme fraud”. In general, for free and fair elections, we expect these 2D histograms to show a predominantly uncorrelated relationship between the variables of “% Voter Turnout” (x-axis) and “% Vote Share” (y-axis).

The source data for this analysis comes directly from the VA Department of Elections servers, and was downloaded shortly after the conclusion and certification of the 2021 election results on 12/11/2021 (results file) and 12/12/2021 (turnout file). A link to the current version of these files, hosted by ELECT is here: https://apps.elections.virginia.gov/SBE_CSV/ELECTIONS/ELECTIONRESULTS/2021/2021%20November%20General%20.csv and https://apps.elections.virginia.gov/SBE_CSV/ELECTIONS/ELECTIONTURNOUT/Turnout-2021%20November%20General%20.csv. The files actually used for this analysis, as downloaded from the ELECT servers on the dates mentioned are posted at the end of this article.

Note that even though the republican (Youngkin) won in VA, the y-axis of these plots presented here was computed as the % vote share for the democratic candidate (McCauliffe) in order to more easily compare with the 2020 results. I can produce Yougkin vote share % versions as well if people are interested, and am happy to do so.

While the 2021 election fingerprints look to have less correlations between the variables as compared to 2020 data, they still look very non-gaussian and concerning. While there is no clearly observable “well-known” artifacts as called out in the NAS paper, there is definitely something irregular about the Virginia 2021 election data. Specifically, I find the per-precinct absentee [mail-in + post-election + early-in-person] plot (Figure 6) interesting as there is a diffuse background as well as a linearly correlated set of virtual precincts that show low turnout but very high vote share for the democratic candidate.

One of the nice differences about 2021 VA data is that they actually identified the distinctions between mail-in, early-in-person, and post-election vote tallies in the CAP’s this year. I have broken out the individual sub-groups as well and we can see that the absentee early-in-person (Figure 9) has a fairly diffuse distribution, while the absentee mail-in (Figure 7) and absentee post-election (Figure 8) ballots show a very high McAuliffe Vote %, and what looks to be a linear correlation.

For comparison I’ve also included the 2020 fingerprints. All of the 2020 fingerprints have been recomputed using the exact same MATLAB source code that processed the 2021 data. The archive date of the “2020 November General.csv” and “Turnout 2020 November General.csv” files used was 11/30/2020.

I welcome any and all independent reviews or assistance in making and double checking these results, and will gladly provide all collated source data and MATLAB code to anyone who is interested.

Figure 1 : VA 2021 Per locality, absentee (CAP) + physical precincts

Figure 2 : VA 2021 Per locality, physical precincts only:

Figure 3 : VA 2021 Per locality, absentee precincts only:

Figure 4 : VA 2021 Per precinct, absentee (CAP) + physical precincts:

Figure 5 : VA 2021 Per precinct, physical precincts only:

Figure 6 : VA 2021 Per precinct, absentee (CAP) precincts only:

Figure 7 : VA 2021 Per precinct, CAP precincts, mail-in ballots only:

Figure 8 : VA 2021 Per precinct, CAP precincts, post-election ballots only:

Figure 9 : VA 2021 Per precinct, CAP precincts, early-in-person ballots only:

Comparison to VA 2020 Fingerprints

Figure 10 : VA 2020 Per locality, absentee (CAP) + physical precincts:

Figure 11 : VA 2020 Per locality, physical precincts only:

Figure 12 : VA 2020 Per locality, absentee precincts only:

Figure 13 : VA 2020 Per precinct, absentee (CAP) + physical precincts:

Figure 14 : VA 2020 Per precinct, physical precincts only:

Figure 15 : VA 2020 Per precinct, absentee (CAP) precincts only:

Source Data Files:

Categories
Election Data Analysis Election Forensics Election Integrity Interesting programming technical Uncategorized

Vanishing Voter ID’s in sequential 2021 DAL files

During the 2021 election I archived multiple versions of the Statewide Daily Absentee List (DAL) files as produced by the VA Department of Elections (ELECT). As the name implies, the DAL files are a daily produced official product from ELECT that accumulates data representing the state of absentee votes over the course of the election. i.e. The data that exists in a DAL file produced on Tuesday morning should be contained in the DAL file produced on the following Wednesday along with any new entries from the events of Tuesday, etc.

Therefore, it is expected that once a Voter ID number is listed in the DAL file during an election period, subsequent DAL files *should* include a record associated with that voter ID. The status of that voter and the absentee ballot might change, but the records of the transactions during the election should still be present. I have confirmed that this is the expected behavior via discussions with multiple former and previous VA election officials.

Stepping through the snapshots of collected 2021 DAL files in chronological order, we can observe Voter IDs that mysteriously “vanish” from the DAL record. We can do this by simply mapping the existence/non-existence of unique Voter ID numbers in each file. The plot below in Figure 1 is the counts of the number of observed “vanishing” ID numbers as we move from file to file. The total number of vanishing ID numbers is 429 over the course of the 2021 election. Not a large number. But it’s 429 too many. I can think of no legitimate reason that this should occur.

Now an interesting thing to do, is to look at a few examples of how these issues manifest themselves in the data. Note that I’m hiding the personally identifiable information from the DAL file records in the screenshots below, BTW.

The first example in the screenshot below is an issue where the voter in question has a ballot that is in the “APPROVED” and “ISSUED” state, meaning that they have submitted a request for a ballot and that the ballot has been sent out. The record for this voter ID is present in the DAL file up until Oct 14th 2021, after which it completely vanishes from the DAL records. This voter ID is also not present in the RVL or VHL downloaded from the state on 11/06/2021.

This voter was apparently issued a real, live, ballot for 2021 and then was subsequently removed from the DAL and (presumably) the voter rolls + VERIS on or around the 14th Oct according to the DAL snapshots.  What happened to that ballot? What happened to the record of that ballot? The only public record of that ballot even existing, let alone the fact that it was physically issued and mailed out, was erased when the Voter ID was removed from DAL/RVL/VHL records.  Again, this removal happened in the middle of an election where that particular voter had already been issued a live ballot!

A few of these IDs actually “reappear” in the DAL records.  ID “230*****” is one example, and a screenshot of its chronological record is below.  The ballot shows as being ISSUED until Oct 14th 2021.  It then disappears from the DAL record completely until the data pull on Oct 24th, where it shows up again as DELETED.  This status is maintained until Nov 6th 2021 when it starts oscillating between “Marked” and “deleted” until it finally lands on “Marked” in the Dec 5 DAL file pull.  The entire time the Application status is in the “Approved” state for this voter ID.  From my discussions with registrars and election officials the “Marked” designation signifies that a ballot has been received by the registrar for that voter and is slated to be tabulated.

I have poked ELECT on twitter (@wwrkds) on this matter to try and get an official response, and submitted questions on this matter to Ashley Coles at ELECT, per the advice of my local board of elections chair. Her response to me is below:

I will update this post as information changes.

Categories
Election Data Analysis Election Forensics Election Integrity programming technical

Slides from Loudoun Election Integrity Meeting

Presented 2022-04-14 at Patriot Pub in Hamilton VA. Note that I mis-quoted the total registration error number off the slide deck during the voice track of my presentation. The slides are correct … my eyes and memory just suck sometimes! I said ~850K. The actual number is ~380K.

Video can be found here: https://www.patriotpuballiance.com