Month: May 2023

Potential Duplicate Registrants in VA RVL by Locality

Post author By Jonathan Lareau
Post date May 31, 2023
No Comments on Potential Duplicate Registrants in VA RVL by Locality

Previously I posted the computation of potential duplicate records based on string comparisons in the registered voter list. As a follow up to that article, I’ve compiled the statistics of the number of potential pairs for each locality in VA.

I tallied the number of registrant pairs with the reference match criteria defined by the MOU between ELECT and the DMV along with the two highest confidence (most stringent) match criteria that I computed. I also stratified the results by Active registrant records only or either Active or Inactive records. I also stratified by if the pairs crossed a locality boundary or not.

The table below is organized into the following computed columns, and has been sorted in decreasing order according to column 5.

Exactly matching First + Last + DOB, which is equivalent to the MOU between ELECT and DMV.
Exactly matching First + Middle + Last + Suffix + DOB
Exactly matching First + Middle + Last + Suffix + DOB + Gender + Street Address
The same as #1, but filtering for only ACTIVE voter records
The same as #2, but filtering for only ACTIVE voter records
The same as #3, but filtering for only ACTIVE voter records
The same as #1, but filtering for only pairs that cross a locality boundary.
The same as #2, but filtering for only pairs that cross a locality boundary.
The same as #3, but filtering for only pairs that cross a locality boundary.
The same as #4, but filtering for only pairs that cross a locality boundary.
The same as #5, but filtering for only pairs that cross a locality boundary.
The same as #6, but filtering for only pairs that cross a locality boundary.

		1	2	3	4	5	6	7	8	9	10	11	12
LOCALITY_NAME	Num Registrant Records	Pct Same First Last Dob	Pct Same Full Name Dob	Pct Same Full Name Dob Address	Pct Same First Last Dob _ Active Only	Pct Same Full Name Dob _ Active Only	Pct Same Full Name Dob Address _ Active Only	Pct Same First Last Dob _ xLoc	Pct Same Full Name Dob _ xLoc	Pct Same Full Name Dob Address _ xLoc	Pct Same First Last Dob _ Active Only _ xLoc	Pct Same Full Name Dob _ Active Only _ xLoc	Pct Same Full Name Dob Address _ Active Only _ xLoc
NORTON CITY	2604	0.2304%	0.2304%	0.1536%	0.1920%	0.1920%	0.1536%	0.0768%	0.0768%	0.0000%	0.0384%	0.0384%	0.0000%
NOTTOWAY COUNTY	9704	0.2988%	0.2061%	0.0618%	0.2473%	0.1752%	0.0618%	0.2370%	0.1649%	0.0206%	0.1855%	0.1340%	0.0206%
RADFORD CITY	9551	0.4293%	0.2827%	0.0000%	0.2827%	0.1675%	0.0000%	0.4293%	0.2827%	0.0000%	0.2827%	0.1675%	0.0000%
HIGHLAND COUNTY	1903	0.2627%	0.1576%	0.1051%	0.2627%	0.1576%	0.1051%	0.1576%	0.0525%	0.0000%	0.1576%	0.0525%	0.0000%
WILLIAMSBURG CITY	10480	0.2195%	0.1336%	0.0000%	0.2004%	0.1336%	0.0000%	0.2004%	0.1336%	0.0000%	0.1813%	0.1336%	0.0000%
LYNCHBURG CITY	56319	0.3072%	0.1829%	0.0533%	0.2255%	0.1296%	0.0533%	0.1616%	0.0764%	0.0000%	0.1190%	0.0479%	0.0000%
EMPORIA CITY	4023	0.3480%	0.1740%	0.0000%	0.2983%	0.1243%	0.0000%	0.2486%	0.0746%	0.0000%	0.1989%	0.0249%	0.0000%
SUFFOLK CITY	71580	0.2403%	0.1229%	0.0754%	0.2249%	0.1187%	0.0754%	0.1229%	0.0307%	0.0000%	0.1104%	0.0265%	0.0000%
FALLS CHURCH CITY	11213	0.1784%	0.1338%	0.0357%	0.1516%	0.1159%	0.0178%	0.0892%	0.0624%	0.0000%	0.0803%	0.0624%	0.0000%
SUSSEX COUNTY	7149	0.2658%	0.1259%	0.0839%	0.2238%	0.1119%	0.0839%	0.1539%	0.0140%	0.0000%	0.1119%	0.0000%	0.0000%
FRANKLIN CITY	5924	0.2026%	0.1182%	0.0338%	0.1857%	0.1013%	0.0338%	0.1688%	0.0844%	0.0000%	0.1519%	0.0675%	0.0000%
APPOMATTOX COUNTY	12195	0.2542%	0.1230%	0.0328%	0.2214%	0.0902%	0.0328%	0.2050%	0.0738%	0.0000%	0.1886%	0.0574%	0.0000%
LEE COUNTY	15619	0.2497%	0.0960%	0.0128%	0.2305%	0.0832%	0.0128%	0.1473%	0.0192%	0.0000%	0.1409%	0.0192%	0.0000%
ALBEMARLE COUNTY	84889	0.1920%	0.1001%	0.0212%	0.1590%	0.0825%	0.0188%	0.1402%	0.0554%	0.0000%	0.1096%	0.0401%	0.0000%
AMHERST COUNTY	22906	0.1965%	0.0829%	0.0437%	0.1790%	0.0742%	0.0437%	0.1441%	0.0393%	0.0000%	0.1266%	0.0306%	0.0000%
PRINCE EDWARD COUNTY	13595	0.2280%	0.0883%	0.0000%	0.1912%	0.0662%	0.0000%	0.2133%	0.0883%	0.0000%	0.1765%	0.0662%	0.0000%
STAUNTON CITY	18180	0.1980%	0.0935%	0.0000%	0.1595%	0.0605%	0.0000%	0.1650%	0.0605%	0.0000%	0.1265%	0.0275%	0.0000%
NELSON COUNTY	11895	0.1765%	0.0673%	0.0168%	0.1513%	0.0588%	0.0168%	0.1261%	0.0504%	0.0000%	0.1177%	0.0420%	0.0000%
ARLINGTON COUNTY	177092	0.1378%	0.0683%	0.0113%	0.1146%	0.0576%	0.0102%	0.0870%	0.0344%	0.0000%	0.0683%	0.0260%	0.0000%
NORTHUMBERLAND COUNTY	10457	0.1339%	0.0574%	0.0191%	0.1243%	0.0574%	0.0191%	0.0956%	0.0191%	0.0000%	0.0861%	0.0191%	0.0000%
SOUTHAMPTON COUNTY	13218	0.2194%	0.0757%	0.0000%	0.1740%	0.0530%	0.0000%	0.1589%	0.0454%	0.0000%	0.1286%	0.0227%	0.0000%
HOPEWELL CITY	15825	0.2401%	0.0695%	0.0253%	0.2085%	0.0506%	0.0253%	0.1390%	0.0190%	0.0000%	0.1201%	0.0126%	0.0000%
LUNENBURG COUNTY	8097	0.1853%	0.0618%	0.0000%	0.1729%	0.0494%	0.0000%	0.1853%	0.0618%	0.0000%	0.1729%	0.0494%	0.0000%
AMELIA COUNTY	10179	0.1375%	0.0884%	0.0098%	0.0884%	0.0491%	0.0098%	0.1375%	0.0884%	0.0098%	0.0884%	0.0491%	0.0098%
RICHMOND CITY	161097	0.1707%	0.0639%	0.0000%	0.1316%	0.0490%	0.0000%	0.1459%	0.0528%	0.0000%	0.1155%	0.0416%	0.0000%
CHARLOTTESVILLE CITY	34789	0.1265%	0.0604%	0.0000%	0.1064%	0.0489%	0.0000%	0.1150%	0.0489%	0.0000%	0.0949%	0.0374%	0.0000%
LEXINGTON CITY	4211	0.2612%	0.1187%	0.0000%	0.1900%	0.0475%	0.0000%	0.2612%	0.1187%	0.0000%	0.1900%	0.0475%	0.0000%
FAIRFAX COUNTY	787727	0.1143%	0.0559%	0.0053%	0.0988%	0.0474%	0.0053%	0.0665%	0.0236%	0.0000%	0.0546%	0.0171%	0.0000%
CHARLOTTE COUNTY	8474	0.2242%	0.0708%	0.0236%	0.1652%	0.0472%	0.0236%	0.2006%	0.0472%	0.0000%	0.1416%	0.0236%	0.0000%
HARRISONBURG CITY	26443	0.1777%	0.0870%	0.0000%	0.1210%	0.0454%	0.0000%	0.1324%	0.0567%	0.0000%	0.0908%	0.0303%	0.0000%
BRUNSWICK COUNTY	11098	0.2253%	0.0631%	0.0000%	0.1982%	0.0451%	0.0000%	0.2072%	0.0451%	0.0000%	0.1802%	0.0270%	0.0000%
HAMPTON CITY	100807	0.2044%	0.0764%	0.0060%	0.1468%	0.0446%	0.0040%	0.1210%	0.0387%	0.0000%	0.0972%	0.0268%	0.0000%
WISE COUNTY	24750	0.1455%	0.0525%	0.0000%	0.1333%	0.0444%	0.0000%	0.1212%	0.0364%	0.0000%	0.1091%	0.0283%	0.0000%
WYTHE COUNTY	20950	0.1480%	0.0525%	0.0191%	0.1289%	0.0430%	0.0191%	0.1002%	0.0143%	0.0000%	0.0907%	0.0143%	0.0000%
CHESAPEAKE CITY	178005	0.1258%	0.0433%	0.0303%	0.1140%	0.0410%	0.0303%	0.0843%	0.0062%	0.0000%	0.0747%	0.0051%	0.0000%
NEWPORT NEWS CITY	124778	0.1354%	0.0537%	0.0016%	0.1122%	0.0409%	0.0016%	0.1002%	0.0313%	0.0000%	0.0850%	0.0216%	0.0000%
CUMBERLAND COUNTY	7416	0.1483%	0.0539%	0.0270%	0.1214%	0.0405%	0.0270%	0.1214%	0.0270%	0.0000%	0.0944%	0.0135%	0.0000%
PRINCE GEORGE COUNTY	24957	0.1643%	0.0401%	0.0000%	0.1322%	0.0401%	0.0000%	0.1643%	0.0401%	0.0000%	0.1322%	0.0401%	0.0000%
HALIFAX COUNTY	25086	0.1196%	0.0438%	0.0239%	0.1156%	0.0399%	0.0239%	0.0877%	0.0120%	0.0000%	0.0837%	0.0080%	0.0000%
SMYTH COUNTY	20159	0.1339%	0.0397%	0.0000%	0.1290%	0.0397%	0.0000%	0.1141%	0.0198%	0.0000%	0.1091%	0.0198%	0.0000%
FAIRFAX CITY	17825	0.1234%	0.0617%	0.0000%	0.0954%	0.0393%	0.0000%	0.1122%	0.0617%	0.0000%	0.0842%	0.0393%	0.0000%
CAMPBELL COUNTY	41318	0.1380%	0.0508%	0.0048%	0.1186%	0.0387%	0.0048%	0.1283%	0.0411%	0.0000%	0.1089%	0.0290%	0.0000%
COLONIAL HEIGHTS CITY	13066	0.0918%	0.0383%	0.0000%	0.0918%	0.0383%	0.0000%	0.0918%	0.0383%	0.0000%	0.0918%	0.0383%	0.0000%
CHESTERFIELD COUNTY	270084	0.1529%	0.0478%	0.0067%	0.1300%	0.0381%	0.0059%	0.1107%	0.0248%	0.0000%	0.0937%	0.0196%	0.0000%
PETERSBURG CITY	23740	0.1685%	0.0421%	0.0000%	0.1559%	0.0379%	0.0000%	0.1601%	0.0421%	0.0000%	0.1474%	0.0379%	0.0000%
SURRY COUNTY	5675	0.1762%	0.0352%	0.0000%	0.1410%	0.0352%	0.0000%	0.1410%	0.0000%	0.0000%	0.1057%	0.0000%	0.0000%
STAFFORD COUNTY	111261	0.1222%	0.0440%	0.0072%	0.1079%	0.0351%	0.0072%	0.1007%	0.0279%	0.0000%	0.0881%	0.0207%	0.0000%
BUCHANAN COUNTY	14836	0.0876%	0.0337%	0.0000%	0.0876%	0.0337%	0.0000%	0.0607%	0.0067%	0.0000%	0.0607%	0.0067%	0.0000%
PORTSMOUTH CITY	68381	0.1536%	0.0409%	0.0058%	0.1375%	0.0336%	0.0058%	0.1185%	0.0263%	0.0000%	0.1024%	0.0190%	0.0000%
PITTSYLVANIA COUNTY	45322	0.1677%	0.0441%	0.0044%	0.1522%	0.0331%	0.0044%	0.1324%	0.0221%	0.0000%	0.1214%	0.0154%	0.0000%
MECKLENBURG COUNTY	22996	0.1522%	0.0478%	0.0000%	0.1305%	0.0304%	0.0000%	0.1261%	0.0391%	0.0000%	0.1131%	0.0304%	0.0000%
NORTHAMPTON COUNTY	9877	0.0911%	0.0304%	0.0202%	0.0810%	0.0304%	0.0202%	0.0911%	0.0101%	0.0000%	0.0810%	0.0101%	0.0000%
PAGE COUNTY	17095	0.1872%	0.0351%	0.0000%	0.1521%	0.0292%	0.0000%	0.1521%	0.0117%	0.0000%	0.1170%	0.0058%	0.0000%
ACCOMACK COUNTY	25483	0.1216%	0.0275%	0.0000%	0.1020%	0.0275%	0.0000%	0.1138%	0.0275%	0.0000%	0.0942%	0.0275%	0.0000%
GRAYSON COUNTY	10941	0.1645%	0.0274%	0.0000%	0.1554%	0.0274%	0.0000%	0.1462%	0.0274%	0.0000%	0.1371%	0.0274%	0.0000%
ALLEGHANY COUNTY	11069	0.1355%	0.0271%	0.0000%	0.1084%	0.0271%	0.0000%	0.0994%	0.0090%	0.0000%	0.0723%	0.0090%	0.0000%
MATHEWS COUNTY	7378	0.0949%	0.0271%	0.0271%	0.0678%	0.0271%	0.0271%	0.0678%	0.0000%	0.0000%	0.0407%	0.0000%	0.0000%
BEDFORD COUNTY	63240	0.1233%	0.0300%	0.0063%	0.1154%	0.0269%	0.0063%	0.1012%	0.0142%	0.0000%	0.0933%	0.0111%	0.0000%
HENRICO COUNTY	240436	0.1152%	0.0299%	0.0083%	0.0998%	0.0258%	0.0083%	0.0944%	0.0175%	0.0000%	0.0807%	0.0133%	0.0000%
WAYNESBORO CITY	15561	0.1735%	0.0450%	0.0000%	0.1285%	0.0257%	0.0000%	0.1735%	0.0450%	0.0000%	0.1285%	0.0257%	0.0000%
HANOVER COUNTY	87000	0.1092%	0.0287%	0.0023%	0.1011%	0.0253%	0.0023%	0.1023%	0.0218%	0.0000%	0.0943%	0.0184%	0.0000%
CRAIG COUNTY	3972	0.1007%	0.0252%	0.0000%	0.1007%	0.0252%	0.0000%	0.1007%	0.0252%	0.0000%	0.1007%	0.0252%	0.0000%
GALAX CITY	4067	0.1229%	0.0246%	0.0000%	0.1229%	0.0246%	0.0000%	0.1229%	0.0246%	0.0000%	0.1229%	0.0246%	0.0000%
ORANGE COUNTY	28482	0.1299%	0.0351%	0.0000%	0.1194%	0.0246%	0.0000%	0.1299%	0.0351%	0.0000%	0.1194%	0.0246%	0.0000%
DANVILLE CITY	28838	0.1040%	0.0312%	0.0000%	0.0902%	0.0243%	0.0000%	0.1040%	0.0312%	0.0000%	0.0902%	0.0243%	0.0000%
CARROLL COUNTY	21163	0.1040%	0.0236%	0.0095%	0.1040%	0.0236%	0.0095%	0.0945%	0.0142%	0.0000%	0.0945%	0.0142%	0.0000%
FREDERICK COUNTY	67912	0.1075%	0.0324%	0.0088%	0.0883%	0.0236%	0.0059%	0.0898%	0.0206%	0.0000%	0.0736%	0.0147%	0.0000%
MANASSAS PARK CITY	9018	0.0665%	0.0222%	0.0000%	0.0554%	0.0222%	0.0000%	0.0444%	0.0222%	0.0000%	0.0333%	0.0222%	0.0000%
HENRY COUNTY	36539	0.1259%	0.0246%	0.0000%	0.1122%	0.0219%	0.0000%	0.0931%	0.0082%	0.0000%	0.0848%	0.0055%	0.0000%
BLAND COUNTY	4581	0.1091%	0.0218%	0.0000%	0.1091%	0.0218%	0.0000%	0.1091%	0.0218%	0.0000%	0.1091%	0.0218%	0.0000%
SPOTSYLVANIA COUNTY	105361	0.0987%	0.0247%	0.0057%	0.0873%	0.0218%	0.0057%	0.0816%	0.0095%	0.0000%	0.0702%	0.0066%	0.0000%
WINCHESTER CITY	18352	0.1035%	0.0381%	0.0000%	0.0708%	0.0218%	0.0000%	0.0926%	0.0381%	0.0000%	0.0599%	0.0218%	0.0000%
LANCASTER COUNTY	9267	0.0755%	0.0216%	0.0000%	0.0755%	0.0216%	0.0000%	0.0755%	0.0216%	0.0000%	0.0755%	0.0216%	0.0000%
KING WILLIAM COUNTY	13996	0.1286%	0.0214%	0.0000%	0.1143%	0.0214%	0.0000%	0.1286%	0.0214%	0.0000%	0.1143%	0.0214%	0.0000%
WESTMORELAND COUNTY	14233	0.1827%	0.0211%	0.0000%	0.1756%	0.0211%	0.0000%	0.1546%	0.0211%	0.0000%	0.1475%	0.0211%	0.0000%
VIRGINIA BEACH CITY	331914	0.1118%	0.0259%	0.0066%	0.0967%	0.0208%	0.0066%	0.0883%	0.0114%	0.0000%	0.0762%	0.0081%	0.0000%
POWHATAN COUNTY	24287	0.1400%	0.0371%	0.0000%	0.1153%	0.0206%	0.0000%	0.1400%	0.0371%	0.0000%	0.1153%	0.0206%	0.0000%
BOTETOURT COUNTY	26311	0.1178%	0.0190%	0.0076%	0.1102%	0.0190%	0.0076%	0.1102%	0.0114%	0.0000%	0.1026%	0.0114%	0.0000%
FLUVANNA COUNTY	21001	0.1286%	0.0238%	0.0000%	0.1190%	0.0190%	0.0000%	0.1095%	0.0238%	0.0000%	0.1000%	0.0190%	0.0000%
SCOTT COUNTY	16059	0.1121%	0.0249%	0.0000%	0.1059%	0.0187%	0.0000%	0.0996%	0.0125%	0.0000%	0.0934%	0.0062%	0.0000%
ALEXANDRIA CITY	112212	0.0820%	0.0205%	0.0000%	0.0686%	0.0178%	0.0000%	0.0784%	0.0169%	0.0000%	0.0651%	0.0143%	0.0000%
TAZEWELL COUNTY	28147	0.0995%	0.0178%	0.0142%	0.0959%	0.0178%	0.0142%	0.0853%	0.0036%	0.0000%	0.0817%	0.0036%	0.0000%
RICHMOND COUNTY	5649	0.2301%	0.0354%	0.0000%	0.1947%	0.0177%	0.0000%	0.1947%	0.0354%	0.0000%	0.1593%	0.0177%	0.0000%
ROCKINGHAM COUNTY	56817	0.0845%	0.0246%	0.0035%	0.0739%	0.0176%	0.0000%	0.0739%	0.0176%	0.0000%	0.0669%	0.0141%	0.0000%
LOUISA COUNTY	29567	0.1150%	0.0271%	0.0135%	0.1015%	0.0169%	0.0135%	0.1082%	0.0135%	0.0000%	0.0947%	0.0034%	0.0000%
LOUDOUN COUNTY	291914	0.0740%	0.0219%	0.0041%	0.0620%	0.0164%	0.0041%	0.0651%	0.0171%	0.0000%	0.0531%	0.0116%	0.0000%
RAPPAHANNOCK COUNTY	6239	0.0962%	0.0160%	0.0000%	0.0801%	0.0160%	0.0000%	0.0962%	0.0160%	0.0000%	0.0801%	0.0160%	0.0000%
JAMES CITY COUNTY	64390	0.0745%	0.0186%	0.0000%	0.0668%	0.0155%	0.0000%	0.0621%	0.0124%	0.0000%	0.0544%	0.0093%	0.0000%
PATRICK COUNTY	12862	0.0855%	0.0155%	0.0000%	0.0777%	0.0155%	0.0000%	0.0855%	0.0155%	0.0000%	0.0777%	0.0155%	0.0000%
PRINCE WILLIAM COUNTY	316530	0.0812%	0.0186%	0.0000%	0.0663%	0.0148%	0.0000%	0.0711%	0.0142%	0.0000%	0.0581%	0.0104%	0.0000%
AUGUSTA COUNTY	54993	0.1455%	0.0218%	0.0036%	0.1255%	0.0145%	0.0036%	0.1346%	0.0182%	0.0000%	0.1146%	0.0109%	0.0000%
DINWIDDIE COUNTY	20835	0.1584%	0.0384%	0.0048%	0.1152%	0.0144%	0.0048%	0.1488%	0.0288%	0.0048%	0.1152%	0.0144%	0.0048%
GOOCHLAND COUNTY	21410	0.1261%	0.0187%	0.0000%	0.1121%	0.0140%	0.0000%	0.1261%	0.0187%	0.0000%	0.1121%	0.0140%	0.0000%
MONTGOMERY COUNTY	61944	0.0936%	0.0145%	0.0000%	0.0807%	0.0129%	0.0000%	0.0904%	0.0145%	0.0000%	0.0775%	0.0129%	0.0000%
SHENANDOAH COUNTY	32304	0.0960%	0.0155%	0.0000%	0.0743%	0.0124%	0.0000%	0.0960%	0.0155%	0.0000%	0.0743%	0.0124%	0.0000%
ROANOKE COUNTY	73467	0.0953%	0.0163%	0.0027%	0.0830%	0.0123%	0.0027%	0.0817%	0.0109%	0.0000%	0.0694%	0.0068%	0.0000%
SALEM CITY	17932	0.0892%	0.0112%	0.0000%	0.0781%	0.0112%	0.0000%	0.0892%	0.0112%	0.0000%	0.0781%	0.0112%	0.0000%
NEW KENT COUNTY	19022	0.1051%	0.0210%	0.0000%	0.0894%	0.0105%	0.0000%	0.0946%	0.0210%	0.0000%	0.0789%	0.0105%	0.0000%
WASHINGTON COUNTY	39449	0.1014%	0.0152%	0.0000%	0.0887%	0.0101%	0.0000%	0.0862%	0.0051%	0.0000%	0.0786%	0.0051%	0.0000%
MADISON COUNTY	10407	0.0865%	0.0192%	0.0000%	0.0769%	0.0096%	0.0000%	0.0865%	0.0192%	0.0000%	0.0769%	0.0096%	0.0000%
NORFOLK CITY	141236	0.0984%	0.0092%	0.0000%	0.0864%	0.0085%	0.0000%	0.0899%	0.0064%	0.0000%	0.0793%	0.0057%	0.0000%
PULASKI COUNTY	23825	0.0881%	0.0126%	0.0000%	0.0756%	0.0084%	0.0000%	0.0881%	0.0126%	0.0000%	0.0756%	0.0084%	0.0000%
CLARKE COUNTY	12269	0.1060%	0.0163%	0.0000%	0.0978%	0.0082%	0.0000%	0.1060%	0.0163%	0.0000%	0.0978%	0.0082%	0.0000%
GREENE COUNTY	14926	0.1072%	0.0067%	0.0000%	0.1072%	0.0067%	0.0000%	0.1072%	0.0067%	0.0000%	0.1072%	0.0067%	0.0000%
GLOUCESTER COUNTY	30284	0.0859%	0.0066%	0.0000%	0.0859%	0.0066%	0.0000%	0.0859%	0.0066%	0.0000%	0.0859%	0.0066%	0.0000%
WARREN COUNTY	30517	0.0885%	0.0066%	0.0000%	0.0852%	0.0066%	0.0000%	0.0819%	0.0066%	0.0000%	0.0786%	0.0066%	0.0000%
ISLE OF WIGHT COUNTY	31179	0.0898%	0.0064%	0.0000%	0.0834%	0.0064%	0.0000%	0.0898%	0.0064%	0.0000%	0.0834%	0.0064%	0.0000%
ROCKBRIDGE COUNTY	16266	0.1230%	0.0123%	0.0000%	0.1045%	0.0061%	0.0000%	0.1230%	0.0123%	0.0000%	0.1045%	0.0061%	0.0000%
CULPEPER COUNTY	37117	0.0943%	0.0108%	0.0000%	0.0808%	0.0054%	0.0000%	0.0889%	0.0108%	0.0000%	0.0754%	0.0054%	0.0000%
FAUQUIER COUNTY	56396	0.0887%	0.0071%	0.0000%	0.0762%	0.0053%	0.0000%	0.0887%	0.0071%	0.0000%	0.0762%	0.0053%	0.0000%
FREDERICKSBURG CITY	19455	0.0874%	0.0051%	0.0000%	0.0720%	0.0051%	0.0000%	0.0874%	0.0051%	0.0000%	0.0720%	0.0051%	0.0000%
FRANKLIN COUNTY	39866	0.0602%	0.0050%	0.0050%	0.0502%	0.0050%	0.0050%	0.0552%	0.0000%	0.0000%	0.0452%	0.0000%	0.0000%
MANASSAS CITY	23815	0.1008%	0.0042%	0.0000%	0.0966%	0.0042%	0.0000%	0.0840%	0.0042%	0.0000%	0.0798%	0.0042%	0.0000%
YORK COUNTY	50838	0.0925%	0.0157%	0.0000%	0.0669%	0.0039%	0.0000%	0.0885%	0.0157%	0.0000%	0.0629%	0.0039%	0.0000%
BATH COUNTY	3358	0.0893%	0.0000%	0.0000%	0.0893%	0.0000%	0.0000%	0.0893%	0.0000%	0.0000%	0.0893%	0.0000%	0.0000%
BRISTOL CITY	12345	0.0729%	0.0000%	0.0000%	0.0567%	0.0000%	0.0000%	0.0567%	0.0000%	0.0000%	0.0567%	0.0000%	0.0000%
BUCKINGHAM COUNTY	11063	0.1356%	0.0271%	0.0000%	0.0904%	0.0000%	0.0000%	0.1356%	0.0271%	0.0000%	0.0904%	0.0000%	0.0000%
BUENA VISTA CITY	4432	0.0903%	0.0000%	0.0000%	0.0903%	0.0000%	0.0000%	0.0903%	0.0000%	0.0000%	0.0903%	0.0000%	0.0000%
CAROLINE COUNTY	22894	0.1005%	0.0087%	0.0000%	0.0830%	0.0000%	0.0000%	0.1005%	0.0087%	0.0000%	0.0830%	0.0000%	0.0000%
CHARLES CITY COUNTY	5720	0.0524%	0.0000%	0.0000%	0.0350%	0.0000%	0.0000%	0.0524%	0.0000%	0.0000%	0.0350%	0.0000%	0.0000%
COVINGTON CITY	3888	0.1029%	0.0000%	0.0000%	0.0772%	0.0000%	0.0000%	0.1029%	0.0000%	0.0000%	0.0772%	0.0000%	0.0000%
DICKENSON COUNTY	10144	0.1084%	0.0000%	0.0000%	0.0887%	0.0000%	0.0000%	0.1084%	0.0000%	0.0000%	0.0887%	0.0000%	0.0000%
ESSEX COUNTY	8318	0.1443%	0.0000%	0.0000%	0.1443%	0.0000%	0.0000%	0.1443%	0.0000%	0.0000%	0.1443%	0.0000%	0.0000%
FLOYD COUNTY	11852	0.0759%	0.0000%	0.0000%	0.0759%	0.0000%	0.0000%	0.0759%	0.0000%	0.0000%	0.0759%	0.0000%	0.0000%
GILES COUNTY	12093	0.0413%	0.0000%	0.0000%	0.0331%	0.0000%	0.0000%	0.0413%	0.0000%	0.0000%	0.0331%	0.0000%	0.0000%
GREENSVILLE COUNTY	6435	0.1709%	0.0155%	0.0000%	0.1399%	0.0000%	0.0000%	0.1709%	0.0155%	0.0000%	0.1399%	0.0000%	0.0000%
KING AND QUEEN COUNTY	5403	0.0740%	0.0000%	0.0000%	0.0740%	0.0000%	0.0000%	0.0740%	0.0000%	0.0000%	0.0740%	0.0000%	0.0000%
KING GEORGE COUNTY	19780	0.1314%	0.0000%	0.0000%	0.0910%	0.0000%	0.0000%	0.1314%	0.0000%	0.0000%	0.0910%	0.0000%	0.0000%
MARTINSVILLE CITY	9070	0.0992%	0.0000%	0.0000%	0.0882%	0.0000%	0.0000%	0.0992%	0.0000%	0.0000%	0.0882%	0.0000%	0.0000%
MIDDLESEX COUNTY	8746	0.1029%	0.0114%	0.0000%	0.0800%	0.0000%	0.0000%	0.1029%	0.0114%	0.0000%	0.0800%	0.0000%	0.0000%
POQUOSON CITY	9635	0.0934%	0.0000%	0.0000%	0.0934%	0.0000%	0.0000%	0.0934%	0.0000%	0.0000%	0.0934%	0.0000%	0.0000%
ROANOKE CITY	66083	0.0817%	0.0015%	0.0000%	0.0666%	0.0000%	0.0000%	0.0817%	0.0015%	0.0000%	0.0666%	0.0000%	0.0000%
RUSSELL COUNTY	19240	0.1091%	0.0000%	0.0000%	0.1040%	0.0000%	0.0000%	0.1091%	0.0000%	0.0000%	0.1040%	0.0000%	0.0000%

Election Data Analysis Election Forensics Election Integrity mathematics programming technical

Potential duplicate registrants in VA voter list

Post author By Jonathan Lareau
Post date May 27, 2023
No Comments on Potential duplicate registrants in VA voter list

I previously documented the utilization of the Hamming string distance measure to identify candidate pairs of duplicate registrants in voter lists. While a good first attempt at quantifying the numbers of potential duplicates in the voter rolls, using a hamming distance metric is less than ideal for reasons discussed below and in the previous article. I have since been able to update the processing functions to use a more complete Levenshtein distance (LD) metric, and made some improvements to parsers and other code utilities, etc., but otherwise the the analysis followed the same procedure, and is described below.

Using the 2022-11-23 Registered Voter List (RVL) and the 2023-01-26 Voter History List (VHL) purchased from the VA Department of Elections (ELECT) I wrote up an analysis script to check for potentially duplicated registrant records in the RVL and cross reference duplicate pairings with the VHL to identify potential duplicate votes. The details are summarized below.

Please note that I will not publish voter Personally Identifiable Information (PII) on this blog. I have substituted fictitious PII information for all examples given below, and cryptographically hashed all voter information in the downloadable results file. I will make available the detailed information to those that have the authorization to receive and process voter data upon request (contact us).

Summary of Results:

As a baseline, there were 6,464 records for STATUS=’Active’ registrants that adhered to the definition of a “duplicate” when Social Security Number (SSN) is not available, as defined by the MOU between DMV and ELECT (section 7.3) of having the same First Name + Last Name + Full Date of Birth (DOB). I’ve included a copy of the MOU between the VA DMV and ELECT at the end of this article for reference. It should be noted that most records held by DMV and ELECT have a SSN associated with them (or at least they should). SSN information is not distributed as part of the data purchased from ELECT, however, so this is the appropriate standard baseline for this work.

Upgrading our definition of a potential duplicate to [First + Middle + Last + Suffix + DOB] and using a LevenshteinDistance=0 drops the number of potential duplicates to 1,982, with each identified registrant in a pair having an exactly matching string result and unique voter ID numbers.

According to my derivations and simulations that are described in detail here, we should only expect to see an average of 11 (+/- 3) potential duplicate pairs (a.k.a. “collisions”) at a distance of 0. This is over two orders of magnitude different than what we observe in the compiled results. Such a discrepancy deserves further investigation and verification.

Allowing for a single string difference by setting LevenshteinDistance<=1 increases the pool of potential duplicates to 5,568. While this relaxation of the filter does allow us to find certain issues (described below) it also increases our chances of finding false positives as well. The LD metric results should not be viewed as a final determination, but as simply a useful tool to make an initial pass through the data and find candidate matches that still require further review, verification and validation.

Increasing to LevenshteinDistance<=2 brings the number of potential duplicates up to 32,610. When we increase to LD <= 3 we get an explosion of 183,130 potential duplicates.

Method:

For every entry in the latest RVL, I performed a string distance comparison, based on Levenshtein distance, between every possible pair of strings of (FIRST NAME + MIDDLE NAME + LAST NAME + SUFFIX + FULL DOB). For the ~6M different RVL entries, we therefore need to compute ~3.8 x 10^13 different string comparisons, and each string comparison can require upwards of 75 x 75 individual character comparisons, meaning the total number of character operations is on the order of 202.5 Quadrillion, not including logging and I/O.

A distance of 0 indicates the strings being compared are identical, a distance of 1 indicates that there a single character can be changed, inserted or removed that would convert one string into the other. A distance of 2 indicates that 2 modifications are required, etc.

Example: The string pair of “ALISHA” –> “ALISHIA” has an LD of 1, corresponding to the addition of an “I” before the final “A”.

I aggregated all of the Levenshtein distance pairings that were less than or equal to 3 characters different in order to identify potential (key word) duplicated registrants, and additionally for each pairing looked at the voter history information for each registrant in the pair to determine if there was a potential (again … key word) for multiple ballots to be cast by the same person in any given election. As we allow for more characters to be different, we potentially are including many more likely false positive matches, even if we are catching more true positives.

For example: At a distance of 4 the strings of “Dave Joseph Smith M 10/01/1981” and “Tony Joseph Smith M 10/01/1981” at the same address would produce a potential match, but so would “Davey Joseph Smith M 10/01/1981” and “David Josiph Smith M 10/02/1981”. The first pair is more likely to be a false positive due to twins, while the second is more likely to be due to typo’s, mistakes, or use of nicknames and might warrant further investigation. A much stronger potential match would be something like “David Josiph Smith M 10/01/1981” and “David Joseph Smith M 10/01/1981”, with a distance of 1 at the same address. In an attempt to limit false positives, I have clamped the distance checks to <= 3 in this analysis.

The Levenshtein distance measure is importantly able to identify potential insertions or deletions as well as character changes, which is an improvement over the Hamming distance measure. This is exampled by the following pairing: “David Joseph Smith M 10/01/1981” and “Dave Joseph Smith M 10/01/1981”. The change from “id” to “e” in the first name adds/subtracts a character making the rest of the characters in the remainder of the string shift position. A Levenshtein metric would correctly return a small distance of 2, whereas the hamming distance returns 27.

Note that with the official records obtained from ELECT, and in accordance with the laws of VA, I do not have access to the social security number or drivers license numbers for each registration record, which would help in identifying and discriminating potential duplicate errors vs things like twins, etc. I only have the first name, middle name, last name, suffix, month of birth, day of birth, year of birth, gender, and address information that I can work with. I can therefore only take things so far before someone else (with investigative authority and ability to access those other fields) would need to step in and confirm and validate these findings.

Results:

The summary totals are as follows, with detailed examples.

	DMV_ELECT MOU Standard	LD <= 0	LD <= 1	LD <= 2	LD <= 3
Number of Potential Duplicate Registrant Pairs	7,586 (0.12%)	2,472 (0.04%)	6,620 (0.11%)	32,610 (0.53%)	183,130 (2.99%)
Number of Potential Duplicate Registrant Pairs (Active Only)	6,464 (0.11%)	1,982 (0.03%)	5,568 (0.10%)	28,884 (0.50%)	164,302 (2.85%)
Number of Potential Duplicate Ballots	6,362	112	3,576	37,028	236,254
Number of Potential Duplicate Ballots (Active Only)	6,228	110	3,542	36,434	232,394

Examples of Types of Issues Observed:

NOTE THE BELOW INFORMATION HAS HAD THE VOTER PERSONALLY IDENTIFIABLE INFORMATION (“PII”) FICTIONALIZED. WHILE THESE ARE BASED ON REAL DATA TO ILLUSTRATE THE DIFFERENT TYPES OF OBSERVATIONS, THEY DO NOT REPRESENT REAL VOTER INFORMATION.

Example #1: The following set of records has the exact match (distance = 0) of full name and full birthdate (including year), but different address and different voter ID numbers AND there was a vote cast from each of those unique voter ID’s in the 2020 General Election. While it’s remotely possible that two individuals share the exact same name, month, day and year of birth … it is probabilistically unlikely (see here), and should warrant further scrutiny.

Voter Record A:

AMY BETH McVOTER 12/05/1970 F 12345 CITIZEN CT

Voter Record B:

AMY BETH McVOTER 12/05/1970 F 5678 McPUBLIC DR

Example #2: This set of records has a single character different (distance of 1) in their first name, but middle name, last name, birthdate and address are identical AND both records are associated with votes that were cast in the 2020, 2021, and 2022 November General Elections. While it is possible that this is a pair of 23 year old twins (with same middle names) that live together, it at least bears looking into.

Voter Record A:

TAYLOR DAVID VOTER 02/16/2000 M 6543 OVERLOOK AVE NW

Voter Record B:

DAYLOR DAVID VOTER 02/16/2000 M 6543 OVERLOOK AVE NW

Example #3: This set of records has two characters different (distance of 2) in their birthdate, but name and address are identical AND the birth years are too close together for a child/parent relationship, AND both records are associated with votes that were cast in the 2020 and 2022 November General Elections.

Voter Record A:

REGINA DESEREE MACGUFFIN 02/05/1973 F 123 POPE AVE

Voter Record B:

REGINA DESEREE MACGUFFIN 03/07/1973 F 123 POPE AVE

Example #4: This set of records has again a single character different (distance of 1) in the first name (but not the first letter this time) and the last name, birthdate and address are identical. There were also multiple votes cast in the 2019 and 2022 November General from these registrants.

Voter Record A:

EDGARD JOHNSON 10/19/1981 M 5498 PAGELAND BLVD

Voter Record B:

EDUARD JOHNSON 10/19/1981 M 5498 PAGELAND BLVD

Example #5: This set of records has two characters different (distance of 2) in the first and middle names and the last name, birthdate, gender and address are identical. There were also multiple votes cast in the 2021 and 2022 November General from these registrants. Again it is possible that these records represent a set of twins given the information that ELECT provides.

Voter Record A:

ALANA JAVETTE THOMPSON 01/01/2003 F 123 CHARITY LN

Voter Record B:

ALAYA YAVETTE THOMPSON 01/01/2003 F 123 CHARITY LN

Example #6: The following set of records has the exact match (Distance = 0) of full name and full birthdate (including year), and same address but different voter ID numbers. There was no duplicated votes in the same election detected between the two ID numbers.

Voter Record A:

JAMES TIBERIUS KIRK 03/22/2223 M 1701 Enterprise Bridge

Voter Record B:

JAMES TIBERIUS KIRK 03/22/2223 M 1701 Enterprise Bridge

Example #7: The following set of records has the exact match (distance = 0) of full name and full birthdate (including year), same address but different gender and voter ID numbers. There was no duplicated votes in the same election detected between the two ID numbers.

Voter Record A:

MAXWELL QUAID CLINGER 11/03/2004 M 4077 MASH DR

Voter Record B:

MAXWELL QUAID CLINGER 11/03/2004 U 4077 MASH DR

Example #8: The following set of records has a single punctuation character different, with the same address but different voter ID numbers. There was no duplicated votes in the same election detected between the two ID numbers.

Voter Record A:

JOHN JACOB JINGLHIEMER-SCHMIDT 06/29/1997 M 12345 JACOBS RD

Voter Record B:

JOHN JACOB JINGLHIEMER SCHMIDT 06/29/1997 M 12345 JACOBS RD

Results Dataset:

A full version of the aggregated excel data is provided below, however all voter information (ID, first name, middle name, last name, dob, gender, address) have been removed and replaced by a one-way hash number, with randomized salt, based on the voter ID. The full file with specific voter information can be provided to parties authorized by ELECT to receive and process voter information, Election Officials, or Law Enforcement upon request.

20221123-VA-RVL-String-Distance.csv

The MOU between the VA Department of Elections (ELECT) and the VA Department of Motor Vehicles (DMV) is also provided below for reference. Section 7.3 defines the minimal standards for determining a match when no social security number is present.

MOU-between-DMV-and-The-Virginia-Department-of-Elections-2021 Download

Election Data Analysis mathematics programming technical

Derivation of Expected number of String Collisions in VA Registered Voter Data

Post author By Jonathan Lareau
Post date May 27, 2023
No Comments on Derivation of Expected number of String Collisions in VA Registered Voter Data

Below I present the theory and derivation as to how I arrived at the expected value of 11 collisions (+/- 3) as mentioned in my posts discussing string distance analysis (here and here). I’ve tried to make the derivation below as digestible as possible, with accessible references, but it is admittedly still a very technical read. I think its important to “show my work” on the subject, though, and I present it here and am happy to take comments and criticism (contact).

Q: How much of a chance do we actually have of getting an exact (Hamming distance of 0) collision in the full name and full date of birth? Well, there is a similar and well known probability puzzle that asks how many random people do you need to approximately have a 50% chance of 2 of them sharing the same birthday (not including the year of birth). This is known as the “Birthday Problem” in probability theory, and rather surprisingly, the answer is that you only need about 23 people in your population sample to have a 50% probability that 2 of those people will share a day-of-year of birth. To quote the wikipedia article on the matter “… While it may seem surprising that only 23 individuals are required to reach a 50% probability of a shared birthday, this result is made more intuitive by considering that the birthday comparisons will be made between every possible pair of individuals. With 23 individuals, there are 23 × 22/2 = 253 pairs to consider, far more than half the number of days in a year.” The same mathematics of the birthday problem is the basis of the Birthday Attack cryptographic exploit, and it is therefore a well-studied problem in cryptography and cyber security.

Figure 1: The computed probability of at least two people sharing a birthday versus the number of people. A recreation of the classic “Birthday Problem”.

Now, as interesting as the toy birthday problem is as described above, it is over simplified for the problem we are looking at here. Firstly, the problem setup assumes independent and identically distributed random variable (e.g. an “IID” set of variables). While this is not exactly the case, the IID assumption provides for a computable first order estimate, and in the case of the classical birthday problem the estimate has been shown to be fairly accurate under experimental conditions.

Secondly, when we start additionally considering the year of birth, or sharing of first names, middle names and last names, things get much more complicated to compute, but the method is the same. We want to determine the probability of 2 people sharing the same First Name, Middle Name, Last Name, Suffix, Month-of-Birth, Day-of-Birth and Year-of-Birth in the population of unique registrants in the Registered Voter List. This means that in addition to the 365 day-of-birth possibilities, we need to estimate the number of possible years to cover, the number of possible first names, the number of possible middle names, the number of possible last names, the number of possible suffix strings and then include these possibilities into the same formulation as the birthday problem setup.

For determining how many years we should cover, I will simply use the average life expectancy of approximately 79 years. We can therefore update our N value of the birthday problem from 365 to 365 * 79 = 28835. When we perform the same analysis as the standard birthday problem with just this new parameter included, we end up needing 200 people in our sample population to have a 50% probability of of 2 people having a match.

Figure 2: The computed probability of at least two people sharing a birthday versus the number of people in the sample population. A recreation of the classic “Birthday Problem”, but we’ve updated the analysis to include the year of birth, and assumed the average life expectancy of 79 years. This moves the 50% crossover point to a population size of 200 from 23 for the standard Birthday Problem setup. [Edit: On 2025-05-13 this plot was corrected to the correct plot. The previous version had repeated the plot of Figure 1.]

A similar analysis can be done with the number of names being considered, etc. For each (assumed independent and uniform) variable we add to the setup, we multiply the number of possible states (N) by the number of unique variable settings.

We can estimate the universe of possible names using the frequentist method from the RVL data itself: We know that we have 6,127,859 unique voter ID’s in the RVL, and there are 14 unique SUFFIX entries, 291,368 unique FIRST names, 405,591 unique MIDDLE names, and 465,185 unique LAST names. So multiplying out 365 x 79 x 14 x 291368 x 405591 x 465185 = 2.22 x 10^22 potential states to consider.

Now unfortunately, as we start dealing with bigger and bigger N values the ability of computers to maintain the necessary precision to carry out the mathematics for direct computation becomes harder and harder, eventually resulting in Infinite or divide-by-zero answers as the probabilities get smaller and smaller. So lets begin by first determining if we can find the 50% crossover point for the unique voter ID population size. We find that we only need 410 unique First, Middle, and Last names (each) to break the 50% probability limit.

Figure 3: The computed probability of at least two people sharing a first name, middle name, last name, suffix, month-of-birth, day-of-birth, year-of-birth versus the number of people in the sample population. This assumes the Nyears = 79, Nsuffix = 14, Nfirst = 410, Nmiddle = 410, Nlast = 410.

As we increase the number of unique (first, middle, last) names under consideration, we find that we very quickly reduce the probability to near zero (again … this is assuming an IID set of variables … more on that later). In fact we only need to assume that there are 1300 unique first names, middle names and last names before the probability drops to under 1%. This is two full orders of magnitude below the actual number of unique first names, middle names and last names (each) that we find by simple examination of the RVL file, so the actual probability of a collision under these conditions should be much, much, much lower. While not exactly zero, it is computationally indistinguishable from zero given machine precision. Note (again) that this formulation is still simplified in that it assumes a uniform distribution within the N possible states, but it still serves to give a first order approximation and sanity check.

Figure 4: The computed probability of at least two people sharing a first name, middle name, last name, suffix, month-of-birth, day-of-birth, year-of-birth versus the number of people in the sample population. This assumes the Nyears = 79, Nsuffix = 14, Nfirst = 1300, Nmiddle = 1300, Nlast = 1300.

As we start approaching the limit of computational precision we have to resort to approximation methods for computing the very small, but non-zero probability of collision given the actual number of unique first, middle and last names observed in the RVL dataset. We can use the Taylor series expansion for small powers in order to do this, and our equation for computing the probability becomes: Pb = 1 – exp(-k*(k-1) / (2 *N)).

Replicating our earlier example in Figure 4 above with Nfirst == Nmiddle == Nlast == 1300 to show the comparison of the Taylor expansion to the explicit computation produces the graphic in Figure 5 below. We see that the small value approximation is close, but slightly over-estimates the directly computed probability for IID variables.

Figure 5: The computed probability of at least two people sharing a first name, middle name, last name, suffix, month-of-birth, day-of-birth, year-of-birth versus the number of people in the sample population. This assumes the Nyears = 79, Nsuffix = 14, Nfirst = 1300, Nmiddle = 1300, Nlast = 1300.

When we perform this Taylor series approximation and look to find the number of records required in order to obtain a 50% probability that any 2 records would match given our updated universe of possible matches, we end up with requiring K = 176,000,000,000, or 176 Billion records. When we again try to evaluate the Taylor series for the explicit number of unique Voter ID’s present in the RVL file, which is just over 6M, we again obtain a number that is computationally indistinguishable from 0. (To be absolutely meticulous … its a bigger number that is indistinguishable from 0 than we previously computed, but it is still indistinguishable from zero.)

Another Implementation note: In order to explicitly code the above direct computations we also need to do some clever tricks with logarithms in order to avoid numerical overflow / underflow issues as much as possible. The formula for computing the permutations, which is N! / (N-K)! = N x (N-1) x … x (N-K+1) can have numerical issues when N becomes large. However if we take the base-10 logarithm of the equation, we can use the product and quotient rules of logarithms to compute the result and avoid numerical overflow: log10( N! / (N-K)! ) = log10(N!) – log((N-K)!) = log10(N) + log10(N-1) + … + … log10(N-K+1), which is a much more stable computation.

We can perform a similar trick in order to compute the denominator of N^k by using the power property of logarithms such that log10( N^k ) = k x log10(N).

You must of course remember to reverse the logarithm once you’ve computed the log-sums. So the final computation of Pb becomes the following:

Vnr = log10(N) + log10(N-1) + … + … log10(N-K+1), where N is the number of possible states N = 365 x Nyears x Nfirst x Nmiddle x Nlast x Nsuffix.

Vt = k x log10(N)

Pa_log10 = Vnr – Vt = log10(Pa) = log10(Vnr/Vt)

Pb = 1 – 10^(Pa)

Updating from uniform distributions to non-uniform distributions

So what happens when we take into account the fact that names and birthdays are not uniformly distributed? (e.g. the last name of “Smith” is more frequent than “Sandeval”) This fact increases the probability of a collision occurring in the dataset. This increase also makes intuitive sense as we can anecdotally observe that coincident names and birthdates, while still rare … do actually happen in real life with common names.

However, in the non-uniform case we don’t have as nearly of a nice closed set of formulas for computing the probability. What we can do instead to estimate the probability is perform a number of Monte Carlo simulations of selecting K values from the weighted possibilities, and determine how many collisions occurred in each simulation trial. By setting K equal to the number of unique Voter ID values in the RVL dataset, we can directly answer the question via simulation of “how many collisions of First+Middle+Last+Suffix+DOB should we expect when looking at the VA Registered Voter List file“?

We can determine the weightings for each variable easily enough from the distributions of unique values in the data itself.

The below MATLAB weightedCollisionSim(…) function is a program that can be used to perform this analysis. It assumes that the RVL table object is a global variable to setup the trials, and uses the MATLAB built-in randsample(…) function to perform each draw.

After 100 simulation runs, the results are that for the K=6,127,859 unique voter ID’s in the RVL, we should expect to have an average of about 11 collisions at Hamming distance of 0, with a standard deviation of roughly 3.

I will note that as a validation and verification step, the MATLAB simulation code below, when used with uniform sampling, produces similar results to what we analytically derived above.

function [p,m,s] = weightedCollisionSim(k,ntrials,varargin)
% To compute the probability the ntrials must be >> 1:
% [p,m,s] = weightedCollisionSim(k,ntrials,values1,weights1,...,values2,weights2)
% [p,m,s] = weightedCollisionSim(k,ntrials,Nvalues1,weights1,...,Nvalues2,weights2)
%
% OUTPUTS:
% p = Probability of a collision
% m = mean number of collisions
% s = standard deviation of collisions

if nargin == 0
    global rvl; % Assume the RVL is an available global var

    ntrials = 100; % Number of trials
   
    % Population size set as num of unique voter IDs in RVL
    npop = numel(unique(rvl.IDENTIFICATION_NUMBER));

    % Convert the DOB strings to datetime objects
    dob = datetime(rvl.DOB);

    % How many unique days of the year are there?
    [ud,uda,udb] = unique(day(dob,'dayofyear'));
    % How often do they occur?
    nud = accumarray(udb,1,[numel(ud),1]);
    Ndays = numel(ud);

    % How many unique years of birth are there?
    [uy,uya,uyb] = unique(year(dob));
    % How often do they occur?
    nuy = accumarray(uyb,1,[numel(uy),1]);
    Nyears = numel(uy);

    % How many unique suffix strings are there?
    [us,usa,usb] = unique(rvl.SUFFIX);
    % How often do they occur?
    nus = accumarray(usb,1,[numel(us),1]);
    Nsuffix = numel(us);

    % How many unique first names are there?
    [uf,ufa,ufb] = unique(rvl.FIRST_NAME);
    % How often do they occur?
    nuf = accumarray(ufb,1,[numel(uf),1]);
    Nfirst = numel(uf);

    % How many unique middle names are there?
    [um,uma,umb] = unique(rvl.MIDDLE_NAME);
    % How often do they occur?
    num = accumarray(umb,1,[numel(um),1])
    Nmiddle = numel(um);

    % How many unique last names are there?
    [ul,ula,ulb] = unique(rvl.LAST_NAME);
    % How often do they occur?
    nul = accumarray(ulb,1,[numel(ul),1]);
    Nlast = numel(ul);
        
    % Initializing the weighting vectors
    w0 = nus;
    w1 = nud;
    w2 = nuy;
    w3 = nuf;
    w4 = num;
    w5 = nul;

    % Recursively compute results and return
    [p,m,s] = weightedCollisionSim(npop,ntrials,1:Nsuffix,w0,1:Ndays,w1,1:Nyears,w2,...
        1:Nfirst,w3,1:Nmiddle,w4,1:Nlast,w5);
    return
end

if nargin < 2 || isempty(ntrials)
    ntrials = 1;
end

nc = zeros(ntrials,1);
for j = 1:ntrials
    fprintf('Trial %d\n',j);
    y = zeros(k,numel(varargin)/2);
    m = 1;
    for i = 1:2:numel(varargin)
        w = varargin{i+1};
        v = varargin{i};
        if ~isempty(w) && isvector(w)
            % Non-uniform weightings
            y(:,m) = randsample(v,k,true,w);
        else
            % Uniform sampling
            y(:,m) = randsample(v,k,true);
        end
        m = m+1;
    end
    [u,~,ib] = unique(y,'rows');
    nu = accumarray(ib,1,[size(u,1),1]);
    nc(j) = sum(nu > 1);
end
p = mean(nc>0);
m = mean(nc);
s = std(nc);

Recent Posts

Recent Comments

Archives

Categories