Reviewing the Quality of “Big Data” in automatic data systems: An Example

Main Article Content

Tom Koch


In recent decades there has been an extraordinary growth in and acceptance of automatic data systems that collect official and popular reports of epidemic occurrence. While different systems employ one or another proprietary algorithms to collect and parse disease reports all include, at a minimum, spatial locators, the date of a report, and the number of individual cases reported. These systems have been increasingly vital in both the study of individual epidemics and the exposition of expanding epidemics in real time. To date, however, there has been little analysis of the nature and quality of the data collected in these “big-net” programs or the degree to which redundancies and uncertainties may limit their utility. Here data on the 2009 H1N1 Type-A influenza epidemic gathered by a single system,, is parsed to determine where problems exist and how they might be rectified.

Article Details

How to Cite
KOCH, Tom. Reviewing the Quality of “Big Data” in automatic data systems: An Example. Medical Research Archives, [S.l.], v. 8, n. 9, sep. 2020. ISSN 2375-1924. Available at: <>. Date accessed: 27 oct. 2020. doi:
Review Articles


1. Khoury MJ, Cordero JF, Greenberg F, James LM, Erickson JD. A population study of the VACTERL association: evidence for its etiologic heterogeneity. Pediatrics 1983; 71:815-20.
2. Coronavirus COVID-19 Cases., 2020. (Accessed July 5, 2020).
3. Balcan D., Colizzac V, Gonçalvesa B, Hu H, Ramascob J, Vespignani A. Multiscale mobility networks and the spatial spreading of infectious diseases. PNAS 2009. 106 (51): 21484–21489. Accessed May 15, 2018.
4. Brigham H. (1832). A Treatise on Epidemic Cholera: Including an Historical Account of Its Origin and Press, to the Present Period. Hartford, CT: H. and F. J. Huntington. .
5. Brown JS, Freifeld CC, Reis BY, and MAND KD. Surveillance Sans Frontières: Internet-Based Emerging Infectious Disease Intelligence and the HealthMap Project. PLoS Medicine 2008; 5 (7): 1019-1024. .
6. Chiara GC, Raffle J, Aisyah DN, Sartain F, Kozlakidis Z. Big Data Analytics, Infectious Diseases and Associated Ethical Impacts. Philos & Technol 2019; 32 (1): 69-85.
7. Feldman J, Thomas-Bachli A, Forsyth J, Hasnain Z, et al. Development of a global infectious disease activity database using natural language processing, machine learning, and human expertise. Journal of the American Medical Informatics Association 2019; 36 (11), 1355–1359. doi: 10.1093/jamia/ocz112.
8. Fleming DM, Van der Velden J, Paget WJ. The evolution of influenza surveillance in Europe and prospects for the next ten years. Vaccine 2003; 21 (16): 1749-1753. PMID: 12686088 .
9. Frelfeld C, Brownstein J. About Healthmap. Boston: Boston Children's Hospital, 2007. .
10. Lancet. History of the rise, progress, ravages, etc. of the blue cholera of India. Lancet 1831; 17; 429: 241-284,
11. Lazaro G.L, Yourish K. 2020. See how the Coronavirus Death Toll Grew Across the U.S. New York Times (April 7), 2020. (Accessed July 5, 2020).
12. Gilbert G L, Degeling C, and Johnson J. Communicable Disease Surveillance Ethics in the Age of Big Data and New Technology. Asian Bioethics Review 2019; 11: 173–187
13. Heymann DL, Guenael RG. The Brown Journal of World Affairs 2004; 10 (2): 185-197.
14. Johns Hopkins University. School of Medicine Coronavirus Centre. Baltimore, MD, 2020. .
15. Koch T. Disease Maps: Epidemics on the Ground. Chicago, IL. University of Chicago Press, 2011.
16. Koch, T. Cartographies of Disease: Maps, Mapping, and Medicine. Redlands, CA: Esri Press, 2017: Chapter 14.
17. Kraemer M, Hay SI, Pigott DM, Smith DL, et al. Progress and Challenges in Infectious Disease Cartography. Trends in Parasitology 2016; 32(1): 19-29. .
18. Lee A. Summer camps close after Covid-19 outbreaks among campers and staff. CNN News, 2020 (July 8).
19. Lee EC, Asher JM, Goldlust S, Kraemer JD, et al. Mind the scales: harnessing spatial big data for infectious disease surveillance and inference. J Infect Disease 2016; 214 (S4): S409–S413.
20. Leetaru K, Schrodt P A. GDELT: Global Data on Events, Location and Tone 1979–2012. Paper presented at The International Studies Association meetings, San Francisco, 2013.
21. Ling Yeo-Teh N, Tang B. L. An alarming retraction rate for scientific publications on Coronavirus Disease 2019 (COVID-19). Accountability in Research Policies and Quality Assurance, 2020. DOI: 10.1080/08989621.2020.1782203.
22. Mehta N, Pandit A. Concurrence of big data analytics and healthcare: A systematic review. International Journal of Medical Informatics 2018; 114: 57-65. .
23. Milwaukee County. 2020. Milwaukee County Covid-19 Dashboard. (Accessed July 5, 2020).
24. O'Shea J.. Digital Disease Detection: A Systematic REview of Event-based Internet Biosurveillance Systems. Int J. Med Informatics 2017; 101: 14-22. Doi:10.1016/j.ijmedinf.2017.01.019.
25. Polonsky JA, Baidjoe A., Kamvar ZN, Cori A., et al. Outbreak analytics: a developing data science for informing the response to emerging pathogens. Phil. Trans. R. Soc. B 2019; 374: 20180276: 1-11.
26. Snow J. On the Mode of Communication of Cholera, Second Edition. London: Churchill, 1855.
27. Snow J. Cholera and the Water Supply of the South Districts of London in 1854. Journal of Public Health 1856; 2: 239-257.
28. U.S. Census. Quick Facts: San Francisco County. Population, 2010.,CA,US/PST045218
29. U.S. Census Annual Estimates of the Resident Population for Incorporated Places Over 50,000, Ranked by July 1, 2012 Population: April 1, 2010 to July 1, 2012 - United States -- Places of 50,000+ Population 2012 Population Estimates., 012.
30. Wickham H. Tidy Data. Journal of Statistical Software . 2014; 59 (10): 1-22.
31. WHO. 2019. Influenza: Flunet. Geneva: World Health Organization. .
32. Yan SJ, Chughtai AA, Macintyre, CR. 2017. Utility and potential of rapid epidemic intelligence from internet-based sources. Int J Infect Dis 2017; 63: 77–87.

Most read articles by the same author(s)

Obs.: This plugin requires at least one statistics/report plugin to be enabled. If your statistics plugins provide more than one metric then please also select a main metric on the admin's site settings page and/or on the journal manager's settings pages.