Data resuscitation – extracting machine-readable information from a million paper runsheets using computer vision and human ingenuity
Verma, A; Blanchard, IE; Potvin, G; Buckeridge, D; de Montigny, L
Introduction: Traditionally, paramedics documented clinical information on paper “runsheets.” Some services continue to use paper, while others that have transitioned to electronic reporting have archives that are not machine-readable. The result is the effective loss of data invaluable for research and quality improvement, especially to assess temporal changes. In the case of one Canadian EMS system, 1 031 346 scanned runsheets have been archived for the 2015–2020 period, of which only 20% have been processed by human abstractors. Processing the remaining files would require 7.8 person-years at a cost of $230 000.
Objective: To describe the process and estimate the cost and accuracy of extracting vital signs and clinical interventions from digital images of runsheets using machine-learning (ML).
Methods: After each scan was aligned to a baseline image using computer-vision algorithms, the sections showing vital signs (25 freehand boxes and 65 checkboxes) and paramedical interventions (51 checkboxes) were cropped out. We manually labeled the checkbox images for 1000 random runsheets, and used a human-in-the-loop strategy to label the vital sign images. Using 80% of the human-labeled data, we fit a multi-label convolutional neural network (CNN) to the possible recorded values for each of the vital signs, and one single-label CNN for checkboxes. We used the remaining 20% of data to measure accuracy.
Results: The project was completed in four months at a cost of $29 000. The full process was implemented in an automated, reusable pipeline (Apache Airflow); image preparation took 24 hours of computation, while fitting each model took 5.5 hours. Vital-sign digits were extracted with an accuracy of 96–99% depending on the vital sign. Excluding empty boxes (nearly always correctly classified), checkbox accuracy was 99%.
Conclusion: It is feasible to efficiently and accurately extract machine-readable data from handwritten numbers and checkboxes from paper-based runsheets. The process did not require specialized hardware, but did require expertise in ML and in EMS data collection. The project did not process other sections of the runsheet, such as chief complaint, which will be the focus of future research.