OpenTox Virtual Conference 2021 Session 15
Towards Automating Information Extraction with FIDDLE: From Text Annotation to Interoperable Information Extraction via Machine Learning
Systematic review, already a cornerstone of evidence-based medicine, has recently gained significant popularity in several other disciplines including environmental health and evidence-based toxicology. One critical and time-consuming process that must occur during systematic review is the extraction of relevant qualitative and quantitative raw data from the free text of scientific documents. The specific data types extracted differ among disciplines, but within a given scientific domain, certain data points are extracted repeatedly for each review that is conducted.
To that end, Sciome is conducting research and development of a semi-automated data extraction platform for use in this context. We are focusing our research on three specific aims. First, we are currently working on the “software 2.0” version of our PDF text extraction software which utilizes deep learning, image processing, and NLP to convert binary PDF text documents into machine-readable raw text. Second, we have developed a web-based platform designed to allow users to efficiently annotate text with entities, groups, and relations that are of interest for a given data domain. Finally, we are using the resulting datasets, which can be easily exported in a number of popular annotation formats, to build high-quality neural machine learning models for automated information extraction and normalization.
Because accurate data extraction can be a challenging problem, and given that current methods rarely achieve 100% accuracy, all of the resulting methods will be integrated into a “human-in-the-loop” system that combines machine and human intelligence in a manner that is superior to using either in isolation. The system will: highlight extracted terms in a pdf; automatically populate extraction forms with extracted data; allow humans to intervene and correct the results, and learn from the corrections to continually update the model. Since extraction workflows vary among organizations and users, the system allows for easy import and export of data at several intermediate steps. Furthermore, by defining and supporting standardized interfaces for various information processing tasks, our system is designed to facilitate the incorporation of extraction components developed by external providers and academic research groups. Our overarching goal is to translate emerging semi-automated extraction technologies out of the lab and into practical software.