Clinical NLP datasets

Sunday, 8 March 2020 at 22:58

Datasets useful for clinical Natural Language Processing (NLP)

For the past year I have been working on NLP models for information extraction from clinical texts. It has been challenging to find relevant/in-domain available annotated datasets, so below I'm tabulating some resources I have found useful, annotated by NLP task, in the hope that these may be useful to other lonely souls navigating the desolate oceans of big data.

Data source: Biomedical articles/papers

These datasets contain text that is quite different from clinical text, however the datasets are publically available and in my experience contain vocabulary overlap with a clinical vocabulary of interest.

  • Genia Part of Speech Parsing

  • CRAFT Part of Speech Parsing

  • Elsevier Part of Speech Parsing Medicine publications (421 sentences). The tags used may be non-standard.

  • MedMentions Named Entities

Data source: Clinical data

All clinical datasets below are not publically available. Most can be requested / made available to researchers by following instructions on the corresponding dataset webpage.

  • Thyme/MiPACQ Treebank Parsing Named Entities

  • i2b2 Named Entities Relations Negation + Uncertainty

  • Fan et al Parsing This is a subpart of the i2b2 data annotated for constituency parsing, as described in the following paper. NOTE: Click on files and download wordfreak_annotation_files.zip - the default download is a single file. The downloaded files include annotations with text offsets that need to be linked with corresponding files in the i2b2 dataset.

  • n2c2 Named Entities An extension to the i2b2 dataset that links entity spans to UMLS concepts released as part of the 2019 n2c2 challenge on clinical concept normalisation. The data should be available on request (at some point in 2020).

  • Bioscope Negation + Uncertainty The paper introducing the dataset. The dataset is constructed from 3 different sources: Abstracts of the Genia corpus, Full scientific articles, and Clinical free-texts. The radiology report corpus used was from the CMC clinical coding challenge.

  • MIMIC-III Raw text Document labels MIMIC, in addition to a lot of clinical raw text contains documents annotated with ICD-9 codes.

  • MIMIC-CXR Raw text This release contains radiology reports with their corresponding images. Some relevant code resources are available on github.

Additional resources can be found on the Clinical NLP Workshop page.

Thanks to Hang Dong for his feedback and for pointing out additional resources.