A selection of Python scripts for automatic search of biomedical publications followed by extraction and processing of information about their authors using the PubMed database.
The goal of this project is to identify individual actors in the public debate of scientific issues and to determine their position within science. It specifically aims to establish:
- whether an actor can be considered a contributing expert in a given subject area as demonstrated through published research papers
- whether an actor can be considered a contributing expert, but has not published research papers in a given subject area
See: Python Installation und Ausführung (german)
- Add topics (
topic.csv
) - Add actors in CSV file
- Collects all publications for every actor (via PubMed)
- Filters publications (via
journal_ranking.csv
) - Generates author file for every actor
- one row = one publication
- Review generated author files
- Particularly the values
Active
andAuthorship Confidence
- Particularly the values
- Adjust input files if necessary
- Adjust search string for topic/actors
- In case of automatic authorship rating: improve lists
Location
andInstitution
- On changes: Execute
get_actors_publications.py
again
- Uses all author files
- Generates complete list with all authors and their metrics
- one row = one author
Both scripts ask for the topic (see topics.csv
) upon launch and whether publications should be retrieved using the name or ORCID of the author.
The INPUT folder contains all lists with actors and the journal ranking. Of importance here is the file topics.csv
as it is the so-called configuration file for the scripts.
File format: csv
First row: column names
Following rows: one row = one topic
-
short
– an acronym for the topic, used upon script launch -
search string
– search string is the way to check whether a research paper was published in a specific subject area. A search in PubMed is initiated, which contains the publication ID and this search string. (see code for further details) -
actors list file
– file name of the lists of actors in a given subject area
File format: csv
First row: column names
Following rows: one row = one actor
AktID
– unique ID of an actorName
– precise name of a personPosition
(optional)Institution
(optional)Label
(optional, e.g. doctor, expert, researcher)InstitutionList
– list of institutions at which the actor published research papers (for automatic authorship evaluation)LocationList
– list of known cities/countries in which the actor published research papers (for automatic authorship evaluation)PubMedSearch
– search string used to identify person in PubMed. Simple string or complex query.
Export from Scimago Journal & Country Rank
-
author_[topic]/
subfolder – contains all author files of the specified topic. -
result_[topic].csv
– contains final evaluation for all authors of a the specified topic.
File format: csv
First row: column names
Following rows: one row = one publication
Active
– 0 or 1. Controls whether publication will be considered in the complete evaluation of a subject area.Authorship Confidence
– Result of an automatic authorship evaluation. (see below)ID (LINK)
– ID of publication in PubMed database. It is linked and can be opened with Ctrl+Click.Topic
– Acronym of subject area: if the publication could be assigned to the subject area (via subject area search string).Title
– Title of publicationCitations
– Number of citations in other publicationsDate
– Date of publicationAuthor Position
– first/middle/lastCo-Author count
– Number of coauthors
Because the data in the PubMed databank are inconsistent, the algorithm cannot always assure that the person (the actor) under consideration is, in fact, the author of the found research paper or it is, for instance, a person with similar initials.
To address this issues, the script get_actors_publications.py
facilitates an automatic evaluation of the authorship. This can be turned on or off upon launch.
There are indicators that increase the confidence over the authorship.
When applies, the number will be added to the Authorship Confidence
value.
+0.4
– Firstname matches+0.3
– an institution matches+0.2
– a location matches
The data about institution and location will be fed from the list of authors. The more data is available in the list, the more significant is the Authorship Confidence
value.
The Authorship Confidence
value has a direct impact on the Active
value. Should the Authorship Confidence
be equal zero, so the Active
will also be set to zero. In this case, it is necessary to review the authorship manually again. Should the person under consideration be the author of publication, it is necessary to add institution and location to the list of actors. This ensures a better Authorship Confidence
value in the next run of the script.
- Conception: Prof. Dr. Markus Lehmkuhl (KIT & FU Berlin), Dr. Evgeniya Boklage (FU Berlin)
- Implementation: Yannick Milhahn (TU Berlin & FU Berlin)
Distributed under GPLv3 License. See LICENSE for more information.