Analyse Short Astrophysical Texts
This tool is part of a pilot study demonstrating how a research text can be matched to a pool of analysis tools described using an appropriate ontology and stored in a knowledge graph. Based on the wavelength range and the type of phenomenon extracted from the text, it suggests possible follow-up analysis tools from MMODA to study the identified astronomical objects.
Pipeline
The analysis process consists of three main steps: 1. Entity extraction 2. Text vectorization 3. Follow-up tool prediction
Entity Extraction
To extract entities from the input text, we implemented two methods:
- Regular Expressions (REGEX)
- ontology of telescopes and observatories.
- IVOA ontology of astronomical source types.
-
patterns inspired by typical source name formats.
-
Language Model (astroBERT)
- A language model trained to perform Named-Entity Recognition (NER) on astrophysical texts.
Extracted entities: - Right Ascension (R.A.) and Declination (Dec.) - Names of astronomical sources - Classes of astronomical sources (extracted via REGEX and by querying the source names on SIMBAD, TNS, and FINK) - Telescopes, instruments, observatories, surveys - Wavelengths (extracted via astroBERT only)
Text Vectorization
After entity extraction, each text is embedded into a vector of size 59. This vector includes:
- 41 source classes
- selected from 226 IVOA classes, using the hierarchy of classes and subclasses developed by IVOA, see this file.
- 9 telescope types
- Radio
- Infrared
- Optical
- Ultraviolet
- X-ray
- Gamma-ray
- Cosmic-ray
- Gravitational wave
- Neutrino telescopes
- 9 MMODA tools
- the MMODA tools directly linked to telescopes (9 tools as of December, 2024).
Follow-up Tool Prediction
To suggest a follow-up analysis tool, we developed a Convolutional Neural Network (CNN) trained on vectorized texts.
The training pairs correspond to (first, follow-up) texts from ATels and GCN Circulars.
Based on the CNN output vector we generate direct links to the relevant MMODA tools.
Input
The tool accepts the following inputs:
- A selector for the origin of the text: ATel, GCN Circular or other.
- The corresponding ATel or GCN Circular number to fetch the text from the online archive.
- Alternatively, a short astrophysical text can be provided directly.
If a custom text is given, it takes precedence and the tool will skip fetching from external sources.
In both cases, a unique identifier (e.g. the circular number) is required to label the input and structure the outputs.
Output
The tool produces 9 tables:
table_astrobert_results— All entities detected by astroBERT.table_source_classes— All detected source classes.table_source_positions— All sources with known positions and all detected positions.table_sources— All detected source names.table_telescopes— All detected telescopes, instruments, observatories, and surveys.table_unknown_sources— Source names that could not be found in SIMBAD, TNS, or FINK.-
table_vectorized_text— contains:a) the input vector of the CNN as the vectorized text
b) the output vector of the CNN
-
table_vectorized_url— All generated MMODA tool URL vectors based on the CNN output vector. The URL vectors are obtained as follows:a) Each URL vector represents a single astrophysical source, i.e., the corresponding source classes are encoded as 1s at the appropriate indices in a 59-sized vector.
b) For each astrophysical source, we create different URL vectors corresponding to different instruments/tools from MMODA or telescope types. For tools that are not part of the 59-sized vector, we create URL vectors that have values of 1 only at the positions corresponding to the telescope type: e.g. 1) For SPI-ACS, the values of the vector at the positions of SPI-ACS, INTEGRAL and gamma-ray are 1; e.g. 2) For Auger, the value of the vector at the position of cosmic-ray is 1.
-
table_vectorized_url_scores— Scores for each possible MMODA URL generated.
Each URL score, shown in table_vectorized_url_scores , is computed as the dot product between the normalized (numpy.linalg.norm) CNN output vector and the normalized URL vector:
For Developers
A) New tools on the MMODA platform
In order to include newly added tools on the MMODA platform, one should modify the json file that links a telescope type to a MMODA tool. In case a tool is removed from the MMODA platform, the same file should be changed. However, one should be very careful about the tools that have a direct connection to an instrument, see aux_functions.py, since this has not been tested.
C) Change the number of instruments or telescope types in the input/output vector
This modification requires more changes in the following files aux_functions.py, pipeline_vectorize_text.py, predict_vectorised_text.py. In addition, the CNN should be retrained on the new types of vectors.