Analyse Short Astrophysics Texts

Analyse Short Astrophysical Texts

This tool is part of a pilot study demonstrating how a research text can be matched to a pool of analysis tools described using an appropriate ontology and stored in a knowledge graph. Based on the wavelength range and the type of phenomenon extracted from the text, it suggests possible follow-up analysis tools from MMODA to study the identified astronomical objects.

Pipeline

The analysis process consists of three main steps: 1. Entity extraction 2. Text vectorization 3. Follow-up tool prediction

Entity Extraction

To extract entities from the input text, we implemented two methods:

  1. Regular Expressions (REGEX)
  2. ontology of telescopes and observatories.
  3. IVOA ontology of astronomical source types.
  4. patterns inspired by typical source name formats.

  5. Language Model (astroBERT)

  6. A language model trained to perform Named-Entity Recognition (NER) on astrophysical texts.

Extracted entities: - Right Ascension (R.A.) and Declination (Dec.) - Names of astronomical sources - Classes of astronomical sources (extracted via REGEX and by querying the source names on SIMBAD, TNS, and FINK) - Telescopes, instruments, observatories, surveys - Wavelengths (extracted via astroBERT only)

Text Vectorization

After entity extraction, each text is embedded into a vector of size 59. This vector includes:

  1. 41 source classes
  2. selected from 226 IVOA classes, using the hierarchy of classes and subclasses developed by IVOA, see this file.
  3. 9 telescope types
  4. Radio
  5. Infrared
  6. Optical
  7. Ultraviolet
  8. X-ray
  9. Gamma-ray
  10. Cosmic-ray
  11. Gravitational wave
  12. Neutrino telescopes
  13. 9 MMODA tools
  14. the MMODA tools directly linked to telescopes (9 tools as of December, 2024).

Follow-up Tool Prediction

To suggest a follow-up analysis tool, we developed a Convolutional Neural Network (CNN) trained on vectorized texts.
The training pairs correspond to (first, follow-up) texts from ATels and GCN Circulars.

Based on the CNN output vector we generate direct links to the relevant MMODA tools.


Input

The tool accepts the following inputs:

  • A selector for the origin of the text: ATel, GCN Circular or other.
  • The corresponding ATel or GCN Circular number to fetch the text from the online archive.
  • Alternatively, a short astrophysical text can be provided directly.
    If a custom text is given, it takes precedence and the tool will skip fetching from external sources.

In both cases, a unique identifier (e.g. the circular number) is required to label the input and structure the outputs.


Output

The tool produces 9 tables:

  1. table_astrobert_results — All entities detected by astroBERT.
  2. table_source_classes — All detected source classes.
  3. table_source_positions — All sources with known positions and all detected positions.
  4. table_sources — All detected source names.
  5. table_telescopes — All detected telescopes, instruments, observatories, and surveys.
  6. table_unknown_sources — Source names that could not be found in SIMBAD, TNS, or FINK.
  7. table_vectorized_text — contains:

    a) the input vector of the CNN as the vectorized text

    b) the output vector of the CNN

  8. table_vectorized_url — All generated MMODA tool URL vectors based on the CNN output vector. The URL vectors are obtained as follows:

    a) Each URL vector represents a single astrophysical source, i.e., the corresponding source classes are encoded as 1s at the appropriate indices in a 59-sized vector.

    b) For each astrophysical source, we create different URL vectors corresponding to different instruments/tools from MMODA or telescope types. For tools that are not part of the 59-sized vector, we create URL vectors that have values of 1 only at the positions corresponding to the telescope type: e.g. 1) For SPI-ACS, the values of the vector at the positions of SPI-ACS, INTEGRAL and gamma-ray are 1; e.g. 2) For Auger, the value of the vector at the position of cosmic-ray is 1.

  9. table_vectorized_url_scores — Scores for each possible MMODA URL generated.

Each URL score, shown in table_vectorized_url_scores , is computed as the dot product between the normalized (numpy.linalg.norm) CNN output vector and the normalized URL vector:


For Developers

A) New tools on the MMODA platform

In order to include newly added tools on the MMODA platform, one should modify the json file that links a telescope type to a MMODA tool. In case a tool is removed from the MMODA platform, the same file should be changed. However, one should be very careful about the tools that have a direct connection to an instrument, see aux_functions.py, since this has not been tested.

C) Change the number of instruments or telescope types in the input/output vector

This modification requires more changes in the following files aux_functions.py, pipeline_vectorize_text.py, predict_vectorised_text.py. In addition, the CNN should be retrained on the new types of vectors.