Getting started with pyglotaran

Pyglotaran is an open-source modeling framework, written in Python and designed for global and target analysis of time-resolved spectroscopy data. It provides a flexible framework for analyzing complex spectrotemporal datasets, supporting a wide range of kinetic models including sequential, parallel, and target analysis schemes.

This getting-started notebook exists to get you started with using pyglotaran, hopefully in 20 minutes or less.

What you are viewing now may be a static rendering of an otherwise interactive notebook. This guide is the most useful if you either follow along in a new notebook, or download the original notebook from the repository.

Click here for more details on notebooks.

A Python notebook, also known as Jupyter Notebook, is an interactive computational environment, where you can run code, explore data and present your results, all in a single file (with the file extension .ipynb).

There are three main ways to run a Jupyter Notebook:

[local] Using Jupyter Notebook Directly
[local|github] Using VS Code with the Python and Jupyter extensions
[cloud] Using Google Colab

For the purpose of this guide, it is assumed you already know how to work with notebooks; else there are plenty of tutorials online.

Preface

If you are going through this guide, you most likely have some dataset burning a hole in your pocket.

Please rest assured, we’ll get to that. But first, we’d like to take you through a typically modeling workflow.

Plotting raw data

The pyglotaran framework has built-in functionality to create useful plots. They are part of the pyglotaran_extras package, and if you followed the installation instructions, you should already have this installed.

In there, we have a number of plotting functions we can import and use in our notebook. Let’s start with plot_data_overview.

[2]:

from pyglotaran_extras import plot_data_overview

plot_data_overview(my_dataset);

../../_images/notebooks_getting_started_getting_started_6_0.svg

Isn’t it pretty? We see our 2D intensity map with the time coordinate (labeled Time (ps)) along the x-axis and the spectral coordinate (labeled Wavelength (nm)) along the y-axis.

Just below that we see our singular value decomposition (SVD) of the data matrix, with the first four (4) singular vectors plotted.

The left most plot (data. LSV) shows the left singular vectors representing the vector’s evolution along the time coordinate.
The right most plot (data. RSV) shows the right singular vectors reflecting the spectral shapes associated with that evolution.
The middle plot shows the first 10 singular values, on a logarithmic scale.

From this we can deduce the rank of our data matrix is three (3), which is how many decay components we’ll need for our model later.

Q: What? How do you know that? And what’s a S.V.D.?

… A: (click here)

Very simply put the Singular Value Decomposition (SVD) is a mathematical technique used to decompose a matrix (our data) into a number of left and right singular vectors and their corresponding ‘weights’ (the singular values). It allows us to quickly visualize the ‘rank’ of the matrix (which gives us a rough approximation of how many decay components we might need in our model to fit the data).

More precisely, the SVD decomposed a (data) matrix into three other matrices. It is often used in signal processing and statistics to identify patterns in data. Specifically, SVD decomposes a matrix (A) into three matrices (U), (Σ), and (V^T) such that (A = UΣV^T). Here, (U) and (V) are orthogonal matrices, and (Σ) is a diagonal matrix containing the singular values. The left singular vectors (columns of (U)) represent the time coordinates, while the right singular vectors (columns of (V)) represent the spectral coordinates. The singular values in (Σ) provide a measure of the importance of each corresponding singular vector.

Put simply: if we take the first left singular vector (LSV), multiply it by the first singular value (SV), and then multiply by the first right singular vector (RSV), we obtain the first approximation of the data matrix. Repeating this process for the second and third singular vectors provides an even better approximation, especially since the rank of this particular matrix appears to be 3. By summing the products of all the left and right singular vectors, each weighted by their corresponding singular value, we can reconstruct the original data matrix.

That’s more than you need to know about Singular Value Decompositions at this time.

Starting a project

Once we have decided that our data is good enough to attempt to model it using pyglotaran we can start our adventure.

To start using pyglotaran in your analysis, you only have to import the Project class and open a project (folder).

[3]:

from glotaran.project import Project

my_project = Project.open("my_project")

If the project does not already exist this will create a new project and its folder structure for you. In our case we had only the models + parameters folders and the data + results folder were created when opening the project.

[4]:

# This so called shell command allows us to 'list' (ls) the content of the project folder.
%ls my_project

data/  models/  parameters/  project.gta  results/

Import the data into your project

As long as your data can be transformed into a well structured xarray Dataset you can import it directly into your project.

Q: Well structured?

A: …

By that we mean that your data is of type xarray.Dataset with a data Data variable (or xarray.DataArray) and has the coordinates time and spectral.

This will then save your data as NetCDF (.nc) file into the data folder inside of your project with the name that you gave it (here my_project/data/my_data.nc).

[5]:

my_project.import_data(my_dataset, dataset_name="my_data")

After importing, our my_project is aware of the data that we named my_data when importing.

Working with models

⚠️ Please note that the exact way in which a model is defined may still change slightly in future versions of pyglotaran. But no worries, there will always be a clear procedure to upgrade any existing models you may have created in the meantime.

After importing our data into the project, to analyse them, we need a model (or analysis scheme).

If it does not already exists, create a file called my_model.yaml in your projects’ models folder and fill it with the following content.

📝 Don’t let this file extension (.yaml or .yml) scare you, it’s just another plain text file, which you can open with literally any text editor.

In our case the file already exists, so we can just show you the content, which you can then copy paste.

[6]:

my_project.show_model_definition("my_model")
# copy the model definition from the output below

[6]:

initial_concentration:
  input:
    compartments: [s1, s2, s3]
    parameters: [input.1, input.0, input.0]

k_matrix:
  k1:
    matrix:
      (s2, s1): kinetic.1
      (s3, s2): kinetic.2
      (s3, s3): kinetic.3

megacomplex:
  m1:
    k_matrix: [k1]
    type: decay

irf:
  irf1:
    type: gaussian
    center: irf.center
    width: irf.width

dataset:
  my_data:
    initial_concentration: input
    megacomplex: [m1]
    irf: irf1

⬆️ copy into models/my_model.yaml

The above model reads (from bottom to top) as:

We have a dataset named my_data, which we model with a single (kinetic) megacomplex m1, with initial_concentration vector input and instrument response function (IRF) irf1.
The IRF irf1 is defined as being of type gaussian with its center location at irf.center and its width to be irf.width
The megacomplex m1 is composed of just a single kinetic matrix k1 of type decay (short for exponential decays)
This k_matrix is composed of just 3 rate constants (kinetic.1, kinetic.2, kinetic.3) describing the sequential kinetic scheme: “s1->s2->s3->ground”
It is sequential because the initial_concentration input defines all of the input (1) going to s1, and none of it going to s2 or s3.

You can check your model for problems with the validate method.

[7]:

my_project.validate("my_model")

[7]:

Your model is valid.

Working with parameters

A model by itself is not sufficient, we also need starting values for the parameters we define in the model.

For this, we use a parameters file. Create a file called my_parameters.yaml in your parameters folder with the following content.

Again, in our case the file already exists, so we just show the content for you to copy.

[8]:

my_project.show_parameters_definition("my_parameters")

[8]:

input:
  - ["1", 1, { "vary": False }]
  - ["0", 0, { "vary": False }]

kinetic: [0.51, 0.31, 0.11]

irf:
  - ["center", 0.31]
  - ["width", 0.11]

This reads as:

There are two input parameters input.1 with (fixed) value 1, and input.0 with (fixed) value 0.
There are 3 kinetics rates, with starting values 0.51, 0.31, 0.11.
There are 2 IRF related parameters irf.center with starting value 0.31 and irf.width with starting value 0.11.

All parameters are implicitly ‘free’, unless specified with { "vary": False } to be fixed.

You can use the validate method, which can check for missing parameters.

[9]:

my_project.validate("my_model", "my_parameters")

[9]:

Your model is valid.

Since not all problems in the model can be detected automatically it is wise to visually inspect the model.

For this purpose, you can just load the model and inspect its markdown rendered version.

[10]:

my_project.load_model("my_model")

[10]:

Model

Dataset Groups

default
- Label: default
- Residual Function: variable_projection

K Matrix

k1
- Label: k1
- Matrix: {(‘s2’, ‘s1’): ‘kinetic.1’, (‘s3’, ‘s2’): ‘kinetic.2’, (‘s3’, ‘s3’): ‘kinetic.3’}

Megacomplex

m1
- Label: m1
- Dimension: time
- Type: decay
- K Matrix: [‘k1’]

Initial Concentration

input
- Label: input
- Compartments: [‘s1’, ‘s2’, ‘s3’]
- Parameters: [‘input.1’, ‘input.0’, ‘input.0’]
- Exclude From Normalize: []

Irf

irf1
- Label: irf1
- Normalize: True
- Backsweep: False
- Type: gaussian
- Center: irf.center
- Width: irf.width

Dataset

my_data
- Label: my_data
- Group: default
- Force Index Dependent: False
- Megacomplex: [‘m1’]
- Initial Concentration: input
- Irf: irf1

The same way you should inspect your parameters.

[11]:

my_project.load_parameters("my_parameters")

[11]:

input:

Label

Value

Standard Error

t-value

Minimum

Maximum

Vary

Non-Negative

Expression

1

1.000e+00

nan

nan

-inf

inf

False

False

None

0

0.000e+00

nan

nan

-inf

inf

False

False

None
irf:

Label

Value

Standard Error

t-value

Minimum

Maximum

Vary

Non-Negative

Expression

center

3.100e-01

nan

nan

-inf

inf

True

False

None

width

1.100e-01

nan

nan

-inf

inf

True

False

None

kinetic:

Label	Value	Standard Error	t-value	Minimum	Maximum	Vary	Non-Negative	Expression
1	5.100e-01	nan	nan	-inf	inf	True	False	`None`
2	3.100e-01	nan	nan	-inf	inf	True	False	`None`
3	1.100e-01	nan	nan	-inf	inf	True	False	`None`

Optimizing - fitting our data

Now we have all the components in place to start our analysis, optimizing the parameters while minimizing the residual, i.e. fitting our model to our data.

[12]:

result = my_project.optimize("my_model", "my_parameters")
result

   Iteration     Total nfev        Cost      Cost reduction    Step norm     Optimality
       0              1         1.1173e+04                                    1.73e+06
       1              2         1.4989e+01      1.12e+04       1.96e-02       1.27e+04
       2              3         7.5689e+00      7.42e+00       5.86e-03       1.01e+03
       3              4         7.5650e+00      3.84e-03       1.68e-05       6.26e-02
       4              5         7.5650e+00      1.51e-11       4.41e-09       6.23e-06
Both `ftol` and `xtol` termination conditions are satisfied.
Function evaluations 5, initial cost 1.1173e+04, final cost 7.5650e+00, first-order optimality 6.23e-06.

[12]:

Optimization Result
Number of residual evaluation	5
Number of residuals	151200
Number of free parameters	5
Number of conditionally linear parameters	216
Degrees of freedom	150979
Chi Square	1.51e+01
Reduced Chi Square	1.00e-04
Root Mean Square Error (RMSE)	1.00e-02

Model

Dataset Groups

default
- Label: default
- Residual Function: variable_projection

K Matrix

k1
- Label: k1
- Matrix: {(‘s2’, ‘s1’): ‘kinetic.1(5.00e-01±6.79e-05, t-value: 7364, initial: 5.10e-01)’, (‘s3’, ‘s2’): ‘kinetic.2(3.00e-01±3.94e-05, t-value: 7620, initial: 3.10e-01)’, (‘s3’, ‘s3’): ‘kinetic.3(1.00e-01±4.23e-06, t-value: 23642, initial: 1.10e-01)’}

Megacomplex

m1
- Label: m1
- Dimension: time
- Type: decay
- K Matrix: [‘k1’]

Initial Concentration

input
- Label: input
- Compartments: [‘s1’, ‘s2’, ‘s3’]
- Parameters: [‘input.1(1.00e+00, fixed)’, ‘input.0(0.00e+00, fixed)’, ‘input.0(0.00e+00, fixed)’]
- Exclude From Normalize: []

Irf

irf1
- Label: irf1
- Normalize: True
- Backsweep: False
- Type: gaussian
- Center: irf.center(3.00e-01±5.04e-06, t-value: 59496, initial: 3.10e-01)
- Width: irf.width(1.00e-01±6.73e-06, t-value: 14864, initial: 1.10e-01)

Dataset

my_data
- Label: my_data
- Group: default
- Force Index Dependent: False
- Megacomplex: [‘m1’]
- Initial Concentration: input
- Irf: irf1

What you see from the optimization ‘log’ is that the optimization took about 5 iterations to converge, going from a cost of 11.2k down to 7.56.

The Optimization Result table gives us a nice ‘statistics’ overview and if we click on details we can view the optimized model.

Since we are satisfied that our fit has converged, we’ll proceed to look at the outcome in more detail.

Speaking of outcomes, please note that each time you run an optimization the result will be saved in the project’s results folder.

You may occasionally want to clean these up, especially with larger projects or datasets.

[13]:

# To view the saved results, look at the content of the project's results folder
%ls "my_project/results"

my_model_run_0000/

One way to look at our results in more detail, is to query the optimized_parameters object in the results.

This will also print (if it can be computed) the standard error for each estimated parameter.

[14]:

result.optimized_parameters

[14]:

input:

Label

Value

Standard Error

t-value

Minimum

Maximum

Vary

Non-Negative

Expression

1

1.000e+00

nan

nan

-inf

inf

False

False

None

0

0.000e+00

nan

nan

-inf

inf

False

False

None

irf:

Label	Value	Standard Error	t-value	Minimum	Maximum	Vary	Non-Negative	Expression
center	3.000e-01	5.042e-06	59496	-inf	inf	True	False	`None`
width	1.000e-01	6.728e-06	14864	-inf	inf	True	False	`None`

kinetic:

Label	Value	Standard Error	t-value	Minimum	Maximum	Vary	Non-Negative	Expression
1	5.000e-01	6.790e-05	7364	-inf	inf	True	False	`None`
2	3.000e-01	3.937e-05	7620	-inf	inf	True	False	`None`
3	1.000e-01	4.230e-06	23642	-inf	inf	True	False	`None`

You can further inspect the result by accessing its data attribute. In our example it only contains a single my_data dataset, but it could contain many datasets in a multi-dataset analysis.

[15]:

result.data

Finally, although knowing your optimization converged, and the optimized parameters seem reasonable is half the battle, the real question is of course … what does it LOOK like. Let’s plot!

Visualize the Result

The results can be visualized in a similar way as the dataset, using a function from the pyglotaran_extras part of the framework, in this case plot_overview.

[16]:

from pyglotaran_extras import plot_overview

plot_overview(result);

../../_images/notebooks_getting_started_getting_started_39_0.svg

In this overview you see the following.

first row:

[left] The concentrations corresponding to the components in the kinetic scheme
[middle] The species associated spectra (SAS), in this case equivalent to the evolution associated spectra (EAS)
[right] The decay associated spectra (DAS)

second row:

[left] The residual matrix, with the IRF (dispersion) curve plotted on top
[middle] The normalized SAS
[right] The normalized DAS

third row: the SVD of the residual matrix, with

[left] The first 2 components (black, red) of the left singular vectors (time)
[left] The first 2 components (black, red) of the right singular vectors (spectral)
[right] The first 10 singular values on a logarithmic scale.

forth row: the SVD of the data matrix, with

[left] The first 4 components (black, red, blue, green) of the left singular vectors (time)
[left] The first 4 components (black, red, blue, green) of the right singular vectors (spectral)
[right] The first 10 singular values on a logarithmic scale.

Conclusion

In this guide we showed how to

plot your data (once it’s in the right format)
start a project, import the data into it
work with models (and a model .yaml file)
specify starting values for our parameters (in a parameters .yaml file)
analyze your data
plot the results

Welcome to the future of global target analysis, we hope you like it here!

And now for something completely different

Right, right. We were going to talk about your data. That .csv file did end up burning a hole in your pocket, didn’t it? 😉

Well, there is a reason we have standerdized around xarray’s Datasets, we know what to expect.

We don’t know that with .csv. 🙈

Some people save something as .csv, but it is space or tab delimited.
Some people put in a header, of a single line, two lines, 4 lines.
Some people put in a footer, of many lines.
Some people pad their data matrix with a zero, some don’t.
Some people include their spectral and time coordinates, some don’t.
Some do, but call it something completely different.
Some people use monotonic increasing coordinates (as one should), but some don’t 😱.

In short, you can never be sure with .csv.

All of that being said … if you had some fairly clean .csv data lying around you wanted to import, you could use the pandas library to read the data and then create an xarray Dataset out of it.

How, is left as an exercise to the reader, but we left some tips below.

Click to reveal tips!

Assuming a csv file with the timepoints in the first row, and the spectral coordinate (e.g. wavelengths) in the fist column.

-2.0,0.0,2.0,10.0,100.0,1000.0
420,0,10,15,5,1,0
520,1,100,25,7,2,0
620,0,25,15,2,1,0

Then this bit of Python code could read in your data.

import pandas as pd
import numpy as np
import xarray as xr
from pathlib import Path
filepath = Path(r"file_that_burned_a_hole_in_your_pocket.csv")
# Load the CSV file into a pandas DataFrame
df = pd.read_csv(filepath, delimiter=',')
# Convert index and columns to numeric, ignoring any non-numeric values
df.index = pd.to_numeric(df.index, errors='coerce')
df.columns = pd.to_numeric(df.columns, errors='coerce')
# Remove any rows or columns that couldn't be converted to numeric
df = df.loc[df.index.notnull(), df.columns.notnull()]
# Extract the coordinates (assuming they were in the .csv file)
timepoints = np.array(df.columns.values).astype(float)
wavelengths = np.array(df.index.values).astype(float)
dataset = xr.DataArray(
        df.values.T,
        dims=["time", "spectral"],
        coords={"time": timepoints, "spectral": wavelengths},
    ).to_dataset(name="data")
dataset

# pro tip, try to plot it to see if it was read in correctly.
plot_data_overview(dataset,linlog=True);

At some point we will add a more robust and battle tested csv_to_dataset function to the pyglotaran_extras package, but even then the above continues to hold true.

Getting started with pyglotaran

Preface

Working with data

Plotting raw data

Starting a project

Import the data into your project

Working with models

Working with parameters

Model

Dataset Groups

K Matrix

Megacomplex

Initial Concentration

Irf

Dataset

Optimizing - fitting our data

Model

Dataset Groups

K Matrix

Megacomplex

Initial Concentration

Irf

Dataset

Visualize the Result

Conclusion

And now for something completely different