Getting started with pyglotaran
Pyglotaran is an open-source modeling framework, written in Python and designed for global and target analysis of time-resolved spectroscopy data. It provides a flexible framework for analyzing complex spectrotemporal datasets, supporting a wide range of kinetic models including sequential, parallel, and target analysis schemes.
This getting-started notebook exists to get you started with using pyglotaran, hopefully in 20 minutes or less.
What you are viewing now may be a static rendering of an otherwise interactive notebook. This guide is the most useful if you either follow along in a new notebook, or download the original notebook from the repository.
Click here for more details on notebooks.
A Python notebook, also known as Jupyter Notebook, is an interactive computational environment, where you can run code, explore data and present your results, all in a single file (with the file extension .ipynb).
There are three main ways to run a Jupyter Notebook:
[local]Using Jupyter Notebook Directly[local|github]Using VS Code with the Python and Jupyter extensions[cloud]Using Google Colab
For the purpose of this guide, it is assumed you already know how to work with notebooks; else there are plenty of tutorials online.
Preface
If you are going through this guide, you most likely have some dataset burning a hole in your pocket.
Please rest assured, we’ll get to that. But first, we’d like to take you through a typically modeling workflow.
Working with data
Let’s start with the premise that we already have some data imported in our notebook.
To make this guide self-contained, we’ll make use of some simulated data from the glotaran.testing package
[1]:
# For the purpose of illustrating the workflow, we will just use
# some simulated data from the glotaran.testing package.
from glotaran.testing.simulated_data.sequential_spectral_decay import DATASET as my_dataset
# ending the cell with the variable name will display the content of the variable
my_dataset
[1]:
<xarray.Dataset> Size: 1MB
Dimensions: (time: 2100, spectral: 72)
Coordinates:
* time (time) float64 17kB -1.0 -0.99 -0.98 -0.97 ... 19.97 19.98 19.99
* spectral (spectral) float64 576B 600.0 601.4 602.8 ... 696.6 698.0 699.4
Data variables:
data (time, spectral) float64 1MB 0.002468 0.005216 ... 2.556 2.323
Attributes:
source_path: dataset_1.ncLike all data in pyglotaran, the dataset is an xarray.Dataset, for which you can find more information on the xarray hompage.
From the output cell, we can quickly see that the dataset my_dataset has the following properties:
It has the
Dimensions:timeandspectral(and they must be named like that)For these data the time coordinate starts at
-1.0and runs until19.99,For these data the spectral coordinate starts at
600and runs until699.4.The dataset (currently) has a single Data variable called
datawith (2100 time x 72 spectral)=151200 datapoints, later more variables may be added to the dataset.
Towards the end of this notebook you will find out how to read in your data and transform it into an xarray.Dataset, but for now, let’s continue.
Plotting raw data
The pyglotaran framework has built-in functionality to create useful plots. They are part of the pyglotaran_extras package, and if you followed the installation instructions, you should already have this installed.
In there, we have a number of plotting functions we can import and use in our notebook. Let’s start with plot_data_overview.
[2]:
from pyglotaran_extras import plot_data_overview
plot_data_overview(my_dataset);
Isn’t it pretty? We see our 2D intensity map with the time coordinate (labeled Time (ps)) along the x-axis and the spectral coordinate (labeled Wavelength (nm)) along the y-axis.
Just below that we see our singular value decomposition (SVD) of the data matrix, with the first four (4) singular vectors plotted.
The left most plot (data. LSV) shows the left singular vectors representing the vector’s evolution along the time coordinate.
The right most plot (data. RSV) shows the right singular vectors reflecting the spectral shapes associated with that evolution.
The middle plot shows the first 10 singular values, on a logarithmic scale.
From this we can deduce the rank of our data matrix is three (3), which is how many decay components we’ll need for our model later.
Q: What? How do you know that? And what’s a S.V.D.?
… A: (click here)
Very simply put the Singular Value Decomposition (SVD) is a mathematical technique used to decompose a matrix (our data) into a number of left and right singular vectors and their corresponding ‘weights’ (the singular values). It allows us to quickly visualize the ‘rank’ of the matrix (which gives us a rough approximation of how many decay components we might need in our model to fit the data).
More precisely, the SVD decomposed a (data) matrix into three other matrices. It is often used in signal processing and statistics to identify patterns in data. Specifically, SVD decomposes a matrix (A) into three matrices (U), (Σ), and (V^T) such that (A = UΣV^T). Here, (U) and (V) are orthogonal matrices, and (Σ) is a diagonal matrix containing the singular values. The left singular vectors (columns of (U)) represent the time coordinates, while the right singular vectors (columns of (V)) represent the spectral coordinates. The singular values in (Σ) provide a measure of the importance of each corresponding singular vector.
Put simply: if we take the first left singular vector (LSV), multiply it by the first singular value (SV), and then multiply by the first right singular vector (RSV), we obtain the first approximation of the data matrix. Repeating this process for the second and third singular vectors provides an even better approximation, especially since the rank of this particular matrix appears to be 3. By summing the products of all the left and right singular vectors, each weighted by their corresponding singular value, we can reconstruct the original data matrix.
That’s more than you need to know about Singular Value Decompositions at this time.
Starting a project
Once we have decided that our data is good enough to attempt to model it using pyglotaran we can start our adventure.
To start using pyglotaran in your analysis, you only have to import the Project class and open a project (folder).
[3]:
from glotaran.project import Project
my_project = Project.open("my_project")
If the project does not already exist this will create a new project and its folder structure for you. In our case we had only the models + parameters folders and the data + results folder were created when opening the project.
[4]:
# This so called shell command allows us to 'list' (ls) the content of the project folder.
%ls my_project
data/ models/ parameters/ project.gta results/
Import the data into your project
As long as your data can be transformed into a well structured xarray Dataset you can import it directly into your project.
Q: Well structured?
A: …
By that we mean that your data is of type xarray.Dataset with a data Data variable (or xarray.DataArray) and has the coordinates time and spectral.
This will then save your data as NetCDF (.nc) file into the data folder inside of your project with the name that you gave it (here my_project/data/my_data.nc).
[5]:
my_project.import_data(my_dataset, dataset_name="my_data")
After importing, our my_project is aware of the data that we named my_data when importing.
Working with models
⚠️ Please note that the exact way in which a model is defined may still change slightly in future versions of pyglotaran. But no worries, there will always be a clear procedure to upgrade any existing models you may have created in the meantime.
After importing our data into the project, to analyse them, we need a model (or analysis scheme).
If it does not already exists, create a file called my_model.yaml in your projects’ models folder and fill it with the following content.
📝 Don’t let this file extension (
.yamlor.yml) scare you, it’s just another plain text file, which you can open with literally any text editor.
In our case the file already exists, so we can just show you the content, which you can then copy paste.
[6]:
my_project.show_model_definition("my_model")
# copy the model definition from the output below
[6]:
initial_concentration:
input:
compartments: [s1, s2, s3]
parameters: [input.1, input.0, input.0]
k_matrix:
k1:
matrix:
(s2, s1): kinetic.1
(s3, s2): kinetic.2
(s3, s3): kinetic.3
megacomplex:
m1:
k_matrix: [k1]
type: decay
irf:
irf1:
type: gaussian
center: irf.center
width: irf.width
dataset:
my_data:
initial_concentration: input
megacomplex: [m1]
irf: irf1
⬆️ copy into models/my_model.yaml
The above model reads (from bottom to top) as:
We have a dataset named
my_data, which we model with a single (kinetic) megacomplexm1, with initial_concentration vectorinputand instrument response function (IRF)irf1.The IRF
irf1is defined as being of typegaussianwith its center location atirf.centerand its width to beirf.widthThe megacomplex
m1is composed of just a single kinetic matrixk1of typedecay(short for exponential decays)This k_matrix is composed of just 3 rate constants (
kinetic.1,kinetic.2,kinetic.3) describing the sequential kinetic scheme: “s1->s2->s3->ground”It is sequential because the initial_concentration
inputdefines all of the input (1) going tos1, and none of it going tos2ors3.
You can check your model for problems with the validate method.
[7]:
my_project.validate("my_model")
[7]:
Your model is valid.
Working with parameters
A model by itself is not sufficient, we also need starting values for the parameters we define in the model.
For this, we use a parameters file. Create a file called my_parameters.yaml in your parameters folder with the following content.
Again, in our case the file already exists, so we just show the content for you to copy.
[8]:
my_project.show_parameters_definition("my_parameters")
[8]:
input:
- ["1", 1, { "vary": False }]
- ["0", 0, { "vary": False }]
kinetic: [0.51, 0.31, 0.11]
irf:
- ["center", 0.31]
- ["width", 0.11]
This reads as:
There are two
inputparametersinput.1with (fixed) value 1, andinput.0with (fixed) value 0.There are 3 kinetics rates, with starting values
0.51,0.31,0.11.There are 2 IRF related parameters
irf.centerwith starting value0.31andirf.widthwith starting value0.11.
All parameters are implicitly ‘free’, unless specified with { "vary": False } to be fixed.
You can use the validate method, which can check for missing parameters.
[9]:
my_project.validate("my_model", "my_parameters")
[9]:
Your model is valid.
Since not all problems in the model can be detected automatically it is wise to visually inspect the model.
For this purpose, you can just load the model and inspect its markdown rendered version.
[10]:
my_project.load_model("my_model")
[10]:
Model
Dataset Groups
default
Label: default
Residual Function: variable_projection
K Matrix
k1
Label: k1
Matrix: {(‘s2’, ‘s1’): ‘kinetic.1’, (‘s3’, ‘s2’): ‘kinetic.2’, (‘s3’, ‘s3’): ‘kinetic.3’}
Megacomplex
m1
Label: m1
Dimension: time
Type: decay
K Matrix: [‘k1’]
Initial Concentration
input
Label: input
Compartments: [‘s1’, ‘s2’, ‘s3’]
Parameters: [‘input.1’, ‘input.0’, ‘input.0’]
Exclude From Normalize: []
Irf
irf1
Label: irf1
Normalize: True
Backsweep: False
Type: gaussian
Center: irf.center
Width: irf.width
Dataset
my_data
Label: my_data
Group: default
Force Index Dependent: False
Megacomplex: [‘m1’]
Initial Concentration: input
Irf: irf1
The same way you should inspect your parameters.
[11]:
my_project.load_parameters("my_parameters")
[11]:
input:
Label
Value
Standard Error
t-value
Minimum
Maximum
Vary
Non-Negative
Expression
1
1.000e+00
nan
nan
-inf
inf
False
False
None0
0.000e+00
nan
nan
-inf
inf
False
False
Noneirf:
Label
Value
Standard Error
t-value
Minimum
Maximum
Vary
Non-Negative
Expression
center
3.100e-01
nan
nan
-inf
inf
True
False
Nonewidth
1.100e-01
nan
nan
-inf
inf
True
False
Nonekinetic:
Label
Value
Standard Error
t-value
Minimum
Maximum
Vary
Non-Negative
Expression
1
5.100e-01
nan
nan
-inf
inf
True
False
None2
3.100e-01
nan
nan
-inf
inf
True
False
None3
1.100e-01
nan
nan
-inf
inf
True
False
None
Optimizing - fitting our data
Now we have all the components in place to start our analysis, optimizing the parameters while minimizing the residual, i.e. fitting our model to our data.
[12]:
result = my_project.optimize("my_model", "my_parameters")
result
Iteration Total nfev Cost Cost reduction Step norm Optimality
0 1 1.1173e+04 1.73e+06
1 2 1.4989e+01 1.12e+04 1.96e-02 1.27e+04
2 3 7.5689e+00 7.42e+00 5.86e-03 1.01e+03
3 4 7.5650e+00 3.84e-03 1.68e-05 6.26e-02
4 5 7.5650e+00 1.51e-11 4.41e-09 6.23e-06
Both `ftol` and `xtol` termination conditions are satisfied.
Function evaluations 5, initial cost 1.1173e+04, final cost 7.5650e+00, first-order optimality 6.23e-06.
[12]:
Optimization Result |
|
|---|---|
Number of residual evaluation |
5 |
Number of residuals |
151200 |
Number of free parameters |
5 |
Number of conditionally linear parameters |
216 |
Degrees of freedom |
150979 |
Chi Square |
1.51e+01 |
Reduced Chi Square |
1.00e-04 |
Root Mean Square Error (RMSE) |
1.00e-02 |
Model
Dataset Groups
default
Label: default
Residual Function: variable_projection
K Matrix
k1
Label: k1
Matrix: {(‘s2’, ‘s1’): ‘kinetic.1(5.00e-01±6.79e-05, t-value: 7364, initial: 5.10e-01)’, (‘s3’, ‘s2’): ‘kinetic.2(3.00e-01±3.94e-05, t-value: 7620, initial: 3.10e-01)’, (‘s3’, ‘s3’): ‘kinetic.3(1.00e-01±4.23e-06, t-value: 23642, initial: 1.10e-01)’}
Megacomplex
m1
Label: m1
Dimension: time
Type: decay
K Matrix: [‘k1’]
Initial Concentration
input
Label: input
Compartments: [‘s1’, ‘s2’, ‘s3’]
Parameters: [‘input.1(1.00e+00, fixed)’, ‘input.0(0.00e+00, fixed)’, ‘input.0(0.00e+00, fixed)’]
Exclude From Normalize: []
Irf
irf1
Label: irf1
Normalize: True
Backsweep: False
Type: gaussian
Center: irf.center(3.00e-01±5.04e-06, t-value: 59496, initial: 3.10e-01)
Width: irf.width(1.00e-01±6.73e-06, t-value: 14864, initial: 1.10e-01)
Dataset
my_data
Label: my_data
Group: default
Force Index Dependent: False
Megacomplex: [‘m1’]
Initial Concentration: input
Irf: irf1
What you see from the optimization ‘log’ is that the optimization took about 5 iterations to converge, going from a cost of 11.2k down to 7.56.
The Optimization Result table gives us a nice ‘statistics’ overview and if we click on details we can view the optimized model.
Since we are satisfied that our fit has converged, we’ll proceed to look at the outcome in more detail.
Speaking of outcomes, please note that each time you run an optimization the result will be saved in the project’s results folder.
You may occasionally want to clean these up, especially with larger projects or datasets.
[13]:
# To view the saved results, look at the content of the project's results folder
%ls "my_project/results"
my_model_run_0000/
One way to look at our results in more detail, is to query the optimized_parameters object in the results.
This will also print (if it can be computed) the standard error for each estimated parameter.
[14]:
result.optimized_parameters
[14]:
input:
Label
Value
Standard Error
t-value
Minimum
Maximum
Vary
Non-Negative
Expression
1
1.000e+00
nan
nan
-inf
inf
False
False
None0
0.000e+00
nan
nan
-inf
inf
False
False
Noneirf:
Label
Value
Standard Error
t-value
Minimum
Maximum
Vary
Non-Negative
Expression
center
3.000e-01
5.042e-06
59496
-inf
inf
True
False
Nonewidth
1.000e-01
6.728e-06
14864
-inf
inf
True
False
Nonekinetic:
Label
Value
Standard Error
t-value
Minimum
Maximum
Vary
Non-Negative
Expression
1
5.000e-01
6.790e-05
7364
-inf
inf
True
False
None2
3.000e-01
3.937e-05
7620
-inf
inf
True
False
None3
1.000e-01
4.230e-06
23642
-inf
inf
True
False
None
You can further inspect the result by accessing its data attribute. In our example it only contains a single my_data dataset, but it could contain many datasets in a multi-dataset analysis.
[15]:
result.data
[15]:
{'my_data': <xarray.Dataset>}
my_data
<xarray.Dataset> Size: 6MB
Dimensions: (time: 2100, spectral: 72,
left_singular_value_index: 72,
singular_value_index: 72,
right_singular_value_index: 72,
clp_label: 3, species: 3,
component_m1: 3, species_m1: 3,
to_species_m1: 3, from_species_m1: 3)
Coordinates:
* time (time) float64 17kB -1.0 -0.99 ... 19.99
* spectral (spectral) float64 576B 600.0 ... 699.4
* clp_label (clp_label) <U2 24B 's1' 's2' 's3'
* species (species) <U2 24B 's1' 's2' 's3'
* component_m1 (component_m1) int64 24B 1 2 3
rate_m1 (component_m1) float64 24B 0.5 0.3 0.1
lifetime_m1 (component_m1) float64 24B 2.0 3.334 10.0
* species_m1 (species_m1) <U2 24B 's1' 's2' 's3'
initial_concentration_m1 (species_m1) float64 24B 1.0 0.0 0.0
* to_species_m1 (to_species_m1) <U2 24B 's1' 's2' 's3'
* from_species_m1 (from_species_m1) <U2 24B 's1' 's2' 's3'
Dimensions without coordinates: left_singular_value_index,
singular_value_index, right_singular_value_index
Data variables: (12/21)
data (time, spectral) float64 1MB 0.002468 .....
data_left_singular_vectors (time, left_singular_value_index) float64 1MB ...
data_singular_values (singular_value_index) float64 576B 6.57...
data_right_singular_vectors (spectral, right_singular_value_index) float64 41kB ...
residual (time, spectral) float64 1MB 0.002468 .....
matrix (time, clp_label) float64 50kB 6.111e-39...
... ...
irf_center float64 8B 0.3
irf_width float64 8B 0.1
decay_associated_spectra_m1 (spectral, component_m1) float64 2kB 31....
a_matrix_m1 (component_m1, species_m1) float64 72B 1...
k_matrix_m1 (to_species_m1, from_species_m1) float64 72B ...
k_matrix_reduced_m1 (to_species_m1, from_species_m1) float64 72B ...
Attributes:
source_path: /home/docs/checkouts/readthedocs.org/us...
model_dimension: time
global_dimension: spectral
root_mean_square_error: 0.010003331059717659
weighted_root_mean_square_error: 0.010003331059717659
dataset_scale: 1
loader: <function load_dataset at 0x79421b880ca0>Finally, although knowing your optimization converged, and the optimized parameters seem reasonable is half the battle, the real question is of course … what does it LOOK like. Let’s plot!
Visualize the Result
The results can be visualized in a similar way as the dataset, using a function from the pyglotaran_extras part of the framework, in this case plot_overview.
[16]:
from pyglotaran_extras import plot_overview
plot_overview(result);
In this overview you see the following.
first row:
[left]The concentrations corresponding to the components in the kinetic scheme[middle]The species associated spectra (SAS), in this case equivalent to the evolution associated spectra (EAS)[right]The decay associated spectra (DAS)
second row:
[left]The residual matrix, with the IRF (dispersion) curve plotted on top[middle]The normalized SAS[right]The normalized DAS
third row: the SVD of the residual matrix, with
[left]The first 2 components (black, red) of the left singular vectors (time)[left]The first 2 components (black, red) of the right singular vectors (spectral)[right]The first 10 singular values on a logarithmic scale.
forth row: the SVD of the data matrix, with
[left]The first 4 components (black, red, blue, green) of the left singular vectors (time)[left]The first 4 components (black, red, blue, green) of the right singular vectors (spectral)[right]The first 10 singular values on a logarithmic scale.
Conclusion
In this guide we showed how to
plot your data (once it’s in the right format)
start a project, import the data into it
work with models (and a model
.yamlfile)specify starting values for our parameters (in a parameters
.yamlfile)analyze your data
plot the results
Welcome to the future of global target analysis, we hope you like it here!
And now for something completely different
Right, right. We were going to talk about your data. That .csv file did end up burning a hole in your pocket, didn’t it? 😉
Well, there is a reason we have standerdized around xarray’s Datasets, we know what to expect.
We don’t know that with .csv. 🙈
Some people save something as .csv, but it is space or tab delimited.
Some people put in a header, of a single line, two lines, 4 lines.
Some people put in a footer, of many lines.
Some people pad their data matrix with a zero, some don’t.
Some people include their spectral and time coordinates, some don’t.
Some do, but call it something completely different.
Some people use monotonic increasing coordinates (as one should), but some don’t 😱.
In short, you can never be sure with .csv.
All of that being said … if you had some fairly clean .csv data lying around you wanted to import, you could use the pandas library to read the data and then create an xarray Dataset out of it.
How, is left as an exercise to the reader, but we left some tips below.
Click to reveal tips!
Assuming a csv file with the timepoints in the first row, and the spectral coordinate (e.g. wavelengths) in the fist column.
-2.0,0.0,2.0,10.0,100.0,1000.0
420,0,10,15,5,1,0
520,1,100,25,7,2,0
620,0,25,15,2,1,0
Then this bit of Python code could read in your data.
import pandas as pd
import numpy as np
import xarray as xr
from pathlib import Path
filepath = Path(r"file_that_burned_a_hole_in_your_pocket.csv")
# Load the CSV file into a pandas DataFrame
df = pd.read_csv(filepath, delimiter=',')
# Convert index and columns to numeric, ignoring any non-numeric values
df.index = pd.to_numeric(df.index, errors='coerce')
df.columns = pd.to_numeric(df.columns, errors='coerce')
# Remove any rows or columns that couldn't be converted to numeric
df = df.loc[df.index.notnull(), df.columns.notnull()]
# Extract the coordinates (assuming they were in the .csv file)
timepoints = np.array(df.columns.values).astype(float)
wavelengths = np.array(df.index.values).astype(float)
dataset = xr.DataArray(
df.values.T,
dims=["time", "spectral"],
coords={"time": timepoints, "spectral": wavelengths},
).to_dataset(name="data")
dataset
# pro tip, try to plot it to see if it was read in correctly.
plot_data_overview(dataset,linlog=True);
At some point we will add a more robust and battle tested csv_to_dataset function to the pyglotaran_extras package, but even then the above continues to hold true.