Skip to content

Preparing a dataset

Preparing a dataset for multistate modeling with PyMSM¤

The first step of any multistate model is to provide the sample data of paths and covariates.

There are 2 types of dataset formats which can serve as an input:

1) a list of PathObject
2) a pandas data frame in the format used to fit the CompetingRiskModel class

1. A list of PathObject¤

Best to see an example:

# Load Rotterdam example data
from pymsm.datasets import prep_rotterdam
dataset, _ = prep_rotterdam()

# Print types
print('dataset type: {}'.format(type(dataset)))
print('elements type: {}'.format(type(dataset[0])))
dataset type: <class 'list'>
elements type: <class 'pymsm.multi_state_competing_risks_model.PathObject'>

The dataset is a list of elements from class PathObject. Each PathObject in the list corresponds to a single sample’s (i.e “patient’s”) observed path.

Let’s look at one such object in detail:

# Display paths and covariates of one sample (#1314)
sample_path = dataset[1314]
sample_path.print_path()
Sample id: 1326
States: [1, 2, 3]
Transition times: [873.999987, 1672.0000989999999]
Covariates:
year      1990
age         44
meno         0
grade        3
nodes       17
pgr         40
er           7
hormon       0
chemo        1
Name: 1314, dtype: object

2. A pandas dataframe¤

a pandas data frame in the format used to fit the CompetingRiskModel class. Let's see one:

# Load EBMT dataset
from pymsm.datasets import prep_ebmt_long
competing_risk_dataset, covariate_cols, state_labels = prep_ebmt_long()
competing_risk_dataset.head()
sample_id origin_state target_state time_entry_to_origin time_transition_to_target match_no gender mismatch proph_yes year_1990-1994 year_1995-1998 agecl_<=20 agecl_>40
0 1 1 2 0.0 22.0 1 0 0 1 0 0
1 1 2 0 22.0 995.0 1 0 0 1 0 0
2 2 1 3 0.0 12.0 1 0 0 1 0 0
3 2 3 4 12.0 29.0 1 0 0 1 0 0
4 2 4 5 29.0 422.0 1 0 0 1 0 0
print(competing_risk_dataset.columns)
Index(['sample_id', 'origin_state', 'target_state', 'time_entry_to_origin',
       'time_transition_to_target', 'match_no gender mismatch', 'proph_yes',
       'year_1990-1994', 'year_1995-1998', 'agecl_<=20', 'agecl_>40'],
      dtype='object')

The competing_risk_dataset has to include the following columns:

'sample_id',
'origin_state',
'target_state',
'time_entry_to_origin',
'time_transition_to_target'  

which are self-explanatory, as well as any other covariate columns.