Refining Cancer Survival Analysis: Constructing Age-specific Relative Survival Graphs Using Polynomial Regression to Interpolate Age-group Based Data from the Surveillance, Epidemiology, and End Results Program

Author: Vedanth Ramji
Mentor: Dr. Natarajan Ganesan
APL Global School

Abstract

The Surveillance, Epidemiology, and End Results (SEER) program by the National Cancer Institute provides relative survival by time since diagnosis graphs for different cancers at different age groups [9]. This data is invaluable in computing cancer treatment and survivorship statistics, providing critical insights into disease prognosis and patient outcomes [2]. However, a primary limitation in this data lies in the expansive age groups used (<15, <20, 15-39, 40-64, 50-64, 65-74, 65+, and 75+), which fail to accurately represent survival rates for specific ages. As cancer survival rates can differ based on the age of individuals, and considering that the age composition of cancer patients differs across populations, adjusting for age is essential when comparing cancer survival rates [12]. This study proposes a mathematical and computational approach using polynomial regression to create accurate and precise survival curves for exact ages without requiring additional cancer statistics. Polynomial regression is applied to relative survival rates for specific age groups at different times since diagnosis, generating functions that map age to relative survival rates. Then, these functions are used to create age-specific survival curves. Age-specific survival curves can provide patients and healthcare providers with more accurate survival data for developing treatment programs, while public health systems can optimize medical resource allocation.

Introduction

The Surveillance, Epidemiology, and End Results (SEER) program curates and provides cancer statistics procured from different regions in the USA, offering insights into cancer incidence, prevalence, treatment, and survival rates [1]. Cancer survival rates are essential indicators for assessing treatment efficacy, monitoring patient outcomes, and guiding healthcare decisions. Relative survival rates consider both cancer-related deaths and non-cancer-related deaths and compare the survival rates of cancer patients to those of the overall population, taking into consideration factors such as age, race, and gender [1]. This makes them a preferred metric for assessing cancer-specific survival. However, as the curves are for set age ranges, the data fails to effectively capture survival rates for specific ages. For example, the relative survival rate for acute myeloid leukemia (AML) in the age range 15-39 after one year since diagnosis is 78.9 ± 0.4%, while the relative survival rate for AML in the age range 40-64 after one year since diagnosis is 59.4 ± 0.3%. Therefore, it is challenging to determine the precise survival rate for individual ages within these age ranges.

Polynomial regression is a sophisticated statistical technique to elucidate intricate relationships between variables. It involves fitting a polynomial function to observed data points. Unlike simple linear regression, which assumes a linear relationship between variables, polynomial regression accommodates non-linearity by employing higher-degree polynomial equations. This allows for a more flexible relationship between factors.

This study presents an approach that leverages polynomial regression to construct relative survival by time since diagnosis graphs for exact ages without having to collect additional cancer statistics to determine survival rates for individual ages. This is done by, first, plotting relative survival rates against different ages for each year since diagnosis. Then polynomial regression is performed on these graphs to obtain functions that map a certain age to a relative survival rate for a specific period since diagnosis. The outputs of these functions are then used to calculate the points on the relative survival by time since diagnosis graph for exact ages.

Methods

To illustrate the process of constructing age-specific survival curves, data from SEER on acute myeloid leukemia has been utilized. However, the methods outlined in the study can be used on other relative survival by time since diagnosis graphs of other cancers (see Figure 8).

Data Acquisition and Cleaning

Relative survival rates by time since diagnosis data for acute myeloid leukemia was downloaded in a comma-separated values (CSV) file from SEER [9]. The age ranges in the CSV file were: <15, <20, 15-39, 40-64, 50-64, 65-74, 65+ and 75+. Before applying polynomial regression, the raw CSV data was processed using the ‘pandas’ Python library to remove irrelevant or incomplete information (the ‘pandas’ package provides support for working with different datasets [4]). The CSV file was loaded onto a pandas DataFrame object, and rows that did not contain column headings or survival rates were removed, ensuring the data’s integrity and consistency.


Figure 1. Structure of the cleaned pandas DataFrame which contains relative survival for time since diagnosis of acute myeloid leukemia taken from the CSV data downloaded from SEER [9].

Figure 2. Example of a relative survival by time since diagnosis graph for acute myeloid leukemia taken from SEER [9]. This is a graphical representation of the CSV file loaded onto the pandas DataFrame, shown in Figure 1.

Selection of Age Groups for Regression

A diverse range of survival rates is required to optimize the accuracy of the curve fitting while performing polynomial regression. To achieve this, specific age groups are selected from the dataset, ensuring a wide representation of survival rates. Relative survival rates for AML in the ages 15, 39, 64, and 74 were selected. These ages correspond to the age groups <15, 15-39, 50-64, and 65-74respectively, in the cleaned CSV data. This selection can be modified depending on the availability of data for different cancers.


Figure 3. The relative survival by time since diagnosis graph with only the age groups: <15, 15-39, 50-64, and 65-74. Regression will be applied to this graph. Graph obtained from SEER [9].

By modifying the age groups used to represent a broader spectrum of survival rates, potential fluctuations in the trend of the survival curve can be minimized. This enhances the robustness of the polynomial regression and improves the accuracy and precision of interpolated survival data.

Performing Polynomial Regression

With the selected age-specific survival rates, we apply polynomial regression to fit a polynomial curve to the data points. The polynomial regression model aims to find the best-fitting curve that maps ages to relative survival rates. In this case, polynomial regression of the second order is employed, yielding a quadratic function that can effectively approximate the underlying survival trend. However, depending on the trend of other cancers, regression to the third or even fourth order might be necessary.

To perform polynomial regression the pandas, Scikit-learn, and NumPy Python libraries were used. The pandas package was used to load the cleaned DataFrame containing relative survival rates of the chosen age ranges: <15, 15-39, 50-64, and 65-74.

NumPy is a fundamental package for scientific computing in Python [5]. It provides support for matrices, arrays, and mathematical functions to operate on these arrays [5]. The ‘array()’ function in NumPy is used to create an array of the chosen ages (15, 39, 64, and 74). The ‘reshape()’ function with the arguments ‘reshape(-1, 1)’ is used to restructure this array into a single column. Survival rates for these ages are taken from the data provided by SEER and stored in the list ‘survival_data.’

Scikit-learn is a Python package that assists with data analysis, statistical modeling, and machine learning. It also provides support for performing different types of regression [6]. ‘LinearRegression’ is a class from the ‘sklearn.linear_model’ module in the Scikit-learn package. It fits a linear model to input data by finding the coefficients that minimize the residual sum of squares between observed and target values.

‘Polynomial Features’ is another class from the ‘sklearn.preprocessing’ module in Scikit-learn. It generates polynomial features based on input data, enabling the model to capture non-linear relationships between the features and target variables.


Figure 5. Extrapolated relative survival rate by age graph for 1 year since diagnosis of acute myeloid leukemia. Graph created using Matplotlib [3].

The equation of the line of best fit from the polynomial regression can be represented as follows, where y is the relative survival rate (%) and x is age (years):

y = —0.0187x2 + 0.9386x + 72.7471.

To assess the goodness of fit for this regression, an R-squared value was calculated. The R-squared value measures the proportion of the variation in the dependent variable (relative survival rates) that is predictable from the independent variable (age). A high R-squared value (close to 1) indicates a strong correlation between age and relative survival, supporting the accuracy of the interpolation. The R-squared value for the above regression was 0.9853. This process was repeated for all the other years since diagnosis (2 – 10 years). The results are shown in Table 1.


Table 1. Equations of lines of best fit for relative survival by age for 1 year since diagnosis for acute myeloid leukemia.

Figure 6. Extrapolated relative survival rate by age graph for 1 to 4 years since diagnosis of acute myeloid leukemia. Graph created using Matplotlib [3].

Results

After calculating the equations of the lines of best fit in Table 1, survival rates for different times since diagnosis for specific ages can be calculated by passing different ages as inputs into the equations of the lines of best fit. Ideally, ages less than 15 and greater than 85 should be ignored as most clinical trials do not account for these ages, and notable changes in physiology occur in geriatric cases and infants [7][8]. Arbitrary ages of 25, 35, 45, and 55 were taken to illustrate age-specific survival curves, but any other age can also be chosen.

These ages were given as inputs to each of the equations calculated in Table 1. The results are shown in Table 2.


Table 2. Survival rates of different ages for different times since diagnosis of acute myeloid leukemia calculated using equations from Table 1.

The relative survival rates calculated in Table 2 can then be plotted against time since diagnosis for each age to produce age-specific relative survival by time since diagnosis graphs. This is shown in Figure 7.


Figure 7. Relative survival by time since diagnosis graphs for ages 25, 35, 45, and 55 for acute myeloid leukemia. Graph created using Matplotlib [3].

To assess the accuracy of the constructed survival curves, a Euclidean pairwise distance measure is used. Pairwise distance measures help quantify the similarity or dissimilarity between data points or objects. In Euclidean distance measures, the straight-line distance between two points in a Euclidean space is measured. It is calculated as the square root of the sum of squared differences between corresponding elements of two lists. The Python package ‘SciPy’ provides support for calculating Euclidean distance measures in the ‘scipy.spatial.distance’ module [11]. This was used to calculate distance measures between the constructed age-specific survival curves and survival curves from SEER, which are displayed in Table 3.


Table 3. Pairwise distance measures between age-specific survival curves for AML and SEER’s survival curves.

Age-specific survival curves for other cancer types can also be constructed using the same methods. Survival curves for the age of 35 for glioblastoma, adenocarcinoma of the esophagus, acute monocytic leukemia and acute lymphocytic leukemia were constructed (see Figure 8) and their pairwise distance measures from SEER’s data for the age-range 15-39 were calculated (see Table 4).


Figure 8. Relative survival by time since diagnosis graphs for glioblastoma, adenocarcinoma (esophageal), acute monocytic leukemia and acute lymphocytic leukemia for the age of 35. Data from SEER [9]. Graph created using Matplotlib [3].

Conclusion

The results obtained from polynomial regression indicate high goodness of fit, as evidenced by the high R-squared values ranging from 0.9850 to 0.9928. This indicates a strong correlation between age and relative survival rates, validating the accuracy of the interpolation method. The pairwise distance measures are also relatively small. Three of the four distance measures for the survival curves for AML are less than 20% and the pairwise distance measures for glioblastoma, adenocarcinoma of the esophagus, acute monocytic leukemia and acute lymphocytic leukemia are all less than 20%, indicating that the age-specific graphs are spatially close to SEER’s data. However, at higher ages (55 years for AML) the pairwise distance is larger than 20%. This is due to interference from low relativesurvival rates, especially from the age group of 65-74 years. The availability of a greater number of narrower age ranges would greatly help offset the effects of low survival rates at higher ages.

The proposed methodology demonstrates a novel and effective way to construct age-specific relative survival curves from SEER’s age-group based data. By utilizing polynomial regression, accurate survival data for exact ages can be interpolated, addressing the limitations of the expansive age groups provided by SEER.

In conclusion, the presented novel polynomial regression technique provides an efficient means of constructing precise and accurate age-specific survival curves without the need for additional cancer statistics, especially for younger ages. As cancer statistics are updated, it is hoped that these graphs will eventually offer significant value to healthcare providers, patients, and public health systems. For healthcare providers, the availability of accurate survival data for specific ages aids in developing tailored treatment plans, optimizing therapy choices, and enhancing patient outcomes. Patients, too, benefit from more personalized information that empowers them to make informed decisions about their treatment journey. Moreover, public health systems can utilize this detailed survival data to allocate medical resources effectively and ensure equitable access to quality cancer care across all age groups.

In the future, it is hoped that more comprehensive survival data with narrower age ranges will be available, further enhancing the accuracy of interpolation [10].

References

  1. DeSantis, C.E., Lin, C.C., Mariotto, A.B., Siegel, R.L., Stein, K.D., Kramer, J.L., Alteri, R.,Robbins, A.S. and Jemal, A. (2014), Cancer treatment and survivorship statistics, 2014. CA: A Cancer Journal for Clinicians, 64: 252-271. https://doi.org/10.3322/caac.21235
  2. Understanding Statistics Used to Guide Prognosis and Evaluate Treatment. (2010, April 8). Cancer.Net. https://www.cancer.net/navigating-cancer-care/cancer-basics/understanding-statistics-used-guide-prognosis-and-evaluate-treatment
  3. J. D. Hunter, “Matplotlib: A 2D Graphics Environment,” in Computing in Science & Engineering, vol. 9, no. 3, pp. 90-95, May-June 2007, doi: 10.1109/MCSE.2007.55.
  4. The pandas development team. (2023). pandas-dev/pandas: Pandas (v2.1.0rc0). Zenodo. https://doi.org/10.5281/zenodo.8239932
  5. Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Array programming with NumPy. Nature 585, 357–362 (2020). https://doi.org/10.1038/s41586-020-2649-2
  6. Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
  7. Zhang E, DuBois SG. Early Termination of Oncology Clinical Trials in the United States. Cancer Med. 2023; 12: 5517-5525. doi: 10.1002/cam4.5385
  8. Shenoy, P., & Harugeri, A. (2015). Elderly patients’ participation in clinical trials. Perspectives in clinical research, 6(4), 184–189. https://doi.org/10.4103/2229-3485.167099
  9. SEER*Explorer: An interactive website for SEER cancer statistics [Internet]. Surveillance Research Program, National Cancer Institute; 2023 Apr 19. [updated: 2023 Nov 16; cited 2023 Dec 14]. Available from: https://seer.cancer.gov/statistics-network/explorer/. Data source(s): SEER Incidence Data, November 2022 Submission (1975-2020), SEER 22 registries (excluding Illinois and Massachusetts). Expected Survival Life Tables by Socio-Economic Standards.
  10. Simon, R., Altman, D. Statistical aspects of prognostic factor studies in oncology. Br J Cancer 69, 979–985 (1994). https://doi.org/10.1038/bjc.1994.192
  11. Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, CJ Carey, İlhan Polat, Yu Feng, Eric W. Moore,Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E.A. Quintero, Charles R Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1.0 Contributors. (2020) SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17(3), 261-272.
  12. Brenner, H., Arndt, V., Gefeller, O., & Hakulinen, T. (2004). An alternative approach to age adjustment of cancer survival rates. European Journal of Cancer, 40(15), 2317–2322. doi:10.1016/j.ejca.2004.07.007

About the author

Vedanth Ramji

Vedanth is a junior at APL Global School, Chennai, India. He is passionate about research in computational biology and creating digital healthcare solutions. Vedanth is currently a long-term student researcher at the Big Data Biology Lab at Queensland University of Technology, where he works on bioinformatics tools and conducts research on antimicrobial resistance and metagenomics.

He is also a software developer for Queromatics, a not-for-profit cancer research and consulting organization, where he developed a cancer treatment planning app – Cancerstop. Vedanth is the founder of Thaavaram, a global surveillance system to collect data, detect, monitor, and act on antimicrobial resistance in plants.