Introduction of the mini-project, Example #4:
This example addresses a number up issues.
Firstly, it circumvents the issue identifed in Example #3 which is the need to have up to date data. The data are now retrieved from the NASA.
Secondly, would like to further test mixing R and Python code. The work could be done in either language but yet, the ability to mix the best of both languages is desirable in some circumstances.
And finally, it interesting to look at the data on global warming. Of course, this is one of the most important challenges that humanity faces, if not the most important.
Data from:
https://data.giss.nasa.gov/gistemp/GISTEMP Team, 2023: GISS Surface Temperature Analysis (GISTEMP), version 4. NASA Goddard Institute for Space Studies. Dataset accessed 20YY-MM-DD at https://data.giss.nasa.gov/gistemp/.
Lenssen, N., G. Schmidt, J. Hansen, M. Menne, A. Persin, R. Ruedy, and D. Zyss, 2019: Improvements in the GISTEMP uncertainty model. J. Geophys. Res. Atmos., 124, no. 12, 6307-6326, doi:10.1029/2018JD029522.
[1] "Data retrieved 2024-11-13"
Year Jan Feb Mar Apr May ... J-D D-N DJF MAM JJA SON
0 1880 -.20 -.26 -.09 -.17 -.10 ... -.18 *** *** -.12 -.18 -.21
1 1881 -.20 -.16 .02 .04 .06 ... -.09 -.10 -.18 .04 -.08 -.19
2 1882 .16 .13 .04 -.16 -.14 ... -.11 -.09 .07 -.09 -.16 -.19
3 1883 -.29 -.37 -.12 -.18 -.18 ... -.18 -.20 -.34 -.16 -.10 -.19
4 1884 -.13 -.08 -.36 -.40 -.34 ... -.28 -.27 -.11 -.37 -.31 -.28
.. ... ... ... ... ... ... ... ... ... ... ... ... ...
140 2020 1.18 1.24 1.18 1.12 .99 ... 1.01 1.03 1.18 1.10 .89 .98
141 2021 .81 .65 .89 .76 .79 ... .85 .84 .75 .81 .86 .95
142 2022 .91 .89 1.05 .83 .84 ... .89 .90 .89 .91 .94 .86
143 2023 .88 .98 1.23 .99 .94 ... 1.17 1.13 .89 1.06 1.16 1.41
144 2024 1.25 1.44 1.39 1.32 1.16 ... *** *** 1.35 1.29 1.25 ***
[145 rows x 19 columns]
[1] "Data retrieved 2024-11-13"
Restructure and validate this data reorginization in R. Just showing 3 sample rows
# A tibble: 3 × 13
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1880 -0.2 -0.26 -0.09 -0.17 -0.1 -0.22 -0.2 -0.11 -0.16 -0.23 -0.23 -0.19
2 1881 -0.2 -0.16 0.02 0.04 0.06 -0.19 0.01 -0.04 -0.16 -0.22 -0.19 -0.08
3 1882 0.16 0.13 0.04 -0.16 -0.14 -0.23 -0.16 -0.08 -0.15 -0.24 -0.17 -0.36
# A tibble: 1,740 × 6
Year month Anomoly_Celcius Date Month_Number color_scale
<int> <chr> <dbl> <chr> <int> <int>
1 1880 Jan -0.2 1880-01-01 1 0
2 1880 Feb -0.26 1880-02-01 2 0
3 1880 Mar -0.09 1880-03-01 3 0
4 1880 Apr -0.17 1880-04-01 4 0
5 1880 May -0.1 1880-05-01 5 0
6 1880 Jun -0.22 1880-06-01 6 0
7 1880 Jul -0.2 1880-07-01 7 0
8 1880 Aug -0.11 1880-08-01 8 0
9 1880 Sep -0.16 1880-09-01 9 0
10 1880 Oct -0.23 1880-10-01 10 0
11 1880 Nov -0.23 1880-11-01 11 0
12 1880 Dec -0.19 1880-12-01 12 0
13 1881 Jan -0.2 1881-01-01 1 1
14 1881 Feb -0.16 1881-02-01 2 1
15 1881 Mar 0.02 1881-03-01 3 1
16 1881 Apr 0.04 1881-04-01 4 1
17 1881 May 0.06 1881-05-01 5 1
18 1881 Jun -0.19 1881-06-01 6 1
19 1881 Jul 0.01 1881-07-01 7 1
20 1881 Aug -0.04 1881-08-01 8 1
21 1881 Sep -0.16 1881-09-01 9 1
22 1881 Oct -0.22 1881-10-01 10 1
23 1881 Nov -0.19 1881-11-01 11 1
24 1881 Dec -0.08 1881-12-01 12 1
25 1882 Jan 0.16 1882-01-01 1 2
26 1882 Feb 0.13 1882-02-01 2 2
27 1882 Mar 0.04 1882-03-01 3 2
28 1882 Apr -0.16 1882-04-01 4 2
29 1882 May -0.14 1882-05-01 5 2
30 1882 Jun -0.23 1882-06-01 6 2
31 1882 Jul -0.16 1882-07-01 7 2
32 1882 Aug -0.08 1882-08-01 8 2
33 1882 Sep -0.15 1882-09-01 9 2
34 1882 Oct -0.24 1882-10-01 10 2
35 1882 Nov -0.17 1882-11-01 11 2
36 1882 Dec -0.36 1882-12-01 12 2
# ℹ 1,704 more rows
The line plot shows all the available data. An approximately linear increase in temparature anomoly can be seen from around 1978.
A violin plot of all the available data. Each ‘violin’ is a summary of data points for each of the 12 month for that given year. The ‘violin’ encompasses a box-plot if one Zooms in. Zooming and other dynamic plotting features require a manual refresh after using the controls.
The variability of anomolies can be seen clearly with some years have low variability e.g. 1971 and some high e.g. 2023.
Violin Plot Since 1960
A violin plot of data limited to a minimum year. The violins are accompanied by data points dots to the left for each month.
This plot I find remarkable: the line colour is a gradient from green through orange to red and is only assigned by the year (i.e. the temperature anomoly is not considered in choosing the color).
The colored banding pattern correlates quite well with temperature anomalies. Sadly, there is no doubt that climate change is progressing year by year and this plot illustrates the phenomena quite well.
Like the previous plot but now the data are truncated by a minimum year to plot in order to give a better resolution. The line plotting function joins the points with the spline function which is a bit more appealing i think. Color coding is the same as in the “Line Plot” i.e. only colored by year. The steady progress of climate change is astonishing, and sadly, 2023 seems to represent some kind of accelerated change as it really stands out from the pack of already disturbing anomaly curves.
Line Plot Since 1975
The main task target of exploring Dashboards with R and Python mash-up is completed. However, there are really many ways of representing, analyzing and processing this data.
Some targets, given time, include:
---
title: "Example #4"
format: dashboard
---
```{r}
# use of reticulate allows variables created in python
# to be accessed by R and vice versa
library(reticulate)
# get the date for use later
sys_date <- format(Sys.Date(), "%Y-%m-%d")
# save the retrieved data so it can be easily worked on in console
# for dev purposes
file_name <- paste0('./data/eg4_data_',sys_date,'.tsv')
```
```{python}
# imports
import pandas as pd
import math
import plotly.express as px
import datapackage
import csv
import sys
import urllib.request
from urllib.error import URLError, HTTPError
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from pathlib import Path
from textwrap import wrap
# Function to get colors array of specified length
# Ranges from green through orange to red to indicate increasing temperature anomoly
def get_color_array(length):
# Create a custom color map
cmap = mcolors.LinearSegmentedColormap.from_list("list_name", ["green", "orange", "red"])
color_array = [mcolors.rgb2hex(cmap(i)[:3]) for i in np.linspace(0, 1, length)]
return color_array
```
# Data Fetch
## Row {height="40%"}
Introduction of the mini-project, Example #4:
This example addresses a number up issues.
- Firstly, it circumvents the issue identifed in Example #3 which is the need to have up to date data. The data are now
retrieved from the NASA.
- Secondly, would like to further test mixing R and Python code. The work could be done in either language but yet, the ability
to mix the best of both languages is desirable in some circumstances.
- And finally, it interesting to look at the data on global warming. Of course, this is one of the most important challenges
that humanity faces, if not the most important.
Data from:
<https://data.giss.nasa.gov/gistemp/>
- Section: *Global-mean monthly, seasonal, and annual means, 1880-present, updated through most recent month*
Citations:
- GISTEMP Team, 2023: GISS Surface Temperature Analysis (GISTEMP), version 4. NASA Goddard Institute for Space Studies. Dataset
accessed 20YY-MM-DD at <https://data.giss.nasa.gov/gistemp/>.
- Lenssen, N., G. Schmidt, J. Hansen, M. Menne, A. Persin, R. Ruedy, and D. Zyss, 2019: [Improvements in the GISTEMP uncertainty
model](https://pubs.giss.nasa.gov/abs/le05800h.html). J. Geophys. Res. Atmos., **124**, no. 12, 6307-6326,
doi:10.1029/2018JD029522.
## Row {height="10%"}
```{r}
#| title: Data Source
print(paste0("Data retrieved ", sys_date))
```
## Row {height="50%"}
```{python}
#| title: Data Extraction
url = 'https://data.giss.nasa.gov/gistemp/tabledata_v4/GLB.Ts+dSST.csv'
try:
# Attempt to open the URL
response = urllib.request.urlopen(url)
lines = [l.decode('utf-8') for l in response.readlines()]
# remove the first line - it's not needed
first = lines.pop(0)
# print("Removed line: " + first)
cr = csv.reader(lines)
# the first line beccomes the header
df = pd.DataFrame(cr, columns=next(cr))
print(df)
except HTTPError as e:
# Handle HTTPError (e.g., 404 Not Found, 500 Internal Server Error)
print(f"HTTPError: {e.code} - {e.reason}")
sys.exit("Exiting")
except URLError as e:
# Handle URLError (e.g., network connectivity issues)
print(f"URLError: {e.reason}")
sys.exit("Exiting")
except Exception as e:
# Handle other exceptions
print(f"An unexpected error occurred: {e}")
sys.exit("Exiting")
# how to gracefully exit from R, Quarto, knitr when a python exception is thrown
```
# Data Restructure
## Row {height="10%"}
```{r}
#| title: Tab information
print(paste0("Data retrieved ", sys_date))
```
## Row {height="10%"}
Restructure and validate this data reorginization in R. Just showing 3 sample rows
## Row {height="20%"}
```{r}
#| title: Source and restructured data sample
library("tidyr")
library("dplyr")
library("stringr")
# this is the python data frame coped as R data frame
df <-py$df
# remove the string ***
# it means there is no data yet
df <- df %>%
mutate_all(~replace(., grepl('\\*\\*\\*', .), ''))
write.table(df, file=file_name, quote=FALSE, sep='\t')
df <- read.table(file = file_name, sep = '\t', header = TRUE)
df <- as_tibble(df)
# just take the monthly data
dfm <- df[,1:13]
#
n_head <- 3
head(dfm,n_head)
```
## Row {height="60%"}
```{r}
# pivot long
dfml <- pivot_longer(dfm, cols = 2:13)
colnames(dfml)[colnames(dfml) == 'name'] <- 'month'
colnames(dfml)[colnames(dfml) == 'value'] <- 'Anomoly_Celcius'
dfml$Date <- dfml$month
dfml <- dfml %>% mutate(Date = case_when(
Date == 'Jan' ~ paste0(Year, '-01-01'),
Date == 'Feb' ~ paste0(Year, '-02-01'),
Date == 'Mar' ~ paste0(Year, '-03-01'),
Date == 'Apr' ~ paste0(Year, '-04-01'),
Date == 'May' ~ paste0(Year, '-05-01'),
Date == 'Jun' ~ paste0(Year, '-06-01'),
Date == 'Jul' ~ paste0(Year, '-07-01'),
Date == 'Aug' ~ paste0(Year, '-08-01'),
Date == 'Sep' ~ paste0(Year, '-09-01'),
Date == 'Oct' ~ paste0(Year, '-10-01'),
Date == 'Nov' ~ paste0(Year, '-11-01'),
Date == 'Dec' ~ paste0(Year, '-12-01'),
TRUE ~ 'Exception'))
# make a separate column for month number
dfml$Month_Number <- dfml$Date
dfml <- dfml %>% mutate(Month_Number = str_replace_all(Month_Number, "\\d{4}-", ""))
dfml <- dfml %>% mutate(Month_Number = str_replace_all(Month_Number, "-\\d{2}", ""))
dfml$Month_Number <- as.integer(dfml$Month_Number)
# make a separate column color scale
dfml$color_scale <- as.integer(dfml$Year - 1880)
dfml %>% print(n = n_head * 12)
# send R df back to python
py$dfml <- dfml
```
# Line Plot
## Row {height="10%"}
The line plot shows all the available data. An approximately linear increase in temparature anomoly can be seen from around 1978.
## Row {height="90%"}
```{python}
#| title: Line Plot
# Convert the 'Date' column to datetime64
dfml['Date'] = pd.to_datetime(dfml['Date'])
# for use in another mini-project
dfml.to_csv("../Dash/gistemp.csv",index=False)
```
```{python}
#| title: Line Plot
fig = px.line(
dfml, x="Date", y="Anomoly_Celcius"
)
# Set x-axis Date range
start_date = '1880-01-01'
end_date = '2030-01-01'
fig.update_xaxes(range=[start_date, end_date])
```
# Violin Plot
## Row {height="15%"}
A violin plot of all the available data. Each 'violin' is a summary of data points for each of the 12 month for that given year.
The 'violin' encompasses a box-plot if one Zooms in. Zooming and other dynamic plotting features require a manual refresh after
using the controls.
The variability of anomolies can be seen clearly with some years have low variability e.g. 1971 and some high e.g. 2023.
## Row {height="85%"}
```{python}
#| title: Violin plot of all available data.
fig = px.violin(dfml, y="Anomoly_Celcius", x="Year", box=True,
hover_data=[dfml.Anomoly_Celcius, dfml.Year, dfml.month] )
fig.show()
```
# Violin Since..
## Row {height="5%"}
```{python}
year_var = 1960
print("Violin Plot Since", year_var)
```
## Row {height="5%"}
A violin plot of data limited to a minimum year. The violins are accompanied by data points dots to the left for each month.
## Row {height="90%"}
```{python}
#| title: Violin Plot Since..
dfml_short = dfml[dfml['Year'] >= year_var]
fig = px.violin(dfml_short,
y="Anomoly_Celcius", x="Year", box=False,
points="all",
hover_data=[dfml_short.Anomoly_Celcius, dfml_short.Year, dfml_short.month])
fig.show()
```
# Line by Year
## Row {height="15%"}
- This plot I find remarkable: the line colour is a gradient from green through orange to red and is only assigned by the year
(i.e. the temperature anomoly is not considered in choosing the color).
- The colored banding pattern correlates quite well with temperature anomalies. Sadly, there is no doubt that climate change is
progressing year by year and this plot illustrates the phenomena quite well.
## Row {height="85%"}
```{python}
#| title: Line by Year
# very strange that line_shape='spline' results in an error here
# but with the almost identical code below it's not a problem
# Specify the desired length of the color array as
# the number of distinct years
# Generate the color array
colors = get_color_array(dfml['Year'].nunique())
fig1 = px.line(dfml,
y="Anomoly_Celcius", x="Month_Number",
line_group="Year", line_shape='hvh',
markers=True,
color='Year',
color_discrete_sequence=colors,
hover_data=[dfml.Anomoly_Celcius, dfml.Year, dfml.month]
)
# Set the x-axis resolution to label every month
# Strangely these arguments do not work when generating the initial plot
fig1.update_layout(xaxis=dict(
tickmode='linear', # Use linear tick mode
dtick=1 # Specify the interval between ticks
))
```
# Line by Year Since..
Like the previous plot but now the data are truncated by a minimum year to plot in order to give a better resolution. The line
plotting function joins the points with the spline function which is a bit more appealing i think. Color coding is the same as in
the "Line Plot" i.e. only colored by year. The steady progress of climate change is astonishing, and sadly, 2023 seems to
represent some kind of accelerated change as it really stands out from the pack of already disturbing anomaly curves.
## Row {height="10%"}
```{python}
year_var = 1975
print("Line Plot Since", year_var)
```
## Row {height="90%"}
```{python}
#| title: Line by Year Since..
dfml_short = dfml[dfml['Year'] >= year_var]
# Specify the desired length of the color array as
# the number of distinct years
# Generate the color array
colors = get_color_array(dfml_short['Year'].nunique())
fig2 = px.line(dfml_short,
y="Anomoly_Celcius", x="Month_Number", line_shape='spline',
line_group="Year",
color='Year', color_discrete_sequence=colors,
markers=True,
hover_data=[dfml_short.Anomoly_Celcius, dfml_short.Year, dfml_short.month])
# Set the x-axis resolution to label every month
# Strangely these arguments do not work when generating the initial plot
fig2.update_layout(xaxis=dict(
tickmode='linear', # Use linear tick mode
dtick=1 # Specify the interval between ticks
))
```
# Always more to do..
The main task target of exploring Dashboards with R and Python mash-up is completed. However, there are really many ways of
representing, analyzing and processing this data.
Some targets, given time, include:
(1) plots of yearly averaged data (simple correlations, heat-maps..)
(2) statistical hypothesis testing
(3) regression analysis and prediction
(4) more dynamic interaction features
(5) small restructing for robustness: locally cache the read of the source data and operate on the latest version of valid data in
the case of server or network down conditions
(6) data source change trigger
(7) and surely, more points to add......
# Code
## Row
```{python}
# wrap just the long lines over a specified number of characters
def wrap_long_lines(input_text, max_line_chars):
lines = input_text.split('\n')
wrapped_lines = []
for line in lines:
if len(line) > max_line_chars:
wrapped_lines.extend(wrap(line, width=max_line_chars))
else:
wrapped_lines.append(line)
wrapped_text = '\n'.join(wrapped_lines)
return wrapped_text
txt = Path('example_4.qmd').read_text()
# its a wrap!
# those long comments are now readable the Code tab
wrapped_txt = wrap_long_lines(txt, 130)
print(wrapped_txt)
```