Exploratory data analysis of mutivariate time series data

The secret to interactive visualization with plotly

This post describes the types of visualisation for exploratory multivariate time series analysis and provides code snippets of such visualisations using Plotly python library.

A 3D render of data varying in space and time. Generated by DALL-E-II

Motivation

“There’s nothing like seeing for oneself”. ― Japanese proverb

Exploratory multivariate data analysis

Multivariate time series data refers to a set of observations of multiple variables measured over time. It is a type of data that is characterized by multiple variables recorded at regular intervals, such as daily, weekly, or monthly. In contrast to univariate time series data, which only contains observations for a single variable, multivariate time series data contains observations for multiple variables. For example, a multivariate time series dataset might contain observations of temperature, rainfall, and wind speed recorded daily for a particular region or multiple areas over the course of several years. Each observation in the dataset would contain values for all three variables at a given time, and the data would be ordered by time.

Exploratory multivariate data analysis is a type of statistical analysis that aims to summarize, visualize and understand complex relationships among multiple variables in a dataset. The goal of exploratory multivariate data analysis is to uncover patterns, relationships, and insights that may not be evident or hard to get from univariate or bivariate analyses. This analysis helps to identify patterns, relationships, and anomalies that are specific to certain variables and time periods.

Multivariate time series visualization

Visualization is an important tool for exploratory data analysis as it allows for the effective representation and communication of complex data sets. By using various types of charts, graphs, and maps, data analysts can do two major things:

  1. identify patterns and trends within the data, and make informed decisions based on that information.
  2. communicate data insights to others, making it an effective way to share findings and collaborate on data-driven projects.

Visualization is particularly relevant to multivariate time series analysis, as it allows for the simultaneous examination of multiple variables over time. By using line charts, scatter plots, or heat maps, data analysts can easily identify patterns and relationships between different variables, and understand how they change over time. This is especially useful when working with large and complex data sets, as it can be difficult to make sense of the data without some form of visual representation.

Overall, the ability to effectively visualize data can greatly enhance the efficiency and effectiveness of data analysis, and is a valuable skill for any data professional to possess.

Types of exploratory analysis

The exploratory data analysis of multivariate data can be categorized to the following visualization types that are explored in this post:

Before diving into each category, let’s prepare the dataset for the visualisation examples.

Data

The exploratory data analysis can be applied to a wide range of data, such as environmental data, energy data, social data, financial data, and more. In this post, the financial data from the latest M6 Financial Forecasting Competition are used:

import yfinance as yf

#The M6 asset universe
assets = [
  "ABBV","ACN","AEP","AIZ","ALLE","AMAT","AMP","AMZN","AVB","AVY",
  "AXP","BDX","BF-B","BMY","BR","CARR","CDW","CE","CHTR","CNC",
  "CNP","COP","CTAS","CZR","DG","DPZ","DRE","DXC","META","FTV",
  "GOOG","GPC","HIG","HST","JPM","KR","OGN","PG","PPL","PRU",
  "PYPL","RE","ROL","ROST","UNH","URI","V","VRSK","WRK","XOM",
  "IVV","IWM","EWU","EWG","EWL","EWQ","IEUS","EWJ","EWT","MCHI",
  "INDA","EWY","EWA","EWH","EWZ","EWC","IEMG","LQD","HYG","SHY",
  "IEF","TLT","SEGA.L","IEAA.L","HIGH.L","JPEA.L","IAU","SLV","GSG","REET",
  "ICLN","IXN","IGF","IUVL.L","IUMO.L","SPMV.L","IEVL.L","IEFM.L","MVEU.L",
  "XLK","XLF","XLV","XLE","XLY","XLI","XLC","XLU","XLP","XLB","VXX"]

#Download historical data (select starting date)
starting_date = "2015-01-01"

data = yf.download(assets, start=starting_date, ignore_tz=True)
prices = data['Adj Close']

Let’s visualize the missing data in the dataset:

The horizontal lines are likely to indicate bank holidays, while vertical lines indicate times when the stocks were not traded and hence not present in the dataset. DRE (Duke Realty Corp) stock is not traded since Sep 30th, 2022. Its full absence is likely caused by yfinance download

Let’s fill the missing values and transform the adjusted close prices to the target data. The target data for the forecasting part of the competition is the percentage return of adjusted close price over four weeks period.

def calculate_pct_returns(x: pd.Series, periods: int) -> pd.Series:
    return 1 + x.pct_change(periods=periods)

# fill missing values
prices = (prices.dropna(how="all", axis=1).fillna(method="ffill").fillna(method="bfill"))

target_data = prices.apply(calculate_pct_returns, periods=20, axis=0).dropna()

Distributional analysis

Distributional analysis is a statistical technique used to examine the distribution of a set of data. It involves studying the shape, central tendency, spread, and other features of the distribution of a variable, and making inferences about the underlying population from which the data was sampled.

The goal of distributional analysis is to understand the underlying pattern of the data and how it varies across different groups or conditions.

Box plot

A box plot, also known as a box and whisker plot, is a type of data visualization that displays the distribution of a set of continuous or ordinal data. It provides a compact and informative summary of the data by displaying the median, quartiles, and outliers of the data in a single plot.

A box plot consists of a box that represents the interquartile range (IQR) of the data, which encompasses the middle 50% of the data. The box is drawn from the first quartile (25th percentile) to the third quartile (75th percentile), and the line inside the box represents the median (50th percentile) of the data. The “whiskers” extending from the box represent the range of the data, excluding outliers, which are plotted as individual points outside the whiskers.

Box plots are widely used to compare the distribution of multiple variables, to identify outliers, and to detect skewness or symmetry in the data. By visualizing the quartiles, median, and outliers, box plots provide a concise and interpretable summary of the distribution of the data.

Click me to see the code
Box plot
df = target_data.copy()
fig = go.Figure()
N = len(df.columns)     # Number of boxes
# generate an array of rainbow colors by fixing the saturation and lightness of the HSL
# representation of colour and marching around the hue.
# Plotly accepts any CSS color format, see e.g. http://www.w3schools.com/cssref/css_colors_legal.asp.
c = ['hsl('+str(h)+',50%'+',50%)' for h in np.linspace(0, 360, N)]

for i, column in enumerate(df.columns[:]):
    fig.add_trace(go.Box(y=df[column], name=column,  marker_color=c[i]))
    
# format the layout
fig.update_layout(
    legend=dict(orientation="v", yanchor="bottom", y=0.1, xanchor="right", x=1.15),
    title="Box plot",
    width=800,
    height=400,
    font=dict(family="Gravitas One", size=12, color="black"),
)
fig.update_xaxes(tickangle=-90)
fig.update_yaxes(title="Returns, [pu]")
fig.show()

Ridge/Joy plot

A Ridge or Joy plot is a type of data visualization that displays a set of overlapping line segments to represent the distribution of a set of continuous or ordinal data. The plot is named after its inventor, Norman B. Joy, who first introduced it in the 1930s.

The Ridge plot displays the distribution of the data over the range of values, with the height of the line segments indicating the density of data points at a given value. Ridge plots can be used to visualize the distribution of a single variable and to compare the distributions of multiple variables. Ridge plots are particularly useful for visualizing complex data distributions, where the shape of the distribution is not easily summarized by a single statistic like the mean or median.

In the example below, the Ridge plot is used to compare densities of asset distributions per specific month.

Click me to see the code
Ridge/Joy plot
import plotly.graph_objects as go
from plotly.colors import n_colors
import numpy as np

months = target_data.index.month.unique().to_list()
colors = n_colors('rgb(5, 200, 200)', 'rgb(200, 10, 10)', len(months), colortype='rgb')
# fig = make_subplots(rows=100, cols=1, vertical_spacing=0.01, shared_xaxes=True)

i = 1
fig = go.Figure()
for column in target_data.columns.to_list()[::-1]:
    for j in range(len(months)):
        data_line = target_data.loc[target_data.index.month==months[j], column].dropna().T
        fig.add_trace(go.Violin(x=data_line, legendgroup=months[j], scalegroup=months[j], line_color=colors[j], name=f'{column}'))
    i+=1

fig.update_traces(orientation='h', side='positive', width=2, points=False)
fig.update_layout(xaxis_showgrid=False, xaxis_zeroline=False)
fig.update_layout(violingap=0, violinmode='overlay')
fig.update_layout(
    title="Ridgeline/Joy plot",
    width=800,
    height=800,
    font=dict(family="Gravitas One", size=12, color="black"),
)
fig.show()

Mean/Var plot

A Mean/Var scatter plot is a type of scatter plot where the x-axis represents the mean of the data, the y-axis represents the variable category, and a single dot size shows variance of the data. It is used to visualize the relationship between the mean and variance of a set of data, and can provide insights into the distribution of the data.

As for the Ridge plot, the points of Mean Var plot are grouped by the month.

Click me to see the code
Mean/Var plot
import plotly.express as px

df = target_data.dropna()
df = (
    df.groupby([df.index.month_name()])
    .agg({k: ["mean", "var"] for k in df.columns})
    .unstack(1)
    .unstack(1)
    .reset_index()
    .rename(columns={"level_0": "ticker", "Date": "Month"})
)
n_colors = 12
colors = px.colors.sample_colorscale(
    "plasma", [n / (n_colors - 1) for n in range(n_colors)]
)

df["Month"] = df["Month"].astype("category")

fig = px.scatter(
    df,
    x="mean",
    y="ticker",
    color="Month",
    size="var",
    size_max=45,
    log_x=False,
    color_discrete_sequence=colors,
)

fig.update_xaxes(showline=True, linewidth=0.01, linecolor="grey", gridcolor="grey")
fig.update_yaxes(showline=True, linewidth=0.01, linecolor="grey", gridcolor="grey")
fig.update_layout(
    legend=dict(orientation="v", yanchor="bottom", y=0.05, xanchor="right", x=1.05),
    title="Mean/ Variance scatter plot",
    width=800,
    height=400,
    font=dict(family="Gravitas One", size=12, color="black"),
)
fig.show()

Q-Q plot

A Quantile-Quantile (Q-Q) plot is a graphical representation of the comparison between two sets of data. It plots the quantiles of one dataset against the quantiles of another dataset to check if they are drawn from the same underlying distribution. If the two datasets come from the same underlying distribution, the points in the Q-Q plot will form a roughly straight line, which indicates that the quantiles of the two datasets are proportional to each other. Deviations from this straight line can indicate differences between the distributions.

Q-Q plots can also be used to check if a dataset follows a specific theoretical distribution, such as a normal or exponential distribution. In that case, the quantiles of the data set are plotted against the corresponding quantiles of the theoretical distribution.

Here are some key steps for interpreting a Q-Q plot:

Click me to see the code
QQ plot
import itertools
from plotly.express.colors import sample_colorscale
from plotly.subplots import make_subplots
from statsmodels.graphics.gofplots import qqplot

df = target_data.iloc[-365:]

x = np.linspace(0, 1, len(df.columns))
c = sample_colorscale('rainbow', list(x), colortype='rgb')


fig = go.Figure()

for i, column in enumerate(df.columns.to_list()):
    series = df[column]
    qqplot_data = qqplot(series, line='s').gca().lines
    fig.add_trace({
        'type': 'scatter',
        'x': qqplot_data[0].get_xdata(),
        'y': qqplot_data[0].get_ydata(),
        'mode': 'markers',
        'marker': {
            'color': c[i]
        },
        'legendgroup': column, 
        'name': column, 
        'showlegend': True
    })

    fig.add_trace({
        'type': 'scatter',
        'x': qqplot_data[1].get_xdata(),
        'y': qqplot_data[1].get_ydata(),
        'mode': 'lines',
        'line': {
            'color': c[i]
        },
        'legendgroup': column, 
        'name': column, 
        'showlegend': False

    })

fig['layout'].update({
    'title': 'Quantile-Quantile Plot',
    'xaxis': {
        'title': 'Theoritical Quantities',
        'zeroline': False
    },
    'yaxis': {
        'title': 'Sample Quantities'
    },
    'showlegend': False,
    'width': 800,
    'height': 700,
})


fig.show()

Histogram plot

A histogram is a graphical representation of data that groups data points into ranges and displays the frequency of data points within each range as bars. The x-axis of a histogram plot represents the range of values in the data, and the y-axis represents the frequency or count of data points within each range. The ranges are usually specified as bins, and the height of each bar represents the number of data points in the corresponding bin.

Histograms are used to visualize the distribution of a set of continuous or discrete data, and can provide information about the central tendency, skewness, and spread of the data. By comparing histograms of different datasets, one can gain insights into the similarities and differences between the data distributions.

Click me to see the code
Histogram plot
import plotly.graph_objects as go

import numpy as np

df = target_data.copy()

fig = go.Figure()
for column in df.columns:
    fig.add_trace(go.Histogram(
        x=df[column],
        cumulative_enabled=False,
        histnorm='percent',
        name=column, # name used in legend and hover labels
    ))

fig.update_layout(
    title_text='Histogram plot', # title of plot
    xaxis_title_text='Value', # xaxis label
    yaxis_title_text='Count', # yaxis label
)
fig.update_layout(barmode='overlay')
fig.update_xaxes(range=[0,3])
fig.show()

Temporal analysis

The goal of temporal analysis is to understand how the behavior of a variable changes over time and to identify any underlying patterns or trends in the data.

Line plot with sliding window

A line plot for temporal analysis is a type of data visualization that displays the change in a continuous or ordinal variable over time. Line plots provide a simple and intuitive way to understand how a variable changes over time, and to identify trends, patterns, and outliers in the data.

In a line plot, the x-axis is usually a time-based variable, such as date, time, or an index. The y-axis represents the value of the variable.

The example shows an advanced version of the line plot with sliding window.

If you zoom in the year 2022, then you will see a periodic multivariate variations resambling a signal of unstable system with increasing amplitude.

If you have a high granularity data over a long time period, line plots often overloads the memory and crash you IDE. In that case, plotly-resampler library can serve the purpose.

Click me to see the code

Line plot with sliding window code

  df = target_data.copy()
  df["Mean"] = df.mean(axis=1)
  df = df.reset_index().rename(columns={"index": "Date"})

  fig = px.line(
      df,
      x="Date",
      y=df.columns[1:],
      hover_data={"Date": "|%B %d, %Y"},
      title="Sliding window",
  )

  fig.update_xaxes(
      rangeslider_visible=True,
      rangeselector=dict(
          buttons=list(
              [
                  dict(count=1, label="1m", step="month", stepmode="backward"),
                  dict(count=6, label="6m", step="month", stepmode="backward"),
                  dict(count=1, label="YTD", step="year", stepmode="todate"),
                  dict(count=1, label="1y", step="year", stepmode="backward"),
                  dict(step="all"),
              ]
          )
      ),
  )
  fig.update_layout(
      xaxis_tickformatstops=[
          dict(dtickrange=[None, 1000], value="%H:%M:%S.%L ms"),
          dict(dtickrange=[1000, 60000], value="%H:%M:%S s"),
          dict(dtickrange=[60000, 3600000], value="%H:%M m"),
          dict(dtickrange=[3600000, 86400000], value="%H:%M h"),
          dict(dtickrange=[86400000, 604800000], value="%e. %b d"),
          dict(dtickrange=[604800000, "M1"], value="%e. %b w"),
          dict(dtickrange=["M1", "M12"], value="%b '%y M"),
          dict(dtickrange=["M12", None], value="%Y Y"),
      ],
      margin=dict(l=50, r=0, t=50, b=50),
      font=dict(family="Gravitas One", size=12, color="black"),
  )
  fig.update_yaxes(
      range=[0.0, 3.0],
  )

  fig.show()

Heatmap plot

A heatmap plot is a type of data visualization that displays a matrix of data values as a grid of colored cells, where the color of each cell represents the magnitude of a specific data value. It is particularly useful for visualizing and analyzing multivariate datasets, where multiple variables are measured for each data point.

Click me to see the code

Heatmap plot code

import plotly.express as px
df = target_data.copy()
fig = px.imshow(df.T, 
                color_continuous_scale="Cividis_r", 
                origin='upper', 
                title="Heatmap plot",
                range_color=(0.5,1.75)
               )
# update layout for xaxis tickmode as linear
fig.update_layout(
   yaxis = dict(
      tickfont=dict(family='Helvetica', size=8, color='black')
   ),
   font=dict(family="Gravitas One", size=12, color="black"),
)
fig.show()

Autocorrelation plots

Autocorrelation plots, also known as ACF (autocorrelation function) plots, are a type of data visualization used to analyze the relationships between the values of a time-series data. Autocorrelation plots display the relationship between the values of a time-series data and its lagged values, with the lagged values representing the values of the time-series data at previous N time steps.

In an autocorrelation plot, the x-axis represents the lags, and the y-axis represents the correlation between the values of the time-series data and its lagged values. A positive correlation between the values and their lags indicates a trend or pattern in the data, while a negative correlation indicates an opposing trend or pattern.

Like autocorrelation plots, partial autocorrelation plots (PACF) display the relationship between the values of a time-series data and its lagged values. However, unlike autocorrelation plots, partial autocorrelation plots only display the relationship between the values and their lagged values neglecting the effect of shorter lags.

For the autocorrelation plots, there are three options to use:

Below, you can see these visualizations for the dataset. What is your favorite?

Heatmap autocorrelation plot

Click me to see the code
Heatmap autocorrelation plot code
  fig = make_subplots(
      2,
      1,
      shared_xaxes=True,
      shared_yaxes=False,
      subplot_titles=(
          "Plot 1",
          "Plot 2",
      ),
      vertical_spacing=0.1,
  )
  names = {"Plot 1": "Autocorrelation (ACF)", "Plot 2": "Partial Autocorrelation (PACF)"}


  fig.add_trace(
      go.Heatmap(
          z=df_acf.T,
          y=df_acf.T.index,
          x=df_acf.T.columns,
          colorscale="Rainbow",
          coloraxis="coloraxis1",
      ),
      row=1,
      col=1,
  )

  fig.add_trace(
      go.Heatmap(
          z=df_pacf.T,
          y=df_acf.T.index,
          x=df_acf.T.columns,
          colorscale="Rainbow",
          coloraxis="coloraxis1",
      ),
      row=2,
      col=1,
  )

  fig.update_layout(
      height=800,
      width=800,
      title_text="ACF and PACF",
      xaxis2_title="Time lag, [days]",
      yaxis_title="Ticker, [idx]",
      yaxis2_title="Ticker, [idx]",
      legend_title="Legend",
      showlegend=False,
      margin=dict(l=50, r=0, t=50, b=50),
      font=dict(family="Gravitas One", size=12, color="black"),
      yaxis1_nticks=40,
      yaxis2_nticks=40,
      yaxis1=dict(tickfont=dict(family="Helvetica", size=8, color="black")),
      yaxis2=dict(tickfont=dict(family="Helvetica", size=8, color="black")),
      coloraxis1_colorbar=dict(
          thickness=20.0,
          title="",
      ),
  )

  fig.for_each_annotation(lambda a: a.update(text=names[a.text]))
  fig.show()

Scatter autocorrelation plot

Click me to see the code
Scatter autocorrelation plot code

modified from here

  from plotly.express.colors import sample_colorscale
  x = np.linspace(0, 1, len(target_data.columns))
  c = sample_colorscale('rainbow', list(x), colortype='rgb')
  rgb_to_rgba = lambda x: "rgba" + x[3:-1] + ", 0.05)"

  fig = make_subplots(2, 1, shared_xaxes=True, shared_yaxes=False, subplot_titles=("Plot 1", "Plot 2",), vertical_spacing=0.1,)
  names = {'Plot 1':'Autocorrelation (ACF)', 'Plot 2':'Partial Autocorrelation (PACF)'}
  nlags = 50

  for j, func in zip(range(1,3), [acf, pacf]):
      for i, column in enumerate(target_data.columns.to_list()):
          series = target_data[column]
          corr_array = func(series, alpha=0.05, nlags=nlags)
          lower_y = corr_array[1][:,0] - corr_array[0]
          upper_y = corr_array[1][:,1] - corr_array[0]
          fig.add_scatter(x=np.arange(len(corr_array[0])), y=corr_array[0], mode='markers', marker_color=c[i],
                          marker_size=12, name=column, row=j, col=1)
          [fig.add_scatter(x=(x,x), y=(0,corr_array[0][x]), mode='lines', line_color='#3f3f3f',line_width=0.1, name=column, row=j, col=1) 
                  for x in range(len(corr_array[0]))]
          fig.add_scatter(x=np.arange(len(corr_array[0])), y=upper_y, mode='lines', line_color=rgb_to_rgba(c[i]), name=column, row=j, col=1)
          fig.add_scatter(x=np.arange(len(corr_array[0])), y=lower_y, mode='lines', fillcolor=rgb_to_rgba(c[i]), name=column,
                  fill='tonexty', line_color=rgb_to_rgba(c[i]), row=j, col=1)
  fig.update_traces(showlegend=False)
  fig.update_xaxes(range=[-1,50])
  fig.update_yaxes(zerolinecolor='#000000')
  fig.update_layout(
      title_text="ACF and PACF",
      xaxis2_title="Time lag, [days]",
      yaxis_title="ACF, [pu]",
      yaxis2_title="PACF, [pu]",
      legend_title="Legend",
      showlegend = False,
      width=800,
      height=800,
      font=dict(family="Gravitas One", size=12, color="black"),
  )
  fig.for_each_annotation(lambda a: a.update(text = names[a.text]))
  fig.show()

3D autocorrelation plot

Click me to see the code
3D autocorrelation plot code
  fig = make_subplots(
    1,
    2,
    shared_xaxes=False,
    shared_yaxes=False,
    subplot_titles=(
        "Plot 1",
        "Plot 2",
    ),
    # horizontal_spacing=0.5,
    specs=[[{"type": "surface"}, {"type": "surface"}]],
)
names = {"Plot 1": "Autocorrelation (ACF)", "Plot 2": "Partial Autocorrelation (PACF)"}


fig.add_trace(
    go.Surface(z=df_acf.values, x=df_acf.columns, y=df_acf.index, showscale=True),
    row=1,
    col=1,
)

fig.add_trace(
    go.Surface(z=df_pacf.values, x=df_pacf.columns, y=df_pacf.index, showscale=True),
    row=1,
    col=2,
)

fig.update_layout(
    title="3d ACF and PACF",
    autosize=True,
    width=1200,
    height=600,
    margin=dict(l=65, r=50, b=105, t=90),
    scene=dict(
        xaxis_title="Ticker, [idx]",
        yaxis_title="Time lag, [days]",
        zaxis_title="ACF",
    ),
    scene2=dict(
        xaxis_title="Ticker, [idx]",
        yaxis_title="Time lag, [days]",
        zaxis_title="PACF",
    ),
    font=dict(family="Gravitas One", size=12, color="black"),
)
fig.update_traces(
    contours_z=dict(
        show=True, usecolormap=True, highlightcolor="limegreen", project_z=True
    )
)
fig.for_each_annotation(lambda a: a.update(text=names[a.text]))

fig.show()

Scatter polar plot

A scatter polar plot is a type of data visualization that combines a scatter plot and a polar plot. It is used to display the relationship between two variables, where one variable is represented by a radial distance from the origin, and the other variable is represented by an angle around the origin. Scatter polar plots are useful for displaying the relationships between two cyclic variables, such as wind direction and wind speed, or for displaying the distribution of data in polar coordinates.

Click me to see the code
Scatter polar
from sklearn.preprocessing import FunctionTransformer

def sin_transformer(period):
    return FunctionTransformer(lambda x: np.sin(x / period * np.pi / 2) * 360)

def cos_transformer(period):
    return FunctionTransformer(lambda x: np.cos(x / period * np.pi / 2) * 360)

test_df = target_data.dropna().copy()
test_df["dayofweek"] = test_df.index.dayofweek
test_df["dayofyear"] = test_df.index.dayofyear

test_df["sin_dayofyear"] = sin_transformer(365).fit_transform(test_df["dayofyear"])
test_df["cos_dayofyear"] = cos_transformer(365).fit_transform(test_df["dayofyear"])
test_df = test_df[test_df.index.year == 2022]
names = {'Plot 1':'Cos(f)', 'Plot 2':'Sin(f)'}

fig = make_subplots(
    1,
    2,
    shared_xaxes=False,
    shared_yaxes=False,
    subplot_titles=(
        "Plot 1",
        "Plot 2",
    ),
    horizontal_spacing=0.1,
    specs=[[{"type": "scatterpolar"}, {"type": "scatterpolar"}]],

)

for column in target_data.columns.to_list():
    fig.add_trace(
        go.Scatterpolar(
            r=test_df[column],
            theta=test_df["cos_dayofyear"],
            mode="lines",
            name=column,
        ),
        row=1,
        col=1,
    )
    fig.add_trace(
        go.Scatterpolar(
            r=test_df[column],
            theta=test_df["sin_dayofyear"],
            mode="lines",
            name=column,
        ),
        row=1,
        col=2,
    )

fig.update_layout(    
    title_text="Scatter polar",
    showlegend=False, 
    width=1000,
    height=600,
    font=dict(family="Gravitas One", size=12, color="black"),)
fig.for_each_annotation(lambda a: a.update(text = names[a.text]))

fig.show()

Lagged scatter plot

A lagged scatter plot is a type of data visualization that is used to analyze the relationships between the values of a time-series data. A lagged scatter plot displays the relationship between the values of a time-series data and its lagged values, with the lagged values representing the values of the time-series data at previous time steps.

Here are some key steps for interpreting a lagged scatter plot:

  1. Identify the trend: Look for overall patterns in the data. If the data points form a clear diagonal line from the bottom left to the top right of the plot, it indicates a positive relationship between the values and their lags. If the data points form a clear diagonal line from the top left to the bottom right, it indicates a negative relationship.

  2. Identify the strength of the relationship: The strength of the relationship between the values and their lags can be estimated by the degree of clustering of the data points. A tight cluster of data points indicates a strong relationship, while a dispersed scatter of data points indicates a weak relationship.

  3. Identify the seasonality: If the data points form clear clusters at regular intervals, it indicates a seasonal pattern in the data. The number of clusters can be used to determine the frequency of the seasonality.

  4. Identify the appropriate number of lags: The appropriate number of lags to include in a time-series model can be determined by the number of lags that have a significant relationship with the values. A significant relationship can be determined by statistical tests or by visual inspection of the autocorrelation plot.

Click me to see the code
Lagged scatter plot
 import itertools
from plotly.express.colors import sample_colorscale

df = target_data.iloc[-20:]
x = np.linspace(0, 1, len(df.columns))
c = sample_colorscale('rainbow', list(x), colortype='rgb')
locs = [i for i in itertools.product(range(1,4), repeat=2)]

fig = make_subplots(3, 3, shared_xaxes=True, shared_yaxes=True, 
                    subplot_titles=[f"Lag {i}" for i in range(1,10)], 
                    vertical_spacing=0.05, horizontal_spacing=0.05,)

for i, column in enumerate(df.columns.to_list()):
    series = df[column]  
    for lag in range(1, 10):
        lag_series = series.shift(lag)
        fig.add_scatter(x=lag_series.values[lag:], y=series.values[lag:], 
                        mode='markers', marker_color=c[i], 
                        legendgroup=column, 
                        name=column, 
                        marker_size=6, row=locs[lag-1][0], col=locs[lag-1][1],
                        showlegend=True if lag==1 else False)
fig.update_layout(
    title="Lag plot",
    width=800,
    height=800,
    font=dict(family="Gravitas One", size=12, color="black"),
)

fig.update_yaxes(title="Actual", row=2, col=1)
fig.update_xaxes(title="Shifted", row=3, col=2)
fig.show()

Seasonal decomposition plot

Seasonal decomposition is a statistical technique used to separate a time-series data into its components: trend, seasonality, and residuals. The goal of seasonal decomposition is to isolate the trend and seasonality of a time-series data so that they can be modeled and forecasted separately.

A seasonal decomposition plot, also known as a seasonal decomposition of time series (STL) plot, is a visual representation of the results of the seasonal decomposition of a time-series data. It typically shows the original time-series data, the estimated trend, the estimated seasonality, and the residuals.

Here are some key features of a seasonal decomposition plot:

By analyzing the trend, seasonality, and residuals, a seasonal decomposition plot can provide valuable information about the structure of the time-series data, which can be used to make informed decisions about how to model and forecast the data.

Click me to see the code
Seasonal decomposition plot
from plotly.subplots import make_subplots
import itertools
from statsmodels.tsa.seasonal import seasonal_decompose

df_tsa = apply_to_dataframe(
    target_data,
    func=partial(seasonal_decompose, model="additive", period=20),
    axis=0,
)
results = ["observed", "trend", "resid", "seasonal"]
cbarlocs = [0.85, 0.5, 0.15, 0.0]

fig = make_subplots(
    4,
    1,
    horizontal_spacing=0.0,
    shared_xaxes=True,
    shared_yaxes=False,
    subplot_titles=(
        "Observed",
        "Trend",
        "Residuals",
        "Seasonal",
    ),
)

for idx, attr in zip(list(range(1, 5)), results):
    data = pd.concat(
        [
            getattr(df_tsa[column], attr)
            for column in target_data.columns.to_list()[::-1]
        ],
        axis=1,
    )
    data.columns = target_data.columns.to_list()[::-1]
    fig.add_trace(
        go.Heatmap(
            z=data.values,
            y=data.columns,
            name=column,
            coloraxis="coloraxis",
        ),
        row=idx,
        col=1,
    )

fig.update_layout(
    title="Seasonal decomposition",
    showlegend=True,
    width=800,
    height=800,
    coloraxis=dict(colorscale="Plasma", colorbar_x=1.02, colorbar_thickness=20),
    margin=dict(l=50, r=0, t=50, b=50),
    font=dict(family="Gravitas One", size=12, color="black"),
    yaxis1=dict(tickfont=dict(family="Helvetica", size=8, color="black")),
    yaxis2=dict(tickfont=dict(family="Helvetica", size=8, color="black")),
    yaxis3=dict(tickfont=dict(family="Helvetica", size=8, color="black")),
    yaxis4=dict(tickfont=dict(family="Helvetica", size=8, color="black")),
)

fig.show()

Spatial analysis

The goal of spatial analysis is to understand the relationships and patterns between the variables in a dataset.

Correlation plot

A correlation plot is a type of plot used to visualize the relationship between two variables. It is often used to assess the strength and direction of the relationship between the variables. The goal of a correlation plot is to determine whether there is a relationship between the variables and, if so, to characterize the nature of that relationship.

The strength of the relationship between the variables can be assessed by the correlation coefficient, which is a measure of the strength and direction of the linear relationship between the variables. A correlation coefficient of +1 indicates a perfect positive relationship, meaning that as one variable increases, the other variable also increases. A correlation coefficient of -1 indicates a perfect negative relationship, meaning that as one variable increases, the other variable decreases. A correlation coefficient of 0 indicates that there is no relationship between the variables.

Click me to see the code
Correlation plot
import plotly.graph_objects as go
import numpy as np

df = target_data.copy()
corr = df.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))

data = go.Heatmap(
        z=corr.mask(mask),
        x=corr.columns,
        y=corr.columns,
        colorscale=px.colors.diverging.RdBu,
        zmin=-1,
        zmax=1,
)

layout = go.Layout(
    title_text='Asset Correlation Matrix', 
    title_x=0.5, 
    width=600, 
    height=600,
    xaxis_showgrid=False,
    yaxis_showgrid=False,
      yaxis_autorange='reversed'
)

fig=go.Figure(data=[data], layout=layout)
fig.update_layout(
   yaxis = dict(
      tickfont=dict(family='Helvetica', size=6, color='black')
      ),
   xaxis = dict(
      tickfont=dict(family='Helvetica', size=6, color='black')
      )
)
fig.show()

PPscore plot

The PPS is an asymmetric, data-type-agnostic score that can detect linear or non-linear relationships between two variables. The score ranges from 0 (no predictive power) to 1 (perfect predictive power). It can be used as an alternative to the correlation (matrix). The score is calculated using one variable trying to predict the target variable. In python, PP score can be calculated using ppscore library.

Click me to see the code
PPscore plot
import ppscore as pps
import seaborn as sns
matrix_df = pps.matrix(target_data.dropna())[['x', 'y', 'ppscore']].pivot(columns='x', index='y', values='ppscore')

mask = np.triu(np.ones_like(matrix_df, dtype=bool))
data = go.Heatmap(
        z=matrix_df.mask(mask),
        x=matrix_df.columns,
        y=matrix_df.columns,
        colorscale=px.colors.sequential.Blues,
        zmin=0,
        zmax=1,
)

layout = go.Layout(
    title_text='Asset PPsore Matrix', 
    title_x=0.5, 
    width=600, 
    height=600,
    xaxis_showgrid=False,
    yaxis_showgrid=False,
      yaxis_autorange='reversed'
)

fig=go.Figure(data=[data], layout=layout)
fig.update_layout(
   yaxis = dict(
      tickfont=dict(family='Helvetica', size=6, color='black')
      ),
   xaxis = dict(
      tickfont=dict(family='Helvetica', size=6, color='black')
      )
)
fig.show()

Further reading

These are some of the most popular figures for multivariate visualization, but other types of figures can also be used depending on the specific data and analysis needs.

Generic badge Generic badge