## Getting started with Time Series Forecasting with Prophet

Facebook prophet is simplest way to get started with time series analysis with python. Please refer to this post to know how to install prophet on Ubuntu.

We will have a look at official example with one complete program.

 ################################################################################################ # name: 01_fbprophet_getting_started.py # desc: Official tutorial # date: NA # Author: NA ################################################################################################ import pandas as pd from fbprophet import Prophet print('*** Program Started ***') ### Impoting data as pandas dataframe df = pd.read_csv('example_wp_log_peyton_manning.csv') print(df.head()) ### Initiating new object Prophet & initiating fit method and passing input dataframe. Fitting should take 1-5 seconds m = Prophet() m.fit(df) print("type of m" , type(m)) ### Extending data to few future dates future = m.make_future_dataframe(periods=365) print("type of future" , type(future)) ### The predict method will assign each row in future a predicted value which it names yhat forecast = m.predict(future) print("type of forecast" , type(future)) # print(forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()) ### Plotting forecast fig1 = m.plot(forecast) # fig1.show() fig1.savefig('01_fbprophet_getting_started-01.png') ### Plotting forecast components fig2 = m.plot_components(forecast) fig2.savefig('01_fbprophet_getting_started-02.png') ### Saving output excel forecast.to_csv('example_wp_log_peyton_manning_output.csv', sep=',') print('*** Program Completed ***')

Here is the output on terminal

$python3.6 01_fbprophet_getting_started.py *** Program Started *** ds y 0 2007-12-10 9.590761 1 2007-12-11 8.519590 2 2007-12-12 8.183677 3 2007-12-13 8.072467 4 2007-12-14 7.893572 INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. Initial log joint probability = -19.4685 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 99 7975.37 0.00149529 224.247 1 1 128 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 186 7992.27 5.72063e-05 157.088 5.678e-07 0.001 261 LS failed, Hessian reset 199 7993.26 0.000312701 314.644 0.1004 1 277 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 299 7997.05 0.0015387 170.701 1 1 408 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 321 7998.61 0.00020668 308.573 1.22e-06 0.001 478 LS failed, Hessian reset 369 8000.52 2.98767e-05 97.9518 2.746e-07 0.001 566 LS failed, Hessian reset 399 8000.98 0.000153501 134.602 0.7945 0.7945 601 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 457 8001.99 0.000276407 292.083 2.159e-06 0.001 718 LS failed, Hessian reset 499 8002.58 0.000699641 197.602 1 1 770 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 550 8003.07 5.79234e-05 181.032 3.403e-07 0.001 874 LS failed, Hessian reset 599 8003.43 0.000218596 78.2273 0.7213 0.7213 928 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 695 8004.08 3.66526e-05 116.76 2.994e-07 0.001 1095 LS failed, Hessian reset 699 8004.11 0.000537041 111.615 1 1 1099 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 788 8004.7 3.21305e-06 76.4964 4.987e-08 0.001 1259 LS failed, Hessian reset 797 8004.7 6.1457e-07 61.1166 0.6741 0.6741 1270 Optimization terminated normally: Convergence detected: relative gradient magnitude is below tolerance type of m <class 'fbprophet.forecaster.Prophet'> type of future <class 'pandas.core.frame.DataFrame'> type of forecast <class 'pandas.core.frame.DataFrame'> *** Program Completed *** Let us try to run the same by using column name as date and value instead of ds and y. When I tried this, I got following error. $ python3.6 01_fbprophet_getting_started.py
*** Program Started ***
time value
0 2007-12-10 9.590761
1 2007-12-11 8.519590
2 2007-12-12 8.183677
3 2007-12-13 8.072467
4 2007-12-14 7.893572
Traceback (most recent call last):
File "01_fbprophet_getting_started.py", line 19, in <module>
m.fit(df)
File "/usr/local/lib/python3.6/site-packages/fbprophet/forecaster.py", line 1016, in fit
"Dataframe must have columns 'ds' and 'y' with the dates and "
ValueError: Dataframe must have columns 'ds' and 'y' with the dates and values respectively.

Daily Seasonality

You might have observed following message

INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.

To get rid of this message, add daily_seasonality=True prophet object. It will look like below.

m = Prophet(daily_seasonality=True)

## How to Install facebook prophet library Ubuntu

Prophet is forecasting librabry developed by facebook, it has been open sourced by facebook. Following is the simple command to install it.

$sudo python3.6 -m pip install fbprophet Collecting fbprophet Downloading https://files.pythonhosted.org/packages/9b/a1/fef4ce00acbc28e75c0d33f60c9777527c4295656903b00ac4c9525cef7f/fbprophet-0.4.post2.tar.gz (45kB) 100% |████████████████████████████████| 51kB 354kB/s Collecting Cython>=0.22 (from fbprophet) Downloading https://files.pythonhosted.org/packages/e1/fd/711507fa396064bf716493861d6955af45369d2c470548e34af20b79d4d4/Cython-0.29.6-cp36-cp36m-manylinux1_x86_64.whl (2.1MB) 100% |████████████████████████████████| 2.1MB 316kB/s Collecting pystan>=2.14 (from fbprophet) Downloading https://files.pythonhosted.org/packages/17/77/dd86797a7e7fccca117233c6d50cc171e0c2b2f5a0cd2a8d9753ee09b7be/pystan-2.18.1.0-cp36-cp36m-manylinux1_x86_64.whl (50.0MB) 100% |████████████████████████████████| 50.0MB 312kB/s Requirement already satisfied: numpy>=1.10.0 in /usr/local/lib/python3.6/site-packages (from fbprophet) (1.14.0) Requirement already satisfied: pandas>=0.20.1 in /usr/local/lib/python3.6/site-packages (from fbprophet) (0.22.0) Requirement already satisfied: matplotlib>=2.0.0 in /usr/local/lib/python3.6/site-packages (from fbprophet) (2.1.2) Collecting lunardate>=0.1.5 (from fbprophet) Downloading https://files.pythonhosted.org/packages/4e/7e/377a3cbba646ec0cf79433ef858881d809a3b87eb887b0901cb83c66a758/lunardate-0.2.0-py3-none-any.whl Collecting convertdate>=2.1.2 (from fbprophet) Downloading https://files.pythonhosted.org/packages/74/83/d0fa07078f4d4ae473a89d7d521aafc66d82641ea0af0ef04a47052e8f17/convertdate-2.1.3-py2.py3-none-any.whl Collecting holidays>=0.9.5 (from fbprophet) Downloading https://files.pythonhosted.org/packages/16/09/c882bee98acfa310933b654697405260ec7657c78430a14e785ef0f1314b/holidays-0.9.10.tar.gz (73kB) 100% |████████████████████████████████| 81kB 370kB/s Collecting setuptools-git>=1.2 (from fbprophet) Downloading https://files.pythonhosted.org/packages/05/97/dd99fa9c0d9627a7b3c103a00f1566d8193aca8d473884ed258cca82b06f/setuptools_git-1.2-py2.py3-none-any.whl Requirement already satisfied: python-dateutil>=2 in /usr/local/lib/python3.6/site-packages (from pandas>=0.20.1->fbprophet) (2.6.1) Requirement already satisfied: pytz>=2011k in /usr/local/lib/python3.6/site-packages (from pandas>=0.20.1->fbprophet) (2017.3) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/site-packages (from matplotlib>=2.0.0->fbprophet) (0.10.0) Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.6/site-packages (from matplotlib>=2.0.0->fbprophet) (1.11.0) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/site-packages (from matplotlib>=2.0.0->fbprophet) (2.2.0) Collecting ephem<3.8,>=3.7.5.3 (from convertdate>=2.1.2->fbprophet) Downloading https://files.pythonhosted.org/packages/c3/2c/9e1a815add6c222a0d4bf7c644e095471a934a39bc90c201f9550a8f7f14/ephem-3.7.6.0.tar.gz (739kB) 100% |████████████████████████████████| 747kB 246kB/s Installing collected packages: Cython, pystan, lunardate, ephem, convertdate, holidays, setuptools-git, fbprophet Running setup.py install for ephem ... done Running setup.py install for holidays ... done Running setup.py install for fbprophet ... done Successfully installed Cython-0.29.6 convertdate-2.1.3 ephem-3.7.6.0 fbprophet-0.4.post2 holidays-0.9.10 lunardate-0.2.0 pystan-2.18.1.0 setuptools-git-1.2 You are using pip version 10.0.1, however version 19.0.3 is available. You should consider upgrading via the 'pip install --upgrade pip' command. It takes some time considering number of dependencies. You can verify installation by login to python3.6 console, importing module and checking it. $ python3.6
Python 3.6.4 (default, Jan 13 2018, 12:02:51)
[GCC 5.4.0 20160609] on linux
>>> import fbprophet
>>> fbprophet.__version__
'0.4'
>>>

Please note that while using facebook prophet, correct way to import library is as below

from fbprophet import Prophet

## What is correlation and how to find correlation using python

When two sets of data are strongly linked together we say they have a High Correlation.

Correlation is Positive when the values increase together, and
Correlation is Negative when one value decreases as the other increases

In common usage it most often refers to how close two variables are to having a linear relationship with each other. Here is sample values and shape for correlation

#### Pearson’s correlation coefficient

This is most commonly used correlation coefficient

The population correlation coefficient ρX,Y between two random variables X and Y with expected values μX and μY and standard deviations σX and σY is defined as

#### Pearson’s correlation coefficient using Python

When calculated using scipy, it returns pearson’s correlation coefficient and 2-tailed p-value

When calculated using numpy, it returns The correlation coefficient matrix of the variables.

#### Spearman’s rank correlation coefficient

Spearman’s rank correlation coefficient or Spearman’s rho, named after Charles Spearman and often denoted by the Greek letter rho. The Spearman correlation coefficient is defined as the Pearson correlation coefficient between the ranked variables.

#### Kendall rank correlation coefficient

the Kendall rank correlation coefficient, commonly referred to as Kendall’s tau coefficient (after the Greek letter τ), is a statistic used to measure the ordinal association between two measured quantities. A tau test is a non-parametric hypothesis test for statistical dependence based on the tau coefficient.

#### Python code for calculating Person’s, Spearman’s and Kendall’s coefficient.

 ################################################################################################ # name: correlation_coefficient_01.py # desc: correlation coefficient # date: 2018-12-22 # Author: conquistadorjd ################################################################################################ import numpy as np from scipy import stats #Calculate mean by python a = [10,20,30,40,50,60] b = [9,9,10,8,9,10] #Using scipy to calculate person's coefficient pearsonr_val = stats.pearsonr(a,b) print('pearsonr_val : ', pearsonr_val) #pearsonr_val : (0.2130214807490179, 0.6853010393640564) #Using numpy corrcoef_val = np.corrcoef(a,b) print('corrcoef_val : ', corrcoef_val) #corrcoef_val : [[1. 0.21302148] #[0.21302148 1. ]] #Using scipy to calculate Spearmanr spearmanr_val = stats.spearmanr(a,b) print('spearmanr_val : ', spearmanr_val) #spearmanr_val : SpearmanrResult(correlation=0.24688535993934707, pvalue=0.6371960853462737) #Using scipy to calculate kendalltau kendalltau_val = stats.kendalltau(a,b) print('kendalltau_val : ', kendalltau_val) #kendalltau_val : KendalltauResult(correlation=0.2335496832484569, pvalue=0.5374525191136282) print('*** Program ended ***')

Correlation can have a value:

• 1 is a perfect positive correlation
• 0 is no correlation (the values don’t seem linked at all)
• -1 is a perfect negative correlation

Important points to be noted

• Correlation is not causation
• Person’s coefficient works only if there is linear relationship between two variables.

## How to Find Mean, Median and Mode Using Python

Before calculating mean, median and mode, let us look at types of data and characteristics of the data. At a very high level data can be classified as categorical and quantitative data. Both can be further classified as below

 Difference Order Similar Interval Meaningful Zero Categorical Nominal (Cities) Yes – – – Categorical Ordinal (Temp.) Yes Yes – – Quantitative Interval Yes Yes Yes – Quantitative Ration Yes Yes Yes Yes

Now all of these types of data do not have all characteritics

 Mode Median Mean Nominal Yes – – Ordinal Yes – – interval Yes Yes Yes ratio Yes Yes Yes

Mean:

Mean is nothing but average. It can be calculated in python or by using numpy

Median

Middle value of observation when ordered from low to high

Mode

Mots commonly occurring observation

 ################################################################################################ # name: discriptive_statistics_01.py # desc: identify type of progression # date: 2018-12-22 # Author: conquistadorjd ################################################################################################ import numpy as np from scipy import stats #Calculate mean by python input_data = input('Input elements separated by comma :') # Convert input into List input_list = list(map(int, input_data.split(','))) print ("input_list", input_list , type(input_list)) # Mean calculation using simple python mean = sum(input_list)/ len(input_list) print('mean', mean) # Mean calculation using numpy mean = np.mean(input_list) print('mean', mean) # Median calculation using numpy median = np.median(input_list) print('median', median) # Mode calculation using scipy mode = stats.c(input_list) print('mode', mode)

## Resources for Learning Statistics

Various online resources (online sources, text books) are freely available on internet.

There are so many online courses that you might get overwhelmed bu the sheer numbers.

## Introduction of Set theory

a set is a collection of objects (called members or elements) that is regarded as being a single object. To indicate that an object x is a member of a set A one writes x ∊ A, while x ∉ A indicates that x is not a member of A.

• A ∪ B—read “A union B” or “the union of A and B
• A ∩ B—read “A intersection B” or “the intersection of A and B
• U is called the universal set
• A′ or U − A is called as complement of A
• Cartesian product: Let A and B be two sets. Cartesian product of A and B is denoted by A × B, is the set of all ordered pairs (a,b), where a belong to A and B belong to B.

A × B = {(a, b) | a ∈ A ∧ b ∈ B}

• if every element in A is also in B and every element in B is in A; symbolically, x ∊ A implies x ∊ B and vice versa.

## Series and Progression – Arithmetic, Geometric and Harmonic

Let us clarify few terms first.

A sequence is a set of numbers written in a particular order.

A series is something we obtain from a sequence by adding all the terms together. Please note its not sequence of numbers, its sum of numbers in sequence.

progression has a specific formula to calculate its nth term, whereas a sequence can be based on a logical rule like ‘a group of prime numbers

Here is an example of sequence

${u}_{1},{u}_{2},{u}_{3,.....,}{u}_{n}$

Example of Series

${u}_{1}+{u}_{2}+{u}_{3}+.....+{u}_{n}$

### Arithmetic Series

Arithmetic sequence is a sequence of numbers in which each term after the first is obtained by adding a constant (d).

An arithmetic progression is given by following formula

where a = the first term , d = the common difference

Some of the important formulae

### Geometric Progression

A geometric progression, or GP, is a sequence where each new term after the first is obtained by multiplying the preceding term by a constant r.

Where a : first term and r is constant

### Harmonic Progression

harmonic progression is closely related with arithmetic progression. Non-zero numbers

${a}_{1},{a}_{2},{a}_{3},...,{a}_{n}$

are in Harmonic Progression(HP) if

$\frac{1}{{a}_{1}},\frac{1}{{a}_{2}},\frac{1}{{a}_{3}},...,\frac{1}{{a}_{n}}$

are in Arithmetic progression

### Python program to identify type of progression


################################################################################################
#	name:	progression.py
#	desc:	identify type of progression
#	date:	2018-09-08
################################################################################################

import numpy

def type_of_progression(input_sequence):

delta = []
current = 0
while current < len(input_sequence)-1:

delta.append(input_sequence[current+1]-input_sequence[current])
current = current+1

delta_unique= set(delta)
if len(delta_unique) == 1 :
return "arithmetic"

delta = []
current = 0
while current < len(input_sequence)-1:

delta.append(input_sequence[current+1]/input_sequence[current])
current = current+1

delta_unique= set(delta)
if len(delta_unique) == 1 :
return "geometric"

delta = []
current = 0
while current < len(input_sequence)-1:

delta.append(1/input_sequence[current+1]-1/input_sequence[current])
current = current+1

delta_unique= set(delta)
if len(delta_unique) == 1 :
return "harmonic"
else:
return "nothing"

print('*** Program Started ***')

input_sequence = input('Please input sequence separated by "," : ')
input_sequence = eval('[' + input_sequence + ']')

result = type_of_progression(input_sequence)
print("result :" , result)

print('*** Program Ended ***')


## Resources for Learning Mathematics

Since I had decided to start my buttoms up approach for learning data science, I decided to start by refreshing and learning mathematics. I tried to search for some books from which I can learn, but found some really interesting and god MOOC Course and free ebooks (all legal).

MIT Open Sourceware

eBooks

## What is Linear Regression

Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable. The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression.

Linear regression models are often fitted using the least squares approach.

If there appears to be no association between the proposed explanatory and dependent variables (i.e., the scatterplot does not indicate any increasing or decreasing trends), then fitting a linear regression model to the data probably will not provide a useful model. A valuable numerical measure of association between two variables is the correlation coefficient, which is a value between -1 and 1 indicating the strength of the association of the observed data for the two variables.

There are many names for a regression’s dependent variable. It may be called an outcome variable, criterion variable, endogenous variable, or regressand. The independent variables can be called exogenous variables, predictor variables, or regressors.

Linear regression using python

Following are the ways to do linear regression using python

1. statsmodels
2. scikit-learn
3. scipy

### Linear Regression using statsmodels

Here is sample code

 ################################################################################################ # name: linear-regression-01-statsmodels.py # desc: linear regression using statsmodels # date: 2018-07-14 # Author: conquistadorjd # reference: http://www.statsmodels.org/dev/examples/notebooks/generated/ols.html ################################################################################################ import numpy as np import statsmodels.api as sm from scipy import stats import matplotlib.pyplot as plt print('*** Program started ***') ##################################### Testing different patterns y1=[101,102,103,104,105,106,107] y2=[101,100,99,98,97,96,95] y3=[101,102,101,102,101,102,101] y4=[101,103,105,107,109,111,115] y5=[101,103,102,105,102,107,105] y6=[1,2,3,4,5,6,7] y=y5 x =np.arange(len(y)) x = x +1 # this is to preserve original x values to be used for plotting x1=x # This is needed as per statsmodel documentation x=sm.add_constant(x) ##################################### regression model = sm.OLS(y,x) results = model.fit() # print(results.summary()) # print('Parameters: ', results.params) print( 'results.params : ',results.params) # pc = stats.pearsonr(x,y5) # print(pc) # tau = stats.kendalltau(x,y5) # print(tau) # rho = stats.spearmanr(x,y5) # print(rho) # creating regression line xx= x1 yy = results.params[0] + x1*results.params[1] plt.scatter(x1,y,s=None, marker='o',color='g',edgecolors='g',alpha=0.9,label="Jagur") plt.plot(xx,yy) # plt.xlabel('Sample x Axis') # plt.ylabel('Sample y Axis') # plt.legend(loc=2) # plt.grid(color='black', linestyle='-', linewidth=0.5) # plt.title('PC '+ "{:.3f}".format(pc[0]) + ' tau ' + "{:.3f}".format(tau[0]) + ' rho ' + "{:.3f}".format(rho[0])+ ' gamma ' + "{:.3f}".format(gamma), fontsize=8) # Saving image plt.savefig('linear-regression-01-statsmodels.png') # # In case you dont want to save image but just displya it plt.show() print('*** Program ended ***')

and here is the output

### Linear Regression using scikit-learn

Here is the code

 ################################################################################################ # name: linear-regression-02-scikit-learn.py # desc: linear regression using scikit-learn # date: 2018-07-14 # Author: conquistadorjd # reference: http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html ################################################################################################ import numpy as np import matplotlib.pyplot as plt from sklearn import datasets, linear_model from sklearn.metrics import mean_squared_error, r2_score from scipy import stats print('*** Program started ***') ##################################### Testing different patterns y1=[101,102,103,104,105,106,107] y2=[101,100,99,98,97,96,95] y3=[101,102,101,102,101,102,101] y4=[101,103,105,107,109,111,115] y5=[101,103,102,105,102,107,105] y=[1,2,3,4,5,6,7] y=y5 x=np.arange(len(y)) x1=np.arange(len(y)) x = x +1 # to ensure count is starting from 1 x = np.array(x).reshape(–1, 1) ##################################### regression regr = linear_model.LinearRegression() regr.fit(x, y) print('Coefficients: \n', regr) m=regr.coef_[0] b=regr.intercept_ print("slope=",m, "\nintercept=",b) pc = stats.pearsonr(x1,y) print(pc) # tau = stats.kendalltau(x,y) # print(tau) # rho = stats.spearmanr(x,y) # print(rho) xx= x yy = regr.predict(xx) plt.scatter(x,y,s=None, marker='o',color='g',edgecolors='g',alpha=0.9,label="Jagur") plt.plot(xx,yy) # plt.xlabel('Sample x Axis') # plt.ylabel('Sample y Axis') # plt.legend(loc=2) # plt.grid(color='black', linestyle='-', linewidth=0.5) # plt.title('PC '+ "{:.3f}".format(pc[0]) + ' tau ' + "{:.3f}".format(tau[0]) + ' rho ' + "{:.3f}".format(rho[0])+ ' gamma ' + "{:.3f}".format(gamma), fontsize=8) # Saving image plt.savefig('linear-regression-02-scikit-learn.png') # In case you dont want to save image but just displya it plt.show() print('*** Program ended ***')

and output of this code is as below

### Linear Regression using scipy

Sample code

 ################################################################################################ # name: linear-regression-03-scipy.py # desc: linear regression using scipy # date: 2018-07-14 # Author: conquistadorjd # reference: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html ################################################################################################ import numpy as np import matplotlib.pyplot as plt from sklearn import datasets, linear_model from sklearn.metrics import mean_squared_error, r2_score from scipy import stats print('*** Program started ***') ##################################### Testing different patterns y1=[101,102,103,104,105,106,107] y2=[101,100,99,98,97,96,95] y3=[101,102,101,102,101,102,101] y4=[101,103,105,107,109,111,115] y5=[101,103,102,105,102,107,105] y6=[1,2,3,4,5,6,7] y=y5 x=np.arange(len(y)) x1=np.arange(len(y)) x = x +1 # to ensure count is starting from 1 ##################################### regression slope, intercept, r_value, p_value, std_err = stats.linregress(x, y) print('Coefficients: \n', slope, intercept, r_value, p_value, std_err) pc = stats.pearsonr(x1,y) print(pc) # tau = stats.kendalltau(x,y) # print(tau) # rho = stats.spearmanr(x,y) # print(rho) # xx= x # yy = regr.predict(xx) plt.scatter(x,y,s=None, marker='o',color='g',edgecolors='g',alpha=0.9,label="Jagur") plt.plot(x, intercept + slope*x, label='fitted line') # # plt.xlabel('Sample x Axis') # # plt.ylabel('Sample y Axis') # # plt.legend(loc=2) # # plt.grid(color='black', linestyle='-', linewidth=0.5) # # plt.title('PC '+ "{:.3f}".format(pc[0]) + ' tau ' + "{:.3f}".format(tau[0]) + ' rho ' + "{:.3f}".format(rho[0])+ ' gamma ' + "{:.3f}".format(gamma), fontsize=8) # # Saving image plt.savefig('linear-regression-03-scipy.png') # # In case you dont want to save image but just displya it plt.show() print('*** Program ended ***')

and output

If you look at code, it seems finding linear regression using scipy is shortest and easiest to understand.

## What is Correlation ?

Correlation is  used to indicate dependence or association is any statistical relationship, whether causal or not, between two random variables or bivariate data. It is a measure of relationship between two mathematical variables or measured data values, which includes the Pearson correlation coefficient as a special case.Correlation is any of a broad class of statistical relationships involving dependence, though in common usage it most often refers to how close two variables are to having a linear relationship with each other.

The strength of the linear association between two variables is quantified by the correlation coefficient.

Formula for correlation is as below

• The correlation coefficient always takes a value between -1 and 1,
• Value of 1 or -1 indicating perfect correlation (all points would lie along a straight line in this case).
• A correlation value close to 0 indicates no association between the variables.The closer the value of r to 0 the greater the variation around the line of best fit.
• A positive correlation indicates a positive association between the variables (increasing values in one variable correspond to increasing values in the other variable),
• while a negative correlation indicates a negative association between the variables (increasing values is one variable correspond to decreasing values in the other variable).

The square of the correlation coefficient, r², is a useful value in linear regression. This value represents the fraction of the variation in one variable that may be explained by the other variable. Thus, if a correlation of 0.8 is observed between two variables (say, height and weight, for example), then a linear regression model attempting to explain either variable in terms of the other variable will account for 64% of the variability in the data1

the least-squares regression line will always pass through the means of x and y, the regression line may be entirely described by the means, standard deviations, and correlation of the two variables under investigation.

## Pearson correlation coefficient

Pearsons correlation coefficient is a measure of the linear correlation between two variables X and Y. It has a value between +1 and −1 2.t is obtained by dividing the covariance of the two variables by the product of their standard deviations.

Formula for Pearson Correlation Coefficient

## Rank correlation coefficients

### Spearman’s rank correlation coefficient

The Spearman correlation coefficient is defined as the Pearson correlation coefficient between the ranked variables.3

### Kendall rank correlation coefficient

the Kendall correlation between two variables will be high when observations have a similar (or identical for a correlation of 1) rank (i.e. relative position label of the observations within the variable: 1st, 2nd, 3rd, etc.) between the two variables, and low when observations have a dissimilar (or fully different for a correlation of -1) rank between the two variables.4

## Goodman and Kruskal’s gamma

Goodman and Kruskal’s gamma is a measure of rank correlation, i.e., the similarity of the orderings of the data when ranked by each of the quantities. 5

You can find his report here

Now let us try to calculate these correlations using python, you can find code below

 ################################################################################################ # name: correlationexamples-00.py # desc: Correlations # date: 2018-07-14 # Author: conquistadorjd # remark : goodman_kruskal_gamma formula taken from https://github.com/shilad/context-sensitive-sr/blob/master/SRSurvey/src/python/correlation.py ################################################################################################ from matplotlib import pyplot as plt import numpy as np from scipy import stats from itertools import combinations, permutations def goodman_kruskal_gamma(m, n): """ compute the Goodman and Kruskal gamma rank correlation coefficient; this statistic ignores ties is unsuitable when the number of ties in the data is high. it's also slow. >>> x = [2, 8, 5, 4, 2, 6, 1, 4, 5, 7, 4] >>> y = [3, 9, 4, 3, 1, 7, 2, 5, 6, 8, 3] >>> goodman_kruskal_gamma(x, y) 0.9166666666666666 """ num = 0 den = 0 for (i, j) in permutations(range(len(m)), 2): m_dir = m[i] – m[j] n_dir = n[i] – n[j] sign = m_dir * n_dir if sign > 0: num += 1 den += 1 elif sign < 0: num -= 1 den += 1 return num / float(den) print('*** Program Started ***') # x=[1,2,3,4,5] y1=[101,102,103,104,105,106,107] y2=[101,100,99,98,97,96,95] y3=[101,102,101,102,101,102,102] y4=[101,102,101,101,101,102,103] # y3=y2 # y4=y1 x=np.arange(len(y1)) pc = stats.pearsonr(x,y1) tau = stats.kendalltau(x,y1) rho = stats.spearmanr(x,y1) gamma = goodman_kruskal_gamma(x,y1) ax1 = plt.subplot(221) plt.scatter(x,y1,s=None, marker='o',color='g',edgecolors='g',alpha=0.9,label="Jagur") # plt.xlabel('Sample x Axis') # plt.ylabel('Sample y Axis') # plt.legend(loc=2) # plt.grid(color='black', linestyle='-', linewidth=0.5) plt.title('PC '+ "{:.3f}".format(pc[0]) + ' tau ' + "{:.3f}".format(tau[0]) + ' rho ' + "{:.3f}".format(rho[0])+ ' gamma ' + "{:.3f}".format(gamma)) pc = stats.pearsonr(x,y2) tau = stats.kendalltau(x,y2) rho = stats.spearmanr(x,y2) gamma = goodman_kruskal_gamma(x,y2) ax2 = plt.subplot(222) plt.scatter(x,y2,s=None, marker='o',color='g',edgecolors='g',alpha=0.9,label="Jagur") # plt.xlabel('Sample x Axis') # plt.ylabel('Sample y Axis') # plt.legend(loc=2) # plt.grid(color='black', linestyle='-', linewidth=0.5) plt.title('PC '+ "{:.3f}".format(pc[0]) + ' tau ' + "{:.3f}".format(tau[0]) + ' rho ' + "{:.3f}".format(rho[0])+ ' gamma ' + "{:.3f}".format(gamma)) pc = stats.pearsonr(x,y3) tau = stats.kendalltau(x,y3) rho = stats.spearmanr(x,y3) gamma = goodman_kruskal_gamma(x,y3) ax2 = plt.subplot(223) plt.scatter(x,y3,s=None, marker='o',color='g',edgecolors='g',alpha=0.9,label="Jagur") # plt.xlabel('Sample x Axis') # plt.ylabel('Sample y Axis') # plt.legend(loc=2) # plt.grid(color='black', linestyle='-', linewidth=0.5) plt.title('PC '+ "{:.3f}".format(pc[0]) + ' tau ' + "{:.3f}".format(tau[0]) + ' rho ' + "{:.3f}".format(rho[0])+ ' gamma ' + "{:.3f}".format(gamma)) pc = stats.pearsonr(x,y4) tau = stats.kendalltau(x,y4) rho = stats.spearmanr(x,y4) gamma = goodman_kruskal_gamma(x,y4) ax2 = plt.subplot(224) plt.scatter(x,y4,s=None, marker='o',color='g',edgecolors='g',alpha=0.9,label="Jagur") # plt.xlabel('Sample x Axis') # plt.ylabel('Sample y Axis') # plt.legend(loc=2) # plt.grid(color='black', linestyle='-', linewidth=0.5) plt.title('PC '+ "{:.3f}".format(pc[0]) + ' tau ' + "{:.3f}".format(tau[0]) + ' rho ' + "{:.3f}".format(rho[0])+ ' gamma ' + "{:.3f}".format(gamma)) # Saving image plt.savefig('correlationexamples-01.png') # In case you dont want to save image but just displya it plt.show() print('*** Program ended ***')

output is as below:

2.
Pearson_correlation_coefficient. wikipedia. https://en.wikipedia.org. Accessed July 14, 2018.
3.
Spearman’s rank correlation coefficient. wikipedia. https://en.wikipedia.org. Accessed July 14, 2018.
4.
Kendall_rank_correlation_coefficient. wikipedia. https://en.wikipedia.org/. Accessed July 14, 2018.
5.
Goodman and Kruskal’s gamma. wikipedia. https://en.wikipedia.org/wiki/Goodman_and_Kruskal%27s_gamma. Accessed July 14, 2018.