Meher Béjaoui’s Blog

Multiple Linear Regression Analysis of Temperature Data in Albany and Sacramento

2023-12-29T00:00:00+01:00

Introduction
Exploratory data analysis: Understanding the data
Multiple linear regression

Introduction

The study of weather patterns has been of great interest to scientists, researchers and the general public for many years. In recent times, there has been a growing concern about the impacts of global warming and climate change on weather patterns and the environment. One area of particular interest is the relationship between temperature and other meteorological variables, such as humidity, precipitation, and wind speed. Understanding this relationship is important for improving our understanding of climate patterns and predicting future weather patterns. This study aims to investigate the relationship between temperature and the other meteorological variables in the United States, and to explore how this relationship varies across a specific time range.

The dataset pertains to two prominent cities located on the opposite coasts of the United States: Albany, the capital city of New York State, and Sacramento, the capital city of California State. The data were respectively collected from Albany International Airport and Sacramento Metropolitan Airport.

The data was extracted from the National Oceanic and Atmospheric Administration (NOAA) website on 5 May 2023 and is a subset of the Local Climatological Data (LCD) dataset. The original data ranges back to the 1940s and 1970s for the two selected stations, but for the purpose of this report, we will be analyzing measurements from January 1st, 2000, to December 31st, 2022.

The data was stored in two seperate csv files. It is worth noting that the Daily Humidity values in Sacramento are only available from January 1st, 2005. The data was parsed, cleaned and slightly pre-processed in Excel.
It is worth noting the units of measurement employed for the various variables under consideration. Temperature is expressed in Celsius, wind speed in meters per second (m/s), precipitation in millimeters (mm), and humidity in percentage (%).

The study was conducted utilizing Python 3 and Jupyter Notebook as the programming and development environment. To ensure clarity, comprehensibility, and reproducibility, the report incorporates the code implementation and its corresponding outputs. This approach facilitates a transparent presentation of the analysis, enabling us to follow the methodology and reproduce the results beyond the final report .

Exploratory data analysis: Understanding the data

# Importing libraries
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib.dates import MonthLocator, DateFormatter
import statsmodels.api as sm

# Reading the datasets
albany = pd.read_csv('albany.csv')
sacramento = pd.read_csv('sacramento.csv')

# Datetime formatting
sacramento['date'] = pd.to_datetime(sacramento['date'], format='%d/%m/%Y')
albany['date'] = pd.to_datetime(albany['date'], format='%d/%m/%Y')

The tail() method in the code below is used to retrieve the last rows of the Sacramento DataFrame. This feature facilitates rapid examination of the end of DataFrame, enabling data verification. We confirm the presence of 8333 rows and 7 columns, which include the date variable.

print (sacramento.tail(), "\n The size of the DataFrame is", sacramento.shape)

           date  averagetemperature  averagewindspeed  maximumtemperature  \
2022-12-27               10.56              6.84               12.78   
2022-12-28                6.11              1.48               10.00   
2022-12-29               10.00              3.98               12.22   
2022-12-30               13.89              8.18               16.67   
2022-12-31               13.33              7.82               15.56   

      minimumtemperature  precipitation  humidity  
              8.33          10.41     85.33  
              2.22           0.00     90.75  
              7.22          16.26     85.04  
             11.11           1.52     87.71  
             10.56          47.75     82.71   
 The size of the DataFrame is (8334, 7)

We proceed to do the same with the Albany DataFrame, and confirm the presence of 8375 rows and 7 columns.

print (albany.tail(), "\n The size of the DataFrame is", albany.shape)

           date  averagetemperature  averagewindspeed  maximumtemperature  \
2022-12-27               -2.22              2.59                1.11   
2022-12-28                0.00              3.04                5.00   
2022-12-29                1.67              2.59                8.89   
2022-12-30               10.00              3.53               13.89   
2022-12-31                9.44              3.80               11.67   

      minimumtemperature  precipitation  humidity  
             -5.56           0.00     56.88  
             -5.56           0.00     63.63  
             -5.56           0.00     61.33  
              6.11           0.00     52.79  
              6.67           2.03     81.17   
 The size of the DataFrame is (8375, 7)

In the following, the describe() method, particularly useful for exploratory data analysis, allows for a comprehensive understanding of the data’s summary statistics at a glance. These statistics provide insights into the central tendency, variability, and distribution of the data in each column of the DataFrame.

Sacramento

sacramento.describe()

	averagetemperature	averagewindspeed	maximumtemperature	minimumtemperature	precipitation	humidity
count	8334.000000	8334.000000	8334.000000	8334.000000	8334.000000	6572.000000
mean	16.758523	3.378572	24.091746	9.145821	1.173276	63.278203
std	6.681999	1.715114	8.657389	5.392560	4.735319	16.252343
min	0.560000	0.040000	3.890000	-11.110000	0.000000	16.000000
25%	11.110000	2.100000	16.670000	5.000000	0.000000	51.040000
50%	16.670000	3.170000	23.890000	9.440000	0.000000	62.130000
75%	22.780000	4.430000	31.670000	13.330000	0.000000	76.250000
max	35.560000	13.500000	53.890000	27.220000	104.650000	100.000000

In Sacramento, California, the average annual temperature is around 16.76 degrees Celsius, with a standard deviation of approximately 6.68. The temperature range is substantial, indicating potential fluctuations in weather conditions. Notably, the highest recorded temperature of 53.89 degrees Celsius occurred on March 19, 2018, while the lowest temperature of -11.11 degrees Celsius was recorded on February 11, 2004.
Likewise, the average precipitation amount in Sacramento is approximately 1.17 mm. The precipitation values demonstrate a standard deviation of about 4.74 units. This aligns with the prevailing climate in California, characterized by hot, arid summers and short, cold, wet winters, resulting in partly cloudy conditions. The number of days with precipitations below the average overall precipitation amount of 1.17 mm is 7368 days, over the study period of 8334 days.
The descriptive measures of central tendency and variability correspond to the anticipated weather patterns in the area.

max_temp_index = sacramento['maximumtemperature'].idxmax()
date_max_temp = sacramento.loc[max_temp_index, 'date']
min_temp_index = sacramento['minimumtemperature'].idxmin()
date_min_temp = sacramento.loc[min_temp_index, 'date']
print("Date with the highest maximum temperature:", date_max_temp)
print("Date with the lowest minimum temperature:", date_min_temp)

Date with the highest maximum temperature: 2018-03-19 00:00:00
Date with the lowest minimum temperature: 2004-02-11 00:00:00

count_below_threshold = sacramento[sacramento['precipitation'] < 1.17].shape[0]
print("Number of days with precipitations below the average precipitation amount of 1.17 mm is :", count_below_threshold)
count_exceeding_threshold = sacramento[sacramento['averagetemperature'] > 16.76].shape[0]
print("Number of days with daily average temperature exceeding the overall average temperature of 9.92 C is :", count_exceeding_threshold)

Number of days with precipitations below the average precipitation amount of 1.17 mm is : 7368
Number of days with daily average temperature exceeding the overall average temperature of 9.92 C is : 4028

columns_to_include = ['averagetemperature', 'averagewindspeed', 'precipitation', 'humidity']
subset_sacramento = sacramento[columns_to_include]
# pair plots between numerical columns
sns.pairplot(subset_sacramento)
plt.show()

The grid figure presented illustrates the pair plots, providing insights into the relationships, distributions, and interactions among the selected variables. These visualizations allow for a better understanding of the anticipated effects of each variable on the target temperature variable. To enhance the visual clarity of the plots, the minimum and maximum temperatures were excluded from the analysis.

An evident observation is the strong negative correlation between humidity and average temperature, with a limited number of noticeable outliers. In contrast, the relationship between precipitation and temperature does not exhibit a clear pattern, but rather displays more outlier values. Notably, low precipitation values are observed across various temperature ranges, whereas higher precipitation values tend to occur within a specific temperature range.
Furthermore, the influence of wind speed on temperature is not evident for lower wind speed values. However, as the wind speed increases, the temperature tends to stabilize within a specific temperature range. This finding suggests that wind speed may serve as a more reliable predictor of temperature within this specific range.

Overall, the pair plots provide valuable insights into the relationships and patterns among the variables, shedding light on their potential impact on the studied temperature variable.

correlation_matrix = sacramento.corr()
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm")
plt.title("Sacramento Correlation Matrix Heatmap")
plt.show()

The heatmap figure provides a visual representation of the correlation coefficients between pairs of variables. Each cell in the heatmap represents the correlation coefficient, with color gradients indicating the magnitude and direction of the correlation. Warmer colors, such as shades of red, indicate a positive correlation, while cooler colors, such as shades of blue, indicate a negative correlation.
The heatmap reinforces the observations made from the pair plots mentioned earlier. It visually confirms the relationships and patterns observed between variables.
Furthermore, in the process of building our model, we have made the decision to only include the minimum temperature variable and exclude the maximum temperature. This decision is based on the redundancy of their influence on the average temperature and the potential issue of multicollinearity that could arise. By excluding the maximum temperature, we aim to prevent any redundant or collinear effects in our analysis.

Albany

albany.describe()

	averagetemperature	averagewindspeed	maximumtemperature	minimumtemperature	precipitation	humidity
count	8330.000000	8330.000000	8330.000000	8330.000000	8331.000000	8375.000000
mean	9.916845	3.286505	15.062441	4.494155	2.862368	68.329931
std	10.551006	1.709147	11.239033	10.239078	7.307641	12.755930
min	-21.670000	0.000000	-17.780000	-26.670000	0.000000	24.710000
25%	1.670000	2.010000	5.560000	-2.780000	0.000000	59.960000
50%	10.560000	3.080000	16.110000	4.440000	0.000000	68.250000
75%	19.440000	4.290000	25.000000	13.330000	1.780000	77.500000
max	31.110000	10.590000	37.220000	25.000000	119.130000	100.000000

In Albany, New York, the average overall temperature is around 9.92 degrees Celsius, with a standard deviation of approximately 10.55. The temperature range is as substantial as in Sacramento (around 64 degrees Celsius between the lowest and highest values in both), indicating potential fluctuations in weather conditions. Notably, the highest recorded temperature of 37.22 degrees Celsius occurred on July 21, 2011, while the lowest temperature of -26.67 degrees Celsius was recorded on January 24, 2005. Likewise, the average precipitation amount in Albany is approximately 2.86 mm. The precipitation values demonstrate a standard deviation of about 7.3 units.
This aligns with the prevailing climate in New York, characterized by humid continental with warm to hot summers and freezing cold snowy winters. The number of days with precipitations below the average overall precipitation amount of 2.86 mm is 6603 days, over the study data of 8330 days. While the number of days with daily average temperatures exceeding the overall average temperature of 9.92 C was 4318 days. The descriptive measures of central tendency and variability correspond to the anticipated weather patterns in the area.

These findings from both areas suggest that in comparison, most precipitations in Sacramento occured in fewer days that had an excessive amount of pecipitations. In contrast, the precipitations in Albany were better spread over more days.

max_temp_index2 = albany['maximumtemperature'].idxmax()
date_max_temp2 = albany.loc[max_temp_index2, 'date']
min_temp_index2 = albany['minimumtemperature'].idxmin()
date_min_temp2 = albany.loc[min_temp_index2, 'date']
print("Date with the highest maximum temperature:", date_max_temp2)
print("Date with the lowest minimum temperature:", date_min_temp2)

Date with the highest maximum temperature: 2011-07-21 00:00:00
Date with the lowest minimum temperature: 2005-01-24 00:00:00

count_below_threshold = albany[albany['precipitation'] < 2.86].shape[0]
print("Number of days with precipitations below the average precipitation amount of 2.86 mm is :", count_below_threshold)
count_exceeding_threshold = albany[albany['averagetemperature'] > 9.92].shape[0]
print("Number of days with daily average temperature exceeding the overall average temperature of 9.92 C is :", count_exceeding_threshold)

Number of days with precipitations below the average precipitation amount of 2.86 mm is : 6603
Number of days with daily average temperature exceeding the overall average temperature of 9.92 C is : 4318

subset_albany = albany[columns_to_include]
# pair plots between numerical columns
sns.pairplot(subset_albany)
plt.show()

correlation_matrix2 = albany.corr()
sns.heatmap(correlation_matrix2, annot=True, cmap="coolwarm")
plt.title("Albany Correlation Matrix Heatmap")
plt.show()

The pair plots and heatmap figures for Albany, New York, reveal distinct patterns and relationships among the variables. A notable observation is the prevalence of low correlation coefficients and cooler colors in the heatmap, indicating weaker associations between the variables.
In particular, the maximum and minimum temperatures exhibit a strong positive correlation with the average temperature, suggesting a consistent relationship between these variables. However, for humidity and windspeed, there are no discernible patterns or clear correlations with the average temperature. Additionally, there are several outliers observed in the precipitation variable, indicating instances of extreme or unusual precipitation values.

The absence of clear patterns and the presence of non-linear relationships between humidity, precipitation, windspeed, and average temperature in Albany, New York, can be attributed to the specific nature of the local climate category. This climate category is described as humid continental, characterized by warm to hot summers and freezing cold snowy winters.
Within this climate category, the intricate interplay among various atmospheric conditions, such as prevailing winds and moisture sources, gives rise to diverse and dynamic weather patterns. The variability in temperature, humidity, precipitation, and windspeed is influenced by numerous factors and regional weather phenomena, including blizzards. Furthermore, the seasonal temperature extremes and the potential impact of the proximity to large bodies of water, such as the Atlantic Ocean, introduce additional variability and non-linear relationships among the variables under study.

Considering the distinct characteristics of the locality, it is reasonable to expect that the relationships may not conform to a simple linear pattern. The complex interplay of factors such as temperature inversions, air masses, and local topography contributes to the observed complexity in these relationships.

Multiple linear regression

Sacramento

sacramento.isnull().sum()

date                     0
averagetemperature       0
averagewindspeed         0
maximumtemperature       0
minimumtemperature       0
precipitation            0
humidity              1762
dtype: int64

Prior to constructing the linear regression model, we examine the dataset for missing values. As a result, we identify 1762 instances of missing data specifically pertaining to the humidity variable. Instead of discarding these unavailable data points, which represent a substantial portion of the entire dataset, we opted to impute the missing values using the mean.

mean_humidity = sacramento['humidity'].mean()
sacramento['humidity'].fillna(mean_humidity, inplace=True)

# Adding a constant column to the independent variables:
# in the context of using statsmodels for multiple linear regression, it represents the intercept term in the linear regression equation
X = sacramento[['averagewindspeed', 'minimumtemperature', 'precipitation', 'humidity']]
X = sm.add_constant(X)

# Defining the dependent variable
y = sacramento['averagetemperature']

# Fitting the OLS model
model = sm.OLS(y, X)
results = model.fit()

# Print the regression summary
print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:     averagetemperature   R-squared:                       0.926
Model:                            OLS   Adj. R-squared:                  0.926
Method:                 Least Squares   F-statistic:                 2.601e+04
Date:                Sun, 18 Jun 2023   Prob (F-statistic):               0.00
Time:                        23:27:19   Log-Likelihood:                -16811.
No. Observations:                8334   AIC:                         3.363e+04
Df Residuals:                    8329   BIC:                         3.367e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                 16.2143      0.133    121.559      0.000      15.953      16.476
averagewindspeed      -0.5312      0.013    -40.981      0.000      -0.557      -0.506
minimumtemperature     1.0532      0.004    245.554      0.000       1.045       1.062
precipitation         -0.0711      0.005    -15.089      0.000      -0.080      -0.062
humidity              -0.1139      0.002    -69.456      0.000      -0.117      -0.111
==============================================================================
Omnibus:                      663.002   Durbin-Watson:                   0.920
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3431.099
Skew:                          -0.189   Prob(JB):                         0.00
Kurtosis:                       6.121   Cond. No.                         439.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The OLS Regression Results provide insights into the linear regression model that was applied to the dataset. The R-squared value of 0.926 indicates that approximately 92.6% of the variability in the average temperature (dependent variable) can be accounted for by the independent variables included in the model. This suggests a robust relationship between the independent variables (averagewindspeed, minimumtemperature, precipitation, humidity) and the average temperature.
The F-statistic is 2.601e+04, with a remarkably low probability (p < 0.001). This implies that the overall regression model is statistically significant, indicating that at least one of the independent variables exhibits a significant association with the average temperature.
The coefficients represent the estimated impact of each independent variable on the average temperature while holding other variables constant. Here are the interpretations of the coefficients:

average windspeed: With each unit increase in average windspeed, the average temperature is estimated to decrease by approximately 0.5312 units.
minimum temperature: With each unit increase in minimum temperature, the average temperature is estimated to increase by approximately 1.0532 units.
precipitation: With each unit increase in precipitation, the average temperature is estimated to decrease by approximately 0.0711 units.
humidity: With each unit increase in humidity, the average temperature is estimated to decrease by approximately 0.1139 units.

The p-values associated with each coefficient are very low (p < 0.001), indicating that all independent variables have a statistically significant relationship with the average temperature.
The 95% confidence intervals provide a range within which the true value of each coefficient is likely to fall. For example, the confidence interval for the average windspeed coefficient is (-0.557, -0.506), suggesting that the true effect of average windspeed on average temperature lies within this range with 95% confidence.

Overall, the findings support the idea that the independent variables (averagewindspeed, minimumtemperature, precipitation, humidity) are significant predictors of the average temperature in the Sacramento dataset. The regression model has a high R-squared value, indicating a good fit, and the coefficients and p-values suggest that all independent variables have a meaningful impact on the average temperature.

The regression equation is:

Sacramento temperature = 16.2143 - 0.5312 × averagewindspeed + 1.0532 × minimumtemperature - 0.0711 × precipitation - 0.1139 × humidity

# Obtain the predicted values
y_pred = results.predict(X)

# Calculate the residuals
residuals = y - y_pred

# Plot the Residuals vs. Fitted values
sns.residplot(x=y_pred, y=residuals, lowess=True, line_kws={'color': 'red'})
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs. Fitted Values')
plt.show()

The (Residuals vs. Fitted) plot shows that the residuals are fairly randomly scattered around the 0 residual line. Also, the residuals form a seemingly horizontal band around the residual = 0 line, which suggests that the variances of the error terms can be considered equal.

The graph shows a number of outliers that deviate significantly from the general pattern of the residuals and could be investigated further. The number of these individual points shouldn’t greatly impact the regression coefficients and overall model fit.

# Generate the normal QQ plot
sm.qqplot(residuals, line='s')
plt.title('Normal QQ Plot of Residuals')
plt.show()

The (Normal Q-Q) plot shows if residuals are normally distributed (with a small tail). The relationship between the theoretical quantiles and the standardized residuals is approximately linear for most points. We can say that the error terms are indeed normally distributed. The presence of some outliers is confirmed and pronounced here as well.

Albany

albany.isnull().sum()

date                   0
averagetemperature    45
averagewindspeed      45
maximumtemperature    45
minimumtemperature    45
precipitation         44
humidity               0
dtype: int64

Similar to Sacramento, we examine the dataset for missing values. We identify at most 45 instances of missing data pertaining to all variables except for humidity. We can proceed to discarding these unavailable data points, which do not represent a substantial portion of the entire dataset.

albany.dropna(inplace=True)

X2 = albany[['averagewindspeed', 'minimumtemperature', 'precipitation', 'humidity']]
X2 = sm.add_constant(X2)

# Defining the dependent variable
y2 = albany['averagetemperature']

# Fitting the OLS model
model2 = sm.OLS(y2, X2)
results2 = model2.fit()

# Print the regression summary
print(results2.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:     averagetemperature   R-squared:                       0.972
Model:                            OLS   Adj. R-squared:                  0.972
Method:                 Least Squares   F-statistic:                 7.188e+04
Date:                Sun, 18 Jun 2023   Prob (F-statistic):               0.00
Time:                        23:27:28   Log-Likelihood:                -16575.
No. Observations:                8330   AIC:                         3.316e+04
Df Residuals:                    8325   BIC:                         3.320e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                 12.0641      0.134     89.777      0.000      11.801      12.328
averagewindspeed      -0.3560      0.012    -29.811      0.000      -0.379      -0.333
minimumtemperature     1.0241      0.002    519.140      0.000       1.020       1.028
precipitation         -0.0116      0.003     -3.895      0.000      -0.018      -0.006
humidity              -0.0812      0.002    -45.519      0.000      -0.085      -0.078
==============================================================================
Omnibus:                      805.064   Durbin-Watson:                   1.529
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1700.446
Skew:                           0.618   Prob(JB):                         0.00
Kurtosis:                       4.836   Cond. No.                         484.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The R-squared value of 0.972 indicates that approximately 97.2% of the variability in the average temperature (dependent variable) can be accounted for by the independent variables included in the model. This suggests a strong relationship between the independent variables (averagewindspeed, minimumtemperature, precipitation, humidity) and the average temperature.
The F-statistic is 7.188e+04, with a remarkably low probability (p < 0.001). This implies that the overall regression model is highly statistically significant, indicating that at least one of the independent variables exhibits a significant association with the average temperature.

The coefficients represent the estimated impact of each independent variable on the average temperature while holding other variables constant. Here are the interpretations of the coefficients:

average windspeed: With each unit increase in average windspeed, the average temperature is estimated to decrease by approximately 0.3560 units.
minimum temperature: With each unit increase in minimum temperature, the average temperature is estimated to increase by approximately 1.0241 units.
precipitation: With each unit increase in precipitation, the average temperature is estimated to decrease by approximately 0.0116 units.
humidity: With each unit increase in humidity, the average temperature is estimated to decrease by approximately 0.0812 units.

The p-values associated with each coefficient are very low (p < 0.001), indicating that all independent variables have a statistically significant relationship with the average temperature.

The 95% confidence intervals provide a range within which the true value of each coefficient is likely to fall. For example, the confidence interval for the average windspeed coefficient is (-0.379, -0.333), suggesting that the true effect of average windspeed on average temperature lies within this range with 95% confidence.

The OLS Regression Results show that the regression model has a high R-squared value of 0.972, indicating that approximately 97.2% of the variability in the average temperature (dependent variable) can be explained by the independent variables included in the model. The F-statistic of 7.188e+04 is highly significant (p < 0.001), indicating that the overall regression model is statistically significant.

The regression equation can be expressed as:

Albany temperature = 12.0641 - 0.3560 × averagewindspeed + 1.0241 × minimumtemperature - 0.0116 × precipitation - 0.0812 × humidity

# Obtain the predicted values
y_pred2 = results2.predict(X2)

# Calculate the residuals
residuals2 = y2 - y_pred2

# Plot the Residuals vs. Fitted values
sns.residplot(x=y_pred2, y=residuals2, lowess=True, line_kws={'color': 'red'})
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs. Fitted Values')
plt.show()

Also in this case, the (Residuals vs. Fitted) plot shows that the residuals are randomly scattered around the 0 residual line. The residuals form a horizontal band around the residual = 0 line, which suggests that the variances of the error terms are considered equal.

The graph shows a number of outliers that deviate significantly from the general pattern of the residuals and could be investigated further. The number of these individual points wouldn’t greatly impact the regression coefficients and overall model fit.

# Generate the normal QQ plot
sm.qqplot(residuals2, line='s')
plt.title('Normal QQ Plot of Residuals')
plt.show()

The (Normal Q-Q) plot shows that the relationship between the theoretical quantiles and the standardized residuals is approximately linear for most points. However, it is in fact skewed right, meaning that most of the data is distributed on the left side with a long “tail” of data extending out to the right. The points depart upward from the straight red line as you follow the quantiles from left to right. The red line shows where the points would fall if the dataset were normally distributed. The point’s trend upward shows that the actual quantiles are much greater than the theoretical quantiles, meaning that there is a greater concentration of data beyond the right side of a Gaussian distribution.

Overall, the findings derived from the linear regression analysis provide evidence supporting the existence of a relationship between temperature and meteorological variables in the United States, with a specific focus on Albany and Sacramento. Both cities exhibit statistically significant associations between temperature and the meteorological variables of average windspeed, minimum temperature, precipitation, and humidity.
By understanding and analyzing these relationships, we can enhance our understanding of weather patterns, contribute to climate studies, and improve weather predictions. These findings highlight the significance of considering multiple meteorological factors when examining temperature variations and provide valuable insights into the impacts of these variables on weather patterns in the United States.

Problem Solving Using Computational Thinking - Course Review

2021-05-07T00:00:00+01:00

Course overview and structure
Course review

For their 9th birthday, Coursera celebrated by offering their learners a collection of 9 coursers to pick from, and enrol to earn a free certificate (special offer was available through 30 April 2021). I chose Problem Solving Using Computational Thinking from the University of Michigan.

In this article, I will share some insights about the course, and what you can expect if you decide to take it on.

Course overview and structure

First, the course is taught entirely in English, but there are subtitles for other languages as well (currently French, Portuguese (European), Russian and Spanish).

In week 1, you will learn about the foundations of Computational Thinking from Associate Professor Chris Quintana, from the University of Michigan School of Education. Then, you will have the opportunity to see Computational Thinking through real world and hypothetical examples shared by three experts in weeks 2, 3 and 4.

These experts are, respectively, Associate Director Mariana Carrasco-Teja from the Michigan Institute for Computational Discovery and Engineering (airport surveillance and image analysis case study); Associate Professor Rafael Meza (epidemiology case study); and Instructional and Program Design Coordinator Darin Stockdil from the Center for Education Design, Evaluation, and Research (human trafficking case study).

The learning objectives are:

To define Computational Thinking components including abstraction, problem identification, decomposition, pattern recognition, algorithms, and evaluating solutions.
To recognize Computational Thinking concepts in practice through a series of real-world case examples.
And to develop solutions through the application of Computational Thinking concepts to real world problems (peer-graded assignment).

The course is structured in 5 weeks, with the last being a peer-graded final project. To review the learning material, do the practice quizzes and quizzes, it should take you around 2 hours per week for the first 3 weeks, and 1h15 for the 4th week (excluding the time required for discussion prompts).

That would surely depend on your own pace and learning style, and you should always devote enough time, and work through the material appropriately. As for the last week, I find it the most challenging, and it would take you longer than indicated. Allow and plan for at least 3 hours of work in that week.

Course review

My overall remarks and opinions regarding the course are:

Videos are not too long or too short. They are just about the right length for you to follow and focus in every segment, take a break and get back to another video.
Some parts would require very careful attention, and perhaps repeated reviewing of the material. That is because of their complexity for a non-specialized audience.
The course introduces new concepts, ideas and technologies from a variety of fields and domains. It brings richness of content, and should broaden one’s knowledge beyond the computational thinking aspects. You learn different things in just one course.
The course is well suited and appropriate for various skill levels. Even advanced learners can consolidate their knowledge and learn something new.
There are enough practice quizzes and quizzes for a learner to test their understanding. However, some questions require the student to fill-in their answers, and that might not always be the best method.
Not a lot of in-video questions.
The videos look scripted, with prepared speeches in advance. However, some videos are not as fluid or comprehensible. If necessary, you can use the subtitles.
Estimated times for completion of quizzes are a bit off.
There are no reading materials and other resources.

Note that the case for the 4th week is optional. You will need to consent to be able to read and examine the course material. That is because the case study covers a rather delicate topic regarding hypothetical implications of Computational Thinking on the issue of Human Trafficking.

I have taken many courses from the University of Michigan. This course is true to their approach and methodology. It is well structured, with professional high quality videos and production.

The case studies are meticulously presented, and I think they bring the most value to the course. And when working on the last peer-graded assignment, you get the chance to apply and test your knowledge to the fullest.
Perhaps the thing that can be improved, is the quality of discussion forums.

Overall, I do recommend taking Problem Solving Using Computational Thinking, and investing the time to complete the course.

Happy learning everyone!

K-Means clustering and similarity visualization of constitutions

2021-04-30T00:00:00+01:00

Introduction
Text processing and exploratory analysis
K-Means clustering
Visualizing text corpus similarity

Introduction

Constitutions hold the fundamental principles and rules that constitute the legal basis of a country. They determine the system of goverment, and the relationships between branches and institutions.

When written, these documents can be quite unique and distinct in many aspects, such as length and legal terminology. However, they can also share some other features, since they tend to have similar purposes.

A textual analysis of such data can be useful. We are going to apply some techniques to compare and cluster various constitutions. This work tries to see if constitutional text corpuses are indicative of the set and outlined systems of government.

To do so, we are going to use TF-IDF term weighting and K-Means clustering from scikit-learn. If you need a text analysis refresher, please check here.

# importing the libraries
import os
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

%matplotlib inline

There are 35 constitutions in our dataset. Most of the documents were queried from constitute project on April 2021, using a simple crawler that implements the Beautiful Soup library.

If you intend on getting your data from the source above, and as a general good scraping etiquette, please be nice and avoid overburdening their servers with requests.

After importing the necessary libraries, we read our documents to a DataFrame.

# Construct an empty DataFrame with two columns
df = pd.DataFrame(columns=('document', 'content'))
# Go through the files in working directory
# If it's a text file, open it and append the content to the DataFrame
for filename in os.listdir(os.getcwd()):
    if filename.endswith('.txt'):
        df = df.append({"document": filename[:len(filename)-4],
                        "content": open(filename, encoding='latin_1').read().replace("\n"," ")},
                       ignore_index=True)

We should check the names of the documents. Since pandas DataFrame columns are Series, we can pull them out and call .tolist() to turn them into a Python list.

print(df['document'].tolist())

['algeria_constitution', 'australia_constitution', 'austria_constitution', 'belgium_constitution', 'brazil_constitution', 'burkinafaso_constitution', 'china_constitution', 'costarica_constitution', 'ecuador_constitution', 'france_constitution', 'germany_constitution', 'india_constitution', 'japan_constitution', 'korea_constitution', 'malaysia_constitution', 'mexico_constitution', 'morocco_constitution', 'netherlands_constitution', 'nigeria_constitution', 'norway_constitution', 'pakistan_constitution', 'peru_constitution', 'portugal_constitution', 'rwanda_constitution', 'senegal_constitution', 'singapore_constitution', 'southafrica_constitution', 'spain_constitution', 'sweden_constitution', 'switzerland_constitution', 'tunisia_constitution', 'turkey_constitution', 'us_constitution', 'vietnam_constitution', 'zambia_constitution']

As we can see, the constitutions pertain to different countries from around the world, with different systems of government. Some nations from the list are monarchies, others are republics. Some have a unitary government, while others are federal. And the differences extend to the legislature as well.
It is, in fact, a diverse collection. But, we should keep in mind that most of these constitutions are translated from their respective native languages. The original meaning in each document may not be conveyed with the same degree of accuracy.

Text processing and exploratory analysis

Next, we begin the analysis.
First, we choose a list of stopwords from the Natural Language Tolkit project (nltk). These are high-frequency terms (like who and the), that we may want to filter out of documents before processing.

from nltk.corpus import stopwords
sw = stopwords.words('english')
sw.append('shall') # Add "shall" to stopwrods
print ("There are", len(sw), "words in this stopwords list. The first 10 are:", sw[:10])

There are 180 words in this stopwords list. The first 10 are: ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

Then, we are going to tokenize our texts. NLTK provides several types of tokenizers for that purpose. We will use a custom regular expression tokenizer, that detects words containing alphanumeric characters only.

from nltk import regexp_tokenize
patn = '\w+'
df['content'] = df['content'].apply(lambda w: " ".join(regexp_tokenize(w, patn)))

To have better insights from our text corpus, let us create 4 new columns:

df['number_words'] for the number of words in each document
df['unique_words'] for the number of unique words in each document
df['number_words_without_sw'] for the number of words that are not in the stopwords list
df['percentage'] for the percentage of stop words in each text corpus

df['unique_words'] = df['content'].apply(lambda x: len(set(x.split())))
df['number_words'] = df['content'].apply(lambda x: len(x.split()))
df['number_words_without_sw'] = df['content'].apply(lambda y: len([word for word in y.split() if word not in sw]))
df['percentage'] = 100 - (df['number_words_without_sw'] * 100 / df['number_words'])

Let us use .head() to preview the first 6 rows of our DataFrame.

df.head(6)

All of the percentage values are less than 50. In fact, we can surmise that a well written piece of legal document, should not have a lot of stopwords. However, that may not always be the case. Feel free to share your opinions regarding this, in the comments section below, or by email.

We can push the exploration further, and check which constitutions have the most and the least unique words. Such a measure can be an indicator of the richness of the used lexicon, and its complexity as well.

max_unique_words = df['unique_words'].max()
doc_max_unique_words = df['document'][df.index[df['unique_words'].idxmax()]]
min_unique_words = df['unique_words'].min()
doc_min_unique_words = df['document'][df.index[df['unique_words'].idxmin()]]

print ("The {} has the most unique words with {} words. And the {} has the fewest with only {}"
       .format(doc_max_unique_words, max_unique_words, doc_min_unique_words, min_unique_words))

The brazil_constitution has the most unique words with 5286 words. And the us_constitution has the fewest with only 1005

After the previous exploratory phase, we move to term weighting and tf-idf.

Like we did in the previous article linked above, we are going to use TfidfVectorizer from sklearn to convert the collection of documents to a matrix of TF-IDF features. That would allow us to take into account how often a term shows up.
Again, tf-idf is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. You can refer to the official sklearn documentation for the complex mathematical explanation.

tfidf_vectorizer = TfidfVectorizer(stop_words=sw, use_idf=True)

x = tfidf_vectorizer.fit_transform(df.content)
tfidfcounts = pd.DataFrame(x.toarray(),index = df.document, columns = tfidf_vectorizer.get_feature_names())

K-Means clustering

Let us go over a brief explanation of clustering in general before delving into K-Means clustering that we will be using.

Clustering is the process of grouping a collection of objects, such that those in the same partition (or cluster) are more similar (in some sense) to each other, than to those in other groups (clusters).
There are a lot of clustering algorithms that can be utilized, and their use is modulated by specific conditions in the use cases.

As for the K-Means algorithm, it clusters data by trying to separate samples in n groups of equal variance. It minimizes the squared distance between the cluster mean (centroid) and the points in the cluster. This algorithm requires the number of clusters to be specified.

Below, we set the desired number of clusters. That choice is not an easy task. There are a few ways to determine the optimal number of clusters, but for the sake of this demonstration, we will not be going through them.

# Specify number of clusters
number_of_clusters = 3
km = KMeans(n_clusters = number_of_clusters)
# Compute k-means clustering
km.fit(x)

KMeans(n_clusters=3)

After computing the k-means clustering, and getting a fitted estimator, we ought to see the top words in each cluster.
In the code below, cluster_centers_ gets the coordinates of each centroid. Then, .argsort()[:, ::-1] converts each centroid into a descending sorted list of columns by their relevance. That gives the words most relevant, since in our vector representation, words are the features in the form of columns.
We use .get_feature_names() to get a list of feature names mapped from feature integer indices.
Finally, the for loop wraps up the work, and prints out the top words in each cluster.

order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = tfidf_vectorizer.get_feature_names()

for i in range(number_of_clusters):
    top_words = [terms[ind] for ind in order_centroids[i, :5]]
    print("Cluster {}: {}".format(i, ' '.join(top_words)))

Cluster 0: state may federal law section
Cluster 1: article law national state president
Cluster 2: federal art law confederation para

The top terms for each cluster are somewhat intriguing. In fact, certain words do pertain to specific systems of government.
Let us check the full results, and see how the entire dataset was partitioned.

results = pd.DataFrame()
results['document'] = df.document
results['category'] = km.labels_
results

By observing the results and seeing the top words for each cluster, we can see that a lot of Federal countries were assigned to cluster 2. However, a few were put in other clusters. By the same, some non-Federal countries were grouped in cluster 2.
The same goes for clusters 0 and 1, where similar systems of government are not always put together.

This suggests that word relevance on its own, just gives a broader perception of how well the text corpus reflects the system of government.
The analysis can be improved by using other algorithms and techniques. However, K-Means clustering being fairly simple and easy to implement, can be a good starting point for further and deeper inspection.

Visualizing text corpus similarity

We can visualize the similarities based on the TF-IDF features.
To do so, we start by constructing the vectorizer as usual. We specify max_features to build a vocabulary that considers only the top max_features ordered by term frequency across the dataset.

vectorizer = TfidfVectorizer(use_idf=True, max_features=10, stop_words=sw)
X = vectorizer.fit_transform(df.content)
print (vectorizer.get_feature_names())

['article', 'constitution', 'court', 'federal', 'law', 'may', 'national', 'president', 'public', 'state']

We can use a combination of DataFrame.plot and matplotlib to draw a scatter plot representing the distribution of two terms on x and y axes, and a colormap to showcase the rlevance of a third term.
We can clearly see than only 10 data points had some value regarding the term federal, while the rest had a value of 0 or close.

df2 = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
import matplotlib.pyplot as plt
fig = plt.figure()
axi = plt.gca()

ax = df2.plot(kind='scatter', x= 'federal', y= 'president', s= 250, alpha= 0.6,
              c='state', colormap='viridis',
              figsize= (12,10), ax= axi)

axi.set_title('President X Federal', fontsize=18)
axi.set_xlabel("Federal", fontsize=18)
axi.set_ylabel("President", fontsize=18)

Text(0, 0.5, 'President')

Advanced word analysis with TF-IDF

2021-04-21T00:00:00+01:00

Introduction
Term Frequency
Inverse document frequency

Introduction and basic concepts

In a previous article, we utilized CountVectorizer from scikit-learn to count words. We used bag of words analysis, where a text is represented as the bag of its words, disregarding grammar, and with no particular order. This model may capture the characteristics of the text or document.

However, there are some limitations with simple word count analysis. A better solution would be to use latent features, such as the frequency of words used in a document.

In fact, some terms will appear more often, carrying little useful knowledge about the document’s actual contents. Those very frequent words would shadow the frequencies of more uncommon yet more interesting terms.
These problems can be tackled with TF-IDF. Tf means term-frequency while tf–idf means term-frequency times inverse document-frequency.
It is a statistical measure that evaluates how relevant a word is to a document in a collection of documents.
The TF–IDF value increases in relation to the number of times a word appears in a document, and is compensated by the number of documents in the corpus that contain the word, which helps to compensate for the fact that certain words appear more often than others.

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer
import re
import requests

Our texts for this notebook are some constitutions. We use requests to make a request and get a response with the desired text.

tn_constitution = open("constitution.txt").read().replace("\n"," ") # Tunisian Constitution
us_constitution = requests.get("https://www.gutenberg.org/cache/epub/5/pg5.txt").text[2623:] # US Constitution
jp_constitution = requests.get("https://www.gutenberg.org/cache/epub/612/pg612.txt").text[610:] # Japanese Constitution
athen_constitution = requests.get("https://www.gutenberg.org/cache/epub/26095/pg26095.txt").text[610:] # Athenian Constitution
df = pd.DataFrame([
    { "document": "Tunisian Constitution", "content": tn_constitution},
    { "document": "United States Constitution", "content": us_constitution },
    { "document": "Japanese Constitution", "content": jp_constitution},
    { "document": "Athenian Constitution", "content": athen_constitution },])

In text analysis, the raw data cannot be fed directly to most algorithms, since these expect numerical feature vectors of a fixed size rather than raw text documents of variable length.
In order to address this, there are ways to extract numerical features from text, namely:

Tokenizing : Word tokens are the basic units of text. When processing, the first step is to split strings into tokens and giving an integer id for each possible token.
Counting the occurrences of tokens in each document - how many times does a word appear in the text.
Normalizing and weighting with diminishing importance tokens that occur in the majority of documents.

We can specify a tokenizer when using CountVectorizer. Here, you find a stemming_tokenizer for reference. We will not be using it for this work.

Stemming is a text preprocessing task for transforming related or similar forms of a word to its base form (talking to talk, and cats to cat for example). We will use the Porter stemmer from nltk.

porter_stemmer = PorterStemmer()
def stemming_tokenizer(str_in):
    words = re.sub(r"[^A-Za-z0-9\-]", " ", str_in).lower().split()
    words = [porter_stemmer.stem(word) for word in words]
    return words

Let’s put it all together, and experiment with the CountVectorizer.

vectorizer = CountVectorizer(stop_words='english')

matrix = vectorizer.fit_transform(df.content)
counts = pd.DataFrame(matrix.toarray(), index = df.document, columns = vectorizer.get_feature_names())

Since our texts are all constitutions, we could have a look at some intriguing terms.
But, what else should we be checking? Which words might be the most interesting? The idxmax pandas method would return the label of the column with the maximum value, for each row. That is, we’ll get the most frequent word for each document.

counts.idxmax(axis=1)

document
Tunisian Constitution         article
United States Constitution      shall
Japanese Constitution           shall
Athenian Constitution         council
dtype: object

Now, we look at this subset of words accross all documents.

counts[['people','constitution', 'rules', 'law', 'order', 'assembly', 'house', 'democracy','article','shall','council']]

Term Frequency

We’re going to take into account how often a term shows up by using the TfidfVectorizer in the same way as CountVectorizer. TfidfVectorizer converts a collection of documents to a matrix of TF-IDF features. It is equivalent to CountVectorizer followed by TfidfTransformer.

tfidf_vectorizer = TfidfVectorizer(stop_words='english', use_idf=False)

x = tfidf_vectorizer.fit_transform(df.content)
tfidfcounts = pd.DataFrame(x.toarray(),index = df.document, columns = tfidf_vectorizer.get_feature_names())

Let’s check the same words as we did before!

tfidfcounts[['people','constitution', 'rules', 'law', 'order', 'assembly', 'house', 'democracy','article','shall','council']]

Notice how our numbers have shifted a bit. These are supposedly better relative indicators for the use of words, and their importance in our documents.

Inverse document frequency

By looking at the previous DataFrame, it seems like the word (shall) shows up a lot. So, even though it’s not a stopword, it should be weighted a bit less.

This is inverse term frequency. The more frequent a term shows up across documents, the less important it can be in our matrix.

#use_idf bool, default=True (to highlight by comparison) Enable inverse-document-frequency reweighting
idf_vectorizer = TfidfVectorizer(stop_words='english', use_idf=True)

y = idf_vectorizer.fit_transform(df.content)
idfcounts = pd.DataFrame(y.toarray(), index = df.document, columns = idf_vectorizer.get_feature_names())

Again with the same subset of words accross all documents.

idfcounts[['people','constitution', 'rules', 'law', 'order', 'assembly', 'house', 'democracy','article','shall','council']]

Notice how (council) increased in value because it’s an infrequent term, and (people) decreased in value because it’s quite frequent.

It is beneficial to understand how TF-IDF functions in order to obtain a deeper understanding of how machine learning algorithms work. TF-IDF allows us to associate each word in a document with a numerical value or vector, that reflects its relevance in that document.
In text analysis with machine learning, TF-IDF algorithms help extract keywords, and by determining similar documents, we are able to automatically sort them into clusters.
Besides, given a query, variations of the TF-IDF weighting are also used by search engines in scoring and ranking a document’s relevance.

Counting words in Python with scikit-learn’s CountVectorizer

2021-04-10T00:00:00+01:00

Introduction
Counting words with CountVectorizer
Counting words in multiple documents

Introduction

In a previous article, we used simple techniques to visualize and count words in a document. In this notebook, we will be using another technique. The CountVectorizer from scikit-learn is more elaborate than the Counter tool. It converts a collection of text documents to a matrix of token counts.

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np

The text for our work would be an English version of the Tunisian constitution.

text = open("constitution.txt").read().replace("\n"," ")

Counting words with CountVectorizer

The vectoriser does the implementation that produces a sparse representation of the counts. The fit_transform() method learns the vocabulary dictionary and returns the document-term matrix, as shown below. This method is equivalent to using fit() followed by transform(), but more efficiently implemented.

vectorizer = CountVectorizer()

matrix = vectorizer.fit_transform([text])
matrix # notice the size of the matrix

<1x1741 sparse matrix of type '<class 'numpy.int64'>'
	with 1741 stored elements in Compressed Sparse Row format>

The numbers in the array below represent how many times a word showed up in the text.

matrix.toarray()

array([[2, 1, 1, ..., 1, 1, 3]], dtype=int64)

If we want to know which word is which, we can use get_feature_names() to get feature names from feature integer indices. The order of the words in this array matches the order of the numbers from the previous array. Here, we only output the last 10 words. The last one is youth, and according to the last output value from matrix.toarray(), it appeared 3 times in the text. The word younger appeared just once!

print (vectorizer.get_feature_names()[1731:])

['works', 'world', 'worship', 'writing', 'written', 'year', 'years', 'young', 'younger', 'youth']

We can use DataFrames to turn the results into a human-readable format.

counts_df = pd.DataFrame(matrix.toarray(), columns = vectorizer.get_feature_names())
counts_df

Even more, we can get a sorted list similar to the result given by Counter. We use some pandas magic to transpose index and columns, and the result is naturally a transposed DataFrame. In fact, the used property T is an accessor to the method transpose().

counts_df.T.sort_values(by=0, ascending=False).head(8)

As seen so far, the CountVectorizer is quite useful, and it can handle a lot of preprocessing for us. That would allow us to focus on the interpretation of data for example.

So, how many times did people appear in the text?

counts_df['people']

0    108
Name: people, dtype: int64

How about law and order?

print (counts_df['law'], '\n' ,counts_df['order'])

0    106
Name: law, dtype: int64
 0    9
Name: order, dtype: int64

Counting words in multiple documents

All of that is quite good and exciting. Now, we will see how is CountVectorizer with multiple text documents.
We will be using the United States’ Constitution and the Athenian Constitution, by Aristotle in addition to our previous text.

To read a text file from a URL in Python, we make a request with Requests module to get a Response object. We can read the content of the server’s response by accessing .text.

import requests
US_constitution = requests.get("https://www.gutenberg.org/cache/epub/5/pg5.txt").text[2623:] # To slice out the unwanted text
Athenian_constitution = requests.get("https://www.gutenberg.org/cache/epub/26095/pg26095.txt").text[610:]

We construct a DataFrame with the content by passing the appropriate data.

df = pd.DataFrame([
    { "document": "Tunisian Constitution", "content": text},
    { "document": "United States Constitution", "content": US_constitution },
    { "document": "Athenian Constitution", "content": Athenian_constitution },])
df

Finally, we create an organized DataFrame of the words counted in each document. This time, we feed the entire content column the CountVectorizer instead of a single text variable.

vectorizer = CountVectorizer()

matrix = vectorizer.fit_transform(df.content)
counts = pd.DataFrame(matrix.toarray(), index = df.document, columns = vectorizer.get_feature_names())

counts

This is a nice feature where we can select serveral interesting words to check in all documents.

counts[['people','constitution', 'rules', 'law', 'order', 'assembly', 'house', 'democracy']]

In this notebook, we used CountVectorizer from sklearn to count words in multiple documents. It is more advanced than working with Counter and having to do all the text cleaning.

Visualization and analysis of legal texts

2021-04-08T00:00:00+01:00

Generating Word Clouds
Counting words
Comparing different documents

While browsing the Internet, you have probably seen a picture of a cloud filled with words of varying sizes that reflect the frequency of each word within a given text. This is referred to as a Tag Cloud or a Word Cloud. In this tutorial (see the notebook here), we will learn how to make Word Clouds in Python. This tool is useful for a visual exploration of text data.

We will use legal texts for the purpose of this tutorial, namely the Tunisian Constitution and the Tunisian Hydrocarbons Code.

As usual, we start by importing the different libraries used.
The NumPy library is used for handling large, multi-dimensional arrays and matrices.
For visualization, matplotlib is a comprehensive plotting library. It enables other libraries, such as seaborn and wordcloud, to run on its base.
The pillow library adds support for opening, manipulating, and saving many different image file formats.

from PIL import Image
from wordcloud import WordCloud, STOPWORDS
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import seaborn as sns

Generating Word Clouds

Then, we read the Hydrocarbons Code stored in a .txt file. I chose this legal document, because I have worked on it and analysed it for a thesis report I wrote. It is a complex legal text that pertains to a sensitive and important topic, that is natural resources.

We have to make some necessary text processing. First, we convert the entire text to lower case. This is important since Python strings are case sensitive. Afterwards, we use the magic of regular expressions to deal with apostrophes and other characters to be removed.
Whenever in doubt regarding regular expressions syntax, you can use this website or the official Python documentation to have the desired outcome.

hydrocarbons_code = open("hydrocarbons_code.txt", encoding='latin_1').read().replace("\n"," ")
hydrocarbons_code = hydrocarbons_code.lower()
hydrocarbons_code = re.sub(".'|[^\w ]", " ", hydrocarbons_code)

Since the text is in French, we have to make our own list (technically it’s a Python set with curly brackets here) of the words to remove from the given text. These words would not show in the Word Cloud.

If you do not speak French, most of these words are the equivalent to English articles, pronouns and conjunctions. They would not help us much in understanding the text through visuals.

stop = {'de','du','des','et','est','la','le','les','en','ou','par','au','aux','dans','une','un','pour','sur','ce','ces',
        'ne','qui','que','son','ses','sa','il','ci','a'}

In order to create an image form for the Word Cloud, we need to use a PNG file as a mask. Here, I use the map of Tunisia, just for fun.

The mask argument in the WordCloud function takes an N dimensional array (ndarray). We use Image module to open the PNG file, and we transform it to the numpy array form.
According to WordCloud documentation, all white entries will be considerd masked out, while other entries will be free to draw on. In the NumPy array, all white parts of the mask have a value of 255, whereas values of 1 are black.

map_mask = np.array(Image.open("map.png"))

Now, we have a proper mask and we can make a cloud with the desired shape.
WordCloud takes several parameters, and you can create a personalised result by changing the optional arguments. Some of these are fairly self-explanatory. For the rest, you can always consult the relevant documentation. Or, you can check out the docstring of the function and see the required and optional arguments, by typing and running ?WordCloud.

hydrocarbons_cloud = WordCloud(max_words=1000, mask=map_mask, stopwords=stop , min_word_length = 3, min_font_size = 8,
                               margin=5, random_state=1, background_color="white", include_numbers=True).generate(hydrocarbons_code)

Finally, we can output the result and have an insightful and beautiful visualizations.
Naturally, the most frequent words are hydrocarbons, code, article and holder, as expected in a legal document about hydrocarbons!

%matplotlib inline
plt.figure(figsize=(18,18))
ax = plt.gca()
ax.set_title("Tunisian Hydrocarbons Code Cloud",fontsize=22)
plt.imshow(hydrocarbons_cloud.recolor(colormap=sns.color_palette(palette='blend:red,brown',as_cmap=True), random_state=3),
           interpolation="bilinear")
plt.axis("off")
plt.show()
#hydrocarbons_cloud.to_file("Hydrocarbons code cloud.png") #comment out to save the figure in a PNG format

Now, let’s take a look at the Tunisian Constitution. We will use an English translation from constitute project.

As before, we start by constructing a mask. For this example, it will be the Tunisian flag. For the rest, the only difference is that we use the default built-in STOPWORDS list.

flag_mask = np.array(Image.open("Flag_of_Tunisia.png"))

tunisian_constitution = open("constitution.txt").read().replace("\n"," ").lower()
tunisian_constitution = re.sub("[^\w ]", " ", tunisian_constitution)

constitution_cloud = WordCloud(max_words=1000, mask=flag_mask, stopwords=STOPWORDS , min_word_length = 3, min_font_size = 8,
                               margin=5, random_state=1, background_color="white", include_numbers=True).generate(tunisian_constitution)

Our exquisite result looks good, and presents us with useful visual anchor.

%matplotlib inline
plt.figure(figsize=(20,10))
ax = plt.gca()
ax.set_title("Tunisian Constitution Cloud",fontsize=22, y=1.04)
plt.imshow(constitution_cloud.recolor(colormap=sns.color_palette(palette='blend:red,brown',as_cmap=True)),
           interpolation="bilinear")
plt.axis("off")
plt.show()
#constitution_cloud.to_file("Tunisian Constitution Cloud.png") #comment out to save the figure in a PNG format

Counting words

However, text analysis goes beyond visualisations. We can count words using the Counter collection. If we are only interested in the most common words, we can use .most_common() with or without an argument specifiying the number of words.

In the code below, we use list comprehension to store words appearing more than 30 times in the constitution and having at least 4 letters.

from collections import Counter
tunisian_words = [x for x in Counter(tunisian_constitution.split()).most_common(50) if x[1]>30 and len(x[0])>3 ]
tunisian_words

[('article', 173),
 ('shall', 173),
 ('assembly', 149),
 ('president', 110),
 ('people', 108),
 ('government', 100),
 ('representatives', 97),
 ('with', 97),
 ('republic', 96),
 ('state', 67),
 ('members', 62),
 ('court', 62),
 ('their', 50),
 ('draft', 48),
 ('head', 47),
 ('constitutional', 45),
 ('that', 43),
 ('laws', 42),
 ('within', 40),
 ('from', 39),
 ('right', 38),
 ('council', 37),
 ('judicial', 37),
 ('authorities', 34),
 ('local', 33),
 ('national', 32),
 ('after', 32),
 ('rights', 31),
 ('constitution', 31)]

Comparing different documents

We can compare the occurences of words in two documents as well. We will use the French constitution for comparison, despite the differences between the governance systems of the two countries as implemented in their respective constitutions.

We read and process the .txt file as we did before. Then, we store the words of each document in a list (tunisian_words and french_words).

french_constitution = open("french constitution.txt").read().replace("\n"," ").lower()
french_constitution = re.sub("[^\w ]", " ", french_constitution)

tunisian_words = tunisian_constitution.split(" ")
french_words = french_constitution.split(" ")

Afterwards, we construct a DataFrame by passing in the Counter collections as data entries, and setting the column labels.
In this example, we chose to remove missing values or drop them by using .dropna() with a 0 argument. That means, we drop rows which contain missing values. That would give us a DataFrame of only the words that exist in both documents!

The result below shows the number of occurrences of each word, the total, and a percentage value indicating the prevalence of that word in the Tunisian constitution with reference to its total use in both documents.

df = pd.DataFrame({
    'Tunisian_constitution': Counter(tunisian_words),
    'French_constitution': Counter(french_words)
}).dropna(0)

df['Total'] = df.Tunisian_constitution + df.French_constitution
df['Tunisian_percentage'] = (df.Tunisian_constitution / df.Total) * 100

df.head(10)

All of these common words between the two documents were used differently in each. We can check how many times they appeared by returning the sum of the values over the desired axis.

df.sum(axis = 0)

Tunisian_constitution    11950.000000
French_constitution      11769.000000
Total                    23719.000000
Tunisian_percentage      40472.171694
dtype: float64

Now, let’s look at words used ten or more times in the Tunisian constitution. We sort them by descending value.
The for loop is used to go through df.index and remove the words having three characters or less.

for ind in df.index:
    if len(ind)<4:
        df.drop(ind, inplace = True)

df[df.Tunisian_constitution >= 10].sort_values(by='Tunisian_constitution', ascending=False)

We can use the df.sum() again to compute the frequency of those words.

df.sum(axis = 0)

Tunisian_constitution     5109.000000
French_constitution       5125.000000
Total                    10234.000000
Tunisian_percentage      34301.487972
dtype: float64

In this notebook (using python 3.7 pandas 1.2.1 and matplotlib 3.3.2), we have learned how to draw a Word Cloud that would be helpful for visualization of any text. Besides, we used Counter to count words in documents. The tool worked well with pandas DataFrames, allowing us to make simple comparisons.

This might have been naive text analysis, but it is an important first step towards a more comprehensive and elaborate text analysis.

Plotting climate data using pandas

2021-03-20T00:00:00+01:00

Introduction
Data processing and time series manipulation
Plotting and styling using matplotlib and seaborn

Introduction

The data for this notebook comes from a subset of The National Centers for Environmental Information (NCEI) Daily Global Historical Climatology Network (GHCN-Daily). The GHCN-Daily is comprised of daily climate records from thousands of land surface stations across the globe.

The data (stored in a csv file) is comprised of daily climate records over the period 2005-2015, from land surface stations near Ann Arbor, Michigan, United States. Each row in the datafile corresponds to a single observation.

The provided variables are :

id : station identification code
date : date in YYYY-MM-DD format
element : indicator of element type
- TMAX : Maximum temperature (tenths of degrees C)
- TMIN : Minimum temperature (tenths of degrees C)
value : data value for element (in tenths of degrees C)

For the purpose of this notebook, we are going to plot a line chart of the record high and record low temperatures by day of the year over the period 2005-2014. Then, we overlay a scatter of the 2015 data for any points (highs and lows) for which the ten-year record (2005-2014) was broken in 2015.

Importing libraries and reading data

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib.dates import MonthLocator, DateFormatter

df = pd.read_csv('data.csv')

Data processing and time series manipulation

Since the temparture values are in the tenths of degree Celsius, we need to convert them to °C.

df['Data_Value'] = df['Data_Value']/10 #convert temperatures to °C

Next, we would ensure that the Date values are interpreted as date type, and sort the entire DataFrame by date.

df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)  #convert to date type
df = df.sort_values('Date')  #sort DataFrame by date

We use the head() function to get the first n rows (5 by default), and the shape attribute for a tuple representing the dimensionality of the resulting DataFrame:

print (df.head(), "\n The size of the DataFrame is", df.shape)

                ID       Date Element  Data_Value
USW00004848 2005-01-01    TMIN         0.0
USC00207320 2005-01-01    TMAX        15.0
USC00207320 2005-01-01    TMIN        -1.1
USW00014833 2005-01-01    TMIN        -4.4
USW00014833 2005-01-01    TMAX         3.3
 The size of the DataFrame is (165085, 4)

For clarity and ease, we are going to create two distinct DataFrames to hold TMAX and TMIN values.

dftmax = df[df['Element']=='TMAX'] #get dataframe of only TMAX values
dftmin = df[df['Element']=='TMIN'] #get dataframe of only TMIN values

To have a better overview of the entire data, I decided to keep all values including those of leap years. In this cell, we use numpy.arange() to get all possible dates in such a year (2008).

observation_dates = np.arange('2008-01-01', '2009-01-01', dtype='datetime64[D]')

Now, we would get the maximum temperatures for each day in the period from 2005 to 2014. We should remember that there are several registered values for any given day for that period. The resulting DataFrame dftmax14 has a length of 3652 (maximum values for each day in a 10-year span).

Then, we’ll create a pandas series comprised of all TMAX values in a one-year span (tmax1y). For any single day value (1 June for example), we would get the maximum TMAX for that day for the period 2005-2014.

#extract of TMAX of each day from 2005 to 2014
dftmax14 = dftmax[dftmax['Date']<'2015-01-01'].groupby('Date')['Data_Value'].max()

#set index to month-day format instead of year-m-d
dftmax14.index = dftmax14.index.strftime('%m-%d')

#TMAX in a one year span resume for data from 2005 to 2014
tmax1y = dftmax14.groupby(dftmax14.index).max()

We repeat the same steps to get minimum temperatures for each day in the period from 2005 to 2014.

#extract of TMIN of each day from 2005 to 2014
dftmin14 = dftmin[dftmin['Date']<'2015-01-01'].groupby('Date')['Data_Value'].min()

#set index to month-day format instead of year-m-d
dftmin14.index = dftmin14.index.strftime('%m-%d')

#TMIN in a one year span resume for data from 2005 to 2014
tmin1y = dftmin14.groupby(dftmin14.index).min()

Similarly, we create a DataFrame for TMAX values in 2015 (dftmax15).
Then, we extract the dates for which TMAX values in 2015 were higher than the values of TMAX over the period 2005-2014 (observation_dates_tmax15).
Finally, we update dftmax15 to have a DataFrame of those record breaking TMAX values.

#extract of TMAX of each day in 2015
dftmax15 = dftmax[dftmax['Date'] >= '2015-01-01'].groupby('Date')['Data_Value'].max()
#set index to month-day format instead of year-m-d
dftmax15.index = dftmax15.index.strftime('%m-%d')

dftmax15.loc['02-29'] = tmax1y['02-29']
dftmax15.sort_index(inplace=True)

observation_dates_tmax15 = observation_dates[dftmax15.values > tmax1y.values]

dftmax15 = dftmax15[dftmax15.values > tmax1y.values]

We can check the size of dftmax15, and infer that there are 37 days in 2015 that broke all TMAX values registered from 2005 to 2014.

dftmax15.size

The same previous steps to get a DataFrame dftmin15 of those record breaking TMIN values.

#extract of TMIN of each day in 2015
dftmin15 = dftmin[dftmin['Date'] >= '2015-01-01'].groupby('Date')['Data_Value'].min()
#set index to month-day format instead of year-m-d
dftmin15.index = dftmin15.index.strftime('%m-%d')

dftmin15.loc['02-29'] = tmin1y['02-29']
dftmin15.sort_index(inplace=True)

observation_dates_tmin15 = observation_dates[dftmin15.values < tmin1y.values]

dftmin15 = dftmin15[dftmin15.values < tmin1y.values]

By checking the size of dftmin15, we infer that there are 32 days in 2015 that broke all TMIN values registered from 2005 to 2014.

dftmin15.size

Plotting and styling using matplotlib and seaborn

In this cell, we use matplotlib and seaborn to create a figure, and plot line charts of the record high and record low temperatures by day of the year over the period 2005-2014, and a scatter of the 2015 data for any points (highs and lows) for which the ten-year record (2005-2014) was broken in 2015.
I made sure the visual was nice, with appropriate legends and labels, and reduced chart junk.

months = MonthLocator(range(1, 13), bymonthday=1, interval=1)
monthsFmt = DateFormatter("%b")

%matplotlib inline
sns.set_style("white") # Set the aesthetic style of the plot

fig = plt.figure(figsize=(10,5)) # Create a new figure with determined figsize parameter
ax = plt.gca() # Get the current Axes instance on the current figure

# Plots and Scatter plots of y vs. x with determined parameters
ax.scatter(observation_dates_tmax15 , dftmax15.values, s=35,
           c=sns.dark_palette("purple", n_colors=37), alpha=0.5, label='2005-2014 record high broken' )
ax.scatter(observation_dates_tmin15 , dftmin15.values, s=25,
           c=sns.dark_palette((260, 75, 60), input="husl", n_colors=32), alpha=0.9, label='2005-2014 record low broken' )

ax.plot(observation_dates, tmax1y.values, '-', color= sns.color_palette("Reds")[-2], label='High', alpha=0.7, linewidth=1)
ax.plot(observation_dates, tmin1y.values, '-', color= sns.color_palette("Blues")[-2], label='Low', alpha=0.7, linewidth=1)

ax.set_xlabel('Months')
ax.set_ylabel('Degrees (C)')

ax.set_title('Highest and lowest temperatures by day of the year over the period 2005-2014 \n and records broken in 2015 for Ann Arbor, Michigan, United States')

ax.xaxis.set_major_locator(months)
ax.xaxis.set_major_formatter(monthsFmt)

plt.xticks(alpha=0.5)
plt.yticks(alpha=0.5)
sns.despine()
ax.legend()

maxv=tmax1y.values
minv=tmin1y.values
#To shade the area between the record high and record low temperatures for each day
ax.fill_between(observation_dates,maxv, minv, facecolor=sns.light_palette("lightgrey")[4], alpha=0.2)

#fig.savefig("temp.png", dpi=1200) #comment out to save the figure in png format (reduce dpi to get smaller files)

This is not the only way to represent the data as you may opt to tweak the different parameters, and experience more with the plotting libraries. You can also treat other questions, and have fun with the data and figures.

Furthermore, there are several ways to handle the data and the various DataFrames. I tried to be more explicit and went for simplicity, all whilst incorporating the Zen of Python.

If you want to learn more about data science through the python programming language, I highly recommend Applied Data Science with Python Specialization on Coursera.

Analysing Olympic Games medal table

2021-03-18T00:00:00+01:00

Introduction & preprocessing
Stacked bar chart
Bubble chart

Introduction

The following Jupyter Notebook uses the Olympic games medal dataset, which was derived from the Wikipedia entry on All Time Olympic Games Medals, as of the 2016 Summer Olympics and 2018 Winter Olympics. All changes in medal standings due to doping cases and medal redistributions up to and including 25 November 2020 are taken into account.
Data queried on March 18th, 2021.

Using the power of pandas, read the csv file containing the dataset (the result is a DataFrame). Then perform data cleaning and preprocessing operations to get a more readable and practical format of the DataFrame, to be used later.

import pandas as pd
df = pd.read_csv("olympic_games_medal_table.csv",index_col=0,skiprows=1, encoding='latin_1')

for col in df.columns:    # clean the lables of the raw data
    if col[:2]=="01":
        df.rename(columns={col:"Gold"+col[4:]}, inplace=True)
    if col[:2]=="02":
        df.rename(columns={col:"Silver"+col[4:]}, inplace=True)
    if col[:2]=="03":
        df.rename(columns={col:"Bronze"+col[4:]}, inplace=True)
    if col[:1]=="№":
        df.rename(columns={col:"#"+col[1:]}, inplace=True)

names_ids = df.index.str.split('\s\(')    # split the index by '('

df.index = names_ids.str[0]    # the [0] element is the country name (will be the new index)

df['ID'] = names_ids.str[1].str[:3]    # the [1] element is the abbreviation or ID (take first 3 characters)

spare_df = df.copy()
df = df.drop('Totals')    # remove the row with label 'Totals'

Get the top ten countries that have the most medals in the summer and winter games.

most_medals = df['Combined total'].nlargest(10).index

Stacked bar chart

In this cell, we use matplotlib to visualize stacked bar charts representing the top ten countries in terms of total number of medals in the winter and summer games.

import matplotlib.pyplot as plt

%matplotlib inline

# get necessary pandas series
summer_medals = df.loc[most_medals, ['Total']]['Total']
winter_medals = df.loc[most_medals, ['Total.1']]['Total.1']

fig = plt.figure(figsize=(20, 10)) # create a figure object

# context manager for temporary styling
with plt.style.context(('seaborn-poster', {'xtick.labelsize' : 14, 'axes.labelpad':20 , 'axes.titlepad' : 20,
                        'axes.spines.top' : False, 'axes.spines.right' : False, 'axes.spines.left' : False} )):
    # create the bar plots
    bar_list = [plt.bar(range(len(most_medals)), summer_medals, width = 0.5 ,color='#d69728',
                        tick_label = most_medals, label = 'Summer'),    
         plt.bar(range(len(most_medals)), winter_medals, width = 0.5 , bottom = summer_medals ,
                 tick_label = most_medals, label = 'winter')]
    ax = plt.gca()
    ax.set_ylabel('Number of Medals')
    ax.set_xlabel('Countries')
    ax.set_title('Number of total medals for the top ten countries')
    ax.tick_params(bottom=False, left=False, labelleft=False)

    # attach text label for each bar displaying its value
    for i in bar_list:
        for bar in i:
            height = bar.get_height()
            bottom = bar.get_y()
            if height < 50:
                y = 0.5*height+bottom*1.20
            else:
                y = 0.5*height+bottom
            plt.gca().text(bar.get_x() + bar.get_width()/2, y, str(int(height)),
                 ha='center', color='black', fontsize=20)
    plt.legend()

# save the plot as a png file. you can change the file format to pdf or any supported extension (comment out to use)
#fig.savefig("totalmedals.png", dpi=150)

Bubble chart

This chart is an example of a visualization that can be created to help understand the data. This is a bubble chart showing the value of adjusted gold medals (#total gold/ #total games) Vs. the rank with reference to the number of total medals won.

The size of the bubble corresponds to an adjusted value of total medals (#total medals/ #total games) won, and the color corresponds to the geolocation (European or non-European) or current status (red: no longer exists).

top_medals = spare_df['Combined total'].nlargest(11).index
bubble_df = spare_df.loc[top_medals].drop('ID', axis=1) # dataframe for top 11 winners

# to eliminate overlapping medals' count
sum_topcoun = bubble_df[1:][bubble_df.columns[~bubble_df.columns.str.contains('#')]].sum()
for col in bubble_df.columns:
    if col  in sum_topcoun.index:        
        bubble_df.loc['Totals'][col] -= sum_topcoun[col]

bubble_df = bubble_df.rename(index={'Totals':'Rest of the World'})

# create 2 new columns with their respective data
bubble_df['adjusted_cgold'] = (bubble_df['Gold.2'].div(bubble_df['# Combined Games'])).apply(lambda x: float('%.1f'%x))
bubble_df['Rank'] = range(1,len(bubble_df.index)+1)

# chart creation and styling
with plt.style.context(('seaborn-poster', {'xtick.labelsize' :12, 'ytick.labelsize':12,'axes.labelpad':20 ,
                                           'axes.titlepad' : 20,'axes.labelsize':15} )):
    ax2 = bubble_df.plot(x='Rank', y='adjusted_cgold', kind='scatter',
                    c=['#e4aa1a','#377eb8','#e41a1c','#4daf4a','#4daf4a','#4daf4a','#4daf4a','#4daf4a',
                    '#377eb8','#4daf4a','#4daf4a'], linewidths=2 ,
                    xticks=range(1,len(bubble_df.index)+1),
                         s=(bubble_df['Combined total'].div(bubble_df['# Combined Games']))*100, alpha=.55, figsize=[15,7])
    ax2.set_ylim(0,65)
    ax2.set_title('Adjusted total gold medals by total medals won')
    ax2.set_ylabel('Adjusted gold medals')
    for i, txt in enumerate(bubble_df.index):    # add labels inside each bubble
        ax2.annotate(txt, [bubble_df['Rank'][i], bubble_df['adjusted_cgold'][i]], ha='center',fontsize=11)

#plt.savefig("cgold.png", dpi=150)

This chart shows that the United States has the most number of total medals in summer and winter games as indicated by the x-axis (Rank). Norway has the least number of combined medals among the top ten.

Based on the values of adjusted gold medals represented by the y-axis, the Soviet Union won the most gold medals relative to the number of olympic games in which they participated, followed by the US and Russia respectively.

The sizes of bubbles suggest that the top 3 countries that won most medals relative to the number of games they were part of, are respectively, the Soviet Union, the US and Russia.

To put it in perspective by comparing France and China, the latter won less total medals overall (position on x-axis). But taking into account the number of olympic games played, China did win more gold medals (position on y-axis) and total medals (bubble-size) .

Governance of extractive industry in Tunisia

2021-03-03T00:00:00+01:00

Governance of the extractive industry is important to optimize resource utilization, and to ensure that the outcomes from natural resources exploitation contribute to the sustainable development of the country.

Simultaneously obtaining higher economic revenues and better social impacts is not a simple task, and may be impeded by several practical and organizational obstacles, that focus on the present gains rather than on sustainable development.

Governance can be improved with the right legal, institutional and administrative measures, and by applying certain best practices. They must unfold in terms of multiple reforms carried out judiciously.

The key question addressed by this report is: how can we improve the governance of the extractive industry in Tunisia?

This would be done as follows:

Chapter 1 would address the legal framework governing the extractive industry in Tunisia. An analysis of the different legal texts, and the specific details would help contextualize and put into perspective the status quo.
Chapter 2 addresses the institutional and organizational frameworks of the sector. It would present the main entities that shape the public interventions and strategies. It would also try to analyse and build on the governance of the sector from this perspective.
Chapter 3 gives a broad but important idea about the current health of the sector in Tunisia, and showcases the importance of natural resources and the need for a better governance, with a focus on sustainable development.
Chapter 4 deals with the challenges and opportunities in line with the sustainable development of the sector and the country. It uses different approaches and tools to enhance the structures and regulations, and the governance of the whole sector along the decision chain.

This publication was written and submitted as part of my graduation work and requirements, from the National School of Administration of Tunis in 2020.

Les bases de Python

2018-08-12T00:00:00+01:00

Utiliser Python comme une calculatrice

Après avoir installé Python, vous pouvez ouvrir une console ou l’invite de commande, tapez python et un intepréteur Python s’ouvre. Pour l’instant, nous utilisons ce notebook.
Essayons quelques commandes Python simples

2+3 # Un commentaire

20 + 11 * 3

(20 - 11) / 4

Les nombres entiers (comme 2, 3 et 11) sont de type int, alors que les décimaux (comme 10.0 et 3.14) sont de type float.

19 / 3 # La division (/) donne toujours un nombre de type 'float'

19 // 3 # L'opérateur (//) effectue des divisions entières

19 % 3  # L'opérateur (%) donne le reste de la division entière.

Il est possible de calculer des puissances (X^y) avec l’opérateur **

3 ** 2
# 11 ** 12. # les opérations avec des types d'opérandes mélangés donnent un résultat en virgule flottante

Le signe égal ( = ) permet d’affecter une valeur à une variable

hauteur = 7
base = 9
aire_du_triangle = (hauteur * base) / 2
aire_du_triangl    # parlons des erreurs dans l'affichage des résultats

Les chaînes de caractères

Les chaînes de caractères peuvent être exprimés de différentes manières :

'Bonjour' # guillemets simples
"Hello" # guillemets doubles

'de l\'art' # utiliser \' pour protéger les guillemets
# "de l'art" # ou utiliser les guillemets doubles

'"Le petit chat est mort. ", dit Agnès'
# "\"Le petit chat est mort. \", dit Agnès"

La fonction print() affiche les chaînes de caractères de manière plus lisible.

'"L\'art !" dit Michel.'
# print ('"L\'art !" dit Michel.')
# p = 'Première ligne.\nDeuxième ligne.' # \n signifie nouvelle ligne
# p
# print (p)

Utilisez les chaînes brutes (raw strings) en préfixant la chaîne d’un r, pour éviter que les caractères précédés d’un antislash ne soient interprétés comme étant spéciaux.

print('C:\Documents\nom')
# print(r'C:\Documents\nom') # les guillemets précédés par (r)

Utilisez des triples guillemets : '''abc''' ou """xyz""" pour écrire des chaînes de caractères qui s’étalent sur plusieurs lignes.

Empêcher le retour à la ligne en ajoutant \

print("""\
Définitions:
     -dictionnaire           Une structure de donnée associant des clefs et des valeurs
     -fonction               Une suite d’instructions qui renvoient une valeur à celui qui l’appelle
""")

L’opérateur + permet de coller (concaténer) plusieurs chaînes. L’opérateur * permet de répéter les chaînes.

'OUI ' * 3 + 'Hurrah!'
# prefix = 'Hello Wo'
# prefix + 'rld!'

Les caractères peuvent être accédés par leur position. Pour l’indexation des chaînes de caractères, le premier caractère est à la position 0.

phrase = "Je m'appelle Brian"
phrase[0]
# phrase[15]

Pour effectuer un décompte en partant de la droite, nous utilisons des indices négatifs (commencent par -1).

phrase[-1]
# phrase[-3]
# phrase[-18]

Pour obtenir une sous-chaîne :

phrase[0:2] # caractères de la position 0 (inclut) à 2 (exclu)
# phrase[13:18]
# phrase[:12]
# phrase[12:] # s[:i] + s[i:] = s
# phrase[-5:]
# phrase[20] # indice trop grand (hors bornes)
# phrase[5:20] # gérés silencieusement si utilisés dans des tranches

Les chaînes de caractères sont immutable : elles ne peuvent pas être modifiées.

phrase = "Je m'appelle Brian"
phrase[1] = 'j'
# phrase[13:] = 'Stewie'
# phrase[:13] + 'Stewie !'

La fonction len() donne la longueur d’une chaîne :

p = 'I have a dream'
len(p)

Les listes

Une suite d’éléments séparés par des virgules, placés entre crochets. Les éléments d’une liste ne sont pas obligatoirement du même type.

premiers  = [2, 3, 5, 7, 11]
premiers

Les listes peuvent être indicées et découpées :

premiers[0]
# premiers[-1]
# premiers[2:10]

Les opérations de découpage (en tranches) renvoient une nouvelle liste contenant les éléments spécifiés.

premiers[:] # une copie de la liste

Les listes supportent des opérations comme pour les chaînes de caratcères.

premiers + [13, 17, 19]
# premiers * 2

Il est possible de changer le contenu des listes : elles sont mutables

nbres = [1, 2, 'c', 4]
# nbres[2] = 3
nbres

nbres.append(5) # méthode pour ajouter des éléments à la fin
nbres

nbres[2:4] = [10, 100] # affectation par tranches
# nbres[:] = [] # supprimer toutes les valeurs
nbres

Il est possible de créer des listes contenant d’autres listes :

a = [1, 2, 3]
b = ['f', 'r', 'a']
res = [a, b]
# res
# res[1]
# res[0][2]

La compréhension des listes (list comprehension) permet de construire des nouvelles listes où chaque élément est le résultat d’une opération appliquée à chaque élément d’une autre séquence; ou de créer une sous-séquence d’éléments satisfaisants une certaine condition.
Elle consiste en deux crochets contenants une expression suivie par une clause for, puis par une ou plusieurs clauses for ou if.

[x for x in range(6) if x % 2 == 0]

[(x, y) for x in [1,2,3] for y in [3,1] if x != y]

Contrôle du flux

L’instruction `if` :

x = 11
if x % 2 == 0:
    print ('foo')
    s = 'nombre pair'
else:    
    print ('bar')    # x % 2 != 0

x = 20
if x % 5 == 0 and x % 2 == 0:
    print ('foobar')
elif x % 5 == 0:
    print ('foo')
else:
    print ('bar')

L’instruction `for` :

Elle permet d’itérer sur les éléments d’une séquence (une liste, une chaîne de caractères, etc.) par ordre.

pays = ['France', 'Canada', 'Belgique', 'Suisse']
for p in pays:
    print (p, len(p))

mot = '1 chat'
for c in mot:
    print (c)

for i in range(1, 4):
    print (i ** 2)

L’instruction `while` :

x = 1
while x <= 5:
    print (x, end=' ') # Le paramètre (end) sert à enlever le retour à la ligne, ou terminer par un autre caractère
    x += 2 # x = x + 2

Les fonctions :

Le mot-clé def définit la fonction. Il est suivi du nom de la fonction, et de ses paramètres entre parenthèses.

def add(n):         # calculer la somme des chiffres de 1 à n
    somme = 0       # une variable locale
    for i in range(1, n+1):
        somme += i
    return somme
add(3)
# add(10)

# somme # n'est pas défini 'globalement'
# somme = 100        # variable globale
# somme

def fib(a, b, n):
    """ print une suite de fibnoacci à partir des termes a et b, jusqu'à n """
    while a < n:
        print(a, end=' ')
        a, b = b, a+b
fib(5, 8, 2000) '''{même les fonctions sans instruction
                 return renvoient une valeur, quoique ennuyeuse. Cette valeur est appelée None }'''

Structures de données

Tuples

Une séquence d’éléments séparés par des virgules (et encadrés par des parenthèses si nécessaire)

t = 'python', 3.5, 101
t
# t[2]

# n = t,  'Guido', ('2.7', 2010), [0, 90] # tuples imbriqués
# n

n[1] = 'Guido van Rossum' # ils sont immutables
# n[3][1] = 100
n

vide = () # initier un tuple vide
# un = 'python',

Les ensembles (`sets`)

Une collection non ordonnée, sans élément dupliqué.

notes = {18, 15, 14, 11 ,18, 14}
notes
# fruits = {'orange', 'raisin', 'pomme' ,'kiwi' , 'orange', 'pomme'}
# fruits

a = set('abccba')
a
# a & set('atta') # supportent d'autres opérations (unions, intersections, différences, etc.)

Dictionnaires

Des ensembles non ordonnés de pairs clé : valeur (key : value pairs). Ils sont indexés par des clés (keys), qui peuvent être de n’importe quel type immuable : chaînes de caractères, nombres et tuples (s’ils ne contiennent que des immutables).
Les clés doivent être uniques au sein d’un dictionnaire.

d = {} # créer un dictionnaire vide

# d = {'Marie':15, 'Jean':2, 'Victor':9}
# d['Marie']
# del d['Victor']
# d['Charles'] = 33
d

p = dict([('Anthony', 'a'), ('Guido', [0, 'g']), ('Marie', 'm')])
# p.keys()
# p.values()
# 'guido' in p
# 'Guido' in p

Création des dictionnaires par compréhension (dict comprehensions).

{x: x**3 for x in range(1, 6)}

Modules

Un module est un fichier contenant des définitions et des instructions. Le nom du fichier est celui du module, suffixé de .py.

import <module> # importer le module dans la table des symboles

<module>.<fonction> # accéder aux fonctions (ou constantes)

Pour importer les noms d’un <module_1> directement dans la table des symboles du module qui l’importe (<module_2>). De ce fait, le nom du <module_1> n’est pas défini à l’intérieur du <module_2>

from <module> import <fonction_1>, <fonction_2>, <fonction_n>

Ou bien

from <module> import * # importer tous les noms du module (déconseillé)

Exercice pratique de synthèse

Définir une fonction somme_impairs qui prend une liste de nombres comme argument, et renvoie la somme de tous les entiers impairs. Par exemple :
```
somme_impairs([5, -13, 3]) = -5
somme_impairs([2, 1, 1.01) = 1
```
Écrire un programme qui prend une valeur entrée par l’utilisateur (1 <= N <= 100), et affiche un message comme suit :

 - "Votre valeur X est plus grande. Essayer de nouveau !" # si la valeur entrée est plus grande à celle du programme
 - "Votre valeur X est plus petite. Essayer de nouveau !" # si la valeur entrée est plus petite à celle du programme
 - "Bravo ! C'était bien X"                               # si la valeur entrée est correcte

Avec X est la valeur entrée par l’utilisateur.
Penser à utiliser la fonction randint du module random pour générer un entier aléatoirement.

from random import randint
n = randint(1, 100)

Pour plus de détails sur n’importe quel objet, fonction ou module, utiliser : help(<nom>)

Meher Béjaoui’s Blog

Multiple Linear Regression Analysis of Temperature Data in Albany and Sacramento

Introduction

Exploratory data analysis: Understanding the data

Sacramento

Albany

Multiple linear regression

Sacramento

Albany

Problem Solving Using Computational Thinking - Course Review

Course overview and structure

Course review

K-Means clustering and similarity visualization of constitutions

Introduction

Text processing and exploratory analysis

K-Means clustering

Visualizing text corpus similarity

Advanced word analysis with TF-IDF

Introduction and basic concepts

Term Frequency

Inverse document frequency

Counting words in Python with scikit-learn’s CountVectorizer

Introduction

Counting words with CountVectorizer

Counting words in multiple documents

Visualization and analysis of legal texts

Generating Word Clouds

Counting words

Comparing different documents

Plotting climate data using pandas

Introduction

Data processing and time series manipulation

Plotting and styling using matplotlib and seaborn

Analysing Olympic Games medal table

Introduction

Stacked bar chart

Bubble chart

Governance of extractive industry in Tunisia

Les bases de Python

Sommaire

Utiliser Python comme une calculatrice

Les chaînes de caractères

Les listes

Contrôle du flux

L’instruction if :

L’instruction for :

L’instruction while :

Les fonctions :

Structures de données

Tuples

Les ensembles (sets)

Dictionnaires

Modules

Exercice pratique de synthèse

L’instruction `if` :

L’instruction `for` :

L’instruction `while` :

Les ensembles (`sets`)