<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://meherbejaoui.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://meherbejaoui.com/" rel="alternate" type="text/html" /><updated>2026-04-12T08:04:09+01:00</updated><id>https://meherbejaoui.com/feed.xml</id><title type="html">Meher Bejaoui’s Blog</title><subtitle>Personal and educational blog about Python language, pandas library, matplotlib and other scientific libraries - by Meher Béjaoui.</subtitle><author><name>Meher Bejaoui</name><email>meher.bejaoui@outlook.com</email></author><entry><title type="html">Multiple Linear Regression Analysis of Temperature Data in Albany and Sacramento</title><link href="https://meherbejaoui.com/python/multiregression-analysis-of-temperature-data-in-albany-and-sacramento/" rel="alternate" type="text/html" title="Multiple Linear Regression Analysis of Temperature Data in Albany and Sacramento" /><published>2023-12-29T00:00:00+01:00</published><updated>2023-12-29T00:00:00+01:00</updated><id>https://meherbejaoui.com/python/multiregression-analysis-of-temperature-data-in-albany-and-sacramento</id><content type="html" xml:base="https://meherbejaoui.com/python/multiregression-analysis-of-temperature-data-in-albany-and-sacramento/"><![CDATA[<ul>
  <li><a href="#introduction">Introduction</a></li>
  <li><a href="#exploratory-data-analysis-understanding-the-data">Exploratory data analysis: Understanding the data</a></li>
  <li><a href="#multiple-linear-regression">Multiple linear regression</a></li>
</ul>

<hr />
<h2 id="introduction">Introduction</h2>
<p>The study of weather patterns has been of great interest to scientists, researchers and the general public for many years. In recent times, there has been a growing concern about the impacts of global warming and climate change on weather patterns and the environment. One area of particular interest is the relationship between temperature and other meteorological variables, such as humidity, precipitation, and wind speed. Understanding this relationship is important for improving our understanding of climate patterns and predicting future weather patterns.
This study aims to investigate the relationship between temperature and the other meteorological variables in the United States, and to explore how this relationship varies across a specific time range.</p>

<p>The dataset pertains to two prominent cities located on the opposite coasts of the United States: Albany, the capital city of New York State, and Sacramento, the capital city of California State. The data were respectively collected from Albany International Airport and Sacramento Metropolitan Airport.</p>

<p>The data was extracted from the <a href="https://www.ncei.noaa.gov/cdo-web/datasets">National Oceanic and Atmospheric Administration</a> (NOAA) website on 5 May 2023 and is a subset of the Local Climatological Data (LCD) dataset. The original data ranges back to the 1940s and 1970s for the two selected stations, but for the purpose of this report, we will be analyzing measurements from January 1st, 2000, to December 31st, 2022.</p>

<p>The data was stored in two seperate <code class="language-plaintext highlighter-rouge">csv</code> files. It is worth noting that the Daily Humidity values in Sacramento are only available from January 1st, 2005. The data was parsed, cleaned and slightly pre-processed in Excel.<br />
It is worth noting the units of measurement employed for the various variables under consideration. Temperature is expressed in Celsius, wind speed in meters per second (m/s), precipitation in millimeters (mm), and humidity in percentage (%).</p>

<p>The study was conducted utilizing Python 3 and Jupyter Notebook as the programming and development environment. To ensure clarity, comprehensibility, and reproducibility, the report incorporates the code implementation and its corresponding outputs. This approach facilitates a transparent presentation of the analysis, enabling us to follow the methodology and reproduce the results beyond the final report .</p>

<h2 id="exploratory-data-analysis-understanding-the-data">Exploratory data analysis: Understanding the data</h2>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Importing libraries
</span><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="n">sns</span>
<span class="kn">from</span> <span class="nn">matplotlib.dates</span> <span class="kn">import</span> <span class="n">MonthLocator</span><span class="p">,</span> <span class="n">DateFormatter</span>
<span class="kn">import</span> <span class="nn">statsmodels.api</span> <span class="k">as</span> <span class="n">sm</span>

</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Reading the datasets
</span><span class="n">albany</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'albany.csv'</span><span class="p">)</span>
<span class="n">sacramento</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'sacramento.csv'</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Datetime formatting
</span><span class="n">sacramento</span><span class="p">[</span><span class="s">'date'</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">sacramento</span><span class="p">[</span><span class="s">'date'</span><span class="p">],</span> <span class="nb">format</span><span class="o">=</span><span class="s">'%d/%m/%Y'</span><span class="p">)</span>
<span class="n">albany</span><span class="p">[</span><span class="s">'date'</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">albany</span><span class="p">[</span><span class="s">'date'</span><span class="p">],</span> <span class="nb">format</span><span class="o">=</span><span class="s">'%d/%m/%Y'</span><span class="p">)</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">tail()</code> method in the code below is used to retrieve the last rows of the Sacramento DataFrame. This feature facilitates rapid examination of the end of DataFrame, enabling data verification. We confirm the presence of 8333 rows and 7 columns, which include the date variable.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span> <span class="p">(</span><span class="n">sacramento</span><span class="p">.</span><span class="n">tail</span><span class="p">(),</span> <span class="s">"</span><span class="se">\n</span><span class="s"> The size of the DataFrame is"</span><span class="p">,</span> <span class="n">sacramento</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>           date  averagetemperature  averagewindspeed  maximumtemperature  \
8329 2022-12-27               10.56              6.84               12.78   
8330 2022-12-28                6.11              1.48               10.00   
8331 2022-12-29               10.00              3.98               12.22   
8332 2022-12-30               13.89              8.18               16.67   
8333 2022-12-31               13.33              7.82               15.56   

      minimumtemperature  precipitation  humidity  
8329                8.33          10.41     85.33  
8330                2.22           0.00     90.75  
8331                7.22          16.26     85.04  
8332               11.11           1.52     87.71  
8333               10.56          47.75     82.71   
 The size of the DataFrame is (8334, 7)
</code></pre></div></div>

<hr />
<p>We proceed to do the same with the Albany DataFrame, and confirm the presence of 8375 rows and 7 columns.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span> <span class="p">(</span><span class="n">albany</span><span class="p">.</span><span class="n">tail</span><span class="p">(),</span> <span class="s">"</span><span class="se">\n</span><span class="s"> The size of the DataFrame is"</span><span class="p">,</span> <span class="n">albany</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>           date  averagetemperature  averagewindspeed  maximumtemperature  \
8370 2022-12-27               -2.22              2.59                1.11   
8371 2022-12-28                0.00              3.04                5.00   
8372 2022-12-29                1.67              2.59                8.89   
8373 2022-12-30               10.00              3.53               13.89   
8374 2022-12-31                9.44              3.80               11.67   

      minimumtemperature  precipitation  humidity  
8370               -5.56           0.00     56.88  
8371               -5.56           0.00     63.63  
8372               -5.56           0.00     61.33  
8373                6.11           0.00     52.79  
8374                6.67           2.03     81.17   
 The size of the DataFrame is (8375, 7)
</code></pre></div></div>

<hr />
<p>In the following, the <code class="language-plaintext highlighter-rouge">describe()</code> method, particularly useful for exploratory data analysis, allows for a comprehensive understanding of the data’s summary statistics at a glance. These statistics provide insights into the central tendency, variability, and distribution of the data in each column of the DataFrame.</p>

<h3 id="sacramento">Sacramento</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sacramento</span><span class="p">.</span><span class="n">describe</span><span class="p">()</span>
</code></pre></div></div>

<div>
<style scoped="">
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>averagetemperature</th>
      <th>averagewindspeed</th>
      <th>maximumtemperature</th>
      <th>minimumtemperature</th>
      <th>precipitation</th>
      <th>humidity</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>count</th>
      <td>8334.000000</td>
      <td>8334.000000</td>
      <td>8334.000000</td>
      <td>8334.000000</td>
      <td>8334.000000</td>
      <td>6572.000000</td>
    </tr>
    <tr>
      <th>mean</th>
      <td>16.758523</td>
      <td>3.378572</td>
      <td>24.091746</td>
      <td>9.145821</td>
      <td>1.173276</td>
      <td>63.278203</td>
    </tr>
    <tr>
      <th>std</th>
      <td>6.681999</td>
      <td>1.715114</td>
      <td>8.657389</td>
      <td>5.392560</td>
      <td>4.735319</td>
      <td>16.252343</td>
    </tr>
    <tr>
      <th>min</th>
      <td>0.560000</td>
      <td>0.040000</td>
      <td>3.890000</td>
      <td>-11.110000</td>
      <td>0.000000</td>
      <td>16.000000</td>
    </tr>
    <tr>
      <th>25%</th>
      <td>11.110000</td>
      <td>2.100000</td>
      <td>16.670000</td>
      <td>5.000000</td>
      <td>0.000000</td>
      <td>51.040000</td>
    </tr>
    <tr>
      <th>50%</th>
      <td>16.670000</td>
      <td>3.170000</td>
      <td>23.890000</td>
      <td>9.440000</td>
      <td>0.000000</td>
      <td>62.130000</td>
    </tr>
    <tr>
      <th>75%</th>
      <td>22.780000</td>
      <td>4.430000</td>
      <td>31.670000</td>
      <td>13.330000</td>
      <td>0.000000</td>
      <td>76.250000</td>
    </tr>
    <tr>
      <th>max</th>
      <td>35.560000</td>
      <td>13.500000</td>
      <td>53.890000</td>
      <td>27.220000</td>
      <td>104.650000</td>
      <td>100.000000</td>
    </tr>
  </tbody>
</table>
</div>

<p>In Sacramento, California, the average annual temperature is around 16.76 degrees Celsius, with a standard deviation of approximately 6.68. The temperature range is substantial, indicating potential fluctuations in weather conditions. Notably, the highest recorded temperature of 53.89 degrees Celsius occurred on March 19, 2018, while the lowest temperature of -11.11 degrees Celsius was recorded on February 11, 2004. <br />
Likewise, the average precipitation amount in Sacramento is approximately 1.17 mm. The precipitation values demonstrate a standard deviation of about 4.74 units. This aligns with the prevailing climate in California, characterized by hot, arid summers and short, cold, wet winters, resulting in partly cloudy conditions. The number of days with precipitations below the average overall precipitation amount of 1.17 mm is 7368 days, over the study period of 8334 days.<br />
The descriptive measures of central tendency and variability correspond to the anticipated weather patterns in the area.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">max_temp_index</span> <span class="o">=</span> <span class="n">sacramento</span><span class="p">[</span><span class="s">'maximumtemperature'</span><span class="p">].</span><span class="n">idxmax</span><span class="p">()</span>
<span class="n">date_max_temp</span> <span class="o">=</span> <span class="n">sacramento</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">max_temp_index</span><span class="p">,</span> <span class="s">'date'</span><span class="p">]</span>
<span class="n">min_temp_index</span> <span class="o">=</span> <span class="n">sacramento</span><span class="p">[</span><span class="s">'minimumtemperature'</span><span class="p">].</span><span class="n">idxmin</span><span class="p">()</span>
<span class="n">date_min_temp</span> <span class="o">=</span> <span class="n">sacramento</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">min_temp_index</span><span class="p">,</span> <span class="s">'date'</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Date with the highest maximum temperature:"</span><span class="p">,</span> <span class="n">date_max_temp</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Date with the lowest minimum temperature:"</span><span class="p">,</span> <span class="n">date_min_temp</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Date with the highest maximum temperature: 2018-03-19 00:00:00
Date with the lowest minimum temperature: 2004-02-11 00:00:00
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">count_below_threshold</span> <span class="o">=</span> <span class="n">sacramento</span><span class="p">[</span><span class="n">sacramento</span><span class="p">[</span><span class="s">'precipitation'</span><span class="p">]</span> <span class="o">&lt;</span> <span class="mf">1.17</span><span class="p">].</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Number of days with precipitations below the average precipitation amount of 1.17 mm is :"</span><span class="p">,</span> <span class="n">count_below_threshold</span><span class="p">)</span>
<span class="n">count_exceeding_threshold</span> <span class="o">=</span> <span class="n">sacramento</span><span class="p">[</span><span class="n">sacramento</span><span class="p">[</span><span class="s">'averagetemperature'</span><span class="p">]</span> <span class="o">&gt;</span> <span class="mf">16.76</span><span class="p">].</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Number of days with daily average temperature exceeding the overall average temperature of 9.92 C is :"</span><span class="p">,</span> <span class="n">count_exceeding_threshold</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Number of days with precipitations below the average precipitation amount of 1.17 mm is : 7368
Number of days with daily average temperature exceeding the overall average temperature of 9.92 C is : 4028
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">columns_to_include</span> <span class="o">=</span> <span class="p">[</span><span class="s">'averagetemperature'</span><span class="p">,</span> <span class="s">'averagewindspeed'</span><span class="p">,</span> <span class="s">'precipitation'</span><span class="p">,</span> <span class="s">'humidity'</span><span class="p">]</span>
<span class="n">subset_sacramento</span> <span class="o">=</span> <span class="n">sacramento</span><span class="p">[</span><span class="n">columns_to_include</span><span class="p">]</span>
<span class="c1"># pair plots between numerical columns
</span><span class="n">sns</span><span class="p">.</span><span class="n">pairplot</span><span class="p">(</span><span class="n">subset_sacramento</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<p><img src="/assets/temperatures/output_15_0.png" alt="pair plots grid figure of selected variables for Sacramento" /></p>

<p>The grid figure presented illustrates the pair plots, providing insights into the relationships, distributions, and interactions among the selected variables. These visualizations allow for a better understanding of the anticipated effects of each variable on the target temperature variable. To enhance the visual clarity of the plots, the minimum and maximum temperatures were excluded from the analysis.</p>

<p>An evident observation is the strong negative correlation between humidity and average temperature, with a limited number of noticeable outliers. In contrast, the relationship between precipitation and temperature does not exhibit a clear pattern, but rather displays more outlier values. Notably, low precipitation values are observed across various temperature ranges, whereas higher precipitation values tend to occur within a specific temperature range.<br />
Furthermore, the influence of wind speed on temperature is not evident for lower wind speed values. However, as the wind speed increases, the temperature tends to stabilize within a specific temperature range. This finding suggests that wind speed may serve as a more reliable predictor of temperature within this specific range.</p>

<p>Overall, the pair plots provide valuable insights into the relationships and patterns among the variables, shedding light on their potential impact on the studied temperature variable.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">correlation_matrix</span> <span class="o">=</span> <span class="n">sacramento</span><span class="p">.</span><span class="n">corr</span><span class="p">()</span>
<span class="n">sns</span><span class="p">.</span><span class="n">heatmap</span><span class="p">(</span><span class="n">correlation_matrix</span><span class="p">,</span> <span class="n">annot</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">cmap</span><span class="o">=</span><span class="s">"coolwarm"</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">"Sacramento Correlation Matrix Heatmap"</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<p><img src="/assets/temperatures/output_17_0.png" alt="heatmap figure for correlation coefficients for Sacramento" /></p>

<p>The heatmap figure provides a visual representation of the correlation coefficients between pairs of variables. Each cell in the heatmap represents the correlation coefficient, with color gradients indicating the magnitude and direction of the correlation. Warmer colors, such as shades of red, indicate a positive correlation, while cooler colors, such as shades of blue, indicate a negative correlation.<br />
The heatmap reinforces the observations made from the pair plots mentioned earlier. It visually confirms the relationships and patterns observed between variables.<br />
Furthermore, in the process of building our model, we have made the decision to only include the minimum temperature variable and exclude the maximum temperature. This decision is based on the redundancy of their influence on the average temperature and the potential issue of multicollinearity that could arise. By excluding the maximum temperature, we aim to prevent any redundant or collinear effects in our analysis.</p>

<h3 id="albany">Albany</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">albany</span><span class="p">.</span><span class="n">describe</span><span class="p">()</span>
</code></pre></div></div>

<div>
<style scoped="">
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>averagetemperature</th>
      <th>averagewindspeed</th>
      <th>maximumtemperature</th>
      <th>minimumtemperature</th>
      <th>precipitation</th>
      <th>humidity</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>count</th>
      <td>8330.000000</td>
      <td>8330.000000</td>
      <td>8330.000000</td>
      <td>8330.000000</td>
      <td>8331.000000</td>
      <td>8375.000000</td>
    </tr>
    <tr>
      <th>mean</th>
      <td>9.916845</td>
      <td>3.286505</td>
      <td>15.062441</td>
      <td>4.494155</td>
      <td>2.862368</td>
      <td>68.329931</td>
    </tr>
    <tr>
      <th>std</th>
      <td>10.551006</td>
      <td>1.709147</td>
      <td>11.239033</td>
      <td>10.239078</td>
      <td>7.307641</td>
      <td>12.755930</td>
    </tr>
    <tr>
      <th>min</th>
      <td>-21.670000</td>
      <td>0.000000</td>
      <td>-17.780000</td>
      <td>-26.670000</td>
      <td>0.000000</td>
      <td>24.710000</td>
    </tr>
    <tr>
      <th>25%</th>
      <td>1.670000</td>
      <td>2.010000</td>
      <td>5.560000</td>
      <td>-2.780000</td>
      <td>0.000000</td>
      <td>59.960000</td>
    </tr>
    <tr>
      <th>50%</th>
      <td>10.560000</td>
      <td>3.080000</td>
      <td>16.110000</td>
      <td>4.440000</td>
      <td>0.000000</td>
      <td>68.250000</td>
    </tr>
    <tr>
      <th>75%</th>
      <td>19.440000</td>
      <td>4.290000</td>
      <td>25.000000</td>
      <td>13.330000</td>
      <td>1.780000</td>
      <td>77.500000</td>
    </tr>
    <tr>
      <th>max</th>
      <td>31.110000</td>
      <td>10.590000</td>
      <td>37.220000</td>
      <td>25.000000</td>
      <td>119.130000</td>
      <td>100.000000</td>
    </tr>
  </tbody>
</table>
</div>

<p>In Albany, New York, the average overall temperature is around 9.92 degrees Celsius, with a standard deviation of approximately 10.55. The temperature range is as substantial as in Sacramento (around 64 degrees Celsius between the lowest and highest values in both), indicating potential fluctuations in weather conditions. Notably, the highest recorded temperature of 37.22 degrees Celsius occurred on July 21, 2011, while the lowest temperature of -26.67 degrees Celsius was recorded on January 24, 2005.
Likewise, the average precipitation amount in Albany is approximately 2.86 mm. The precipitation values demonstrate a standard deviation of about 7.3 units.<br />
This aligns with the prevailing climate in New York, characterized by humid continental with warm to hot summers and freezing cold snowy winters. The number of days with precipitations below the average overall precipitation amount of 2.86 mm is 6603 days, over the study data of 8330 days. While the number of days with daily average temperatures exceeding the overall average temperature of 9.92 C was 4318 days. The descriptive measures of central tendency and variability correspond to the anticipated weather patterns in the area.</p>

<p>These findings from both areas suggest that in comparison, most precipitations in Sacramento occured in fewer days that had an excessive amount of pecipitations. In contrast, the precipitations in Albany were better spread over more days.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">max_temp_index2</span> <span class="o">=</span> <span class="n">albany</span><span class="p">[</span><span class="s">'maximumtemperature'</span><span class="p">].</span><span class="n">idxmax</span><span class="p">()</span>
<span class="n">date_max_temp2</span> <span class="o">=</span> <span class="n">albany</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">max_temp_index2</span><span class="p">,</span> <span class="s">'date'</span><span class="p">]</span>
<span class="n">min_temp_index2</span> <span class="o">=</span> <span class="n">albany</span><span class="p">[</span><span class="s">'minimumtemperature'</span><span class="p">].</span><span class="n">idxmin</span><span class="p">()</span>
<span class="n">date_min_temp2</span> <span class="o">=</span> <span class="n">albany</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">min_temp_index2</span><span class="p">,</span> <span class="s">'date'</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Date with the highest maximum temperature:"</span><span class="p">,</span> <span class="n">date_max_temp2</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Date with the lowest minimum temperature:"</span><span class="p">,</span> <span class="n">date_min_temp2</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Date with the highest maximum temperature: 2011-07-21 00:00:00
Date with the lowest minimum temperature: 2005-01-24 00:00:00
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">count_below_threshold</span> <span class="o">=</span> <span class="n">albany</span><span class="p">[</span><span class="n">albany</span><span class="p">[</span><span class="s">'precipitation'</span><span class="p">]</span> <span class="o">&lt;</span> <span class="mf">2.86</span><span class="p">].</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Number of days with precipitations below the average precipitation amount of 2.86 mm is :"</span><span class="p">,</span> <span class="n">count_below_threshold</span><span class="p">)</span>
<span class="n">count_exceeding_threshold</span> <span class="o">=</span> <span class="n">albany</span><span class="p">[</span><span class="n">albany</span><span class="p">[</span><span class="s">'averagetemperature'</span><span class="p">]</span> <span class="o">&gt;</span> <span class="mf">9.92</span><span class="p">].</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Number of days with daily average temperature exceeding the overall average temperature of 9.92 C is :"</span><span class="p">,</span> <span class="n">count_exceeding_threshold</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Number of days with precipitations below the average precipitation amount of 2.86 mm is : 6603
Number of days with daily average temperature exceeding the overall average temperature of 9.92 C is : 4318
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">subset_albany</span> <span class="o">=</span> <span class="n">albany</span><span class="p">[</span><span class="n">columns_to_include</span><span class="p">]</span>
<span class="c1"># pair plots between numerical columns
</span><span class="n">sns</span><span class="p">.</span><span class="n">pairplot</span><span class="p">(</span><span class="n">subset_albany</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<p><img src="/assets/temperatures/output_24_0.png" alt="pair plots grid figure of selected variables for Albany" /></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">correlation_matrix2</span> <span class="o">=</span> <span class="n">albany</span><span class="p">.</span><span class="n">corr</span><span class="p">()</span>
<span class="n">sns</span><span class="p">.</span><span class="n">heatmap</span><span class="p">(</span><span class="n">correlation_matrix2</span><span class="p">,</span> <span class="n">annot</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">cmap</span><span class="o">=</span><span class="s">"coolwarm"</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">"Albany Correlation Matrix Heatmap"</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<p><img src="/assets/temperatures/output_25_0.png" alt="heatmap figure for correlation coefficients for Albany" /></p>

<p>The pair plots and heatmap figures for Albany, New York, reveal distinct patterns and relationships among the variables. A notable observation is the prevalence of low correlation coefficients and cooler colors in the heatmap, indicating weaker associations between the variables.<br />
In particular, the maximum and minimum temperatures exhibit a strong positive correlation with the average temperature, suggesting a consistent relationship between these variables. However, for humidity and windspeed, there are no discernible patterns or clear correlations with the average temperature. Additionally, there are several outliers observed in the precipitation variable, indicating instances of extreme or unusual precipitation values.</p>

<p>The absence of clear patterns and the presence of non-linear relationships between humidity, precipitation, windspeed, and average temperature in Albany, New York, can be attributed to the specific nature of the local climate category. This climate category is described as humid continental, characterized by warm to hot summers and freezing cold snowy winters.<br />
Within this climate category, the intricate interplay among various atmospheric conditions, such as prevailing winds and moisture sources, gives rise to diverse and dynamic weather patterns. The variability in temperature, humidity, precipitation, and windspeed is influenced by numerous factors and regional weather phenomena, including blizzards. Furthermore, the seasonal temperature extremes and the potential impact of the proximity to large bodies of water, such as the Atlantic Ocean, introduce additional variability and non-linear relationships among the variables under study.</p>

<p>Considering the distinct characteristics of the locality, it is reasonable to expect that the relationships may not conform to a simple linear pattern. The complex interplay of factors such as temperature inversions, air masses, and local topography contributes to the observed complexity in these relationships.</p>

<h2 id="multiple-linear-regression">Multiple linear regression</h2>
<h3 id="sacramento-1">Sacramento</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sacramento</span><span class="p">.</span><span class="n">isnull</span><span class="p">().</span><span class="nb">sum</span><span class="p">()</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>date                     0
averagetemperature       0
averagewindspeed         0
maximumtemperature       0
minimumtemperature       0
precipitation            0
humidity              1762
dtype: int64
</code></pre></div></div>

<hr />
<p>Prior to constructing the linear regression model, we examine the dataset for missing values. As a result, we identify 1762 instances of missing data specifically pertaining to the humidity variable. Instead of discarding these unavailable data points, which represent a substantial portion of the entire dataset, we opted to impute the missing values using the mean.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mean_humidity</span> <span class="o">=</span> <span class="n">sacramento</span><span class="p">[</span><span class="s">'humidity'</span><span class="p">].</span><span class="n">mean</span><span class="p">()</span>
<span class="n">sacramento</span><span class="p">[</span><span class="s">'humidity'</span><span class="p">].</span><span class="n">fillna</span><span class="p">(</span><span class="n">mean_humidity</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Adding a constant column to the independent variables:
# in the context of using statsmodels for multiple linear regression, it represents the intercept term in the linear regression equation
</span><span class="n">X</span> <span class="o">=</span> <span class="n">sacramento</span><span class="p">[[</span><span class="s">'averagewindspeed'</span><span class="p">,</span> <span class="s">'minimumtemperature'</span><span class="p">,</span> <span class="s">'precipitation'</span><span class="p">,</span> <span class="s">'humidity'</span><span class="p">]]</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">sm</span><span class="p">.</span><span class="n">add_constant</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>

<span class="c1"># Defining the dependent variable
</span><span class="n">y</span> <span class="o">=</span> <span class="n">sacramento</span><span class="p">[</span><span class="s">'averagetemperature'</span><span class="p">]</span>

<span class="c1"># Fitting the OLS model
</span><span class="n">model</span> <span class="o">=</span> <span class="n">sm</span><span class="p">.</span><span class="n">OLS</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">X</span><span class="p">)</span>
<span class="n">results</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">fit</span><span class="p">()</span>

<span class="c1"># Print the regression summary
</span><span class="k">print</span><span class="p">(</span><span class="n">results</span><span class="p">.</span><span class="n">summary</span><span class="p">())</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                            OLS Regression Results                            
==============================================================================
Dep. Variable:     averagetemperature   R-squared:                       0.926
Model:                            OLS   Adj. R-squared:                  0.926
Method:                 Least Squares   F-statistic:                 2.601e+04
Date:                Sun, 18 Jun 2023   Prob (F-statistic):               0.00
Time:                        23:27:19   Log-Likelihood:                -16811.
No. Observations:                8334   AIC:                         3.363e+04
Df Residuals:                    8329   BIC:                         3.367e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P&gt;|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                 16.2143      0.133    121.559      0.000      15.953      16.476
averagewindspeed      -0.5312      0.013    -40.981      0.000      -0.557      -0.506
minimumtemperature     1.0532      0.004    245.554      0.000       1.045       1.062
precipitation         -0.0711      0.005    -15.089      0.000      -0.080      -0.062
humidity              -0.1139      0.002    -69.456      0.000      -0.117      -0.111
==============================================================================
Omnibus:                      663.002   Durbin-Watson:                   0.920
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3431.099
Skew:                          -0.189   Prob(JB):                         0.00
Kurtosis:                       6.121   Cond. No.                         439.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
</code></pre></div></div>

<hr />
<p>The OLS Regression Results provide insights into the linear regression model that was applied to the dataset. The R-squared value of 0.926 indicates that approximately 92.6% of the variability in the average temperature (dependent variable) can be accounted for by the independent variables included in the model. This suggests a robust relationship between the independent variables (averagewindspeed, minimumtemperature, precipitation, humidity) and the average temperature. <br />
The F-statistic is 2.601e+04, with a remarkably low probability (p &lt; 0.001). This implies that the overall regression model is statistically significant, indicating that at least one of the independent variables exhibits a significant association with the average temperature. <br />
The coefficients represent the estimated impact of each independent variable on the average temperature while holding other variables constant. Here are the interpretations of the coefficients:</p>
<ul>
  <li>average windspeed: With each unit increase in average windspeed, the average temperature is estimated to decrease by approximately 0.5312 units.</li>
  <li>minimum temperature: With each unit increase in minimum temperature, the average temperature is estimated to increase by approximately 1.0532 units.</li>
  <li>precipitation: With each unit increase in precipitation, the average temperature is estimated to decrease by approximately 0.0711 units.</li>
  <li>humidity: With each unit increase in humidity, the average temperature is estimated to decrease by approximately 0.1139 units.</li>
</ul>

<p>The p-values associated with each coefficient are very low (p &lt; 0.001), indicating that all independent variables have a statistically significant relationship with the average temperature. <br />
The 95% confidence intervals provide a range within which the true value of each coefficient is likely to fall. For example, the confidence interval for the average windspeed coefficient is (-0.557, -0.506), suggesting that the true effect of average windspeed on average temperature lies within this range with 95% confidence.</p>

<p>Overall, the findings support the idea that the independent variables (averagewindspeed, minimumtemperature, precipitation, humidity) are significant predictors of the average temperature in the Sacramento dataset. The regression model has a high R-squared value, indicating a good fit, and the coefficients and p-values suggest that all independent variables have a meaningful impact on the average temperature.</p>

<p>The regression equation is:</p>

<center>
Sacramento temperature = 16.2143 - 0.5312 × averagewindspeed + 1.0532 × minimumtemperature - 0.0711 × precipitation - 0.1139 × humidity
</center>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Obtain the predicted values
</span><span class="n">y_pred</span> <span class="o">=</span> <span class="n">results</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>

<span class="c1"># Calculate the residuals
</span><span class="n">residuals</span> <span class="o">=</span> <span class="n">y</span> <span class="o">-</span> <span class="n">y_pred</span>

<span class="c1"># Plot the Residuals vs. Fitted values
</span><span class="n">sns</span><span class="p">.</span><span class="n">residplot</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">y_pred</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="n">residuals</span><span class="p">,</span> <span class="n">lowess</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">line_kws</span><span class="o">=</span><span class="p">{</span><span class="s">'color'</span><span class="p">:</span> <span class="s">'red'</span><span class="p">})</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'Fitted Values'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'Residuals'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Residuals vs. Fitted Values'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<p><img src="/assets/temperatures/output_34_0.png" alt="Residuals vs. Fitted figure for Sacramento" /></p>

<p>The (Residuals vs. Fitted) plot shows that the residuals are fairly randomly scattered around the 0 residual line. Also, the residuals form a seemingly horizontal band around the residual = 0 line, which suggests that the variances of the error terms can be considered equal.</p>

<p>The graph shows a number of outliers that deviate significantly from the general pattern of the residuals and could be investigated further. The number of these individual points shouldn’t greatly impact the regression coefficients and overall model fit.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Generate the normal QQ plot
</span><span class="n">sm</span><span class="p">.</span><span class="n">qqplot</span><span class="p">(</span><span class="n">residuals</span><span class="p">,</span> <span class="n">line</span><span class="o">=</span><span class="s">'s'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Normal QQ Plot of Residuals'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<p><img src="/assets/temperatures/output_36_0.png" alt="Normal Q-Q plot for Sacramento" /></p>

<p>The (Normal Q-Q) plot shows if residuals are normally distributed (with a small tail). The relationship between the theoretical quantiles and the standardized residuals is approximately linear for most points. We can say that the error terms are indeed normally distributed. The presence of some outliers is confirmed and pronounced here as well.</p>

<h3 id="albany-1">Albany</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">albany</span><span class="p">.</span><span class="n">isnull</span><span class="p">().</span><span class="nb">sum</span><span class="p">()</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>date                   0
averagetemperature    45
averagewindspeed      45
maximumtemperature    45
minimumtemperature    45
precipitation         44
humidity               0
dtype: int64
</code></pre></div></div>

<hr />
<p>Similar to Sacramento, we examine the dataset for missing values. We identify at most 45 instances of missing data pertaining to all variables except for humidity. We can proceed to discarding these unavailable data points, which do not represent a substantial portion of the entire dataset.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">albany</span><span class="p">.</span><span class="n">dropna</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">X2</span> <span class="o">=</span> <span class="n">albany</span><span class="p">[[</span><span class="s">'averagewindspeed'</span><span class="p">,</span> <span class="s">'minimumtemperature'</span><span class="p">,</span> <span class="s">'precipitation'</span><span class="p">,</span> <span class="s">'humidity'</span><span class="p">]]</span>
<span class="n">X2</span> <span class="o">=</span> <span class="n">sm</span><span class="p">.</span><span class="n">add_constant</span><span class="p">(</span><span class="n">X2</span><span class="p">)</span>

<span class="c1"># Defining the dependent variable
</span><span class="n">y2</span> <span class="o">=</span> <span class="n">albany</span><span class="p">[</span><span class="s">'averagetemperature'</span><span class="p">]</span>

<span class="c1"># Fitting the OLS model
</span><span class="n">model2</span> <span class="o">=</span> <span class="n">sm</span><span class="p">.</span><span class="n">OLS</span><span class="p">(</span><span class="n">y2</span><span class="p">,</span> <span class="n">X2</span><span class="p">)</span>
<span class="n">results2</span> <span class="o">=</span> <span class="n">model2</span><span class="p">.</span><span class="n">fit</span><span class="p">()</span>

<span class="c1"># Print the regression summary
</span><span class="k">print</span><span class="p">(</span><span class="n">results2</span><span class="p">.</span><span class="n">summary</span><span class="p">())</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                            OLS Regression Results                            
==============================================================================
Dep. Variable:     averagetemperature   R-squared:                       0.972
Model:                            OLS   Adj. R-squared:                  0.972
Method:                 Least Squares   F-statistic:                 7.188e+04
Date:                Sun, 18 Jun 2023   Prob (F-statistic):               0.00
Time:                        23:27:28   Log-Likelihood:                -16575.
No. Observations:                8330   AIC:                         3.316e+04
Df Residuals:                    8325   BIC:                         3.320e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P&gt;|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                 12.0641      0.134     89.777      0.000      11.801      12.328
averagewindspeed      -0.3560      0.012    -29.811      0.000      -0.379      -0.333
minimumtemperature     1.0241      0.002    519.140      0.000       1.020       1.028
precipitation         -0.0116      0.003     -3.895      0.000      -0.018      -0.006
humidity              -0.0812      0.002    -45.519      0.000      -0.085      -0.078
==============================================================================
Omnibus:                      805.064   Durbin-Watson:                   1.529
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1700.446
Skew:                           0.618   Prob(JB):                         0.00
Kurtosis:                       4.836   Cond. No.                         484.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
</code></pre></div></div>

<p>The R-squared value of 0.972 indicates that approximately 97.2% of the variability in the average temperature (dependent variable) can be accounted for by the independent variables included in the model. This suggests a strong relationship between the independent variables (averagewindspeed, minimumtemperature, precipitation, humidity) and the average temperature. <br />
The F-statistic is 7.188e+04, with a remarkably low probability (p &lt; 0.001). This implies that the overall regression model is highly statistically significant, indicating that at least one of the independent variables exhibits a significant association with the average temperature.</p>

<p>The coefficients represent the estimated impact of each independent variable on the average temperature while holding other variables constant. Here are the interpretations of the coefficients:</p>

<ul>
  <li>average windspeed: With each unit increase in average windspeed, the average temperature is estimated to decrease by approximately 0.3560 units.</li>
  <li>minimum temperature: With each unit increase in minimum temperature, the average temperature is estimated to increase by approximately 1.0241 units.</li>
  <li>precipitation: With each unit increase in precipitation, the average temperature is estimated to decrease by approximately 0.0116 units.</li>
  <li>humidity: With each unit increase in humidity, the average temperature is estimated to decrease by approximately 0.0812 units.</li>
</ul>

<p>The p-values associated with each coefficient are very low (p &lt; 0.001), indicating that all independent variables have a statistically significant relationship with the average temperature.</p>

<p>The 95% confidence intervals provide a range within which the true value of each coefficient is likely to fall. For example, the confidence interval for the average windspeed coefficient is (-0.379, -0.333), suggesting that the true effect of average windspeed on average temperature lies within this range with 95% confidence.</p>

<p>The OLS Regression Results show that the regression model has a high R-squared value of 0.972, indicating that approximately 97.2% of the variability in the average temperature (dependent variable) can be explained by the independent variables included in the model. The F-statistic of 7.188e+04 is highly significant (p &lt; 0.001), indicating that the overall regression model is statistically significant.</p>

<p>The regression equation can be expressed as:</p>

<center>
Albany temperature = 12.0641 - 0.3560 × averagewindspeed + 1.0241 × minimumtemperature - 0.0116 × precipitation - 0.0812 × humidity
</center>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Obtain the predicted values
</span><span class="n">y_pred2</span> <span class="o">=</span> <span class="n">results2</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X2</span><span class="p">)</span>

<span class="c1"># Calculate the residuals
</span><span class="n">residuals2</span> <span class="o">=</span> <span class="n">y2</span> <span class="o">-</span> <span class="n">y_pred2</span>

<span class="c1"># Plot the Residuals vs. Fitted values
</span><span class="n">sns</span><span class="p">.</span><span class="n">residplot</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">y_pred2</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="n">residuals2</span><span class="p">,</span> <span class="n">lowess</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">line_kws</span><span class="o">=</span><span class="p">{</span><span class="s">'color'</span><span class="p">:</span> <span class="s">'red'</span><span class="p">})</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'Fitted Values'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'Residuals'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Residuals vs. Fitted Values'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<p><img src="/assets/temperatures/output_44_0.png" alt="Residuals vs. Fitted figure for Albany" /></p>

<p>Also in this case, the (Residuals vs. Fitted) plot shows that the residuals are randomly scattered around the 0 residual line. The residuals form a horizontal band around the residual = 0 line, which suggests that the variances of the error terms are considered equal.</p>

<p>The graph shows a number of outliers that deviate significantly from the general pattern of the residuals and could be investigated further. The number of these individual points wouldn’t greatly impact the regression coefficients and overall model fit.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Generate the normal QQ plot
</span><span class="n">sm</span><span class="p">.</span><span class="n">qqplot</span><span class="p">(</span><span class="n">residuals2</span><span class="p">,</span> <span class="n">line</span><span class="o">=</span><span class="s">'s'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Normal QQ Plot of Residuals'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<p><img src="/assets/temperatures/output_46_0.png" alt="Normal Q-Q plot for Albany" /></p>

<p>The (Normal Q-Q) plot shows that the relationship between the theoretical quantiles and the standardized residuals is approximately linear for most points. However, it is in fact <em>skewed right</em>, meaning that most of the data is distributed on the left side with a long “tail” of data extending out to the right. The points depart upward from the straight red line as you follow the quantiles from left to right. The red line shows where the points would fall if the dataset were normally distributed. The point’s trend upward shows that the actual quantiles are much greater than the theoretical quantiles, meaning that there is a greater concentration of data beyond the right side of a Gaussian distribution.</p>

<hr />
<p>Overall, the findings derived from the linear regression analysis provide evidence supporting the existence of a relationship between temperature and meteorological variables in the United States, with a specific focus on Albany and Sacramento. Both cities exhibit statistically significant associations between temperature and the meteorological variables of average windspeed, minimum temperature, precipitation, and humidity. <br />
By understanding and analyzing these relationships, we can enhance our understanding of weather patterns, contribute to climate studies, and improve weather predictions. These findings highlight the significance of considering multiple meteorological factors when examining temperature variations and provide valuable insights into the impacts of these variables on weather patterns in the United States.</p>]]></content><author><name>Meher Bejaoui</name><email>meher.bejaoui@outlook.com</email></author><category term="python" /><category term="python" /><category term="pandas" /><category term="visualization" /><category term="sklearn" /><summary type="html"><![CDATA[Unlock climate insights with a data-driven exploration of Albany and Sacramento temperatures. Leveraging NOAA datasets and Python tools, delve into their distinct weather patterns through a multiregression lens]]></summary></entry><entry><title type="html">Problem Solving Using Computational Thinking - Course Review</title><link href="https://meherbejaoui.com/blog/problem-solving-using-computational-thinking-course-review/" rel="alternate" type="text/html" title="Problem Solving Using Computational Thinking - Course Review" /><published>2021-05-07T00:00:00+01:00</published><updated>2021-05-07T00:00:00+01:00</updated><id>https://meherbejaoui.com/blog/problem-solving-using-computational-thinking-course-review</id><content type="html" xml:base="https://meherbejaoui.com/blog/problem-solving-using-computational-thinking-course-review/"><![CDATA[<ul>
  <li><a href="#course-overview-and-structure">Course overview and structure</a></li>
  <li><a href="#course-review">Course review</a></li>
</ul>

<hr />
<p>For their 9th birthday, <strong>Coursera</strong> celebrated by offering their learners a collection of 9 coursers to pick from, and enrol to earn a free certificate (special offer was available through 30 April 2021). I chose <strong>Problem Solving Using Computational Thinking</strong> from the <strong>University of Michigan</strong>.</p>

<p>In this article, I will share some insights about the course, and what you can expect if you decide to take it on.</p>

<h2 id="course-overview-and-structure">Course overview and structure</h2>
<p>First, the course is taught entirely in English, but there are subtitles for other languages as well (currently French, Portuguese (European), Russian and Spanish).</p>

<p>In week 1, you will learn about the foundations of Computational Thinking from Associate Professor <em>Chris Quintana</em>, from the University of Michigan School of Education. Then, you will have the opportunity to see Computational Thinking through real world and hypothetical examples shared by three experts in weeks 2, 3 and 4.</p>

<p>These experts are, respectively, Associate Director <em>Mariana Carrasco-Teja</em> from the Michigan Institute for Computational Discovery and Engineering (airport surveillance and image analysis case study); Associate Professor <em>Rafael Meza</em> (epidemiology case study); and Instructional and Program Design Coordinator <em>Darin Stockdil</em> from the Center for Education Design, Evaluation, and Research (human trafficking case study).</p>

<p>The <strong>learning objectives</strong> are:</p>
<ul>
  <li>
    <p>To define Computational Thinking components including abstraction, problem identification, decomposition, pattern recognition, algorithms, and evaluating solutions.</p>
  </li>
  <li>
    <p>To recognize Computational Thinking concepts in practice through a series of real-world case examples.</p>
  </li>
  <li>
    <p>And to develop solutions through the application of Computational Thinking concepts to real world problems (peer-graded assignment).</p>
  </li>
</ul>

<p>The course is structured in 5 weeks, with the last being a peer-graded final project. To review the learning material, do the practice quizzes and quizzes, it should take you around 2 hours per week for the first 3 weeks, and 1h15 for the 4th week (excluding the time required for discussion prompts).</p>

<p>That would surely depend on your own pace and learning style, and you should always devote enough time, and work through the material appropriately. As for the last week, I find it the most challenging, and it would take you longer than indicated. Allow and plan for at least 3 hours of work in that week.</p>
<h2 id="course-review">Course review</h2>
<p>My overall remarks and opinions regarding the course are:</p>
<ul>
  <li>Videos are not too long or too short. They are just about the right length for you to follow and focus in every segment, take a break and get back to another video.</li>
  <li>
    <p>Some parts would require very careful attention, and perhaps repeated reviewing of the material. That is because of their complexity for a non-specialized audience.</p>
  </li>
  <li>
    <p>The course introduces new concepts, ideas and technologies from a variety of fields and domains. It brings richness of content, and should broaden one’s knowledge beyond the computational thinking aspects. You learn different things in just one course.</p>
  </li>
  <li>
    <p>The course is well suited and appropriate for various skill levels. Even advanced learners can consolidate their knowledge and learn something new.</p>
  </li>
  <li>
    <p>There are enough practice quizzes and quizzes for a learner to test their understanding. However, some questions require the student to fill-in their answers, and that might not always be the best method.</p>
  </li>
  <li>
    <p>Not a lot of in-video questions.</p>
  </li>
  <li>
    <p>The videos look scripted, with prepared speeches in advance. However, some videos are not as fluid or comprehensible. If necessary, you can use the subtitles.</p>
  </li>
  <li>
    <p>Estimated times for completion of quizzes are a bit off.</p>
  </li>
  <li>There are no reading materials and other resources.</li>
</ul>

<p>Note that the case for the 4th week is optional. You will need to consent to be able to read and examine the course material. That is because the case study covers a rather delicate topic regarding hypothetical implications of Computational Thinking on the issue of Human Trafficking.</p>

<hr />

<p>I have taken many courses from the University of Michigan. This course is true to their approach and methodology. It is well structured, with professional high quality videos and production.</p>

<p>The case studies are meticulously presented, and I think they bring the most value to the course. And when working on the last peer-graded assignment, you get the chance to apply and test your knowledge to the fullest.  <br />
Perhaps the thing that can be improved, is the quality of discussion forums.</p>

<p>Overall, I do recommend taking <a href="https://www.coursera.org/learn/compthinking">Problem Solving Using Computational Thinking</a>, and investing the time to complete the course.</p>

<p>Happy learning everyone!</p>

<p><img src="/assets/may_2021/coursera_certificate_problem_solving_using_computational_thinking.png" alt="showing coursera certificate for problem solving using computational thinking" /></p>]]></content><author><name>Meher Bejaoui</name><email>meher.bejaoui@outlook.com</email></author><category term="blog" /><category term="review" /><summary type="html"><![CDATA[Reviewing the course Problem Solving Using Computational Thinking from the University of Michigan on Coursera]]></summary></entry><entry><title type="html">K-Means clustering and similarity visualization of constitutions</title><link href="https://meherbejaoui.com/python/kmeans-clustering-and-similarity-visualization-of-constitutions/" rel="alternate" type="text/html" title="K-Means clustering and similarity visualization of constitutions" /><published>2021-04-30T00:00:00+01:00</published><updated>2021-04-30T00:00:00+01:00</updated><id>https://meherbejaoui.com/python/kmeans-clustering-and-similarity-visualization-of-constitutions</id><content type="html" xml:base="https://meherbejaoui.com/python/kmeans-clustering-and-similarity-visualization-of-constitutions/"><![CDATA[<ul>
  <li><a href="#introduction">Introduction</a></li>
  <li><a href="#text-processing-and-exploratory-analysis">Text processing and exploratory analysis</a></li>
  <li><a href="#k-means-clustering">K-Means clustering</a></li>
  <li><a href="#visualizing-text-corpus-similarity">Visualizing text corpus similarity</a></li>
</ul>

<hr />
<h2 id="introduction">Introduction</h2>
<p>Constitutions hold the fundamental principles and rules that constitute the legal basis of a country. They determine the system of goverment, and the relationships between branches and institutions.</p>

<p>When written, these documents can be quite unique and distinct in many aspects, such as length and legal terminology. However, they can also share some other features, since they tend to have similar purposes.</p>

<p>A textual analysis of such data can be useful. We are going to apply some techniques to compare and cluster various constitutions. This work tries to see if constitutional text corpuses are indicative of the set and outlined systems of government.</p>

<p>To do so, we are going to use TF-IDF term weighting and K-Means clustering from scikit-learn. If you need a text analysis refresher, please check <a href="https://www.meherbejaoui.com/python/advanced-word-analysis-tfidf-with-tfidfvectorizer/">here</a>.</p>

<hr />

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># importing the libraries
</span><span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">from</span> <span class="nn">sklearn.feature_extraction.text</span> <span class="kn">import</span> <span class="n">TfidfVectorizer</span>
<span class="kn">from</span> <span class="nn">sklearn.cluster</span> <span class="kn">import</span> <span class="n">KMeans</span>

<span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
</code></pre></div></div>

<p>There are 35 constitutions in our dataset. Most of the documents were queried from <a href="https://www.constituteproject.org/constitution/">constitute project</a> on April 2021, using a simple crawler that implements the <code class="language-plaintext highlighter-rouge">Beautiful Soup</code> <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/">library</a>.</p>

<p>If you intend on getting your data from the source above, and as a general good scraping etiquette, please be nice and avoid overburdening their servers with requests.</p>

<p><img src="/assets/kmeans-clustering-and-similarity-visualization-of-constitutions/constitution_crawler.png" alt="code snippet of constitution crawler" /></p>

<p>After importing the necessary libraries, we read our documents to a <code class="language-plaintext highlighter-rouge">DataFrame</code>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Construct an empty DataFrame with two columns
</span><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">(</span><span class="s">'document'</span><span class="p">,</span> <span class="s">'content'</span><span class="p">))</span>
<span class="c1"># Go through the files in working directory
# If it's a text file, open it and append the content to the DataFrame
</span><span class="k">for</span> <span class="n">filename</span> <span class="ow">in</span> <span class="n">os</span><span class="p">.</span><span class="n">listdir</span><span class="p">(</span><span class="n">os</span><span class="p">.</span><span class="n">getcwd</span><span class="p">()):</span>
    <span class="k">if</span> <span class="n">filename</span><span class="p">.</span><span class="n">endswith</span><span class="p">(</span><span class="s">'.txt'</span><span class="p">):</span>
        <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">append</span><span class="p">({</span><span class="s">"document"</span><span class="p">:</span> <span class="n">filename</span><span class="p">[:</span><span class="nb">len</span><span class="p">(</span><span class="n">filename</span><span class="p">)</span><span class="o">-</span><span class="mi">4</span><span class="p">],</span>
                        <span class="s">"content"</span><span class="p">:</span> <span class="nb">open</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s">'latin_1'</span><span class="p">).</span><span class="n">read</span><span class="p">().</span><span class="n">replace</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span><span class="s">" "</span><span class="p">)},</span>
                       <span class="n">ignore_index</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div></div>

<p>We should check the names of the documents. Since <code class="language-plaintext highlighter-rouge">pandas</code> <code class="language-plaintext highlighter-rouge">DataFrame</code> columns are <code class="language-plaintext highlighter-rouge">Series</code>, we can pull them out and call <code class="language-plaintext highlighter-rouge">.tolist()</code> to turn them into a Python list.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">'document'</span><span class="p">].</span><span class="n">tolist</span><span class="p">())</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>['algeria_constitution', 'australia_constitution', 'austria_constitution', 'belgium_constitution', 'brazil_constitution', 'burkinafaso_constitution', 'china_constitution', 'costarica_constitution', 'ecuador_constitution', 'france_constitution', 'germany_constitution', 'india_constitution', 'japan_constitution', 'korea_constitution', 'malaysia_constitution', 'mexico_constitution', 'morocco_constitution', 'netherlands_constitution', 'nigeria_constitution', 'norway_constitution', 'pakistan_constitution', 'peru_constitution', 'portugal_constitution', 'rwanda_constitution', 'senegal_constitution', 'singapore_constitution', 'southafrica_constitution', 'spain_constitution', 'sweden_constitution', 'switzerland_constitution', 'tunisia_constitution', 'turkey_constitution', 'us_constitution', 'vietnam_constitution', 'zambia_constitution']
</code></pre></div></div>

<p>As we can see, the constitutions pertain to different countries from around the world, with different systems of government. Some nations from the list are monarchies, others are republics. Some have a unitary government, while others are federal. And the differences extend to the legislature as well. <br />
It is, in fact, a diverse collection. But, we should keep in mind that most of these constitutions are translated from their respective native languages. The original meaning in each document may not be conveyed with the same degree of accuracy.</p>

<hr />
<h2 id="text-processing-and-exploratory-analysis">Text processing and exploratory analysis</h2>
<p>Next, we begin the analysis. <br />
First, we choose a list of <code class="language-plaintext highlighter-rouge">stopwords</code> from the Natural Language Tolkit <a href="https://www.nltk.org/">project</a> (<code class="language-plaintext highlighter-rouge">nltk</code>). These are high-frequency terms (like <em>who</em> and <em>the</em>), that we may want to filter out of documents before processing.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">nltk.corpus</span> <span class="kn">import</span> <span class="n">stopwords</span>
<span class="n">sw</span> <span class="o">=</span> <span class="n">stopwords</span><span class="p">.</span><span class="n">words</span><span class="p">(</span><span class="s">'english'</span><span class="p">)</span>
<span class="n">sw</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="s">'shall'</span><span class="p">)</span> <span class="c1"># Add "shall" to stopwrods
</span><span class="k">print</span> <span class="p">(</span><span class="s">"There are"</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">sw</span><span class="p">),</span> <span class="s">"words in this stopwords list. The first 10 are:"</span><span class="p">,</span> <span class="n">sw</span><span class="p">[:</span><span class="mi">10</span><span class="p">])</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>There are 180 words in this stopwords list. The first 10 are: ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
</code></pre></div></div>

<p>Then, we are going to tokenize our texts. NLTK provides several types of tokenizers for that purpose. We will use a custom regular expression tokenizer, that detects words containing alphanumeric characters only.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">nltk</span> <span class="kn">import</span> <span class="n">regexp_tokenize</span>
<span class="n">patn</span> <span class="o">=</span> <span class="s">'\w+'</span>
<span class="n">df</span><span class="p">[</span><span class="s">'content'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'content'</span><span class="p">].</span><span class="nb">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">w</span><span class="p">:</span> <span class="s">" "</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">regexp_tokenize</span><span class="p">(</span><span class="n">w</span><span class="p">,</span> <span class="n">patn</span><span class="p">)))</span>
</code></pre></div></div>

<p>To have better insights from our text corpus, let us create 4 new columns:</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">df['number_words']</code> for the number of words in each document</li>
  <li><code class="language-plaintext highlighter-rouge">df['unique_words']</code> for the number of unique words in each document</li>
  <li><code class="language-plaintext highlighter-rouge">df['number_words_without_sw']</code> for the number of words that are not in the stopwords list</li>
  <li><code class="language-plaintext highlighter-rouge">df['percentage']</code> for the percentage of stop words in each text corpus</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">[</span><span class="s">'unique_words'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'content'</span><span class="p">].</span><span class="nb">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="nb">set</span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="n">split</span><span class="p">())))</span>
<span class="n">df</span><span class="p">[</span><span class="s">'number_words'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'content'</span><span class="p">].</span><span class="nb">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="n">split</span><span class="p">()))</span>
<span class="n">df</span><span class="p">[</span><span class="s">'number_words_without_sw'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'content'</span><span class="p">].</span><span class="nb">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">y</span><span class="p">:</span> <span class="nb">len</span><span class="p">([</span><span class="n">word</span> <span class="k">for</span> <span class="n">word</span> <span class="ow">in</span> <span class="n">y</span><span class="p">.</span><span class="n">split</span><span class="p">()</span> <span class="k">if</span> <span class="n">word</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">sw</span><span class="p">]))</span>
<span class="n">df</span><span class="p">[</span><span class="s">'percentage'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">100</span> <span class="o">-</span> <span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">'number_words_without_sw'</span><span class="p">]</span> <span class="o">*</span> <span class="mi">100</span> <span class="o">/</span> <span class="n">df</span><span class="p">[</span><span class="s">'number_words'</span><span class="p">])</span>
</code></pre></div></div>

<p>Let us use <code class="language-plaintext highlighter-rouge">.head()</code> to preview the first 6 rows of our DataFrame.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">.</span><span class="n">head</span><span class="p">(</span><span class="mi">6</span><span class="p">)</span>
</code></pre></div></div>

<p><img src="/assets/kmeans-clustering-and-similarity-visualization-of-constitutions/dfhead6.png" alt="DataFrame overview showing name of document and content and several columns" /></p>

<p>All of the <em>percentage</em> values are less than 50. In fact, we can surmise that a well written piece of legal document, should not have a lot of <em>stopwords</em>. However, that may not always be the case. Feel free to share your opinions regarding this, in the comments section below, or by email.</p>

<p>We can push the exploration further, and check which constitutions have the most and the least unique words. Such a measure can be an indicator of the richness of the used lexicon, and its complexity as well.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">max_unique_words</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'unique_words'</span><span class="p">].</span><span class="nb">max</span><span class="p">()</span>
<span class="n">doc_max_unique_words</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'document'</span><span class="p">][</span><span class="n">df</span><span class="p">.</span><span class="n">index</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s">'unique_words'</span><span class="p">].</span><span class="n">idxmax</span><span class="p">()]]</span>
<span class="n">min_unique_words</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'unique_words'</span><span class="p">].</span><span class="nb">min</span><span class="p">()</span>
<span class="n">doc_min_unique_words</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'document'</span><span class="p">][</span><span class="n">df</span><span class="p">.</span><span class="n">index</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s">'unique_words'</span><span class="p">].</span><span class="n">idxmin</span><span class="p">()]]</span>

<span class="k">print</span> <span class="p">(</span><span class="s">"The {} has the most unique words with {} words. And the {} has the fewest with only {}"</span>
       <span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">doc_max_unique_words</span><span class="p">,</span> <span class="n">max_unique_words</span><span class="p">,</span> <span class="n">doc_min_unique_words</span><span class="p">,</span> <span class="n">min_unique_words</span><span class="p">))</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>The brazil_constitution has the most unique words with 5286 words. And the us_constitution has the fewest with only 1005
</code></pre></div></div>

<hr />
<p>After the previous exploratory phase, we move to term weighting and <code class="language-plaintext highlighter-rouge">tf-idf</code>.</p>

<p>Like we did in the previous article linked above, we are going to use <code class="language-plaintext highlighter-rouge">TfidfVectorizer</code> from <code class="language-plaintext highlighter-rouge">sklearn</code> to convert the collection of documents to a matrix of TF-IDF features. That would allow us to take into account how often a term shows up.  <br />
Again, <code class="language-plaintext highlighter-rouge">tf-idf</code> is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. You can refer to the official <code class="language-plaintext highlighter-rouge">sklearn</code> <a href="https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting">documentation</a> for the complex mathematical explanation.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tfidf_vectorizer</span> <span class="o">=</span> <span class="n">TfidfVectorizer</span><span class="p">(</span><span class="n">stop_words</span><span class="o">=</span><span class="n">sw</span><span class="p">,</span> <span class="n">use_idf</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

<span class="n">x</span> <span class="o">=</span> <span class="n">tfidf_vectorizer</span><span class="p">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">content</span><span class="p">)</span>
<span class="n">tfidfcounts</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="n">toarray</span><span class="p">(),</span><span class="n">index</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">document</span><span class="p">,</span> <span class="n">columns</span> <span class="o">=</span> <span class="n">tfidf_vectorizer</span><span class="p">.</span><span class="n">get_feature_names</span><span class="p">())</span>
</code></pre></div></div>

<hr />
<h2 id="k-means-clustering">K-Means clustering</h2>
<p>Let us go over a brief explanation of clustering in general before delving into K-Means clustering that we will be using.</p>

<p>Clustering is the process of grouping a collection of objects, such that those in the same partition (or cluster) are more similar (in some sense) to each other, than to those in other groups (clusters). <br />
There are a lot of clustering algorithms that can be utilized, and their use is modulated by specific conditions in the use cases.</p>

<p>As for the K-Means <a href="https://scikit-learn.org/stable/modules/clustering.html#k-means">algorithm</a>, it clusters data by trying to separate samples in <em>n</em> groups of equal variance. It minimizes the squared distance between the cluster mean (centroid) and the points in the cluster.
This algorithm requires the number of clusters to be specified.</p>

<p>Below, we set the desired number of clusters. That choice is not an easy task. There are a few ways to determine the optimal number of clusters, but for the sake of this demonstration, we will not be going through them.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Specify number of clusters
</span><span class="n">number_of_clusters</span> <span class="o">=</span> <span class="mi">3</span>
<span class="n">km</span> <span class="o">=</span> <span class="n">KMeans</span><span class="p">(</span><span class="n">n_clusters</span> <span class="o">=</span> <span class="n">number_of_clusters</span><span class="p">)</span>
<span class="c1"># Compute k-means clustering
</span><span class="n">km</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>KMeans(n_clusters=3)
</code></pre></div></div>

<p>After computing the <code class="language-plaintext highlighter-rouge">k-means</code> clustering, and getting a fitted estimator, we ought to see the top words in each cluster. <br />
In the code below, <code class="language-plaintext highlighter-rouge">cluster_centers_</code> gets the coordinates of each centroid. Then, <code class="language-plaintext highlighter-rouge">.argsort()[:, ::-1]</code> converts each centroid into a descending sorted list of columns by their <em>relevance</em>. That gives the words most relevant, since in our vector representation, words are the features in the form of columns. <br />
We use <code class="language-plaintext highlighter-rouge">.get_feature_names()</code> to get a list of feature names mapped from feature integer indices.  <br />
Finally, the <code class="language-plaintext highlighter-rouge">for</code> loop wraps up the work, and prints out the top words in each cluster.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">order_centroids</span> <span class="o">=</span> <span class="n">km</span><span class="p">.</span><span class="n">cluster_centers_</span><span class="p">.</span><span class="n">argsort</span><span class="p">()[:,</span> <span class="p">::</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="n">terms</span> <span class="o">=</span> <span class="n">tfidf_vectorizer</span><span class="p">.</span><span class="n">get_feature_names</span><span class="p">()</span>

<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">number_of_clusters</span><span class="p">):</span>
    <span class="n">top_words</span> <span class="o">=</span> <span class="p">[</span><span class="n">terms</span><span class="p">[</span><span class="n">ind</span><span class="p">]</span> <span class="k">for</span> <span class="n">ind</span> <span class="ow">in</span> <span class="n">order_centroids</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="p">:</span><span class="mi">5</span><span class="p">]]</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"Cluster {}: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="s">' '</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">top_words</span><span class="p">)))</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Cluster 0: state may federal law section
Cluster 1: article law national state president
Cluster 2: federal art law confederation para
</code></pre></div></div>

<p>The top terms for each cluster are somewhat intriguing. In fact, certain words do pertain to specific systems of government. <br />
Let us check the full results, and see how the entire dataset was partitioned.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">results</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">()</span>
<span class="n">results</span><span class="p">[</span><span class="s">'document'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">document</span>
<span class="n">results</span><span class="p">[</span><span class="s">'category'</span><span class="p">]</span> <span class="o">=</span> <span class="n">km</span><span class="p">.</span><span class="n">labels_</span>
<span class="n">results</span>
</code></pre></div></div>

<p><img src="/assets/kmeans-clustering-and-similarity-visualization-of-constitutions/kmeans_results_on_constitutions.png" alt="DataFrame showing results of kmeans clustering on the dataset of constitutions" /></p>

<p>By observing the results and seeing the top words for each cluster, we can see that a lot of Federal countries were assigned to cluster 2. However, a few were put in other clusters. By the same, some non-Federal countries were grouped in cluster 2. <br />
The same goes for clusters 0 and 1, where similar systems of government are not always put together.</p>

<p>This suggests that word relevance on its own, just gives a broader perception of how well the text corpus reflects the system of government.  <br />
The analysis can be improved by using other algorithms and techniques. However, K-Means clustering being fairly simple and easy to implement, can be a good starting point for further and deeper inspection.</p>

<hr />
<h2 id="visualizing-text-corpus-similarity">Visualizing text corpus similarity</h2>

<p>We can visualize the similarities based on the TF-IDF features.  <br />
To do so, we start by constructing the <em>vectorizer</em> as usual. We specify <code class="language-plaintext highlighter-rouge">max_features</code> to build a vocabulary that considers only the top <code class="language-plaintext highlighter-rouge">max_features</code> ordered by term frequency across the dataset.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">vectorizer</span> <span class="o">=</span> <span class="n">TfidfVectorizer</span><span class="p">(</span><span class="n">use_idf</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">max_features</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">stop_words</span><span class="o">=</span><span class="n">sw</span><span class="p">)</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">vectorizer</span><span class="p">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">content</span><span class="p">)</span>
<span class="k">print</span> <span class="p">(</span><span class="n">vectorizer</span><span class="p">.</span><span class="n">get_feature_names</span><span class="p">())</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>['article', 'constitution', 'court', 'federal', 'law', 'may', 'national', 'president', 'public', 'state']
</code></pre></div></div>

<p>We can use a combination of <code class="language-plaintext highlighter-rouge">DataFrame.plot</code> and <code class="language-plaintext highlighter-rouge">matplotlib</code> to draw a scatter plot representing the distribution of two terms on <strong>x</strong> and <strong>y</strong> axes, and a <code class="language-plaintext highlighter-rouge">colormap</code> to showcase the rlevance of a third term.  <br />
We can clearly see than only 10 data points had some value regarding the term <em>federal</em>, while the rest had a value of 0 or close.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df2</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">X</span><span class="p">.</span><span class="n">toarray</span><span class="p">(),</span> <span class="n">columns</span><span class="o">=</span><span class="n">vectorizer</span><span class="p">.</span><span class="n">get_feature_names</span><span class="p">())</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">()</span>
<span class="n">axi</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">gca</span><span class="p">()</span>

<span class="n">ax</span> <span class="o">=</span> <span class="n">df2</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">kind</span><span class="o">=</span><span class="s">'scatter'</span><span class="p">,</span> <span class="n">x</span><span class="o">=</span> <span class="s">'federal'</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span> <span class="s">'president'</span><span class="p">,</span> <span class="n">s</span><span class="o">=</span> <span class="mi">250</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span> <span class="mf">0.6</span><span class="p">,</span>
              <span class="n">c</span><span class="o">=</span><span class="s">'state'</span><span class="p">,</span> <span class="n">colormap</span><span class="o">=</span><span class="s">'viridis'</span><span class="p">,</span>
              <span class="n">figsize</span><span class="o">=</span> <span class="p">(</span><span class="mi">12</span><span class="p">,</span><span class="mi">10</span><span class="p">),</span> <span class="n">ax</span><span class="o">=</span> <span class="n">axi</span><span class="p">)</span>

<span class="n">axi</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'President X Federal'</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">18</span><span class="p">)</span>
<span class="n">axi</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">"Federal"</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">18</span><span class="p">)</span>
<span class="n">axi</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">"President"</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">18</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Text(0, 0.5, 'President')
</code></pre></div></div>

<p><img src="/assets/kmeans-clustering-and-similarity-visualization-of-constitutions/visualizing_text_corpus_similarity.png" alt="a picture showing text corpus similarity" /></p>]]></content><author><name>Meher Bejaoui</name><email>meher.bejaoui@outlook.com</email></author><category term="python" /><category term="python" /><category term="pandas" /><category term="sklearn" /><summary type="html"><![CDATA[Using TF-IDF term weighting, K-Means clustering from sklearn and visualizing similarities of a text corpus of constitutions]]></summary></entry><entry><title type="html">Advanced word analysis with TF-IDF</title><link href="https://meherbejaoui.com/python/advanced-word-analysis-tfidf-with-tfidfvectorizer/" rel="alternate" type="text/html" title="Advanced word analysis with TF-IDF" /><published>2021-04-21T00:00:00+01:00</published><updated>2021-04-21T00:00:00+01:00</updated><id>https://meherbejaoui.com/python/advanced-word-analysis-tfidf-with-tfidfvectorizer</id><content type="html" xml:base="https://meherbejaoui.com/python/advanced-word-analysis-tfidf-with-tfidfvectorizer/"><![CDATA[<ul>
  <li><a href="#introduction-and-basic-concepts">Introduction</a></li>
  <li><a href="#term-frequency">Term Frequency</a></li>
  <li><a href="#inverse-document-frequency">Inverse document frequency</a></li>
</ul>

<hr />
<h2 id="introduction-and-basic-concepts">Introduction and basic concepts</h2>
<p>In a previous <a href="https://www.meherbejaoui.com/python/counting-words-in-python-with-scikit-learn's-countvectorizer">article</a>, we utilized <strong>CountVectorizer</strong> from scikit-learn to count words. We used bag of words analysis, where a text is represented as the bag of its words, disregarding grammar, and with no particular order. This model may capture the characteristics of the text or document.</p>

<p>However, there are some limitations with simple word count analysis. A better solution would be to use latent features, such as the frequency of words used in a document.</p>

<p>In fact, some terms will appear more often, carrying little useful knowledge about the document’s actual contents. Those very frequent words would shadow the frequencies of more uncommon yet more interesting terms. <br />
These problems can be tackled with <a href="https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting"><strong>TF-IDF</strong></a>. <strong>Tf</strong> means term-frequency while <strong>tf–idf</strong> means term-frequency times inverse document-frequency. <br />
It is a statistical measure that evaluates how relevant a word is to a document in a collection of documents.  <br />
The TF–IDF value increases in relation to the number of times a word appears in a document, and is compensated by the number of documents in the corpus that contain the word, which helps to compensate for the fact that certain words appear more often than others.</p>

<hr />

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">from</span> <span class="nn">sklearn.feature_extraction.text</span> <span class="kn">import</span> <span class="n">CountVectorizer</span>
<span class="kn">from</span> <span class="nn">sklearn.feature_extraction.text</span> <span class="kn">import</span> <span class="n">TfidfVectorizer</span>
<span class="kn">from</span> <span class="nn">nltk.stem.porter</span> <span class="kn">import</span> <span class="n">PorterStemmer</span>
<span class="kn">import</span> <span class="nn">re</span>
<span class="kn">import</span> <span class="nn">requests</span>
</code></pre></div></div>

<p>Our texts for this notebook are some constitutions. We use <code class="language-plaintext highlighter-rouge">requests</code> to make a request and get a response with the desired text.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tn_constitution</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">"constitution.txt"</span><span class="p">).</span><span class="n">read</span><span class="p">().</span><span class="n">replace</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span><span class="s">" "</span><span class="p">)</span> <span class="c1"># Tunisian Constitution
</span><span class="n">us_constitution</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"https://www.gutenberg.org/cache/epub/5/pg5.txt"</span><span class="p">).</span><span class="n">text</span><span class="p">[</span><span class="mi">2623</span><span class="p">:]</span> <span class="c1"># US Constitution
</span><span class="n">jp_constitution</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"https://www.gutenberg.org/cache/epub/612/pg612.txt"</span><span class="p">).</span><span class="n">text</span><span class="p">[</span><span class="mi">610</span><span class="p">:]</span> <span class="c1"># Japanese Constitution
</span><span class="n">athen_constitution</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"https://www.gutenberg.org/cache/epub/26095/pg26095.txt"</span><span class="p">).</span><span class="n">text</span><span class="p">[</span><span class="mi">610</span><span class="p">:]</span> <span class="c1"># Athenian Constitution
</span><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">([</span>
    <span class="p">{</span> <span class="s">"document"</span><span class="p">:</span> <span class="s">"Tunisian Constitution"</span><span class="p">,</span> <span class="s">"content"</span><span class="p">:</span> <span class="n">tn_constitution</span><span class="p">},</span>
    <span class="p">{</span> <span class="s">"document"</span><span class="p">:</span> <span class="s">"United States Constitution"</span><span class="p">,</span> <span class="s">"content"</span><span class="p">:</span> <span class="n">us_constitution</span> <span class="p">},</span>
    <span class="p">{</span> <span class="s">"document"</span><span class="p">:</span> <span class="s">"Japanese Constitution"</span><span class="p">,</span> <span class="s">"content"</span><span class="p">:</span> <span class="n">jp_constitution</span><span class="p">},</span>
    <span class="p">{</span> <span class="s">"document"</span><span class="p">:</span> <span class="s">"Athenian Constitution"</span><span class="p">,</span> <span class="s">"content"</span><span class="p">:</span> <span class="n">athen_constitution</span> <span class="p">},])</span>
</code></pre></div></div>

<hr />
<p>In text analysis, the raw data cannot be fed directly to most algorithms, since these expect numerical feature vectors of a fixed size rather than raw text documents of variable length. <br />
In order to address this, there are ways to extract numerical features from text, namely:</p>

<ul>
  <li><strong>Tokenizing</strong> : Word tokens are the basic units of text. When processing, the first step is to split strings into tokens and giving an integer id for each possible token.</li>
  <li><strong>Counting</strong> the occurrences of tokens in each document - how many times does a word appear in the text.</li>
  <li><strong>Normalizing</strong> and weighting with diminishing importance tokens that occur in the majority of documents.</li>
</ul>

<hr />

<p>We can specify a <code class="language-plaintext highlighter-rouge">tokenizer</code> when using <code class="language-plaintext highlighter-rouge">CountVectorizer</code>. Here, you find a <code class="language-plaintext highlighter-rouge">stemming_tokenizer</code> for reference. We will not be using it for this work.</p>

<p><strong>Stemming</strong> is a text preprocessing task for transforming related or similar forms of a word to its base form (<em>talking</em> to <em>talk</em>, and <em>cats</em> to <em>cat</em> for example). We will use the <code class="language-plaintext highlighter-rouge">Porter stemmer</code> from <code class="language-plaintext highlighter-rouge">nltk</code>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">porter_stemmer</span> <span class="o">=</span> <span class="n">PorterStemmer</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">stemming_tokenizer</span><span class="p">(</span><span class="n">str_in</span><span class="p">):</span>
    <span class="n">words</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="sa">r</span><span class="s">"[^A-Za-z0-9\-]"</span><span class="p">,</span> <span class="s">" "</span><span class="p">,</span> <span class="n">str_in</span><span class="p">).</span><span class="n">lower</span><span class="p">().</span><span class="n">split</span><span class="p">()</span>
    <span class="n">words</span> <span class="o">=</span> <span class="p">[</span><span class="n">porter_stemmer</span><span class="p">.</span><span class="n">stem</span><span class="p">(</span><span class="n">word</span><span class="p">)</span> <span class="k">for</span> <span class="n">word</span> <span class="ow">in</span> <span class="n">words</span><span class="p">]</span>
    <span class="k">return</span> <span class="n">words</span>
</code></pre></div></div>

<p>Let’s put it all together, and experiment with the <code class="language-plaintext highlighter-rouge">CountVectorizer</code>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">vectorizer</span> <span class="o">=</span> <span class="n">CountVectorizer</span><span class="p">(</span><span class="n">stop_words</span><span class="o">=</span><span class="s">'english'</span><span class="p">)</span>

<span class="n">matrix</span> <span class="o">=</span> <span class="n">vectorizer</span><span class="p">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">content</span><span class="p">)</span>
<span class="n">counts</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">matrix</span><span class="p">.</span><span class="n">toarray</span><span class="p">(),</span> <span class="n">index</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">document</span><span class="p">,</span> <span class="n">columns</span> <span class="o">=</span> <span class="n">vectorizer</span><span class="p">.</span><span class="n">get_feature_names</span><span class="p">())</span>
</code></pre></div></div>

<p>Since our texts are all constitutions, we could have a look at some intriguing terms. <br />
But, what else should we be checking? Which words might be the most interesting? The <code class="language-plaintext highlighter-rouge">idxmax</code> <code class="language-plaintext highlighter-rouge">pandas</code> method would return the label of the column with the maximum value, for each row. That is, we’ll get the most frequent word for each document.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">counts</span><span class="p">.</span><span class="n">idxmax</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>document
Tunisian Constitution         article
United States Constitution      shall
Japanese Constitution           shall
Athenian Constitution         council
dtype: object
</code></pre></div></div>

<p>Now, we look at this subset of words accross all documents.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">counts</span><span class="p">[[</span><span class="s">'people'</span><span class="p">,</span><span class="s">'constitution'</span><span class="p">,</span> <span class="s">'rules'</span><span class="p">,</span> <span class="s">'law'</span><span class="p">,</span> <span class="s">'order'</span><span class="p">,</span> <span class="s">'assembly'</span><span class="p">,</span> <span class="s">'house'</span><span class="p">,</span> <span class="s">'democracy'</span><span class="p">,</span><span class="s">'article'</span><span class="p">,</span><span class="s">'shall'</span><span class="p">,</span><span class="s">'council'</span><span class="p">]]</span>
</code></pre></div></div>

<p><img src="/assets/tfidf042021/counts.png" alt="png showing subset of words accross all documents" /></p>

<hr />
<h2 id="term-frequency">Term Frequency</h2>

<p>We’re going to take into account how often a term shows up by using the <code class="language-plaintext highlighter-rouge">TfidfVectorizer</code> in the same way as <code class="language-plaintext highlighter-rouge">CountVectorizer</code>. <code class="language-plaintext highlighter-rouge">TfidfVectorizer</code> converts a collection of documents to a matrix of TF-IDF features. It is equivalent to <code class="language-plaintext highlighter-rouge">CountVectorizer</code> followed by <code class="language-plaintext highlighter-rouge">TfidfTransformer</code>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tfidf_vectorizer</span> <span class="o">=</span> <span class="n">TfidfVectorizer</span><span class="p">(</span><span class="n">stop_words</span><span class="o">=</span><span class="s">'english'</span><span class="p">,</span> <span class="n">use_idf</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>

<span class="n">x</span> <span class="o">=</span> <span class="n">tfidf_vectorizer</span><span class="p">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">content</span><span class="p">)</span>
<span class="n">tfidfcounts</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="n">toarray</span><span class="p">(),</span><span class="n">index</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">document</span><span class="p">,</span> <span class="n">columns</span> <span class="o">=</span> <span class="n">tfidf_vectorizer</span><span class="p">.</span><span class="n">get_feature_names</span><span class="p">())</span>
</code></pre></div></div>

<p>Let’s check the same words as we did before!</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tfidfcounts</span><span class="p">[[</span><span class="s">'people'</span><span class="p">,</span><span class="s">'constitution'</span><span class="p">,</span> <span class="s">'rules'</span><span class="p">,</span> <span class="s">'law'</span><span class="p">,</span> <span class="s">'order'</span><span class="p">,</span> <span class="s">'assembly'</span><span class="p">,</span> <span class="s">'house'</span><span class="p">,</span> <span class="s">'democracy'</span><span class="p">,</span><span class="s">'article'</span><span class="p">,</span><span class="s">'shall'</span><span class="p">,</span><span class="s">'council'</span><span class="p">]]</span>
</code></pre></div></div>

<p><img src="/assets/tfidf042021/tfidfcounts.png" alt="png showing subset of words accross all documents for tfidf counts" /></p>

<p>Notice how our numbers have shifted a bit.
These are supposedly better relative indicators for the use of words, and their importance in our documents.</p>

<hr />
<h2 id="inverse-document-frequency">Inverse document frequency</h2>

<p>By looking at the previous DataFrame, it seems like the word (<em>shall</em>) shows up a lot. So, even though it’s not a <strong>stopword</strong>, it should be <em>weighted</em> a bit less.</p>

<p>This is inverse term frequency. The more frequent a term shows up across documents, the less important it can be in our matrix.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#use_idf bool, default=True (to highlight by comparison) Enable inverse-document-frequency reweighting
</span><span class="n">idf_vectorizer</span> <span class="o">=</span> <span class="n">TfidfVectorizer</span><span class="p">(</span><span class="n">stop_words</span><span class="o">=</span><span class="s">'english'</span><span class="p">,</span> <span class="n">use_idf</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

<span class="n">y</span> <span class="o">=</span> <span class="n">idf_vectorizer</span><span class="p">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">content</span><span class="p">)</span>
<span class="n">idfcounts</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">y</span><span class="p">.</span><span class="n">toarray</span><span class="p">(),</span> <span class="n">index</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">document</span><span class="p">,</span> <span class="n">columns</span> <span class="o">=</span> <span class="n">idf_vectorizer</span><span class="p">.</span><span class="n">get_feature_names</span><span class="p">())</span>
</code></pre></div></div>

<p>Again with the same subset of words accross all documents.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">idfcounts</span><span class="p">[[</span><span class="s">'people'</span><span class="p">,</span><span class="s">'constitution'</span><span class="p">,</span> <span class="s">'rules'</span><span class="p">,</span> <span class="s">'law'</span><span class="p">,</span> <span class="s">'order'</span><span class="p">,</span> <span class="s">'assembly'</span><span class="p">,</span> <span class="s">'house'</span><span class="p">,</span> <span class="s">'democracy'</span><span class="p">,</span><span class="s">'article'</span><span class="p">,</span><span class="s">'shall'</span><span class="p">,</span><span class="s">'council'</span><span class="p">]]</span>
</code></pre></div></div>

<p><img src="/assets/tfidf042021/idfcounts.png" alt="png showing subset of words accross all documents for idf counts" /></p>

<p>Notice how (<em>council</em>) increased in value because it’s an infrequent term, and (<em>people</em>) decreased in value because it’s quite frequent.</p>

<hr />
<p>It is beneficial to understand how TF-IDF functions in order to obtain a deeper understanding of how machine learning algorithms work. TF-IDF allows us to associate each word in a document with a numerical value or vector, that reflects its relevance in that document. <br />
In text analysis with machine learning, TF-IDF algorithms help extract keywords, and by determining similar documents, we are able to automatically sort them into clusters.  <br />
Besides, given a query, variations of the TF-IDF weighting are also used by search engines in scoring and ranking a document’s relevance.</p>]]></content><author><name>Meher Bejaoui</name><email>meher.bejaoui@outlook.com</email></author><category term="python" /><category term="python" /><category term="pandas" /><category term="sklearn" /><summary type="html"><![CDATA[An explanation of text analysis using CountVectorizer and TfidfVectorizer from scikit-learn]]></summary></entry><entry><title type="html">Counting words in Python with scikit-learn’s CountVectorizer</title><link href="https://meherbejaoui.com/python/counting-words-in-python-with-scikit-learn's-countvectorizer/" rel="alternate" type="text/html" title="Counting words in Python with scikit-learn’s CountVectorizer" /><published>2021-04-10T00:00:00+01:00</published><updated>2021-04-10T00:00:00+01:00</updated><id>https://meherbejaoui.com/python/counting-words-in-python-with-scikit-learn&apos;s-countvectorizer</id><content type="html" xml:base="https://meherbejaoui.com/python/counting-words-in-python-with-scikit-learn&apos;s-countvectorizer/"><![CDATA[<ul>
  <li><a href="#introduction">Introduction</a></li>
  <li><a href="#counting-words-with-countvectorizer">Counting words with CountVectorizer</a></li>
  <li><a href="#counting-words-in-multiple-documents">Counting words in multiple documents</a></li>
</ul>

<hr />
<h2 id="introduction">Introduction</h2>
<p>In a previous <a href="https://www.meherbejaoui.com/python/visualization-and-analysis-of-legal-texts">article</a>, we used simple techniques to visualize and count words in a document. In this notebook, we will be using another technique. The <a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer"><code class="language-plaintext highlighter-rouge">CountVectorizer</code></a> from scikit-learn is more elaborate than the <code class="language-plaintext highlighter-rouge">Counter</code> tool. It converts a collection of text documents to a matrix of token counts.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.feature_extraction.text</span> <span class="kn">import</span> <span class="n">CountVectorizer</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
</code></pre></div></div>

<p>The text for our work would be an English version of the Tunisian constitution.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">text</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">"constitution.txt"</span><span class="p">).</span><span class="n">read</span><span class="p">().</span><span class="n">replace</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span><span class="s">" "</span><span class="p">)</span>
</code></pre></div></div>

<hr />
<h2 id="counting-words-with-countvectorizer">Counting words with CountVectorizer</h2>
<p>The <strong>vectoriser</strong> does the implementation that produces a sparse representation of the counts. The <code class="language-plaintext highlighter-rouge">fit_transform()</code> method learns the vocabulary dictionary and returns the document-term matrix, as shown below. This method is equivalent to using <code class="language-plaintext highlighter-rouge">fit()</code> followed by <code class="language-plaintext highlighter-rouge">transform()</code>, but more efficiently implemented.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">vectorizer</span> <span class="o">=</span> <span class="n">CountVectorizer</span><span class="p">()</span>

<span class="n">matrix</span> <span class="o">=</span> <span class="n">vectorizer</span><span class="p">.</span><span class="n">fit_transform</span><span class="p">([</span><span class="n">text</span><span class="p">])</span>
<span class="n">matrix</span> <span class="c1"># notice the size of the matrix
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;1x1741 sparse matrix of type '&lt;class 'numpy.int64'&gt;'
	with 1741 stored elements in Compressed Sparse Row format&gt;
</code></pre></div></div>

<p>The numbers in the array below represent how many times a word showed up in the text.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">matrix</span><span class="p">.</span><span class="n">toarray</span><span class="p">()</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>array([[2, 1, 1, ..., 1, 1, 3]], dtype=int64)
</code></pre></div></div>

<p>If we want to know which word is which, we can use <code class="language-plaintext highlighter-rouge">get_feature_names()</code> to get feature names from feature integer indices. The order of the words in this array matches the order of the numbers from the previous array.
Here, we only output the last 10 words. The last one is <em>youth</em>, and according to the last output value from <code class="language-plaintext highlighter-rouge">matrix.toarray()</code>, it appeared 3 times in the text. The word <em>younger</em> appeared just once!</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span> <span class="p">(</span><span class="n">vectorizer</span><span class="p">.</span><span class="n">get_feature_names</span><span class="p">()[</span><span class="mi">1731</span><span class="p">:])</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>['works', 'world', 'worship', 'writing', 'written', 'year', 'years', 'young', 'younger', 'youth']
</code></pre></div></div>

<p>We can use DataFrames to turn the results into a human-readable format.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">counts_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">matrix</span><span class="p">.</span><span class="n">toarray</span><span class="p">(),</span> <span class="n">columns</span> <span class="o">=</span> <span class="n">vectorizer</span><span class="p">.</span><span class="n">get_feature_names</span><span class="p">())</span>
<span class="n">counts_df</span>
</code></pre></div></div>

<p><img src="/assets/counting_words_with_countvectorizer/dataframe_of_counts.png" alt="DataFrame showing the counts" /></p>

<hr />
<p>Even more, we can get a sorted list similar to the result given by <code class="language-plaintext highlighter-rouge">Counter</code>. We use some <code class="language-plaintext highlighter-rouge">pandas</code> magic to transpose index and columns, and the result is naturally a transposed DataFrame. In fact, the used property <code class="language-plaintext highlighter-rouge">T</code> is an accessor to the method <code class="language-plaintext highlighter-rouge">transpose()</code>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">counts_df</span><span class="p">.</span><span class="n">T</span><span class="p">.</span><span class="n">sort_values</span><span class="p">(</span><span class="n">by</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="bp">False</span><span class="p">).</span><span class="n">head</span><span class="p">(</span><span class="mi">8</span><span class="p">)</span>
</code></pre></div></div>

<p><img src="/assets/counting_words_with_countvectorizer/dataframe_of_sorted_list.png" alt="DataFrame showing the sorted list of counts" /></p>

<p>As seen so far, the <code class="language-plaintext highlighter-rouge">CountVectorizer</code> is quite useful, and it can handle a lot of preprocessing for us. That would allow us to focus on the interpretation of data for example.</p>

<p>So, how many times did <strong>people</strong> appear in the text?</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">counts_df</span><span class="p">[</span><span class="s">'people'</span><span class="p">]</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0    108
Name: people, dtype: int64
</code></pre></div></div>

<p>How about <strong>law</strong> and <strong>order</strong>?</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span> <span class="p">(</span><span class="n">counts_df</span><span class="p">[</span><span class="s">'law'</span><span class="p">],</span> <span class="s">'</span><span class="se">\n</span><span class="s">'</span> <span class="p">,</span><span class="n">counts_df</span><span class="p">[</span><span class="s">'order'</span><span class="p">])</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0    106
Name: law, dtype: int64
 0    9
Name: order, dtype: int64
</code></pre></div></div>

<hr />
<h2 id="counting-words-in-multiple-documents">Counting words in multiple documents</h2>
<p>All of that is quite good and exciting. Now, we will see how is <code class="language-plaintext highlighter-rouge">CountVectorizer</code> with multiple text documents. <br />
We will be using the United States’ Constitution and the Athenian Constitution, by Aristotle in addition to our previous text.</p>

<p>To read a text file from a URL in Python, we make a request with <code class="language-plaintext highlighter-rouge">Requests</code> module to get a <code class="language-plaintext highlighter-rouge">Response</code> object. We can read the content of the server’s response by accessing <code class="language-plaintext highlighter-rouge">.text</code>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">requests</span>
<span class="n">US_constitution</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"https://www.gutenberg.org/cache/epub/5/pg5.txt"</span><span class="p">).</span><span class="n">text</span><span class="p">[</span><span class="mi">2623</span><span class="p">:]</span> <span class="c1"># To slice out the unwanted text
</span><span class="n">Athenian_constitution</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"https://www.gutenberg.org/cache/epub/26095/pg26095.txt"</span><span class="p">).</span><span class="n">text</span><span class="p">[</span><span class="mi">610</span><span class="p">:]</span>
</code></pre></div></div>

<p>We construct a DataFrame with the content by passing the appropriate data.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">([</span>
    <span class="p">{</span> <span class="s">"document"</span><span class="p">:</span> <span class="s">"Tunisian Constitution"</span><span class="p">,</span> <span class="s">"content"</span><span class="p">:</span> <span class="n">text</span><span class="p">},</span>
    <span class="p">{</span> <span class="s">"document"</span><span class="p">:</span> <span class="s">"United States Constitution"</span><span class="p">,</span> <span class="s">"content"</span><span class="p">:</span> <span class="n">US_constitution</span> <span class="p">},</span>
    <span class="p">{</span> <span class="s">"document"</span><span class="p">:</span> <span class="s">"Athenian Constitution"</span><span class="p">,</span> <span class="s">"content"</span><span class="p">:</span> <span class="n">Athenian_constitution</span> <span class="p">},])</span>
<span class="n">df</span>
</code></pre></div></div>

<p><img src="/assets/counting_words_with_countvectorizer/dataframe_showing_name_of_document_and_content.png" alt="DataFrame showing the name of documents and preview of their content" /></p>

<p>Finally, we create an organized DataFrame of the words counted in each document. This time, we feed the entire content column the <code class="language-plaintext highlighter-rouge">CountVectorizer</code> instead of a single text variable.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">vectorizer</span> <span class="o">=</span> <span class="n">CountVectorizer</span><span class="p">()</span>

<span class="n">matrix</span> <span class="o">=</span> <span class="n">vectorizer</span><span class="p">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">content</span><span class="p">)</span>
<span class="n">counts</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">matrix</span><span class="p">.</span><span class="n">toarray</span><span class="p">(),</span> <span class="n">index</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">document</span><span class="p">,</span> <span class="n">columns</span> <span class="o">=</span> <span class="n">vectorizer</span><span class="p">.</span><span class="n">get_feature_names</span><span class="p">())</span>

<span class="n">counts</span>
</code></pre></div></div>

<p><img src="/assets/counting_words_with_countvectorizer/dataframe_showing_counts_of_three_documents.png" alt="DataFrame of the words counted in the three documents" /></p>

<p>This is a nice feature where we can select serveral interesting words to check in all documents.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">counts</span><span class="p">[[</span><span class="s">'people'</span><span class="p">,</span><span class="s">'constitution'</span><span class="p">,</span> <span class="s">'rules'</span><span class="p">,</span> <span class="s">'law'</span><span class="p">,</span> <span class="s">'order'</span><span class="p">,</span> <span class="s">'assembly'</span><span class="p">,</span> <span class="s">'house'</span><span class="p">,</span> <span class="s">'democracy'</span><span class="p">]]</span>
</code></pre></div></div>

<p><img src="/assets/counting_words_with_countvectorizer/dataframe_showing_counts_of_interesting_words.png" alt="DataFrame of interesting words checked in all three documents" /></p>

<hr />

<p>In this notebook, we used <code class="language-plaintext highlighter-rouge">CountVectorizer</code> from <code class="language-plaintext highlighter-rouge">sklearn</code> to count words in multiple documents. It is more advanced than working with <code class="language-plaintext highlighter-rouge">Counter</code> and having to do all the text cleaning.</p>

<!-- Courtesy of embedresponsively.com -->

<div class="responsive-video-container">
    <iframe src="https://www.youtube-nocookie.com/embed/sCZ34kQvX0s" frameborder="0" webkitallowfullscreen="" mozallowfullscreen="" allowfullscreen=""></iframe>
  </div>]]></content><author><name>Meher Bejaoui</name><email>meher.bejaoui@outlook.com</email></author><category term="python" /><category term="python" /><category term="pandas" /><category term="sklearn" /><summary type="html"><![CDATA[Using CountVectorizer to count words in multiple documents]]></summary></entry><entry><title type="html">Visualization and analysis of legal texts</title><link href="https://meherbejaoui.com/python/Visualization-and-analysis-of-legal-texts/" rel="alternate" type="text/html" title="Visualization and analysis of legal texts" /><published>2021-04-08T00:00:00+01:00</published><updated>2021-04-08T00:00:00+01:00</updated><id>https://meherbejaoui.com/python/Visualization-and-analysis-of-legal-texts</id><content type="html" xml:base="https://meherbejaoui.com/python/Visualization-and-analysis-of-legal-texts/"><![CDATA[<ul>
  <li><a href="#generating-word-clouds">Generating Word Clouds</a></li>
  <li><a href="#counting-words">Counting words</a></li>
  <li><a href="#comparing-different-documents">Comparing different documents</a></li>
</ul>

<hr />

<p>While browsing the Internet, you have probably seen a picture of a cloud filled with words of varying sizes that reflect the frequency of each word within a given text. This is referred to as a Tag Cloud or a Word Cloud. In this tutorial (see the notebook <a href="https://github.com/meherbejaoui/meherbejaoui.github.io/blob/master/assets/word_clouds/wordCloud.ipynb">here</a>), we will learn how to make Word Clouds in Python. This tool is useful for a visual exploration of text data.</p>

<p>We will use legal texts for the purpose of this tutorial, namely the Tunisian Constitution and the Tunisian Hydrocarbons Code.</p>

<hr />

<p>As usual, we start by importing the different libraries used. <br />
The <code class="language-plaintext highlighter-rouge">NumPy</code> library is used for handling large, multi-dimensional arrays and matrices. <br />
For visualization, <code class="language-plaintext highlighter-rouge">matplotlib</code> is a comprehensive plotting library. It enables other libraries, such as <code class="language-plaintext highlighter-rouge">seaborn</code> and <code class="language-plaintext highlighter-rouge">wordcloud</code>, to run on its base. <br />
The <code class="language-plaintext highlighter-rouge">pillow</code> library adds support for opening, manipulating, and saving many different image file formats.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">PIL</span> <span class="kn">import</span> <span class="n">Image</span>
<span class="kn">from</span> <span class="nn">wordcloud</span> <span class="kn">import</span> <span class="n">WordCloud</span><span class="p">,</span> <span class="n">STOPWORDS</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">re</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="n">sns</span>
</code></pre></div></div>

<hr />

<h3 id="generating-word-clouds">Generating Word Clouds</h3>
<p>Then, we read the Hydrocarbons Code stored in a <code class="language-plaintext highlighter-rouge">.txt</code> file. I chose this legal document, because I have worked on it and analysed it for a thesis <a href="https://www.meherbejaoui.com/blog/Governance-of-extractive-industry-in-Tunisia">report</a> I wrote. It is a complex legal text that pertains to a sensitive and important topic, that is natural resources.</p>

<p>We have to make some necessary text processing. First, we convert the entire text to lower case. This is important since Python strings are case sensitive. Afterwards, we use the magic of regular expressions to deal with apostrophes and other characters to be removed. <br />
Whenever in doubt regarding regular expressions syntax, you can use <a href="https://regex101.com/">this</a> website or the official Python <a href="https://docs.python.org/3/library/re.html">documentation</a> to have the desired outcome.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hydrocarbons_code</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">"hydrocarbons_code.txt"</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s">'latin_1'</span><span class="p">).</span><span class="n">read</span><span class="p">().</span><span class="n">replace</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span><span class="s">" "</span><span class="p">)</span>
<span class="n">hydrocarbons_code</span> <span class="o">=</span> <span class="n">hydrocarbons_code</span><span class="p">.</span><span class="n">lower</span><span class="p">()</span>
<span class="n">hydrocarbons_code</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="s">".'|[^\w ]"</span><span class="p">,</span> <span class="s">" "</span><span class="p">,</span> <span class="n">hydrocarbons_code</span><span class="p">)</span>
</code></pre></div></div>

<p>Since the text is in French, we have to make our own list (technically it’s a Python <strong>set</strong> with curly brackets here) of the words to remove from the given text. These words would not show in the Word Cloud.</p>

<p>If you do not speak French, most of these words are the equivalent to English articles, pronouns and conjunctions. They would not help us much in understanding the text through visuals.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">stop</span> <span class="o">=</span> <span class="p">{</span><span class="s">'de'</span><span class="p">,</span><span class="s">'du'</span><span class="p">,</span><span class="s">'des'</span><span class="p">,</span><span class="s">'et'</span><span class="p">,</span><span class="s">'est'</span><span class="p">,</span><span class="s">'la'</span><span class="p">,</span><span class="s">'le'</span><span class="p">,</span><span class="s">'les'</span><span class="p">,</span><span class="s">'en'</span><span class="p">,</span><span class="s">'ou'</span><span class="p">,</span><span class="s">'par'</span><span class="p">,</span><span class="s">'au'</span><span class="p">,</span><span class="s">'aux'</span><span class="p">,</span><span class="s">'dans'</span><span class="p">,</span><span class="s">'une'</span><span class="p">,</span><span class="s">'un'</span><span class="p">,</span><span class="s">'pour'</span><span class="p">,</span><span class="s">'sur'</span><span class="p">,</span><span class="s">'ce'</span><span class="p">,</span><span class="s">'ces'</span><span class="p">,</span>
        <span class="s">'ne'</span><span class="p">,</span><span class="s">'qui'</span><span class="p">,</span><span class="s">'que'</span><span class="p">,</span><span class="s">'son'</span><span class="p">,</span><span class="s">'ses'</span><span class="p">,</span><span class="s">'sa'</span><span class="p">,</span><span class="s">'il'</span><span class="p">,</span><span class="s">'ci'</span><span class="p">,</span><span class="s">'a'</span><span class="p">}</span>
</code></pre></div></div>

<p>In order to create an image form for the Word Cloud, we need to use a PNG file as a mask. Here, I use the map of Tunisia, just for fun.</p>

<p>The <strong>mask</strong> argument in the <code class="language-plaintext highlighter-rouge">WordCloud</code> function takes an N dimensional array (ndarray). We use <code class="language-plaintext highlighter-rouge">Image</code> module to open the PNG file, and we transform it to the numpy array form. <br />
According to <code class="language-plaintext highlighter-rouge">WordCloud</code> documentation, all white entries will be considerd masked out, while other entries will be free to draw on. In the NumPy array, all white parts of the mask have a value of 255, whereas values of 1 are black.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">map_mask</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">Image</span><span class="p">.</span><span class="nb">open</span><span class="p">(</span><span class="s">"map.png"</span><span class="p">))</span>
</code></pre></div></div>

<p>Now, we have a proper mask and we can make a cloud with the desired shape. <br />
<code class="language-plaintext highlighter-rouge">WordCloud</code> takes several parameters, and you can create a personalised result by changing the optional arguments. Some of these are fairly self-explanatory. For the rest, you can always consult the relevant <a href="http://amueller.github.io/word_cloud/generated/wordcloud.WordCloud.html">documentation</a>. Or, you can check out the docstring of the function and see the required and optional arguments, by typing and running <code class="language-plaintext highlighter-rouge">?WordCloud</code>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hydrocarbons_cloud</span> <span class="o">=</span> <span class="n">WordCloud</span><span class="p">(</span><span class="n">max_words</span><span class="o">=</span><span class="mi">1000</span><span class="p">,</span> <span class="n">mask</span><span class="o">=</span><span class="n">map_mask</span><span class="p">,</span> <span class="n">stopwords</span><span class="o">=</span><span class="n">stop</span> <span class="p">,</span> <span class="n">min_word_length</span> <span class="o">=</span> <span class="mi">3</span><span class="p">,</span> <span class="n">min_font_size</span> <span class="o">=</span> <span class="mi">8</span><span class="p">,</span>
                               <span class="n">margin</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">background_color</span><span class="o">=</span><span class="s">"white"</span><span class="p">,</span> <span class="n">include_numbers</span><span class="o">=</span><span class="bp">True</span><span class="p">).</span><span class="n">generate</span><span class="p">(</span><span class="n">hydrocarbons_code</span><span class="p">)</span>
</code></pre></div></div>

<p>Finally, we can output the result and have an insightful and beautiful visualizations. <br />
Naturally, the most frequent words are <em>hydrocarbons</em>, <em>code</em>, <em>article</em> and <em>holder</em>, as expected in a legal document about hydrocarbons!</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
<span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">18</span><span class="p">,</span><span class="mi">18</span><span class="p">))</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">gca</span><span class="p">()</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">"Tunisian Hydrocarbons Code Cloud"</span><span class="p">,</span><span class="n">fontsize</span><span class="o">=</span><span class="mi">22</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">hydrocarbons_cloud</span><span class="p">.</span><span class="n">recolor</span><span class="p">(</span><span class="n">colormap</span><span class="o">=</span><span class="n">sns</span><span class="p">.</span><span class="n">color_palette</span><span class="p">(</span><span class="n">palette</span><span class="o">=</span><span class="s">'blend:red,brown'</span><span class="p">,</span><span class="n">as_cmap</span><span class="o">=</span><span class="bp">True</span><span class="p">),</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">3</span><span class="p">),</span>
           <span class="n">interpolation</span><span class="o">=</span><span class="s">"bilinear"</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">axis</span><span class="p">(</span><span class="s">"off"</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
<span class="c1">#hydrocarbons_cloud.to_file("Hydrocarbons code cloud.png") #comment out to save the figure in a PNG format
</span></code></pre></div></div>

<p><img src="/assets/word_clouds/output_13_0.png" alt="png showing Hydrocarbons code cloud as the Tunisian map" /></p>

<hr />
<p>Now, let’s take a look at the Tunisian Constitution. We will use an English translation from <a href="https://www.constituteproject.org/constitution/Tunisia_2014?lang=en">constitute project</a>.</p>

<p>As before, we start by constructing a mask. For this example, it will be the Tunisian flag. For the rest, the only difference is that we use the default built-in <em>STOPWORDS</em> list.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">flag_mask</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">Image</span><span class="p">.</span><span class="nb">open</span><span class="p">(</span><span class="s">"Flag_of_Tunisia.png"</span><span class="p">))</span>

<span class="n">tunisian_constitution</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">"constitution.txt"</span><span class="p">).</span><span class="n">read</span><span class="p">().</span><span class="n">replace</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span><span class="s">" "</span><span class="p">).</span><span class="n">lower</span><span class="p">()</span>
<span class="n">tunisian_constitution</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="s">"[^\w ]"</span><span class="p">,</span> <span class="s">" "</span><span class="p">,</span> <span class="n">tunisian_constitution</span><span class="p">)</span>

<span class="n">constitution_cloud</span> <span class="o">=</span> <span class="n">WordCloud</span><span class="p">(</span><span class="n">max_words</span><span class="o">=</span><span class="mi">1000</span><span class="p">,</span> <span class="n">mask</span><span class="o">=</span><span class="n">flag_mask</span><span class="p">,</span> <span class="n">stopwords</span><span class="o">=</span><span class="n">STOPWORDS</span> <span class="p">,</span> <span class="n">min_word_length</span> <span class="o">=</span> <span class="mi">3</span><span class="p">,</span> <span class="n">min_font_size</span> <span class="o">=</span> <span class="mi">8</span><span class="p">,</span>
                               <span class="n">margin</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">background_color</span><span class="o">=</span><span class="s">"white"</span><span class="p">,</span> <span class="n">include_numbers</span><span class="o">=</span><span class="bp">True</span><span class="p">).</span><span class="n">generate</span><span class="p">(</span><span class="n">tunisian_constitution</span><span class="p">)</span>
</code></pre></div></div>

<p>Our exquisite result looks good, and presents us with useful visual anchor.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
<span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">20</span><span class="p">,</span><span class="mi">10</span><span class="p">))</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">gca</span><span class="p">()</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">"Tunisian Constitution Cloud"</span><span class="p">,</span><span class="n">fontsize</span><span class="o">=</span><span class="mi">22</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="mf">1.04</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">constitution_cloud</span><span class="p">.</span><span class="n">recolor</span><span class="p">(</span><span class="n">colormap</span><span class="o">=</span><span class="n">sns</span><span class="p">.</span><span class="n">color_palette</span><span class="p">(</span><span class="n">palette</span><span class="o">=</span><span class="s">'blend:red,brown'</span><span class="p">,</span><span class="n">as_cmap</span><span class="o">=</span><span class="bp">True</span><span class="p">)),</span>
           <span class="n">interpolation</span><span class="o">=</span><span class="s">"bilinear"</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">axis</span><span class="p">(</span><span class="s">"off"</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
<span class="c1">#constitution_cloud.to_file("Tunisian Constitution Cloud.png") #comment out to save the figure in a PNG format
</span></code></pre></div></div>

<p><img src="/assets/word_clouds/output_18_0.png" alt="png showing Tunisian Constitution Cloud as the Tunisian flag" /></p>

<hr />
<h3 id="counting-words">Counting words</h3>
<p>However, text analysis goes beyond visualisations. We can count words using the <a href="https://docs.python.org/3/library/collections.html#collections.Counter"><code class="language-plaintext highlighter-rouge">Counter</code></a> collection.
If we are only interested in the most common words, we can use <code class="language-plaintext highlighter-rouge">.most_common()</code> with or without an argument specifiying the number of words.</p>

<p>In the code below, we use <em>list comprehension</em> to store words appearing more than 30 times in the constitution and having at least 4 letters.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">Counter</span>
<span class="n">tunisian_words</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">Counter</span><span class="p">(</span><span class="n">tunisian_constitution</span><span class="p">.</span><span class="n">split</span><span class="p">()).</span><span class="n">most_common</span><span class="p">(</span><span class="mi">50</span><span class="p">)</span> <span class="k">if</span> <span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">&gt;</span><span class="mi">30</span> <span class="ow">and</span> <span class="nb">len</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span><span class="o">&gt;</span><span class="mi">3</span> <span class="p">]</span>
<span class="n">tunisian_words</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[('article', 173),
 ('shall', 173),
 ('assembly', 149),
 ('president', 110),
 ('people', 108),
 ('government', 100),
 ('representatives', 97),
 ('with', 97),
 ('republic', 96),
 ('state', 67),
 ('members', 62),
 ('court', 62),
 ('their', 50),
 ('draft', 48),
 ('head', 47),
 ('constitutional', 45),
 ('that', 43),
 ('laws', 42),
 ('within', 40),
 ('from', 39),
 ('right', 38),
 ('council', 37),
 ('judicial', 37),
 ('authorities', 34),
 ('local', 33),
 ('national', 32),
 ('after', 32),
 ('rights', 31),
 ('constitution', 31)]
</code></pre></div></div>

<hr />
<h3 id="comparing-different-documents">Comparing different documents</h3>
<p>We can compare the occurences of words in two documents as well. We will use the French constitution for comparison, despite the differences between the governance systems of the two countries as implemented in their respective constitutions.</p>

<p>We read and process the <code class="language-plaintext highlighter-rouge">.txt</code> file as we did before. Then, we store the words of each document in a list (<em>tunisian_words</em> and <em>french_words</em>).</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">french_constitution</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">"french constitution.txt"</span><span class="p">).</span><span class="n">read</span><span class="p">().</span><span class="n">replace</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span><span class="s">" "</span><span class="p">).</span><span class="n">lower</span><span class="p">()</span>
<span class="n">french_constitution</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="s">"[^\w ]"</span><span class="p">,</span> <span class="s">" "</span><span class="p">,</span> <span class="n">french_constitution</span><span class="p">)</span>

<span class="n">tunisian_words</span> <span class="o">=</span> <span class="n">tunisian_constitution</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">" "</span><span class="p">)</span>
<span class="n">french_words</span> <span class="o">=</span> <span class="n">french_constitution</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">" "</span><span class="p">)</span>
</code></pre></div></div>

<p>Afterwards, we construct a DataFrame by passing in the <em>Counter</em> collections as data entries, and setting the column labels. <br />
In this example, we chose to remove missing values or <em>drop</em> them by using <code class="language-plaintext highlighter-rouge">.dropna()</code> with a <strong>0</strong> argument. That means, we drop rows which contain missing values. That would give us a DataFrame of only the words that exist in both documents!</p>

<p>The result below shows the number of occurrences of each word, the total, and a percentage value indicating the prevalence of that word in the Tunisian constitution with reference to its total use in both documents.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">({</span>
    <span class="s">'Tunisian_constitution'</span><span class="p">:</span> <span class="n">Counter</span><span class="p">(</span><span class="n">tunisian_words</span><span class="p">),</span>
    <span class="s">'French_constitution'</span><span class="p">:</span> <span class="n">Counter</span><span class="p">(</span><span class="n">french_words</span><span class="p">)</span>
<span class="p">}).</span><span class="n">dropna</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>

<span class="n">df</span><span class="p">[</span><span class="s">'Total'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">Tunisian_constitution</span> <span class="o">+</span> <span class="n">df</span><span class="p">.</span><span class="n">French_constitution</span>
<span class="n">df</span><span class="p">[</span><span class="s">'Tunisian_percentage'</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">Tunisian_constitution</span> <span class="o">/</span> <span class="n">df</span><span class="p">.</span><span class="n">Total</span><span class="p">)</span> <span class="o">*</span> <span class="mi">100</span>

<span class="n">df</span><span class="p">.</span><span class="n">head</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</code></pre></div></div>

<p><img src="/assets/word_clouds/dataframe_occurrences_and_total_of_each_word_in_both_documents.png" alt="picture showing DataFrame of occurrences and total for each word in both documents" /></p>

<hr />
<p>All of these common words between the two documents were used differently in each. We can check how many times they appeared by returning the sum of the values over the desired axis.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">axis</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Tunisian_constitution    11950.000000
French_constitution      11769.000000
Total                    23719.000000
Tunisian_percentage      40472.171694
dtype: float64
</code></pre></div></div>

<hr />
<p>Now, let’s look at words used ten or more times in the Tunisian constitution. We sort them by descending value. <br />
The <code class="language-plaintext highlighter-rouge">for</code> loop is used to go through <code class="language-plaintext highlighter-rouge">df.index</code> and remove the words having three characters or less.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">ind</span> <span class="ow">in</span> <span class="n">df</span><span class="p">.</span><span class="n">index</span><span class="p">:</span>
    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">ind</span><span class="p">)</span><span class="o">&lt;</span><span class="mi">4</span><span class="p">:</span>
        <span class="n">df</span><span class="p">.</span><span class="n">drop</span><span class="p">(</span><span class="n">ind</span><span class="p">,</span> <span class="n">inplace</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>

<span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">.</span><span class="n">Tunisian_constitution</span> <span class="o">&gt;=</span> <span class="mi">10</span><span class="p">].</span><span class="n">sort_values</span><span class="p">(</span><span class="n">by</span><span class="o">=</span><span class="s">'Tunisian_constitution'</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
</code></pre></div></div>

<p><img src="/assets/word_clouds/dataframe_occurrences_and_total_of_word_that_appeared_at_least_ten_times.png" alt="picture showing DataFrame of occurrences and total for word that appeared at least 10 times in both documents" /></p>

<hr />
<p>We can use the <code class="language-plaintext highlighter-rouge">df.sum()</code> again to compute the frequency of those words.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">axis</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Tunisian_constitution     5109.000000
French_constitution       5125.000000
Total                    10234.000000
Tunisian_percentage      34301.487972
dtype: float64
</code></pre></div></div>

<hr />
<p>In this notebook (using <em>python 3.7 pandas 1.2.1</em> and <em>matplotlib 3.3.2</em>), we have learned how to draw a Word Cloud that would be helpful for visualization of any text. Besides, we used <code class="language-plaintext highlighter-rouge">Counter</code> to count words in documents. The tool worked well with pandas DataFrames, allowing us to make simple comparisons.</p>

<p>This might have been naive text analysis, but it is an important first step towards a more comprehensive and elaborate text analysis.</p>]]></content><author><name>Meher Bejaoui</name><email>meher.bejaoui@outlook.com</email></author><category term="python" /><category term="python" /><category term="pandas" /><category term="visualization" /><summary type="html"><![CDATA[Using pandas and matplotlib, to generate and style Word Clouds, count words using the Counter collection and compare different documents]]></summary></entry><entry><title type="html">Plotting climate data using pandas</title><link href="https://meherbejaoui.com/python/Temperature-broken-records/" rel="alternate" type="text/html" title="Plotting climate data using pandas" /><published>2021-03-20T00:00:00+01:00</published><updated>2021-03-20T00:00:00+01:00</updated><id>https://meherbejaoui.com/python/Temperature-broken-records</id><content type="html" xml:base="https://meherbejaoui.com/python/Temperature-broken-records/"><![CDATA[<ul>
  <li><a href="#introduction">Introduction</a></li>
  <li><a href="#data-processing-and-time-series-manipulation">Data processing and time series manipulation</a></li>
  <li><a href="#plotting-and-styling-using-matplotlib-and-seaborn">Plotting and styling using matplotlib and seaborn</a></li>
</ul>

<hr />
<h2 id="introduction">Introduction</h2>
<p>The data for this <a href="https://github.com/meherbejaoui/meherbejaoui.github.io/blob/master/assets/temperatures/TemperatureBrokenRecords.ipynb">notebook</a> comes from a subset of The National Centers for Environmental Information (NCEI) Daily Global Historical Climatology Network (GHCN-Daily). The GHCN-Daily is comprised of daily climate records from thousands of land surface stations across the globe.</p>

<p>The data (stored in a <a href="https://github.com/meherbejaoui/meherbejaoui.github.io/blob/master/assets/temperatures/data.csv">csv file</a>) is comprised of daily climate records over the period 2005-2015, from land surface stations near Ann Arbor, Michigan, United States.
Each row in the datafile corresponds to a single observation.</p>

<p>The provided variables are :</p>
<ul>
  <li><strong>id</strong> : station identification code</li>
  <li><strong>date</strong> : date in YYYY-MM-DD format</li>
  <li><strong>element</strong> : indicator of element type
    <ul>
      <li><strong>TMAX</strong> : Maximum temperature (tenths of degrees C)</li>
      <li><strong>TMIN</strong> : Minimum temperature (tenths of degrees C)</li>
    </ul>
  </li>
  <li><strong>value</strong> : data value for element (in tenths of degrees C)</li>
</ul>

<p>For the purpose of this notebook, we are going to plot a line chart of the record high and record low temperatures by day of the year over the period 2005-2014. Then, we overlay a scatter of the 2015 data for any points (highs and lows) for which the ten-year record (2005-2014) was broken in 2015.</p>

<hr />

<p>Importing libraries and reading data</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="n">sns</span>
<span class="kn">from</span> <span class="nn">matplotlib.dates</span> <span class="kn">import</span> <span class="n">MonthLocator</span><span class="p">,</span> <span class="n">DateFormatter</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'data.csv'</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="data-processing-and-time-series-manipulation">Data processing and time series manipulation</h2>
<p>Since the temparture values are in the tenths of degree Celsius, we need to convert them to °C.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">[</span><span class="s">'Data_Value'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'Data_Value'</span><span class="p">]</span><span class="o">/</span><span class="mi">10</span> <span class="c1">#convert temperatures to °C
</span></code></pre></div></div>

<p>Next, we would ensure that the Date values are interpreted as <code class="language-plaintext highlighter-rouge">date type</code>, and sort the entire DataFrame by date.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">[</span><span class="s">'Date'</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">'Date'</span><span class="p">],</span> <span class="n">infer_datetime_format</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>  <span class="c1">#convert to date type
</span><span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">sort_values</span><span class="p">(</span><span class="s">'Date'</span><span class="p">)</span>  <span class="c1">#sort DataFrame by date
</span></code></pre></div></div>

<p>We use the <code class="language-plaintext highlighter-rouge">head()</code> function to get the first n rows (5 by default), and the <code class="language-plaintext highlighter-rouge">shape</code> attribute for a tuple representing the dimensionality of the resulting DataFrame:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span> <span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">head</span><span class="p">(),</span> <span class="s">"</span><span class="se">\n</span><span class="s"> The size of the DataFrame is"</span><span class="p">,</span> <span class="n">df</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                ID       Date Element  Data_Value
60995  USW00004848 2005-01-01    TMIN         0.0
17153  USC00207320 2005-01-01    TMAX        15.0
17155  USC00207320 2005-01-01    TMIN        -1.1
10079  USW00014833 2005-01-01    TMIN        -4.4
10073  USW00014833 2005-01-01    TMAX         3.3
 The size of the DataFrame is (165085, 4)
</code></pre></div></div>

<p>For clarity and ease, we are going to create two distinct DataFrames to hold TMAX and TMIN values.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dftmax</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s">'Element'</span><span class="p">]</span><span class="o">==</span><span class="s">'TMAX'</span><span class="p">]</span> <span class="c1">#get dataframe of only TMAX values
</span><span class="n">dftmin</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s">'Element'</span><span class="p">]</span><span class="o">==</span><span class="s">'TMIN'</span><span class="p">]</span> <span class="c1">#get dataframe of only TMIN values
</span></code></pre></div></div>

<p>To have a better overview of the entire data, I decided to keep all values including those of leap years. In this cell, we use <code class="language-plaintext highlighter-rouge">numpy.arange()</code> to get all possible dates in such a year (2008).</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">observation_dates</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="s">'2008-01-01'</span><span class="p">,</span> <span class="s">'2009-01-01'</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="s">'datetime64[D]'</span><span class="p">)</span>
</code></pre></div></div>

<p>Now, we would get the maximum temperatures for each day in the period from 2005 to 2014. We should remember that there are several registered values for any given day for that period. The resulting DataFrame <strong>dftmax14</strong> has a length of 3652 (maximum values for each day in a 10-year span).</p>

<p>Then, we’ll create a <code class="language-plaintext highlighter-rouge">pandas series</code> comprised of all TMAX values in a one-year span (<strong>tmax1y</strong>). For any single day value (1 June for example), we would get the maximum TMAX for that day for the period 2005-2014.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#extract of TMAX of each day from 2005 to 2014
</span><span class="n">dftmax14</span> <span class="o">=</span> <span class="n">dftmax</span><span class="p">[</span><span class="n">dftmax</span><span class="p">[</span><span class="s">'Date'</span><span class="p">]</span><span class="o">&lt;</span><span class="s">'2015-01-01'</span><span class="p">].</span><span class="n">groupby</span><span class="p">(</span><span class="s">'Date'</span><span class="p">)[</span><span class="s">'Data_Value'</span><span class="p">].</span><span class="nb">max</span><span class="p">()</span>

<span class="c1">#set index to month-day format instead of year-m-d
</span><span class="n">dftmax14</span><span class="p">.</span><span class="n">index</span> <span class="o">=</span> <span class="n">dftmax14</span><span class="p">.</span><span class="n">index</span><span class="p">.</span><span class="n">strftime</span><span class="p">(</span><span class="s">'%m-%d'</span><span class="p">)</span>

<span class="c1">#TMAX in a one year span resume for data from 2005 to 2014
</span><span class="n">tmax1y</span> <span class="o">=</span> <span class="n">dftmax14</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">dftmax14</span><span class="p">.</span><span class="n">index</span><span class="p">).</span><span class="nb">max</span><span class="p">()</span>
</code></pre></div></div>

<p>We repeat the same steps to get minimum temperatures for each day in the period from 2005 to 2014.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#extract of TMIN of each day from 2005 to 2014
</span><span class="n">dftmin14</span> <span class="o">=</span> <span class="n">dftmin</span><span class="p">[</span><span class="n">dftmin</span><span class="p">[</span><span class="s">'Date'</span><span class="p">]</span><span class="o">&lt;</span><span class="s">'2015-01-01'</span><span class="p">].</span><span class="n">groupby</span><span class="p">(</span><span class="s">'Date'</span><span class="p">)[</span><span class="s">'Data_Value'</span><span class="p">].</span><span class="nb">min</span><span class="p">()</span>

<span class="c1">#set index to month-day format instead of year-m-d
</span><span class="n">dftmin14</span><span class="p">.</span><span class="n">index</span> <span class="o">=</span> <span class="n">dftmin14</span><span class="p">.</span><span class="n">index</span><span class="p">.</span><span class="n">strftime</span><span class="p">(</span><span class="s">'%m-%d'</span><span class="p">)</span>

<span class="c1">#TMIN in a one year span resume for data from 2005 to 2014
</span><span class="n">tmin1y</span> <span class="o">=</span> <span class="n">dftmin14</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">dftmin14</span><span class="p">.</span><span class="n">index</span><span class="p">).</span><span class="nb">min</span><span class="p">()</span>
</code></pre></div></div>

<p>Similarly, we create a DataFrame for TMAX values in 2015 (<strong>dftmax15</strong>). <br />
Then, we extract the dates for which TMAX values in 2015 were higher than the values of TMAX over the period 2005-2014 (<strong>observation_dates_tmax15</strong>). <br />
Finally, we update <strong>dftmax15</strong> to have a DataFrame of those record breaking TMAX values.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#extract of TMAX of each day in 2015
</span><span class="n">dftmax15</span> <span class="o">=</span> <span class="n">dftmax</span><span class="p">[</span><span class="n">dftmax</span><span class="p">[</span><span class="s">'Date'</span><span class="p">]</span> <span class="o">&gt;=</span> <span class="s">'2015-01-01'</span><span class="p">].</span><span class="n">groupby</span><span class="p">(</span><span class="s">'Date'</span><span class="p">)[</span><span class="s">'Data_Value'</span><span class="p">].</span><span class="nb">max</span><span class="p">()</span>
<span class="c1">#set index to month-day format instead of year-m-d
</span><span class="n">dftmax15</span><span class="p">.</span><span class="n">index</span> <span class="o">=</span> <span class="n">dftmax15</span><span class="p">.</span><span class="n">index</span><span class="p">.</span><span class="n">strftime</span><span class="p">(</span><span class="s">'%m-%d'</span><span class="p">)</span>

<span class="n">dftmax15</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="s">'02-29'</span><span class="p">]</span> <span class="o">=</span> <span class="n">tmax1y</span><span class="p">[</span><span class="s">'02-29'</span><span class="p">]</span>
<span class="n">dftmax15</span><span class="p">.</span><span class="n">sort_index</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

<span class="n">observation_dates_tmax15</span> <span class="o">=</span> <span class="n">observation_dates</span><span class="p">[</span><span class="n">dftmax15</span><span class="p">.</span><span class="n">values</span> <span class="o">&gt;</span> <span class="n">tmax1y</span><span class="p">.</span><span class="n">values</span><span class="p">]</span>

<span class="n">dftmax15</span> <span class="o">=</span> <span class="n">dftmax15</span><span class="p">[</span><span class="n">dftmax15</span><span class="p">.</span><span class="n">values</span> <span class="o">&gt;</span> <span class="n">tmax1y</span><span class="p">.</span><span class="n">values</span><span class="p">]</span>
</code></pre></div></div>

<p>We can check the size of <strong>dftmax15</strong>, and infer that there are 37 days in 2015 that broke all TMAX values registered from 2005 to 2014.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dftmax15</span><span class="p">.</span><span class="n">size</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>37
</code></pre></div></div>

<p>The same previous steps to get a DataFrame <strong>dftmin15</strong> of those record breaking TMIN values.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#extract of TMIN of each day in 2015
</span><span class="n">dftmin15</span> <span class="o">=</span> <span class="n">dftmin</span><span class="p">[</span><span class="n">dftmin</span><span class="p">[</span><span class="s">'Date'</span><span class="p">]</span> <span class="o">&gt;=</span> <span class="s">'2015-01-01'</span><span class="p">].</span><span class="n">groupby</span><span class="p">(</span><span class="s">'Date'</span><span class="p">)[</span><span class="s">'Data_Value'</span><span class="p">].</span><span class="nb">min</span><span class="p">()</span>
<span class="c1">#set index to month-day format instead of year-m-d
</span><span class="n">dftmin15</span><span class="p">.</span><span class="n">index</span> <span class="o">=</span> <span class="n">dftmin15</span><span class="p">.</span><span class="n">index</span><span class="p">.</span><span class="n">strftime</span><span class="p">(</span><span class="s">'%m-%d'</span><span class="p">)</span>

<span class="n">dftmin15</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="s">'02-29'</span><span class="p">]</span> <span class="o">=</span> <span class="n">tmin1y</span><span class="p">[</span><span class="s">'02-29'</span><span class="p">]</span>
<span class="n">dftmin15</span><span class="p">.</span><span class="n">sort_index</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

<span class="n">observation_dates_tmin15</span> <span class="o">=</span> <span class="n">observation_dates</span><span class="p">[</span><span class="n">dftmin15</span><span class="p">.</span><span class="n">values</span> <span class="o">&lt;</span> <span class="n">tmin1y</span><span class="p">.</span><span class="n">values</span><span class="p">]</span>

<span class="n">dftmin15</span> <span class="o">=</span> <span class="n">dftmin15</span><span class="p">[</span><span class="n">dftmin15</span><span class="p">.</span><span class="n">values</span> <span class="o">&lt;</span> <span class="n">tmin1y</span><span class="p">.</span><span class="n">values</span><span class="p">]</span>
</code></pre></div></div>

<p>By checking the size of dftmin15, we infer that there are 32 days in 2015 that broke all TMIN values registered from 2005 to 2014.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dftmin15</span><span class="p">.</span><span class="n">size</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>32
</code></pre></div></div>

<h2 id="plotting-and-styling-using-matplotlib-and-seaborn">Plotting and styling using matplotlib and seaborn</h2>
<p>In this cell, we use <code class="language-plaintext highlighter-rouge">matplotlib</code> and <code class="language-plaintext highlighter-rouge">seaborn </code> to create a figure, and plot line charts of the record high and record low temperatures by day of the year over the period 2005-2014, and a scatter of the 2015 data for any points (highs and lows) for which the ten-year record (2005-2014) was broken in 2015. <br />
I made sure the visual was nice, with appropriate legends and labels, and reduced chart junk.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">months</span> <span class="o">=</span> <span class="n">MonthLocator</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">13</span><span class="p">),</span> <span class="n">bymonthday</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">interval</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">monthsFmt</span> <span class="o">=</span> <span class="n">DateFormatter</span><span class="p">(</span><span class="s">"%b"</span><span class="p">)</span>

<span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
<span class="n">sns</span><span class="p">.</span><span class="n">set_style</span><span class="p">(</span><span class="s">"white"</span><span class="p">)</span> <span class="c1"># Set the aesthetic style of the plot
</span>
<span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span><span class="mi">5</span><span class="p">))</span> <span class="c1"># Create a new figure with determined figsize parameter
</span><span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">gca</span><span class="p">()</span> <span class="c1"># Get the current Axes instance on the current figure
</span>
<span class="c1"># Plots and Scatter plots of y vs. x with determined parameters
</span><span class="n">ax</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">observation_dates_tmax15</span> <span class="p">,</span> <span class="n">dftmax15</span><span class="p">.</span><span class="n">values</span><span class="p">,</span> <span class="n">s</span><span class="o">=</span><span class="mi">35</span><span class="p">,</span>
           <span class="n">c</span><span class="o">=</span><span class="n">sns</span><span class="p">.</span><span class="n">dark_palette</span><span class="p">(</span><span class="s">"purple"</span><span class="p">,</span> <span class="n">n_colors</span><span class="o">=</span><span class="mi">37</span><span class="p">),</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'2005-2014 record high broken'</span> <span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">observation_dates_tmin15</span> <span class="p">,</span> <span class="n">dftmin15</span><span class="p">.</span><span class="n">values</span><span class="p">,</span> <span class="n">s</span><span class="o">=</span><span class="mi">25</span><span class="p">,</span>
           <span class="n">c</span><span class="o">=</span><span class="n">sns</span><span class="p">.</span><span class="n">dark_palette</span><span class="p">((</span><span class="mi">260</span><span class="p">,</span> <span class="mi">75</span><span class="p">,</span> <span class="mi">60</span><span class="p">),</span> <span class="nb">input</span><span class="o">=</span><span class="s">"husl"</span><span class="p">,</span> <span class="n">n_colors</span><span class="o">=</span><span class="mi">32</span><span class="p">),</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.9</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'2005-2014 record low broken'</span> <span class="p">)</span>

<span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">observation_dates</span><span class="p">,</span> <span class="n">tmax1y</span><span class="p">.</span><span class="n">values</span><span class="p">,</span> <span class="s">'-'</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span> <span class="n">sns</span><span class="p">.</span><span class="n">color_palette</span><span class="p">(</span><span class="s">"Reds"</span><span class="p">)[</span><span class="o">-</span><span class="mi">2</span><span class="p">],</span> <span class="n">label</span><span class="o">=</span><span class="s">'High'</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.7</span><span class="p">,</span> <span class="n">linewidth</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">observation_dates</span><span class="p">,</span> <span class="n">tmin1y</span><span class="p">.</span><span class="n">values</span><span class="p">,</span> <span class="s">'-'</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span> <span class="n">sns</span><span class="p">.</span><span class="n">color_palette</span><span class="p">(</span><span class="s">"Blues"</span><span class="p">)[</span><span class="o">-</span><span class="mi">2</span><span class="p">],</span> <span class="n">label</span><span class="o">=</span><span class="s">'Low'</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.7</span><span class="p">,</span> <span class="n">linewidth</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>

<span class="n">ax</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">'Months'</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">'Degrees (C)'</span><span class="p">)</span>

<span class="n">ax</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Highest and lowest temperatures by day of the year over the period 2005-2014 </span><span class="se">\n</span><span class="s"> and records broken in 2015 for Ann Arbor, Michigan, United States'</span><span class="p">)</span>

<span class="n">ax</span><span class="p">.</span><span class="n">xaxis</span><span class="p">.</span><span class="n">set_major_locator</span><span class="p">(</span><span class="n">months</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">xaxis</span><span class="p">.</span><span class="n">set_major_formatter</span><span class="p">(</span><span class="n">monthsFmt</span><span class="p">)</span>

<span class="n">plt</span><span class="p">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">alpha</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">yticks</span><span class="p">(</span><span class="n">alpha</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span>
<span class="n">sns</span><span class="p">.</span><span class="n">despine</span><span class="p">()</span>
<span class="n">ax</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>

<span class="n">maxv</span><span class="o">=</span><span class="n">tmax1y</span><span class="p">.</span><span class="n">values</span>
<span class="n">minv</span><span class="o">=</span><span class="n">tmin1y</span><span class="p">.</span><span class="n">values</span>
<span class="c1">#To shade the area between the record high and record low temperatures for each day
</span><span class="n">ax</span><span class="p">.</span><span class="n">fill_between</span><span class="p">(</span><span class="n">observation_dates</span><span class="p">,</span><span class="n">maxv</span><span class="p">,</span> <span class="n">minv</span><span class="p">,</span> <span class="n">facecolor</span><span class="o">=</span><span class="n">sns</span><span class="p">.</span><span class="n">light_palette</span><span class="p">(</span><span class="s">"lightgrey"</span><span class="p">)[</span><span class="mi">4</span><span class="p">],</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.2</span><span class="p">)</span>

<span class="c1">#fig.savefig("temp.png", dpi=1200) #comment out to save the figure in png format (reduce dpi to get smaller files)
</span></code></pre></div></div>

<p><img src="/assets/temperatures/output_27_1.png" alt="png showing highest and lowest temperatures by day of the year over the period 2005-2014 and records broken in 2015 for Ann Arbor, Michigan, United States" /></p>

<hr />
<p>This is not the only way to represent the data as you may opt to tweak the different parameters, and experience more with the plotting libraries. You can also treat other questions, and have fun with the data and figures.</p>

<p>Furthermore, there are several ways to handle the data and the various DataFrames. I tried to be more explicit and went for simplicity, all whilst incorporating the <a href="https://www.python.org/dev/peps/pep-0020/#id2">Zen of Python</a>.</p>

<p>If you want to learn more about data science through the python programming language,  I highly recommend <a href="https://www.coursera.org/specializations/data-science-python">Applied Data Science with Python Specialization</a> on <strong>Coursera</strong>.</p>]]></content><author><name>Meher Bejaoui</name><email>meher.bejaoui@outlook.com</email></author><category term="python" /><category term="tutorial" /><category term="python" /><category term="pandas" /><summary type="html"><![CDATA[Plotting line charts and a scatter plot of daily temperature records over the period 2005-2014, using pandas and matplotlib]]></summary></entry><entry><title type="html">Analysing Olympic Games medal table</title><link href="https://meherbejaoui.com/python/Olympic-games-analyses/" rel="alternate" type="text/html" title="Analysing Olympic Games medal table" /><published>2021-03-18T00:00:00+01:00</published><updated>2021-03-18T00:00:00+01:00</updated><id>https://meherbejaoui.com/python/Olympic-games-analyses</id><content type="html" xml:base="https://meherbejaoui.com/python/Olympic-games-analyses/"><![CDATA[<hr />
<ul>
  <li><a href="#introduction-and-preprocessing">Introduction &amp; preprocessing</a></li>
  <li><a href="#stacked-bar-chart">Stacked bar chart</a></li>
  <li><a href="#bubble-chart">Bubble chart</a></li>
</ul>

<hr />

<h3 id="introduction">Introduction</h3>

<p>The following <a href="https://github.com/meherbejaoui/meherbejaoui.github.io/blob/master/assets/olympics/OlympicGamesAnalyses.ipynb">Jupyter Notebook</a> uses the <a href="https://github.com/meherbejaoui/meherbejaoui.github.io/blob/master/assets/olympics/olympic_games_medal_table.csv">Olympic games medal dataset</a>, which was derived from the Wikipedia entry on <a href="https://en.wikipedia.org/wiki/All-time_Olympic_Games_medal_table">All Time Olympic Games Medals</a>, as of the 2016 Summer Olympics and 2018 Winter Olympics. All changes in medal standings due to doping cases and medal redistributions up to and including 25 November 2020 are taken into account.<br />
<em>Data queried on March 18th, 2021.</em></p>

<hr />

<p>Using the power of <strong><code class="language-plaintext highlighter-rouge">pandas</code></strong>, read the csv file containing the dataset (the result is a DataFrame). Then perform data cleaning and preprocessing operations to get a more readable and practical format of the DataFrame, to be used later.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">"olympic_games_medal_table.csv"</span><span class="p">,</span><span class="n">index_col</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span><span class="n">skiprows</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s">'latin_1'</span><span class="p">)</span>

<span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="n">df</span><span class="p">.</span><span class="n">columns</span><span class="p">:</span>    <span class="c1"># clean the lables of the raw data
</span>    <span class="k">if</span> <span class="n">col</span><span class="p">[:</span><span class="mi">2</span><span class="p">]</span><span class="o">==</span><span class="s">"01"</span><span class="p">:</span>
        <span class="n">df</span><span class="p">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="n">col</span><span class="p">:</span><span class="s">"Gold"</span><span class="o">+</span><span class="n">col</span><span class="p">[</span><span class="mi">4</span><span class="p">:]},</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">col</span><span class="p">[:</span><span class="mi">2</span><span class="p">]</span><span class="o">==</span><span class="s">"02"</span><span class="p">:</span>
        <span class="n">df</span><span class="p">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="n">col</span><span class="p">:</span><span class="s">"Silver"</span><span class="o">+</span><span class="n">col</span><span class="p">[</span><span class="mi">4</span><span class="p">:]},</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">col</span><span class="p">[:</span><span class="mi">2</span><span class="p">]</span><span class="o">==</span><span class="s">"03"</span><span class="p">:</span>
        <span class="n">df</span><span class="p">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="n">col</span><span class="p">:</span><span class="s">"Bronze"</span><span class="o">+</span><span class="n">col</span><span class="p">[</span><span class="mi">4</span><span class="p">:]},</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">col</span><span class="p">[:</span><span class="mi">1</span><span class="p">]</span><span class="o">==</span><span class="s">"№"</span><span class="p">:</span>
        <span class="n">df</span><span class="p">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="n">col</span><span class="p">:</span><span class="s">"#"</span><span class="o">+</span><span class="n">col</span><span class="p">[</span><span class="mi">1</span><span class="p">:]},</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

<span class="n">names_ids</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">index</span><span class="p">.</span><span class="nb">str</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">'\s\('</span><span class="p">)</span>    <span class="c1"># split the index by '('
</span>
<span class="n">df</span><span class="p">.</span><span class="n">index</span> <span class="o">=</span> <span class="n">names_ids</span><span class="p">.</span><span class="nb">str</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>    <span class="c1"># the [0] element is the country name (will be the new index)
</span>
<span class="n">df</span><span class="p">[</span><span class="s">'ID'</span><span class="p">]</span> <span class="o">=</span> <span class="n">names_ids</span><span class="p">.</span><span class="nb">str</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="nb">str</span><span class="p">[:</span><span class="mi">3</span><span class="p">]</span>    <span class="c1"># the [1] element is the abbreviation or ID (take first 3 characters)
</span>
<span class="n">spare_df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">copy</span><span class="p">()</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">drop</span><span class="p">(</span><span class="s">'Totals'</span><span class="p">)</span>    <span class="c1"># remove the row with label 'Totals'
</span></code></pre></div></div>

<hr />
<p>Get the top ten countries that have the most medals in the summer and winter games.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">most_medals</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'Combined total'</span><span class="p">].</span><span class="n">nlargest</span><span class="p">(</span><span class="mi">10</span><span class="p">).</span><span class="n">index</span>
</code></pre></div></div>

<hr />
<h3 id="stacked-bar-chart">Stacked bar chart</h3>
<p>In this cell, we use <a href="http://matplotlib.org/"><strong><code class="language-plaintext highlighter-rouge">matplotlib</code></strong></a> to visualize stacked bar charts representing the top ten countries in terms of total number of medals in the winter and summer games.</p>

<hr />

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>

<span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>

<span class="c1"># get necessary pandas series
</span><span class="n">summer_medals</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">most_medals</span><span class="p">,</span> <span class="p">[</span><span class="s">'Total'</span><span class="p">]][</span><span class="s">'Total'</span><span class="p">]</span>
<span class="n">winter_medals</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">most_medals</span><span class="p">,</span> <span class="p">[</span><span class="s">'Total.1'</span><span class="p">]][</span><span class="s">'Total.1'</span><span class="p">]</span>

<span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">20</span><span class="p">,</span> <span class="mi">10</span><span class="p">))</span> <span class="c1"># create a figure object
</span>
<span class="c1"># context manager for temporary styling
</span><span class="k">with</span> <span class="n">plt</span><span class="p">.</span><span class="n">style</span><span class="p">.</span><span class="n">context</span><span class="p">((</span><span class="s">'seaborn-poster'</span><span class="p">,</span> <span class="p">{</span><span class="s">'xtick.labelsize'</span> <span class="p">:</span> <span class="mi">14</span><span class="p">,</span> <span class="s">'axes.labelpad'</span><span class="p">:</span><span class="mi">20</span> <span class="p">,</span> <span class="s">'axes.titlepad'</span> <span class="p">:</span> <span class="mi">20</span><span class="p">,</span>
                        <span class="s">'axes.spines.top'</span> <span class="p">:</span> <span class="bp">False</span><span class="p">,</span> <span class="s">'axes.spines.right'</span> <span class="p">:</span> <span class="bp">False</span><span class="p">,</span> <span class="s">'axes.spines.left'</span> <span class="p">:</span> <span class="bp">False</span><span class="p">}</span> <span class="p">)):</span>
    <span class="c1"># create the bar plots
</span>    <span class="n">bar_list</span> <span class="o">=</span> <span class="p">[</span><span class="n">plt</span><span class="p">.</span><span class="n">bar</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">most_medals</span><span class="p">)),</span> <span class="n">summer_medals</span><span class="p">,</span> <span class="n">width</span> <span class="o">=</span> <span class="mf">0.5</span> <span class="p">,</span><span class="n">color</span><span class="o">=</span><span class="s">'#d69728'</span><span class="p">,</span>
                        <span class="n">tick_label</span> <span class="o">=</span> <span class="n">most_medals</span><span class="p">,</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'Summer'</span><span class="p">),</span>    
         <span class="n">plt</span><span class="p">.</span><span class="n">bar</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">most_medals</span><span class="p">)),</span> <span class="n">winter_medals</span><span class="p">,</span> <span class="n">width</span> <span class="o">=</span> <span class="mf">0.5</span> <span class="p">,</span> <span class="n">bottom</span> <span class="o">=</span> <span class="n">summer_medals</span> <span class="p">,</span>
                 <span class="n">tick_label</span> <span class="o">=</span> <span class="n">most_medals</span><span class="p">,</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'winter'</span><span class="p">)]</span>
    <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">gca</span><span class="p">()</span>
    <span class="n">ax</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">'Number of Medals'</span><span class="p">)</span>
    <span class="n">ax</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">'Countries'</span><span class="p">)</span>
    <span class="n">ax</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Number of total medals for the top ten countries'</span><span class="p">)</span>
    <span class="n">ax</span><span class="p">.</span><span class="n">tick_params</span><span class="p">(</span><span class="n">bottom</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">left</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">labelleft</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>

    <span class="c1"># attach text label for each bar displaying its value
</span>    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">bar_list</span><span class="p">:</span>
        <span class="k">for</span> <span class="n">bar</span> <span class="ow">in</span> <span class="n">i</span><span class="p">:</span>
            <span class="n">height</span> <span class="o">=</span> <span class="n">bar</span><span class="p">.</span><span class="n">get_height</span><span class="p">()</span>
            <span class="n">bottom</span> <span class="o">=</span> <span class="n">bar</span><span class="p">.</span><span class="n">get_y</span><span class="p">()</span>
            <span class="k">if</span> <span class="n">height</span> <span class="o">&lt;</span> <span class="mi">50</span><span class="p">:</span>
                <span class="n">y</span> <span class="o">=</span> <span class="mf">0.5</span><span class="o">*</span><span class="n">height</span><span class="o">+</span><span class="n">bottom</span><span class="o">*</span><span class="mf">1.20</span>
            <span class="k">else</span><span class="p">:</span>
                <span class="n">y</span> <span class="o">=</span> <span class="mf">0.5</span><span class="o">*</span><span class="n">height</span><span class="o">+</span><span class="n">bottom</span>
            <span class="n">plt</span><span class="p">.</span><span class="n">gca</span><span class="p">().</span><span class="n">text</span><span class="p">(</span><span class="n">bar</span><span class="p">.</span><span class="n">get_x</span><span class="p">()</span> <span class="o">+</span> <span class="n">bar</span><span class="p">.</span><span class="n">get_width</span><span class="p">()</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="nb">str</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">height</span><span class="p">)),</span>
                 <span class="n">ha</span><span class="o">=</span><span class="s">'center'</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'black'</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>
    <span class="n">plt</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>

<span class="c1"># save the plot as a png file. you can change the file format to pdf or any supported extension (comment out to use)
#fig.savefig("totalmedals.png", dpi=150)
</span></code></pre></div></div>

<p><img src="/assets/olympics/output_6_0.png" alt="png of stacked bar charts representing the top ten countries in terms of total number of medals in the winter and summer games" /></p>

<hr />
<h3 id="bubble-chart">Bubble chart</h3>
<p>This chart is an example of a visualization that can be created to help understand the data. This is a bubble chart showing the value of <em>adjusted gold medals</em> (#total gold/ #total games) Vs. the <em>rank</em> with reference to the number of total medals won.</p>

<p>The <strong>size</strong> of the bubble corresponds to an <em>adjusted value of total medals</em> (#total medals/ #total games) won, and the <strong>color</strong> corresponds to the geolocation (European or non-European) or current status (red: no longer exists).</p>

<hr />

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">top_medals</span> <span class="o">=</span> <span class="n">spare_df</span><span class="p">[</span><span class="s">'Combined total'</span><span class="p">].</span><span class="n">nlargest</span><span class="p">(</span><span class="mi">11</span><span class="p">).</span><span class="n">index</span>
<span class="n">bubble_df</span> <span class="o">=</span> <span class="n">spare_df</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">top_medals</span><span class="p">].</span><span class="n">drop</span><span class="p">(</span><span class="s">'ID'</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="c1"># dataframe for top 11 winners
</span>
<span class="c1"># to eliminate overlapping medals' count
</span><span class="n">sum_topcoun</span> <span class="o">=</span> <span class="n">bubble_df</span><span class="p">[</span><span class="mi">1</span><span class="p">:][</span><span class="n">bubble_df</span><span class="p">.</span><span class="n">columns</span><span class="p">[</span><span class="o">~</span><span class="n">bubble_df</span><span class="p">.</span><span class="n">columns</span><span class="p">.</span><span class="nb">str</span><span class="p">.</span><span class="n">contains</span><span class="p">(</span><span class="s">'#'</span><span class="p">)]].</span><span class="nb">sum</span><span class="p">()</span>
<span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="n">bubble_df</span><span class="p">.</span><span class="n">columns</span><span class="p">:</span>
    <span class="k">if</span> <span class="n">col</span>  <span class="ow">in</span> <span class="n">sum_topcoun</span><span class="p">.</span><span class="n">index</span><span class="p">:</span>        
        <span class="n">bubble_df</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="s">'Totals'</span><span class="p">][</span><span class="n">col</span><span class="p">]</span> <span class="o">-=</span> <span class="n">sum_topcoun</span><span class="p">[</span><span class="n">col</span><span class="p">]</span>

<span class="n">bubble_df</span> <span class="o">=</span> <span class="n">bubble_df</span><span class="p">.</span><span class="n">rename</span><span class="p">(</span><span class="n">index</span><span class="o">=</span><span class="p">{</span><span class="s">'Totals'</span><span class="p">:</span><span class="s">'Rest of the World'</span><span class="p">})</span>

<span class="c1"># create 2 new columns with their respective data
</span><span class="n">bubble_df</span><span class="p">[</span><span class="s">'adjusted_cgold'</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">bubble_df</span><span class="p">[</span><span class="s">'Gold.2'</span><span class="p">].</span><span class="n">div</span><span class="p">(</span><span class="n">bubble_df</span><span class="p">[</span><span class="s">'# Combined Games'</span><span class="p">])).</span><span class="nb">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="nb">float</span><span class="p">(</span><span class="s">'%.1f'</span><span class="o">%</span><span class="n">x</span><span class="p">))</span>
<span class="n">bubble_df</span><span class="p">[</span><span class="s">'Rank'</span><span class="p">]</span> <span class="o">=</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="nb">len</span><span class="p">(</span><span class="n">bubble_df</span><span class="p">.</span><span class="n">index</span><span class="p">)</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span>

<span class="c1"># chart creation and styling
</span><span class="k">with</span> <span class="n">plt</span><span class="p">.</span><span class="n">style</span><span class="p">.</span><span class="n">context</span><span class="p">((</span><span class="s">'seaborn-poster'</span><span class="p">,</span> <span class="p">{</span><span class="s">'xtick.labelsize'</span> <span class="p">:</span><span class="mi">12</span><span class="p">,</span> <span class="s">'ytick.labelsize'</span><span class="p">:</span><span class="mi">12</span><span class="p">,</span><span class="s">'axes.labelpad'</span><span class="p">:</span><span class="mi">20</span> <span class="p">,</span>
                                           <span class="s">'axes.titlepad'</span> <span class="p">:</span> <span class="mi">20</span><span class="p">,</span><span class="s">'axes.labelsize'</span><span class="p">:</span><span class="mi">15</span><span class="p">}</span> <span class="p">)):</span>
    <span class="n">ax2</span> <span class="o">=</span> <span class="n">bubble_df</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s">'Rank'</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s">'adjusted_cgold'</span><span class="p">,</span> <span class="n">kind</span><span class="o">=</span><span class="s">'scatter'</span><span class="p">,</span>
                    <span class="n">c</span><span class="o">=</span><span class="p">[</span><span class="s">'#e4aa1a'</span><span class="p">,</span><span class="s">'#377eb8'</span><span class="p">,</span><span class="s">'#e41a1c'</span><span class="p">,</span><span class="s">'#4daf4a'</span><span class="p">,</span><span class="s">'#4daf4a'</span><span class="p">,</span><span class="s">'#4daf4a'</span><span class="p">,</span><span class="s">'#4daf4a'</span><span class="p">,</span><span class="s">'#4daf4a'</span><span class="p">,</span>
                    <span class="s">'#377eb8'</span><span class="p">,</span><span class="s">'#4daf4a'</span><span class="p">,</span><span class="s">'#4daf4a'</span><span class="p">],</span> <span class="n">linewidths</span><span class="o">=</span><span class="mi">2</span> <span class="p">,</span>
                    <span class="n">xticks</span><span class="o">=</span><span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="nb">len</span><span class="p">(</span><span class="n">bubble_df</span><span class="p">.</span><span class="n">index</span><span class="p">)</span><span class="o">+</span><span class="mi">1</span><span class="p">),</span>
                         <span class="n">s</span><span class="o">=</span><span class="p">(</span><span class="n">bubble_df</span><span class="p">[</span><span class="s">'Combined total'</span><span class="p">].</span><span class="n">div</span><span class="p">(</span><span class="n">bubble_df</span><span class="p">[</span><span class="s">'# Combined Games'</span><span class="p">]))</span><span class="o">*</span><span class="mi">100</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="p">.</span><span class="mi">55</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">[</span><span class="mi">15</span><span class="p">,</span><span class="mi">7</span><span class="p">])</span>
    <span class="n">ax2</span><span class="p">.</span><span class="n">set_ylim</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">65</span><span class="p">)</span>
    <span class="n">ax2</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Adjusted total gold medals by total medals won'</span><span class="p">)</span>
    <span class="n">ax2</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">'Adjusted gold medals'</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">txt</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">bubble_df</span><span class="p">.</span><span class="n">index</span><span class="p">):</span>    <span class="c1"># add labels inside each bubble
</span>        <span class="n">ax2</span><span class="p">.</span><span class="n">annotate</span><span class="p">(</span><span class="n">txt</span><span class="p">,</span> <span class="p">[</span><span class="n">bubble_df</span><span class="p">[</span><span class="s">'Rank'</span><span class="p">][</span><span class="n">i</span><span class="p">],</span> <span class="n">bubble_df</span><span class="p">[</span><span class="s">'adjusted_cgold'</span><span class="p">][</span><span class="n">i</span><span class="p">]],</span> <span class="n">ha</span><span class="o">=</span><span class="s">'center'</span><span class="p">,</span><span class="n">fontsize</span><span class="o">=</span><span class="mi">11</span><span class="p">)</span>

<span class="c1">#plt.savefig("cgold.png", dpi=150)
</span></code></pre></div></div>

<p><img src="/assets/olympics/output_8_0.png" alt="png showing adjusted gold medals versus the rank of countries with reference to the number of total medals won" /></p>

<hr />
<p>This chart shows that the United States has the most number of total medals in summer and winter games as indicated by the x-axis (Rank). Norway has the least number of combined medals among the top ten.</p>

<p>Based on the values of <em>adjusted gold medals</em> represented by the y-axis, the Soviet Union won the most gold medals relative to the number of olympic games in which they participated, followed by the US and Russia respectively.</p>

<p>The sizes of bubbles suggest that the top 3 countries that won <em>most medals</em> relative to the number of games they were part of, are respectively, the Soviet Union, the US and Russia.</p>

<p>To put it in perspective by comparing France and China, the latter won <em>less</em> total medals overall (position on x-axis). But taking into account the number of olympic games played, China <em>did</em> win more gold medals (position on y-axis) and total medals (bubble-size) .</p>]]></content><author><name>Meher Bejaoui</name><email>meher.bejaoui@outlook.com</email></author><category term="python" /><category term="tutorial" /><category term="python" /><category term="pandas" /><summary type="html"><![CDATA[Analysis of Olympic Games medal table with pandas and matplotlib]]></summary></entry><entry><title type="html">Governance of extractive industry in Tunisia</title><link href="https://meherbejaoui.com/blog/Governance-of-extractive-industry-in-Tunisia/" rel="alternate" type="text/html" title="Governance of extractive industry in Tunisia" /><published>2021-03-03T00:00:00+01:00</published><updated>2021-03-03T00:00:00+01:00</updated><id>https://meherbejaoui.com/blog/Governance-of-extractive-industry-in-Tunisia</id><content type="html" xml:base="https://meherbejaoui.com/blog/Governance-of-extractive-industry-in-Tunisia/"><![CDATA[<p>Governance of the extractive industry is important to optimize resource utilization, and to ensure that the outcomes from natural resources exploitation contribute to the sustainable development of the country.</p>

<p>Simultaneously obtaining higher economic revenues and better social impacts is not a simple task, and may be impeded by several practical and organizational obstacles, that focus on the present gains rather than on sustainable development.</p>

<p>Governance can be improved with the right legal, institutional and administrative measures, and by applying certain best practices. They must unfold in terms of multiple reforms carried out judiciously.</p>

<p>The key question addressed by this report is: <em>how can we improve the governance of the extractive industry in <strong>Tunisia</strong>?</em></p>

<p>This would be done as follows:</p>
<ul>
  <li>
    <p><strong>Chapter 1</strong> would address the legal framework governing the extractive industry in Tunisia. An analysis of the different legal texts, and the specific details would help contextualize and put into perspective the status quo.</p>
  </li>
  <li>
    <p><strong>Chapter 2</strong> addresses the institutional and organizational frameworks of the sector. It would present the main entities that shape the public interventions and strategies. It would also try to analyse and build on the governance of the sector from this perspective.</p>
  </li>
  <li>
    <p><strong>Chapter 3</strong> gives a broad but important idea about the current health of the sector in Tunisia, and showcases the importance of natural resources and the need for a better governance, with a focus on sustainable development.</p>
  </li>
  <li>
    <p><strong>Chapter 4</strong> deals with the challenges and opportunities in line with the sustainable development of the sector and the country. It uses different approaches and tools to enhance the structures and regulations, and the governance of the whole sector along the decision chain.</p>
  </li>
</ul>

<p><em>This publication was written and submitted as part of my graduation work and requirements, from the National School of Administration of Tunis in 2020.</em></p>

<hr />

<iframe src="/assets/governanceinTunisia.pdf" alt="pdf report titled Governance of extractive industry in Tunisia" width="800" height="750"></iframe>]]></content><author><name>Meher Bejaoui</name><email>meher.bejaoui@outlook.com</email></author><category term="blog" /><category term="governance" /><summary type="html"><![CDATA[Thesis report about Governance of extractive industry in Tunisia by Meher Béjaoui]]></summary></entry><entry><title type="html">Les bases de Python</title><link href="https://meherbejaoui.com/python/bases-de-python/" rel="alternate" type="text/html" title="Les bases de Python" /><published>2018-08-12T00:00:00+01:00</published><updated>2018-08-12T00:00:00+01:00</updated><id>https://meherbejaoui.com/python/bases-de-python</id><content type="html" xml:base="https://meherbejaoui.com/python/bases-de-python/"><![CDATA[<hr />
<h2 id="sommaire">Sommaire</h2>

<ul>
  <li><a href="#utiliser-python-comme-une-calculatrice">Introduction à Python</a></li>
  <li><a href="#contrôle-du-flux">Contrôle du flux</a></li>
  <li><a href="#structures-de-données">Structures de données</a></li>
  <li><a href="#modules">Modules</a></li>
</ul>

<hr />
<h3 id="utiliser-python-comme-une-calculatrice">Utiliser Python comme une calculatrice</h3>
<p>Après avoir <a href="https://docs.python.org/3/using/windows.html#installing-python">installé Python</a>, vous pouvez ouvrir une <em>console</em> ou <em>l’invite de commande</em>, tapez python et un intepréteur Python s’ouvre.
Pour l’instant, nous utilisons ce <a href="/assets/IntroductionPython.ipynb">notebook</a>.<br />
Essayons quelques commandes Python simples</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="mi">2</span><span class="o">+</span><span class="mi">3</span> <span class="c1"># Un commentaire
</span></code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="mi">20</span> <span class="o">+</span> <span class="mi">11</span> <span class="o">*</span> <span class="mi">3</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="mi">20</span> <span class="o">-</span> <span class="mi">11</span><span class="p">)</span> <span class="o">/</span> <span class="mi">4</span>
</code></pre></div></div>

<p>Les nombres entiers (comme 2, 3 et 11) sont de type <strong><code class="language-plaintext highlighter-rouge">int</code></strong>, alors que les décimaux (comme 10.0 et 3.14) sont de type <strong><code class="language-plaintext highlighter-rouge">float</code></strong>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="mi">19</span> <span class="o">/</span> <span class="mi">3</span> <span class="c1"># La division (/) donne toujours un nombre de type 'float'
</span></code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="mi">19</span> <span class="o">//</span> <span class="mi">3</span> <span class="c1"># L'opérateur (//) effectue des divisions entières
</span></code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="mi">19</span> <span class="o">%</span> <span class="mi">3</span>  <span class="c1"># L'opérateur (%) donne le reste de la division entière.
</span></code></pre></div></div>

<p>Il est possible de calculer des puissances (X<sup> y</sup>) avec l’opérateur <code class="language-plaintext highlighter-rouge">**</code></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="mi">3</span> <span class="o">**</span> <span class="mi">2</span>
<span class="c1"># 11 ** 12. # les opérations avec des types d'opérandes mélangés donnent un résultat en virgule flottante
</span></code></pre></div></div>

<p>Le signe égal ( <code class="language-plaintext highlighter-rouge">=</code> ) permet d’affecter une valeur à une variable</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hauteur</span> <span class="o">=</span> <span class="mi">7</span>
<span class="n">base</span> <span class="o">=</span> <span class="mi">9</span>
<span class="n">aire_du_triangle</span> <span class="o">=</span> <span class="p">(</span><span class="n">hauteur</span> <span class="o">*</span> <span class="n">base</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span>
<span class="n">aire_du_triangl</span>    <span class="c1"># parlons des erreurs dans l'affichage des résultats
</span></code></pre></div></div>

<hr />
<h3 id="les-chaînes-de-caractères">Les chaînes de caractères</h3>
<p>Les chaînes de caractères peuvent être exprimés de différentes manières :</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">'Bonjour'</span> <span class="c1"># guillemets simples
</span><span class="s">"Hello"</span> <span class="c1"># guillemets doubles
</span></code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">'de l</span><span class="se">\'</span><span class="s">art'</span> <span class="c1"># utiliser \' pour protéger les guillemets
# "de l'art" # ou utiliser les guillemets doubles
</span></code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">'"Le petit chat est mort. ", dit Agnès'</span>
<span class="c1"># "\"Le petit chat est mort. \", dit Agnès"
</span></code></pre></div></div>

<p>La fonction <strong><code class="language-plaintext highlighter-rouge">print()</code></strong> affiche les chaînes de caractères de manière plus lisible.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">'"L</span><span class="se">\'</span><span class="s">art !" dit Michel.'</span>
<span class="c1"># print ('"L\'art !" dit Michel.')
# p = 'Première ligne.\nDeuxième ligne.' # \n signifie nouvelle ligne
# p
# print (p)
</span></code></pre></div></div>

<p>Utilisez les chaînes brutes (<code class="language-plaintext highlighter-rouge">raw strings</code>) en préfixant la chaîne d’un <strong><code class="language-plaintext highlighter-rouge">r</code></strong>, pour éviter que les caractères précédés d’un antislash ne soient interprétés comme étant spéciaux.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="s">'C:\Documents</span><span class="se">\n</span><span class="s">om'</span><span class="p">)</span>
<span class="c1"># print(r'C:\Documents\nom') # les guillemets précédés par (r)
</span></code></pre></div></div>

<p>Utilisez des triples guillemets : <code class="language-plaintext highlighter-rouge">'''abc'''</code> ou <code class="language-plaintext highlighter-rouge">"""xyz"""</code> pour écrire des chaînes de caractères qui s’étalent sur plusieurs lignes.</p>

<p>Empêcher le retour à la ligne en ajoutant <strong><code class="language-plaintext highlighter-rouge">\</code></strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="s">"""</span><span class="se">\
</span><span class="s">Définitions:
     -dictionnaire           Une structure de donnée associant des clefs et des valeurs
     -fonction               Une suite d’instructions qui renvoient une valeur à celui qui l’appelle
"""</span><span class="p">)</span>
</code></pre></div></div>

<p>L’opérateur <strong><code class="language-plaintext highlighter-rouge">+</code></strong> permet de coller (concaténer) plusieurs chaînes. L’opérateur <strong><code class="language-plaintext highlighter-rouge">*</code></strong> permet de répéter les chaînes.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">'OUI '</span> <span class="o">*</span> <span class="mi">3</span> <span class="o">+</span> <span class="s">'Hurrah!'</span>
<span class="c1"># prefix = 'Hello Wo'
# prefix + 'rld!'
</span></code></pre></div></div>

<p>Les caractères peuvent être accédés par leur position. Pour l’indexation des chaînes de caractères, le premier caractère est à la position <strong>0</strong>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">phrase</span> <span class="o">=</span> <span class="s">"Je m'appelle Brian"</span>
<span class="n">phrase</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="c1"># phrase[15]
</span></code></pre></div></div>

<p>Pour effectuer un décompte en partant de la droite, nous utilisons des indices négatifs (commencent par <strong>-1</strong>).</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">phrase</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="c1"># phrase[-3]
# phrase[-18]
</span></code></pre></div></div>

<p>Pour obtenir une sous-chaîne :</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">phrase</span><span class="p">[</span><span class="mi">0</span><span class="p">:</span><span class="mi">2</span><span class="p">]</span> <span class="c1"># caractères de la position 0 (inclut) à 2 (exclu)
# phrase[13:18]
# phrase[:12]
# phrase[12:] # s[:i] + s[i:] = s
# phrase[-5:]
# phrase[20] # indice trop grand (hors bornes)
# phrase[5:20] # gérés silencieusement si utilisés dans des tranches
</span></code></pre></div></div>

<p>Les chaînes de caractères sont <code class="language-plaintext highlighter-rouge">immutable</code> : elles ne peuvent pas être modifiées.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">phrase</span> <span class="o">=</span> <span class="s">"Je m'appelle Brian"</span>
<span class="n">phrase</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="s">'j'</span>
<span class="c1"># phrase[13:] = 'Stewie'
# phrase[:13] + 'Stewie !'
</span></code></pre></div></div>

<p>La fonction <strong><code class="language-plaintext highlighter-rouge">len()</code></strong> donne la longueur d’une chaîne :</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">p</span> <span class="o">=</span> <span class="s">'I have a dream'</span>
<span class="nb">len</span><span class="p">(</span><span class="n">p</span><span class="p">)</span>
</code></pre></div></div>

<hr />
<h3 id="les-listes">Les listes</h3>
<p>Une suite d’éléments séparés par des virgules, placés entre crochets. Les éléments d’une liste ne sont pas obligatoirement du même type.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">premiers</span>  <span class="o">=</span> <span class="p">[</span><span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">11</span><span class="p">]</span>
<span class="n">premiers</span>
</code></pre></div></div>

<p>Les listes peuvent être indicées et découpées :</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">premiers</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="c1"># premiers[-1]
# premiers[2:10]
</span></code></pre></div></div>

<p>Les opérations de découpage (en tranches) renvoient une nouvelle liste contenant les éléments spécifiés.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">premiers</span><span class="p">[:]</span> <span class="c1"># une copie de la liste
</span></code></pre></div></div>

<p>Les listes supportent des opérations comme pour les chaînes de caratcères.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">premiers</span> <span class="o">+</span> <span class="p">[</span><span class="mi">13</span><span class="p">,</span> <span class="mi">17</span><span class="p">,</span> <span class="mi">19</span><span class="p">]</span>
<span class="c1"># premiers * 2
</span></code></pre></div></div>

<p>Il est possible de changer le contenu des listes : elles sont <code class="language-plaintext highlighter-rouge">mutables</code></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">nbres</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="s">'c'</span><span class="p">,</span> <span class="mi">4</span><span class="p">]</span>
<span class="c1"># nbres[2] = 3
</span><span class="n">nbres</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">nbres</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span> <span class="c1"># méthode pour ajouter des éléments à la fin
</span><span class="n">nbres</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">nbres</span><span class="p">[</span><span class="mi">2</span><span class="p">:</span><span class="mi">4</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="mi">10</span><span class="p">,</span> <span class="mi">100</span><span class="p">]</span> <span class="c1"># affectation par tranches
# nbres[:] = [] # supprimer toutes les valeurs
</span><span class="n">nbres</span>
</code></pre></div></div>

<p>Il est possible de créer des listes contenant d’autres listes :</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">]</span>
<span class="n">b</span> <span class="o">=</span> <span class="p">[</span><span class="s">'f'</span><span class="p">,</span> <span class="s">'r'</span><span class="p">,</span> <span class="s">'a'</span><span class="p">]</span>
<span class="n">res</span> <span class="o">=</span> <span class="p">[</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">]</span>
<span class="c1"># res
# res[1]
# res[0][2]
</span></code></pre></div></div>

<p>La compréhension des listes (<code class="language-plaintext highlighter-rouge">list comprehension</code>) permet de construire des nouvelles listes où chaque élément est le résultat d’une opération appliquée à chaque élément d’une autre séquence; ou de créer une sous-séquence d’éléments satisfaisants une certaine condition.<br />
Elle consiste en deux crochets contenants une expression suivie par une clause <strong><code class="language-plaintext highlighter-rouge">for</code></strong>, puis par une ou plusieurs clauses <strong><code class="language-plaintext highlighter-rouge">for</code></strong> ou <strong><code class="language-plaintext highlighter-rouge">if</code></strong>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="n">x</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">6</span><span class="p">)</span> <span class="k">if</span> <span class="n">x</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="mi">0</span><span class="p">]</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">]</span> <span class="k">for</span> <span class="n">y</span> <span class="ow">in</span> <span class="p">[</span><span class="mi">3</span><span class="p">,</span><span class="mi">1</span><span class="p">]</span> <span class="k">if</span> <span class="n">x</span> <span class="o">!=</span> <span class="n">y</span><span class="p">]</span>
</code></pre></div></div>

<hr />
<h3 id="contrôle-du-flux">Contrôle du flux</h3>
<h4 id="linstruction-if-">L’instruction <code class="language-plaintext highlighter-rouge">if</code> :</h4>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x</span> <span class="o">=</span> <span class="mi">11</span>
<span class="k">if</span> <span class="n">x</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
    <span class="k">print</span> <span class="p">(</span><span class="s">'foo'</span><span class="p">)</span>
    <span class="n">s</span> <span class="o">=</span> <span class="s">'nombre pair'</span>
<span class="k">else</span><span class="p">:</span>    
    <span class="k">print</span> <span class="p">(</span><span class="s">'bar'</span><span class="p">)</span>    <span class="c1"># x % 2 != 0
</span></code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x</span> <span class="o">=</span> <span class="mi">20</span>
<span class="k">if</span> <span class="n">x</span> <span class="o">%</span> <span class="mi">5</span> <span class="o">==</span> <span class="mi">0</span> <span class="ow">and</span> <span class="n">x</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
    <span class="k">print</span> <span class="p">(</span><span class="s">'foobar'</span><span class="p">)</span>
<span class="k">elif</span> <span class="n">x</span> <span class="o">%</span> <span class="mi">5</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
    <span class="k">print</span> <span class="p">(</span><span class="s">'foo'</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
    <span class="k">print</span> <span class="p">(</span><span class="s">'bar'</span><span class="p">)</span>
</code></pre></div></div>

<h4 id="linstruction-for-">L’instruction <code class="language-plaintext highlighter-rouge">for</code> :</h4>
<p>Elle permet d’itérer sur les éléments d’une séquence (une liste, une chaîne de caractères, etc.) par ordre.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pays</span> <span class="o">=</span> <span class="p">[</span><span class="s">'France'</span><span class="p">,</span> <span class="s">'Canada'</span><span class="p">,</span> <span class="s">'Belgique'</span><span class="p">,</span> <span class="s">'Suisse'</span><span class="p">]</span>
<span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">pays</span><span class="p">:</span>
    <span class="k">print</span> <span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">p</span><span class="p">))</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mot</span> <span class="o">=</span> <span class="s">'1 chat'</span>
<span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">mot</span><span class="p">:</span>
    <span class="k">print</span> <span class="p">(</span><span class="n">c</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">4</span><span class="p">):</span>
    <span class="k">print</span> <span class="p">(</span><span class="n">i</span> <span class="o">**</span> <span class="mi">2</span><span class="p">)</span>
</code></pre></div></div>

<h4 id="linstruction-while-">L’instruction <code class="language-plaintext highlighter-rouge">while</code> :</h4>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x</span> <span class="o">=</span> <span class="mi">1</span>
<span class="k">while</span> <span class="n">x</span> <span class="o">&lt;=</span> <span class="mi">5</span><span class="p">:</span>
    <span class="k">print</span> <span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">end</span><span class="o">=</span><span class="s">' '</span><span class="p">)</span> <span class="c1"># Le paramètre (end) sert à enlever le retour à la ligne, ou terminer par un autre caractère
</span>    <span class="n">x</span> <span class="o">+=</span> <span class="mi">2</span> <span class="c1"># x = x + 2
</span></code></pre></div></div>

<h5 id="les-fonctions-">Les fonctions :</h5>
<p>Le mot-clé <strong><code class="language-plaintext highlighter-rouge">def</code></strong> définit la fonction. Il est suivi du nom de la fonction, et de ses paramètres entre parenthèses.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">add</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>         <span class="c1"># calculer la somme des chiffres de 1 à n
</span>    <span class="n">somme</span> <span class="o">=</span> <span class="mi">0</span>       <span class="c1"># une variable locale
</span>    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">n</span><span class="o">+</span><span class="mi">1</span><span class="p">):</span>
        <span class="n">somme</span> <span class="o">+=</span> <span class="n">i</span>
    <span class="k">return</span> <span class="n">somme</span>
<span class="n">add</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
<span class="c1"># add(10)
</span>
<span class="c1"># somme # n'est pas défini 'globalement'
# somme = 100        # variable globale
# somme
</span></code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">fib</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">n</span><span class="p">):</span>
    <span class="s">""" print une suite de fibnoacci à partir des termes a et b, jusqu'à n """</span>
    <span class="k">while</span> <span class="n">a</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">end</span><span class="o">=</span><span class="s">' '</span><span class="p">)</span>
        <span class="n">a</span><span class="p">,</span> <span class="n">b</span> <span class="o">=</span> <span class="n">b</span><span class="p">,</span> <span class="n">a</span><span class="o">+</span><span class="n">b</span>
<span class="n">fib</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">2000</span><span class="p">)</span> <span class="s">'''{même les fonctions sans instruction
                 return renvoient une valeur, quoique ennuyeuse. Cette valeur est appelée None }'''</span>
</code></pre></div></div>

<hr />
<h3 id="structures-de-données">Structures de données</h3>
<h4 id="tuples">Tuples</h4>
<p>Une séquence d’éléments séparés par des virgules (et encadrés par des parenthèses si nécessaire)</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">t</span> <span class="o">=</span> <span class="s">'python'</span><span class="p">,</span> <span class="mf">3.5</span><span class="p">,</span> <span class="mi">101</span>
<span class="n">t</span>
<span class="c1"># t[2]
</span>
<span class="c1"># n = t,  'Guido', ('2.7', 2010), [0, 90] # tuples imbriqués
# n
</span></code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">n</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="s">'Guido van Rossum'</span> <span class="c1"># ils sont immutables
# n[3][1] = 100
</span><span class="n">n</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">vide</span> <span class="o">=</span> <span class="p">()</span> <span class="c1"># initier un tuple vide
# un = 'python',
</span></code></pre></div></div>

<h4 id="les-ensembles-sets">Les ensembles (<code class="language-plaintext highlighter-rouge">sets</code>)</h4>
<p>Une collection non ordonnée, sans élément dupliqué.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">notes</span> <span class="o">=</span> <span class="p">{</span><span class="mi">18</span><span class="p">,</span> <span class="mi">15</span><span class="p">,</span> <span class="mi">14</span><span class="p">,</span> <span class="mi">11</span> <span class="p">,</span><span class="mi">18</span><span class="p">,</span> <span class="mi">14</span><span class="p">}</span>
<span class="n">notes</span>
<span class="c1"># fruits = {'orange', 'raisin', 'pomme' ,'kiwi' , 'orange', 'pomme'}
# fruits
</span></code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span> <span class="o">=</span> <span class="nb">set</span><span class="p">(</span><span class="s">'abccba'</span><span class="p">)</span>
<span class="n">a</span>
<span class="c1"># a &amp; set('atta') # supportent d'autres opérations (unions, intersections, différences, etc.)
</span></code></pre></div></div>

<h4 id="dictionnaires">Dictionnaires</h4>
<p>Des ensembles non ordonnés de pairs <strong><code class="language-plaintext highlighter-rouge">clé</code> : <code class="language-plaintext highlighter-rouge">valeur</code></strong> (<code class="language-plaintext highlighter-rouge">key</code> : <code class="language-plaintext highlighter-rouge">value</code> pairs). Ils sont indexés par des <code class="language-plaintext highlighter-rouge">clés</code> (keys), qui peuvent être de n’importe quel type immuable : chaînes de caractères, nombres et tuples (s’ils ne contiennent que des <code class="language-plaintext highlighter-rouge">immutables</code>).<br />
Les clés doivent être uniques au sein d’un dictionnaire.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">d</span> <span class="o">=</span> <span class="p">{}</span> <span class="c1"># créer un dictionnaire vide
</span>
<span class="c1"># d = {'Marie':15, 'Jean':2, 'Victor':9}
# d['Marie']
# del d['Victor']
# d['Charles'] = 33
</span><span class="n">d</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">p</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">([(</span><span class="s">'Anthony'</span><span class="p">,</span> <span class="s">'a'</span><span class="p">),</span> <span class="p">(</span><span class="s">'Guido'</span><span class="p">,</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="s">'g'</span><span class="p">]),</span> <span class="p">(</span><span class="s">'Marie'</span><span class="p">,</span> <span class="s">'m'</span><span class="p">)])</span>
<span class="c1"># p.keys()
# p.values()
# 'guido' in p
# 'Guido' in p
</span></code></pre></div></div>

<p>Création des dictionnaires par compréhension (<code class="language-plaintext highlighter-rouge">dict comprehensions</code>).</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="o">**</span><span class="mi">3</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">6</span><span class="p">)}</span>
</code></pre></div></div>

<hr />
<h3 id="modules">Modules</h3>
<p>Un module est un fichier contenant des définitions et des instructions. Le nom du fichier est celui du module, suffixé de <strong><code class="language-plaintext highlighter-rouge">.py</code></strong>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">import</span> <span class="o">&lt;</span><span class="n">module</span><span class="o">&gt;</span> <span class="c1"># importer le module dans la table des symboles
</span>
<span class="o">&lt;</span><span class="n">module</span><span class="o">&gt;</span><span class="p">.</span><span class="o">&lt;</span><span class="n">fonction</span><span class="o">&gt;</span> <span class="c1"># accéder aux fonctions (ou constantes)
</span></code></pre></div></div>
<p>Pour importer les noms d’un <strong><code class="language-plaintext highlighter-rouge">&lt;module_1&gt;</code></strong> directement dans la table des symboles du module qui l’importe (<strong><code class="language-plaintext highlighter-rouge">&lt;module_2&gt;</code></strong>). De ce fait, le nom du <strong><code class="language-plaintext highlighter-rouge">&lt;module_1&gt;</code></strong> n’est pas défini à l’intérieur du <strong><code class="language-plaintext highlighter-rouge">&lt;module_2&gt;</code></strong></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">from</span> <span class="o">&lt;</span><span class="n">module</span><span class="o">&gt;</span> <span class="k">import</span> <span class="o">&lt;</span><span class="n">fonction_1</span><span class="o">&gt;</span><span class="p">,</span> <span class="o">&lt;</span><span class="n">fonction_2</span><span class="o">&gt;</span><span class="p">,</span> <span class="o">&lt;</span><span class="n">fonction_n</span><span class="o">&gt;</span>
</code></pre></div></div>
<p>Ou bien</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">from</span> <span class="o">&lt;</span><span class="n">module</span><span class="o">&gt;</span> <span class="k">import</span> <span class="o">*</span> <span class="c1"># importer tous les noms du module (déconseillé)
</span></code></pre></div></div>

<hr />
<h2 id="exercice-pratique-de-synthèse">Exercice pratique de synthèse</h2>
<ul>
  <li>Définir une fonction <code class="language-plaintext highlighter-rouge">somme_impairs</code> qui prend une liste de nombres comme argument, et renvoie la somme de tous les entiers impairs. Par exemple :
    <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">somme_impairs</span><span class="p">([</span><span class="mi">5</span><span class="p">,</span> <span class="o">-</span><span class="mi">13</span><span class="p">,</span> <span class="mi">3</span><span class="p">])</span> <span class="o">=</span> <span class="o">-</span><span class="mi">5</span>
<span class="n">somme_impairs</span><span class="p">([</span><span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mf">1.01</span><span class="p">)</span> <span class="o">=</span> <span class="mi">1</span>
</code></pre></div>    </div>
  </li>
  <li>Écrire un programme qui prend une valeur entrée par l’utilisateur (1 &lt;= N &lt;= 100), et affiche un message comme suit :</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="o">-</span> <span class="s">"Votre valeur X est plus grande. Essayer de nouveau !"</span> <span class="c1"># si la valeur entrée est plus grande à celle du programme
</span> <span class="o">-</span> <span class="s">"Votre valeur X est plus petite. Essayer de nouveau !"</span> <span class="c1"># si la valeur entrée est plus petite à celle du programme
</span> <span class="o">-</span> <span class="s">"Bravo ! C'était bien X"</span>                               <span class="c1"># si la valeur entrée est correcte
</span></code></pre></div></div>

<p>Avec X est la valeur entrée par l’utilisateur.<br />
Penser à utiliser la fonction <strong><code class="language-plaintext highlighter-rouge">randint</code></strong> du module <strong><code class="language-plaintext highlighter-rouge">random</code></strong> pour générer un entier aléatoirement.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">random</span> <span class="kn">import</span> <span class="n">randint</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">randint</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span>
</code></pre></div></div>

<p>Pour plus de détails sur n’importe quel objet,  fonction ou module, utiliser : <strong><code class="language-plaintext highlighter-rouge">help(&lt;nom&gt;)</code></strong></p>]]></content><author><name>Meher Bejaoui</name><email>meher.bejaoui@outlook.com</email></author><category term="python" /><category term="tutorial" /><category term="python" /><summary type="html"><![CDATA[Les bases du langage Python]]></summary></entry></feed>