{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "biological-greek",
   "metadata": {},
   "source": [
    "## Extract text from blog entries\n",
    "I'm using requests and beautifulsoup libraries to get all entries from my blog mariogarcia.github.io\n",
    "\n",
    "### A) Getting article list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "champion-employer",
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "ROOT_URL = \"https://mariogarcia.github.io\"\n",
    "\n",
    "# getting all articles links list from archive.html page\n",
    "article_list_page = requests.get(\"{}/archive.html\".format(ROOT_URL))\n",
    "parsed_page       = BeautifulSoup(article_list_page.content, 'html.parser')\n",
    "links             = [\"{}{}\".format(ROOT_URL, entry['href']) for entry in parsed_page.select('ul.group a')]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "international-engagement",
   "metadata": {},
   "source": [
    "### B) Gathering all articles text in a DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "hazardous-refrigerator",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>\\n\\n\\n\\n\\n                                WORK...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>\\n\\n\\n\\n\\n                                WORK...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>\\n\\n\\n\\n\\n                                WORK...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>\\n\\n\\n\\n\\n                                WORK...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>\\n\\n\\n\\n\\n                                WORK...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                text\n",
       "0  \\n\\n\\n\\n\\n                                WORK...\n",
       "1  \\n\\n\\n\\n\\n                                WORK...\n",
       "2  \\n\\n\\n\\n\\n                                WORK...\n",
       "3  \\n\\n\\n\\n\\n                                WORK...\n",
       "4  \\n\\n\\n\\n\\n                                WORK..."
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# downloads html from the link passed as parameter\n",
    "# and extracts the article's full text\n",
    "def extract_text(link):\n",
    "    page       = requests.get(link)\n",
    "    html       = BeautifulSoup(page.content, 'html.parser')\n",
    "    paragraphs = [p.get_text() for p in html.select('div#main')]\n",
    "    \n",
    "    return \" \".join(paragraphs)\n",
    "\n",
    "# gather all articles text\n",
    "texts = [extract_text(link) for link in links]\n",
    "\n",
    "# create source dataframe\n",
    "df_src= pd.DataFrame(texts, columns=['text'])\n",
    "\n",
    "# show texts\n",
    "df_src.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dynamic-camel",
   "metadata": {},
   "source": [
    "## Analyzing documents\n",
    "Lets take a first look to see how documents are estructured to see which patterns I'm going to use to extract important information such as:\n",
    "\n",
    "- Title\n",
    "- Date\n",
    "- Text\n",
    "- Text length"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "irish-bedroom",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\\n\\n\\n\\n\\n                                WORKING IN PROGRESS\\n                             - POST\\n                        \\n\\n\\n\\n                                        Twitter\\n                                    \\n\\n\\n\\n\\n                                        Twitter\\n                                    \\n\\n\\n\\n\\n                                        Github\\n                                    \\n\\n\\n\\n2021-03-25Model Evaluation: ROC Curve and AUC\\n\\n\\n\\n\\n\\n\\n\\nIn the previous entry I was using decision functions and precision-recall curves to decide which threshold and classifier would serve best to my goal, whether it was precision or recall. In this occassion\\nI’m using the ROC curves.\\n\\n\\nThe ROC curves (ROC stands for Receiver Operating Characteristic) represents the performance of a binary classifier. It shows the relationship between false positive rates (FPR) and true positive rates (TPR). The idea is to choose the classifier that maximizes the TPR. Unlike the precision-recall curves the ideal point in a ROC curve is at the top left corner where the TPR is maximized and the FPR is minimized.\\n\\n\\nHere I’m using the same dataset as in the previous article and extending the Jupyter notebook with the ROC curves and AUC. There’re a couple of things to keep in mind to understand the following example:\\n\\n\\n\\n\\nI’m using a list of previously trained classifiers (lst variable)\\n\\n\\nI’m using a custom function that uses different decision functions whether they’re available or not (get_y_predict(…\\u200b))\\n\\n\\n\\n\\nIf you have any doubts about where the example came from you can check the notebook source code at any time.\\n\\n\\nROC\\n\\nimport matplotlib.pyplot as plt\\nfrom sklearn.metrics import roc_curve, auc\\n\\nplt.figure(figsize=(5, 5))\\n\\nfor classifier in lst:\\n    classifier_name      = type(classifier).__name__\\n\\n    # calculating prediction using decision_function() or predict_proba()\\n    y_predict            = get_y_predict(classifier, X_test)\\n\\n    # calculating the roc curves\\n    fpr, tpr, thresholds = roc_curve(y_test, y_predict)\\n    plt.plot(fpr, tpr, label=classifier_name)\\n\\nplt.plot([0, 1], [0, 1], c=\\'green\\', linestyle=\\'--\\')\\nplt.legend(loc=\"lower right\", fontsize=11)\\nplt.show()\\n\\n\\n\\nAs you can see it seems that the KNeighbors classifier is maximizing the TPR.\\n\\n\\n\\n\\n\\n\\n\\n\\nThe diagonal random line\\n\\nIn the previous and following charts, I’ve drawn a diagonal line that represents the separation between\\nbad and good classifiers.\\n\\n\\nThis means that any classifier performing close to it, can be considered as good as random, sometimes it could perform\\nbetter than random others less than random. You should be always be looking for classifiers performing consistenly above that line.\\n\\n\\n\\n\\nTo establish the goodness of classifier is oftenly used as a mesure the AUC or Area Under the Curve. The greater the area under the ROC curve the better is the classifier maximizing the TPR. The KNN classifier was the closest to the upper left corner, lets see the AUC to reasure our suspicions.\\n\\n\\nAUC\\n\\nimport matplotlib.pyplot as plt\\nfrom sklearn.metrics import roc_curve, auc\\n\\nplt.figure()\\n_, ax = plt.subplots(1, 3, figsize=(15, 4))\\ncols  = 0\\n\\nfor classifier in lst:\\n    classifier_name      = type(classifier).__name__\\n    # getting decision function prediction\\n    y_predict            = get_y_predict(classifier, X_test)\\n\\n    # calculating FPR and TPR\\n    fpr, tpr, thresholds = roc_curve(y_test, y_predict)\\n\\n    # calculating the area under the curve\\n    roc_auc              = auc(fpr, tpr)\\n\\n    ax[cols].title.set_text(\"{0} (AUC={1:.2f})\".format(classifier_name, roc_auc))\\n    ax[cols].set(xlabel=\\'False Positive Rate\\', ylabel=\\'True Positive Rate\\')\\n    ax[cols].plot(fpr, tpr, c=\\'k\\')\\n    ax[cols].plot([0, 1], [0, 1], c=\\'k\\', linestyle=\\'--\\')\\n    ax[cols].fill_between(fpr, tpr, hatch=\\'\\\\\\\\\\', color=\\'none\\', edgecolor=\\'#cccccc\\')\\n    cols+=1\\n\\nplt.show()\\n\\n\\n\\n\\n\\n\\n\\n\\nAs we see in the picture, the KNN is the one having the biggest AUC rate (0.96), so according to that it’s the one performing the best.\\n\\n\\n\\n\\nResources\\n\\n\\n\\n\\nUnderstanding the AUC ROC curve: from TowardsDataScience.com\\n\\n\\ntaiwan.ipynb: Taiwanese companies bankruptcy notebook\\n\\n\\n\\n\\n\\n\\n'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample = df_src.loc[1, 'text']\n",
    "sample"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "further-motor",
   "metadata": {},
   "source": [
    "Now I'm using **re** to start testing some regex to extract **title** and **date** afterwards from every entry in the dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "herbal-eagle",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('Model Evaluation: ROC Curve and AUC', '2021-03-25')"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "# regular expression with two groups -> ()\n",
    "matcher = re.search('.*(\\d{4}-\\d{2}-\\d{2})(.*)\\n*', sample)\n",
    "title   = matcher.group(2)\n",
    "date    = matcher.group(1)\n",
    "\n",
    "# showing extracted data\n",
    "title, date"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "mental-wellington",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'In the previous entry I was using decision functions and precision-recall curves to decide which threshold and classifier would serve best to my goal, whether it was precision or recall. In this occassion I’m using the ROC curves. The ROC curves (ROC stands for Receiver Operating Characteristic) represents the performance of a binary classifier. It shows the relationship between false positive rates (FPR) and true positive rates (TPR). The idea is to choose the classifier that maximizes the TPR. Unlike the precision-recall curves the ideal point in a ROC curve is at the top left corner where the TPR is maximized and the FPR is minimized. Here I’m using the same dataset as in the previous article and extending the Jupyter notebook with the ROC curves and AUC. There’re a couple of things to keep in mind to understand the following example: I’m using a list of previously trained classifiers (lst variable) I’m using a custom function that uses different decision functions whether they’re available or not (get_y_predict(…\\u200b)) If you have any doubts about where the example came from you can check the notebook source code at any time. ROC import matplotlib.pyplot as plt from sklearn.metrics import roc_curve, auc plt.figure(figsize=(5, 5)) for classifier in lst: classifier_name = type(classifier).__name__ # calculating prediction using decision_function() or predict_proba() y_predict = get_y_predict(classifier, X_test) # calculating the roc curves fpr, tpr, thresholds = roc_curve(y_test, y_predict) plt.plot(fpr, tpr, label=classifier_name) plt.plot([0, 1], [0, 1], c=\\'green\\', linestyle=\\'--\\') plt.legend(loc=\"lower right\", fontsize=11) plt.show() As you can see it seems that the KNeighbors classifier is maximizing the TPR. The diagonal random line In the previous and following charts, I’ve drawn a diagonal line that represents the separation between bad and good classifiers. This means that any classifier performing close to it, can be considered as good as random, sometimes it could perform better than random others less than random. You should be always be looking for classifiers performing consistenly above that line. To establish the goodness of classifier is oftenly used as a mesure the AUC or Area Under the Curve. The greater the area under the ROC curve the better is the classifier maximizing the TPR. The KNN classifier was the closest to the upper left corner, lets see the AUC to reasure our suspicions. AUC import matplotlib.pyplot as plt from sklearn.metrics import roc_curve, auc plt.figure() _, ax = plt.subplots(1, 3, figsize=(15, 4)) cols = 0 for classifier in lst: classifier_name = type(classifier).__name__ # getting decision function prediction y_predict = get_y_predict(classifier, X_test) # calculating FPR and TPR fpr, tpr, thresholds = roc_curve(y_test, y_predict) # calculating the area under the curve roc_auc = auc(fpr, tpr) ax[cols].title.set_text(\"{0} (AUC={1:.2f})\".format(classifier_name, roc_auc)) ax[cols].set(xlabel=\\'False Positive Rate\\', ylabel=\\'True Positive Rate\\') ax[cols].plot(fpr, tpr, c=\\'k\\') ax[cols].plot([0, 1], [0, 1], c=\\'k\\', linestyle=\\'--\\') ax[cols].fill_between(fpr, tpr, hatch=\\'\\\\\\\\\\', color=\\'none\\', edgecolor=\\'#cccccc\\') cols+=1 plt.show() As we see in the picture, the KNN is the one having the biggest AUC rate (0.96), so according to that it’s the one performing the best. Resources Understanding the AUC ROC curve: from TowardsDataScience.com taiwan.ipynb: Taiwanese companies bankruptcy notebook '"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "# cleaning up excess of \\n and \\s from text\n",
    "sample = re.sub('[\\n|\\s]{1,}', ' ', sample)\n",
    "\n",
    "# removing everything until the title (included) from article's text\n",
    "text   = re.sub('^.*{} '.format(title), '', sample)\n",
    "\n",
    "# show cleaned text\n",
    "text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "coastal-blogger",
   "metadata": {},
   "outputs": [],
   "source": [
    "# copying source dataframe to avoid downloading every time\n",
    "df = df_src.copy()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "wireless-parallel",
   "metadata": {},
   "source": [
    "## Using Pandas DataFrame to extract features from every entry"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "tamil-wheat",
   "metadata": {},
   "source": [
    "a) Extracting the **length** of every entry"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "ideal-prairie",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>len</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>\\n\\n\\n\\n\\n                                WORK...</td>\n",
       "      <td>8038</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>\\n\\n\\n\\n\\n                                WORK...</td>\n",
       "      <td>4108</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>\\n\\n\\n\\n\\n                                WORK...</td>\n",
       "      <td>4990</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>\\n\\n\\n\\n\\n                                WORK...</td>\n",
       "      <td>7758</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>\\n\\n\\n\\n\\n                                WORK...</td>\n",
       "      <td>10761</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                text    len\n",
       "0  \\n\\n\\n\\n\\n                                WORK...   8038\n",
       "1  \\n\\n\\n\\n\\n                                WORK...   4108\n",
       "2  \\n\\n\\n\\n\\n                                WORK...   4990\n",
       "3  \\n\\n\\n\\n\\n                                WORK...   7758\n",
       "4  \\n\\n\\n\\n\\n                                WORK...  10761"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['len'] = df['text'].str.len()\n",
    "\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "short-index",
   "metadata": {},
   "source": [
    "b) Extracting the **title and date** from every entry using the previous regex"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "antique-judges",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>len</th>\n",
       "      <th>date</th>\n",
       "      <th>title</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>\\n\\n\\n\\n\\n                                WORK...</td>\n",
       "      <td>8038</td>\n",
       "      <td>2021-03-26</td>\n",
       "      <td>Model Evaluation: Multiclass evaluation</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>\\n\\n\\n\\n\\n                                WORK...</td>\n",
       "      <td>4108</td>\n",
       "      <td>2021-03-25</td>\n",
       "      <td>Model Evaluation: ROC Curve and AUC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>\\n\\n\\n\\n\\n                                WORK...</td>\n",
       "      <td>4990</td>\n",
       "      <td>2021-03-24</td>\n",
       "      <td>Model Evaluation: Decision functions</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>\\n\\n\\n\\n\\n                                WORK...</td>\n",
       "      <td>7758</td>\n",
       "      <td>2021-03-16</td>\n",
       "      <td>Event Sourcing 101</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>\\n\\n\\n\\n\\n                                WORK...</td>\n",
       "      <td>10761</td>\n",
       "      <td>2021-03-15</td>\n",
       "      <td>Model Evaluation: Confusion Matrix</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                 text    len        date  \\\n",
       "0   \\n\\n\\n\\n\\n                                WORK...   8038  2021-03-26   \n",
       "1   \\n\\n\\n\\n\\n                                WORK...   4108  2021-03-25   \n",
       "2   \\n\\n\\n\\n\\n                                WORK...   4990  2021-03-24   \n",
       "3   \\n\\n\\n\\n\\n                                WORK...   7758  2021-03-16   \n",
       "13  \\n\\n\\n\\n\\n                                WORK...  10761  2021-03-15   \n",
       "\n",
       "                                      title  \n",
       "0   Model Evaluation: Multiclass evaluation  \n",
       "1       Model Evaluation: ROC Curve and AUC  \n",
       "2      Model Evaluation: Decision functions  \n",
       "3                        Event Sourcing 101  \n",
       "13       Model Evaluation: Confusion Matrix  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# extract_all will create a new DataFrame with the extracted data\n",
    "title_date_df = df['text']\\\n",
    "    .str.extractall(r'.*(?P<date>\\d{4}-\\d{2}-\\d{2})(?P<title>.*)[\\n]*')\\\n",
    "    .reset_index(col_fill='origin')\n",
    "\n",
    "# getting rid of not_matching and NaN entries\n",
    "title_date_df = title_date_df.where(title_date_df['match'] == 0).dropna()\n",
    "\n",
    "# merging both dataframes\n",
    "df = pd.merge(df, title_date_df, left_index=True, right_on='level_0')\n",
    "\n",
    "# removing not relevant columns once both dataframes are merged\n",
    "df = df.drop(['level_0', 'match'], axis=1)\n",
    "\n",
    "# now we got our data included in the original dataframe\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "later-window",
   "metadata": {},
   "source": [
    "c) **reordering columns**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "recorded-equality",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df[['title', 'date', 'len', 'text']]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "selective-cheese",
   "metadata": {},
   "source": [
    "d) **Cleaning article's text**: getting rid of headers, title, dates, return characters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "grand-minority",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>date</th>\n",
       "      <th>len</th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Model Evaluation: Multiclass evaluation</td>\n",
       "      <td>2021-03-26</td>\n",
       "      <td>8038</td>\n",
       "      <td>So far I’ve been evaluating binary classifiers...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Model Evaluation: ROC Curve and AUC</td>\n",
       "      <td>2021-03-25</td>\n",
       "      <td>4108</td>\n",
       "      <td>In the previous entry I was using decision fun...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Model Evaluation: Decision functions</td>\n",
       "      <td>2021-03-24</td>\n",
       "      <td>4990</td>\n",
       "      <td>Another tool for evaluating a classifier are d...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Event Sourcing 101</td>\n",
       "      <td>2021-03-16</td>\n",
       "      <td>7758</td>\n",
       "      <td>What is event sourcing ? As opposed to store t...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>Model Evaluation: Confusion Matrix</td>\n",
       "      <td>2021-03-15</td>\n",
       "      <td>10761</td>\n",
       "      <td>The machine learning workflow usually involves...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                      title        date    len  \\\n",
       "0   Model Evaluation: Multiclass evaluation  2021-03-26   8038   \n",
       "1       Model Evaluation: ROC Curve and AUC  2021-03-25   4108   \n",
       "2      Model Evaluation: Decision functions  2021-03-24   4990   \n",
       "3                        Event Sourcing 101  2021-03-16   7758   \n",
       "13       Model Evaluation: Confusion Matrix  2021-03-15  10761   \n",
       "\n",
       "                                                 text  \n",
       "0   So far I’ve been evaluating binary classifiers...  \n",
       "1   In the previous entry I was using decision fun...  \n",
       "2   Another tool for evaluating a classifier are d...  \n",
       "3   What is event sourcing ? As opposed to store t...  \n",
       "13  The machine learning workflow usually involves...  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# removing return characters\n",
    "df['text'] = df['text'].str.replace('[\\n|\\s]{1,}', ' ', regex=True)\n",
    "\n",
    "# removing everything before the text\n",
    "df['text'] = df.apply(lambda x: re.sub('^.*{} '.format(x['title']), '', x['text']).strip(), axis=1)\n",
    "\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "judicial-spanking",
   "metadata": {},
   "source": [
    "## Sorting DataFrame using new features\n",
    "Although it looks promising the truth is that the data we've collected so far are in their string representation, \n",
    "we need to convert text lengths to integers and the text dates to real dates in order to do a fair sorting\n",
    "of the data\n",
    "\n",
    "a) Converting **length and date to integer and dates** respectively"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "right-power",
   "metadata": {},
   "outputs": [],
   "source": [
    "# converting dates strings to datetimes\n",
    "df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')\n",
    "\n",
    "# converting strings to integers\n",
    "df['len']  = df['len'].astype(int)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "front-broad",
   "metadata": {},
   "source": [
    "b) sorting **by length** (descending)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "infrared-newark",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>date</th>\n",
       "      <th>len</th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>DS - Basic stocks charts</td>\n",
       "      <td>2020-10-18</td>\n",
       "      <td>17023</td>\n",
       "      <td>A good way of practicing matplotlib is trying ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30</th>\n",
       "      <td>DS - Data visualization 101</td>\n",
       "      <td>2020-10-09</td>\n",
       "      <td>16889</td>\n",
       "      <td>When trying to explain some data insights to s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>Linear Regression notes</td>\n",
       "      <td>2020-11-13</td>\n",
       "      <td>16702</td>\n",
       "      <td>Classification is a great method to predict di...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>85</th>\n",
       "      <td>Property based testing</td>\n",
       "      <td>2015-11-20</td>\n",
       "      <td>14005</td>\n",
       "      <td>claims to be able to help us to fill this gap ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>35</th>\n",
       "      <td>DS - Research - Bike accidents in Madrid</td>\n",
       "      <td>2020-09-05</td>\n",
       "      <td>12576</td>\n",
       "      <td>DISCLAIMER ALERT: This article is intended to ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                       title       date    len  \\\n",
       "22                  DS - Basic stocks charts 2020-10-18  17023   \n",
       "30               DS - Data visualization 101 2020-10-09  16889   \n",
       "19                   Linear Regression notes 2020-11-13  16702   \n",
       "85                    Property based testing 2015-11-20  14005   \n",
       "35  DS - Research - Bike accidents in Madrid 2020-09-05  12576   \n",
       "\n",
       "                                                 text  \n",
       "22  A good way of practicing matplotlib is trying ...  \n",
       "30  When trying to explain some data insights to s...  \n",
       "19  Classification is a great method to predict di...  \n",
       "85  claims to be able to help us to fill this gap ...  \n",
       "35  DISCLAIMER ALERT: This article is intended to ...  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.sort_values('len', ascending=False).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "protective-airport",
   "metadata": {},
   "source": [
    "c) sorting **by date** (ascending)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "subjective-genius",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>date</th>\n",
       "      <th>len</th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>86</th>\n",
       "      <td>High Order Functions</td>\n",
       "      <td>2015-11-10</td>\n",
       "      <td>3718</td>\n",
       "      <td>…​is a function that does at least one of the ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>85</th>\n",
       "      <td>Property based testing</td>\n",
       "      <td>2015-11-20</td>\n",
       "      <td>14005</td>\n",
       "      <td>claims to be able to help us to fill this gap ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>84</th>\n",
       "      <td>Some applicative style examples</td>\n",
       "      <td>2015-12-21</td>\n",
       "      <td>3609</td>\n",
       "      <td>I’m not going to define what an applicative fu...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>83</th>\n",
       "      <td>Java: Method Reference composition</td>\n",
       "      <td>2016-03-28</td>\n",
       "      <td>3418</td>\n",
       "      <td>Introduction Since JDK 8 there is the java.uti...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>82</th>\n",
       "      <td>Frege basics: File I/O</td>\n",
       "      <td>2016-03-29</td>\n",
       "      <td>2328</td>\n",
       "      <td>Intro All examples are based on IO.fr module f...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                 title       date    len  \\\n",
       "86                High Order Functions 2015-11-10   3718   \n",
       "85              Property based testing 2015-11-20  14005   \n",
       "84     Some applicative style examples 2015-12-21   3609   \n",
       "83  Java: Method Reference composition 2016-03-28   3418   \n",
       "82              Frege basics: File I/O 2016-03-29   2328   \n",
       "\n",
       "                                                 text  \n",
       "86  …​is a function that does at least one of the ...  \n",
       "85  claims to be able to help us to fill this gap ...  \n",
       "84  I’m not going to define what an applicative fu...  \n",
       "83  Introduction Since JDK 8 there is the java.uti...  \n",
       "82  Intro All examples are based on IO.fr module f...  "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.sort_values('date').head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "descending-grace",
   "metadata": {},
   "source": [
    "## Extracting more information from text\n",
    "a) Which is the mean number of characters per article ?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "funky-tractor",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "mean length of characters per article: 5580.05\n"
     ]
    }
   ],
   "source": [
    "mean = df['text'].str.len().mean()\n",
    "\n",
    "print(\"mean length of characters per article: {0:.2f}\".format(mean))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "inappropriate-teens",
   "metadata": {},
   "source": [
    "a) Which is the mean number of digits per article ?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "moving-stanley",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "mean length of digits per article 101.13\n"
     ]
    }
   ],
   "source": [
    "mean = df['text'].str.findall(r'\\d').apply(lambda x: len(x)).mean()\n",
    "\n",
    "print(\"mean length of digits per article {0:.2f}\".format(mean))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "equipped-viking",
   "metadata": {},
   "source": [
    "c) Which are the adjectives following the expression **'the most'** ?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "reduced-element",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['important', 'basic', 'suitable', 'frequent', 'significant',\n",
       "       'convenient', 'used', 'representative', 'popular', 'common',\n",
       "       'famous', 'on', 'appropriate', 'about'], dtype=object)"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adjectives = df['text']\\\n",
    "    .str.extractall(r'the most (?P<most>\\w{1,})')\\\n",
    "    .reset_index(level=1)\\\n",
    "    .loc[:, 'most']\\\n",
    "    .unique()\n",
    "\n",
    "adjectives"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "pressing-robert",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}