## Extract text from blog entries
I'm using requests and beautifulsoup libraries to get all entries from my blog mariogarcia.github.io

### A) Getting article list

In [1]:
import requests
from bs4 import BeautifulSoup

ROOT_URL = "https://mariogarcia.github.io"

# getting all articles links list from archive.html page
article_list_page = requests.get("{}/archive.html".format(ROOT_URL))
parsed_page       = BeautifulSoup(article_list_page.content, 'html.parser')
links             = ["{}{}".format(ROOT_URL, entry['href']) for entry in parsed_page.select('ul.group a')]

### B) Gathering all articles text in a DataFrame

In [2]:
import pandas as pd

# downloads html from the link passed as parameter
# and extracts the article's full text
def extract_text(link):
    page       = requests.get(link)
    html       = BeautifulSoup(page.content, 'html.parser')
    paragraphs = [p.get_text() for p in html.select('div#main')]
    
    return " ".join(paragraphs)

# gather all articles text
texts = [extract_text(link) for link in links]

# create source dataframe
df_src= pd.DataFrame(texts, columns=['text'])

# show texts
df_src.head()

Unnamed: 0,text
0,\n\n\n\n\n WORK...
1,\n\n\n\n\n WORK...
2,\n\n\n\n\n WORK...
3,\n\n\n\n\n WORK...
4,\n\n\n\n\n WORK...


## Analyzing documents
Lets take a first look to see how documents are estructured to see which patterns I'm going to use to extract important information such as:

- Title
- Date
- Text
- Text length

In [3]:
sample = df_src.loc[1, 'text']
sample

'\n\n\n\n\n                                WORKING IN PROGRESS\n                             - POST\n                        \n\n\n\n                                        Twitter\n                                    \n\n\n\n\n                                        Twitter\n                                    \n\n\n\n\n                                        Github\n                                    \n\n\n\n2021-03-25Model Evaluation: ROC Curve and AUC\n\n\n\n\n\n\n\nIn the previous entry I was using decision functions and precision-recall curves to decide which threshold and classifier would serve best to my goal, whether it was precision or recall. In this occassion\nI’m using the ROC curves.\n\n\nThe ROC curves (ROC stands for Receiver Operating Characteristic) represents the performance of a binary classifier. It shows the relationship between false positive rates (FPR) and true positive rates (TPR). The idea is to choose the classifier that maximizes the TPR. Unlike the precis

Now I'm using **re** to start testing some regex to extract **title** and **date** afterwards from every entry in the dataframe

In [4]:
import re

# regular expression with two groups -> ()
matcher = re.search('.*(\d{4}-\d{2}-\d{2})(.*)\n*', sample)
title   = matcher.group(2)
date    = matcher.group(1)

# showing extracted data
title, date

('Model Evaluation: ROC Curve and AUC', '2021-03-25')

In [5]:
import re

# cleaning up excess of \n and \s from text
sample = re.sub('[\n|\s]{1,}', ' ', sample)

# removing everything until the title (included) from article's text
text   = re.sub('^.*{} '.format(title), '', sample)

# show cleaned text
text

'In the previous entry I was using decision functions and precision-recall curves to decide which threshold and classifier would serve best to my goal, whether it was precision or recall. In this occassion I’m using the ROC curves. The ROC curves (ROC stands for Receiver Operating Characteristic) represents the performance of a binary classifier. It shows the relationship between false positive rates (FPR) and true positive rates (TPR). The idea is to choose the classifier that maximizes the TPR. Unlike the precision-recall curves the ideal point in a ROC curve is at the top left corner where the TPR is maximized and the FPR is minimized. Here I’m using the same dataset as in the previous article and extending the Jupyter notebook with the ROC curves and AUC. There’re a couple of things to keep in mind to understand the following example: I’m using a list of previously trained classifiers (lst variable) I’m using a custom function that uses different decision functions whether they’re 

In [6]:
# copying source dataframe to avoid downloading every time
df = df_src.copy()

## Using Pandas DataFrame to extract features from every entry

a) Extracting the **length** of every entry

In [7]:
df['len'] = df['text'].str.len()

df.head()

Unnamed: 0,text,len
0,\n\n\n\n\n WORK...,8038
1,\n\n\n\n\n WORK...,4108
2,\n\n\n\n\n WORK...,4990
3,\n\n\n\n\n WORK...,7758
4,\n\n\n\n\n WORK...,10761


b) Extracting the **title and date** from every entry using the previous regex

In [8]:
# extract_all will create a new DataFrame with the extracted data
title_date_df = df['text']\
    .str.extractall(r'.*(?P<date>\d{4}-\d{2}-\d{2})(?P<title>.*)[\n]*')\
    .reset_index(col_fill='origin')

# getting rid of not_matching and NaN entries
title_date_df = title_date_df.where(title_date_df['match'] == 0).dropna()

# merging both dataframes
df = pd.merge(df, title_date_df, left_index=True, right_on='level_0')

# removing not relevant columns once both dataframes are merged
df = df.drop(['level_0', 'match'], axis=1)

# now we got our data included in the original dataframe
df.head()

Unnamed: 0,text,len,date,title
0,\n\n\n\n\n WORK...,8038,2021-03-26,Model Evaluation: Multiclass evaluation
1,\n\n\n\n\n WORK...,4108,2021-03-25,Model Evaluation: ROC Curve and AUC
2,\n\n\n\n\n WORK...,4990,2021-03-24,Model Evaluation: Decision functions
3,\n\n\n\n\n WORK...,7758,2021-03-16,Event Sourcing 101
13,\n\n\n\n\n WORK...,10761,2021-03-15,Model Evaluation: Confusion Matrix


c) **reordering columns**

In [9]:
df = df[['title', 'date', 'len', 'text']]

d) **Cleaning article's text**: getting rid of headers, title, dates, return characters

In [10]:
# removing return characters
df['text'] = df['text'].str.replace('[\n|\s]{1,}', ' ', regex=True)

# removing everything before the text
df['text'] = df.apply(lambda x: re.sub('^.*{} '.format(x['title']), '', x['text']).strip(), axis=1)

df.head()

Unnamed: 0,title,date,len,text
0,Model Evaluation: Multiclass evaluation,2021-03-26,8038,So far I’ve been evaluating binary classifiers...
1,Model Evaluation: ROC Curve and AUC,2021-03-25,4108,In the previous entry I was using decision fun...
2,Model Evaluation: Decision functions,2021-03-24,4990,Another tool for evaluating a classifier are d...
3,Event Sourcing 101,2021-03-16,7758,What is event sourcing ? As opposed to store t...
13,Model Evaluation: Confusion Matrix,2021-03-15,10761,The machine learning workflow usually involves...


## Sorting DataFrame using new features
Although it looks promising the truth is that the data we've collected so far are in their string representation, 
we need to convert text lengths to integers and the text dates to real dates in order to do a fair sorting
of the data

a) Converting **length and date to integer and dates** respectively

In [11]:
# converting dates strings to datetimes
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')

# converting strings to integers
df['len']  = df['len'].astype(int)

b) sorting **by length** (descending)

In [12]:
df.sort_values('len', ascending=False).head()

Unnamed: 0,title,date,len,text
22,DS - Basic stocks charts,2020-10-18,17023,A good way of practicing matplotlib is trying ...
30,DS - Data visualization 101,2020-10-09,16889,When trying to explain some data insights to s...
19,Linear Regression notes,2020-11-13,16702,Classification is a great method to predict di...
85,Property based testing,2015-11-20,14005,claims to be able to help us to fill this gap ...
35,DS - Research - Bike accidents in Madrid,2020-09-05,12576,DISCLAIMER ALERT: This article is intended to ...


c) sorting **by date** (ascending)

In [13]:
df.sort_values('date').head()

Unnamed: 0,title,date,len,text
86,High Order Functions,2015-11-10,3718,…​is a function that does at least one of the ...
85,Property based testing,2015-11-20,14005,claims to be able to help us to fill this gap ...
84,Some applicative style examples,2015-12-21,3609,I’m not going to define what an applicative fu...
83,Java: Method Reference composition,2016-03-28,3418,Introduction Since JDK 8 there is the java.uti...
82,Frege basics: File I/O,2016-03-29,2328,Intro All examples are based on IO.fr module f...


## Extracting more information from text
a) Which is the mean number of characters per article ?

In [14]:
mean = df['text'].str.len().mean()

print("mean length of characters per article: {0:.2f}".format(mean))

mean length of characters per article: 5580.05


a) Which is the mean number of digits per article ?

In [15]:
mean = df['text'].str.findall(r'\d').apply(lambda x: len(x)).mean()

print("mean length of digits per article {0:.2f}".format(mean))

mean length of digits per article 101.13


c) Which are the adjectives following the expression **'the most'** ?

In [16]:
adjectives = df['text']\
    .str.extractall(r'the most (?P<most>\w{1,})')\
    .reset_index(level=1)\
    .loc[:, 'most']\
    .unique()

adjectives

array(['important', 'basic', 'suitable', 'frequent', 'significant',
       'convenient', 'used', 'representative', 'popular', 'common',
       'famous', 'on', 'appropriate', 'about'], dtype=object)