Skip to content
Snippets Groups Projects
Commit 22558744 authored by Solange Emmenegger's avatar Solange Emmenegger
Browse files

Merge branch 'renku/autosave/solange.emmenegger@hslu.ch/master/e869906d/8bf57070' into 'master'

Renku/autosave/solange.emmenegger@hslu.ch/master/e869906d/8bf57070

See merge request solange.emmenegger/adml-hslu-hs22!10
parents e869906d 9cff590a
No related branches found
No related tags found
1 merge request!10Renku/autosave/solange.emmenegger@hslu.ch/master/e869906/8bf5707
Pipeline #417248 passed with stage
in 7 minutes and 48 seconds
%% Cell type:markdown id: tags:
# Data Quality Assessment
%% Cell type:code id: tags:
``` python
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from wordcloud import WordCloud
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
```
%% Output
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Input In [1], in <cell line: 7>()
5 from sklearn.linear_model import LinearRegression
6 from sklearn.metrics import mean_squared_error
----> 7 from wordcloud import WordCloud
8 import warnings
9 warnings.filterwarnings("ignore")
ModuleNotFoundError: No module named 'wordcloud'
%% Cell type:code id: tags:
``` python
df = pd.read_csv("cars.csv")
df.head()
```
%% Cell type:markdown id: tags:
## Skewed Data
%% Cell type:code id: tags:
``` python
horsepower = df["Horsepower"]
horsepower.plot.hist()
```
%% Cell type:code id: tags:
``` python
print("Mean:", horsepower.mean())
print("Mode:", int(horsepower.mode()))
print("")
print("Mean - Mode = ", horsepower.mean() - int(horsepower.mode()))
```
%% Cell type:markdown id: tags:
### Log-Transform Data
%% Cell type:code id: tags:
``` python
horsepower_log = horsepower.apply(lambda x: np.log(x))
horsepower_log.plot.hist()
print("Mean:", horsepower_log.mean())
print("Mode:", int(horsepower_log.mode()))
print("\nMean - Mode = ", horsepower_log.mean() - int(horsepower_log.mode()))
```
%% Cell type:markdown id: tags:
## Boxplots
%% Cell type:markdown id: tags:
### Horsepower
%% Cell type:code id: tags:
``` python
fig, ax = plt.subplots(figsize=(12,8))
horsepower.plot.box(ax=ax)
```
%% Cell type:code id: tags:
``` python
horsepower.describe()
```
%% Cell type:code id: tags:
``` python
q3 = horsepower.describe().loc['75%']
q1 = horsepower.describe().loc['25%']
iqr = q3 - q1
upper_boundary = q3 + 1.5 * iqr
lower_boundary = q1 - 1.5 * iqr
print("Upper boundary:", upper_boundary, "Lower boundary:", lower_boundary)
```
%% Cell type:markdown id: tags:
### Year
%% Cell type:code id: tags:
``` python
year = df["Year"]
fig, ax = plt.subplots(figsize=(12,8))
year.plot.box(ax=ax)
```
%% Cell type:code id: tags:
``` python
year.describe()
```
%% Cell type:code id: tags:
``` python
q3 = year.describe().loc['75%']
q1 = year.describe().loc['25%']
iqr = q3 - q1
upper_boundary = q3 + 1.5 * iqr
lower_boundary = q1 - 1.5 * iqr
print("Upper boundary:", upper_boundary, "Lower boundary:", lower_boundary)
```
%% Cell type:markdown id: tags:
## Correlation
%% Cell type:code id: tags:
``` python
plt.subplots(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='RdYlGn_r', linewidths=0.5)
```
%% Cell type:markdown id: tags:
## Dummy Variable Trap
%% Cell type:code id: tags:
``` python
colors = pd.get_dummies(df.Color)
colors.head()
```
%% Cell type:code id: tags:
``` python
plt.subplots(figsize=(10, 8))
sns.heatmap(colors.corr(), annot=True, cmap='RdYlGn_r', linewidths=0.5)
```
%% Cell type:markdown id: tags:
### Avoid Trap
%% Cell type:code id: tags:
``` python
colors = pd.get_dummies(df.Color, drop_first=True)
colors.head()
```
%% Cell type:markdown id: tags:
## Numerical Encoding of Text
%% Cell type:markdown id: tags:
### TF-IDF
%% Cell type:code id: tags:
``` python
from sklearn.feature_extraction.text import TfidfVectorizer
```
%% Cell type:code id: tags:
``` python
corpus = [
"The Limmat flows out of the lake.",
"The bears are in the bear pit near the river.",
"The Rhône flows out of Lake Geneva.",
]
vectorizer = TfidfVectorizer()
vectors= vectorizer.fit_transform(corpus)
feature_names = vectorizer.get_feature_names()
dense = vectors.todense().tolist()
pd.DataFrame(dense, columns=feature_names).transpose()
```
%% Cell type:markdown id: tags:
### Word Embeddings
%% Cell type:code id: tags:
``` python
from gensim.models import KeyedVectors
vectors = KeyedVectors.load("../../cc.de.300-distilled.vec", mmap="r")
vectors.syn0norm = vectors.syn0
```
%% Cell type:code id: tags:
``` python
vectors["Mann"]
```
%% Cell type:code id: tags:
``` python
most_similar = vectors.wv.most_similar(positive=["Frau", "König"], negative=["Mann"])[0]
print("König - Mann + Frau = ", most_similar)
```
%% Cell type:markdown id: tags:
## Profile Report
%% Cell type:code id: tags:
``` python
import pandas_profiling
profile = df.profile_report(html={'style':{'full_width':True}})
# Save report
profile.to_file(output_file="c")
profile
```
......
......@@ -12,3 +12,4 @@ tqdm
plotly
scipy
tensorflow
wordcloud
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment