Réalisations Python

Réaliser un algorithme de détection de faux billets de banque par leurs dimensions :

  1. Compléter la base de données des données d’apprentissage par une régression linéaire.
  2. Former l’algorithme d’apprentissage.
  3. Confronter l’algorithme d’apprentissage et les données des vrais billets.
  4. Enregistrer le fichier algorithmique.

Notebooks et outils de programmation :

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import scipy.stats as st
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.stats.outliers_influence import variance_inflation_factor
import random
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
import pingouin as pg
from sklearn import decomposition
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import pickle
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.naive_bayes import GaussianNB

In [2]:

data = pd.read_csv("billets.csv", sep = ";")
ref = pd.read_csv("billets_production.csv", sep = ",")

In [3]:

data.info()
data.describe()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   is_genuine    1500 non-null   bool   
 1   diagonal      1500 non-null   float64
 2   height_left   1500 non-null   float64
 3   height_right  1500 non-null   float64
 4   margin_low    1463 non-null   float64
 5   margin_up     1500 non-null   float64
 6   length        1500 non-null   float64
dtypes: bool(1), float64(6)
memory usage: 71.9 KB

Out[3]:

diagonalheight_leftheight_rightmargin_lowmargin_uplength
count1500.0000001500.0000001500.0000001463.0000001500.0000001500.00000
mean171.958440104.029533103.9203074.4859673.151473112.67850
std0.3051950.2994620.3256270.6638130.2318130.87273
min171.040000103.140000102.8200002.9800002.270000109.49000
25%171.750000103.820000103.7100004.0150002.990000112.03000
50%171.960000104.040000103.9200004.3100003.140000112.96000
75%172.170000104.230000104.1500004.8700003.310000113.34000
max173.010000104.880000104.9500006.9000003.910000114.44000

il y a des valeurs manquantes, identification et compte de ces valeurs :

In [4]:

out_index = data.index[data.isnull().any(axis=1)]
print(out_index)
print(data.isna().sum())
Int64Index([  72,   99,  151,  197,  241,  251,  284,  334,  410,  413,  445,
             481,  505,  611,  654,  675,  710,  739,  742,  780,  798,  844,
             845,  871,  895,  919,  945,  946,  981, 1076, 1121, 1176, 1303,
            1315, 1347, 1435, 1438],
           dtype='int64')
is_genuine       0
diagonal         0
height_left      0
height_right     0
margin_low      37
margin_up        0
length           0
dtype: int64

Remplacement de ces valeurs par l’utilisation d’une régression linéaire pour prédire les valeurs manquantes.

Répartition des données en 2 datas : l’un avec les données complètes, l’autre avec les données à compléter.

In [5]:

data_in = data.dropna()
data_in.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1463 entries, 0 to 1499
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   is_genuine    1463 non-null   bool   
 1   diagonal      1463 non-null   float64
 2   height_left   1463 non-null   float64
 3   height_right  1463 non-null   float64
 4   margin_low    1463 non-null   float64
 5   margin_up     1463 non-null   float64
 6   length        1463 non-null   float64
dtypes: bool(1), float64(6)
memory usage: 81.4 KB

In [6]:

data_out = data[data.isnull().any(axis=1)]
data_out.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 37 entries, 72 to 1438
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   is_genuine    37 non-null     bool   
 1   diagonal      37 non-null     float64
 2   height_left   37 non-null     float64
 3   height_right  37 non-null     float64
 4   margin_low    0 non-null      float64
 5   margin_up     37 non-null     float64
 6   length        37 non-null     float64
dtypes: bool(1), float64(6)
memory usage: 2.1 KB

Utilisation d’une régression linéaire pour compléter les données

In [7]:

X = data_in[["height_left", "height_right", "margin_up", "length", "diagonal"]]

In [8]:

y = data_in["margin_low"]

In [9]:

X = sm.add_constant(X)
model = sm.OLS(y, X).fit()

In [10]:

model.summary()

Out[10]:

Dep. Variable:margin_lowR-squared:0.477
Model:OLSAdj. R-squared:0.476
Method:Least SquaresF-statistic:266.1
Date:Tue, 04 Apr 2023Prob (F-statistic):2.60e-202
Time:09:48:19Log-Likelihood:-1001.3
No. Observations:1463AIC:2015.
Df Residuals:1457BIC:2046.
Df Model:5
Covariance Type:nonrobust
coefstd errtP>|t|[0.0250.975]
const22.99489.6562.3820.0174.05541.935
height_left0.18410.0454.1130.0000.0960.272
height_right0.25710.0435.9780.0000.1730.342
margin_up0.25620.0643.9800.0000.1300.382
length-0.40910.018-22.6270.000-0.445-0.374
diagonal-0.11110.041-2.6800.007-0.192-0.030
Omnibus:73.627Durbin-Watson:1.893
Prob(Omnibus):0.000Jarque-Bera (JB):95.862
Skew:0.482Prob(JB):1.53e-21
Kurtosis:3.801Cond. No.1.94e+05

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.94e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

La colonne coef sont les coef de la linéarité de la fonction par rapport à chaque donnée P>t est le sinificativité de chacune de ses variables Le résiduel : omnibus, prob omnibus il faudrait que Kurtosis qui doit ête le plus proche de 1

In [11]:

# For each X, calculate VIF and save in dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns

Pas de multicollinéarité dans le modèle

In [12]:

vif

Out[12]:

VIF Factorfeatures
0590198.238883const
11.138261height_left
21.230115height_right
31.404404margin_up
41.576950length
51.013613diagonal

In [13]:

ax = pg.qqplot(model.resid, dist = "norm")
#plt.savefig("01.jpg", dpi = 150)
/Users/photos/opt/anaconda3/lib/python3.9/site-packages/outdated/utils.py:14: OutdatedPackageWarning: The package pingouin is out of date. Your version is 0.5.2, the latest is 0.5.3.
Set the environment variable OUTDATED_IGNORE=1 to disable these warnings.
  return warn(
/Users/photos/opt/anaconda3/lib/python3.9/site-packages/outdated/utils.py:14: OutdatedPackageWarning: The package outdated is out of date. Your version is 0.2.1, the latest is 0.2.2.
Set the environment variable OUTDATED_IGNORE=1 to disable these warnings.
  return warn(

Création d’un modèle de régression linéaire

In [14]:

regression_model = LinearRegression()

Ajustement du modèle aux données connues

In [15]:

X

Out[15]:

constheight_leftheight_rightmargin_uplengthdiagonal
01.0104.86104.952.89112.83171.81
11.0103.36103.662.99113.09171.46
21.0104.48103.502.94113.16172.69
31.0103.91103.943.01113.51171.36
41.0104.28103.463.48112.54171.73
14951.0104.38104.173.09111.28171.75
14961.0104.63104.443.37110.97172.19
14971.0104.01104.123.36111.95171.80
14981.0104.28104.063.46112.25172.06
14991.0104.15103.823.37112.07171.47

1463 rows × 6 columns

In [16]:

regression_model.fit(X[X.columns[1:]], y)

Out[16]:

LinearRegression()

Sélection des observations avec des valeurs manquantes

In [17]:

missing_data = data[data["margin_low"].isnull()]

Sélection des variables à utiliser pour prédire les valeurs manquantes

In [18]:

X_missing = missing_data[["height_left", "height_right", "margin_up", "length", "diagonal"]]

Prédiction des valeurs manquantes

In [19]:

y_missing = regression_model.predict(X_missing)

Remplacement des valeurs manquantes dans la base de données

In [20]:

data.loc[data["margin_low"].isnull(), "margin_low"] = y_missing
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   is_genuine    1500 non-null   bool   
 1   diagonal      1500 non-null   float64
 2   height_left   1500 non-null   float64
 3   height_right  1500 non-null   float64
 4   margin_low    1500 non-null   float64
 5   margin_up     1500 non-null   float64
 6   length        1500 non-null   float64
dtypes: bool(1), float64(6)
memory usage: 71.9 KB

Les données sont complétées

In [21]:

compar_1 = []
for i in out_index:
    compar_1.append(data["margin_low"].iloc[i])
compar_1.sort()

In [22]:

#not_index = []
#while len(not_index) < 37:
#    num = random.randint(1, 1500)
#    if num not in out_index and num not in not_index:
#        not_index.append(num)

In [23]:

#np.array(not_index)

In [24]:

not_index = [ 473,  285, 1234, 1229,  214,  392, 1481,  416,  440, 1363,  220,
       1085,  247,  553,  571, 1035,  437, 1370,  974, 1322,  792,  907,
       1231, 1100,  232,  706, 1003,  790,  803, 1369,  433, 1388,  537,
       1349, 1237,  658,  346]

In [25]:

compar_2 = []
for i in not_index:
    compar_2.append(data["margin_low"].iloc[i])
compar_2.sort()

Comparaison en données existantes et données remplacées

In [26]:

comparaison = pd.DataFrame(compar_2, compar_1).reset_index().rename(columns = {"index" : "Remplacées", 0 : "Existantes"})
comparaison

Out[26]:

RemplacéesExistantes
03.6143063.25
13.7463333.75
23.7685543.77
33.8033083.85
43.8937483.85
54.0587643.90
64.0804143.93
74.0936213.94
84.0940654.04
94.1274424.10
104.1350344.11
114.1377804.15
124.1605394.25
134.1606074.29
144.1774204.31
154.1797364.38
164.2374154.43
174.2496294.47
184.2980474.51
194.3185254.51
204.3190144.53
214.3416434.54
224.3718114.60
234.3936684.72
244.4104574.84
254.4398464.86
264.4706505.06
274.6506175.19
284.7105335.27
294.7789675.38
304.8021455.40
315.0475705.46
325.0502775.52
335.0675845.77
345.1400435.80
355.1858626.05
365.7269936.19

Remplacement validé. La plus petite valeur remplacée est plus grande la plus petite valeur existante et la plus grande valeur remplacée est plus petite que la plus grande valeur existante.

Description des variables

In [27]:

desc = data.describe()

Analyse univariée des différentes variables avec des box plot

In [28]:

for col in data.columns[1:]:
    plt.figure()
    plt.boxplot(x = data[col])
    plt.title(col)
#    plt.savefig(f"02 {col} .jpg", dpi = 150)

Identification des valeurs aberrantes

In [29]:

outliers = pd.DataFrame(columns = data.columns)
for col in data.columns:
    q1 = desc.loc['25%', ["diagonal", "height_left", "height_right", "margin_low", "margin_up", "length"]]
    q3 = desc.loc['75%', ["diagonal", "height_left", "height_right", "margin_low", "margin_up", "length"]]
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    outliers[["diagonal", "height_left", "height_right", "margin_low", "margin_up", "length"]] = data[(data[1:] < lower_bound) | (data[1:] > upper_bound)][["diagonal", "height_left", "height_right", "margin_low", "margin_up", "length"]]
/var/folders/0p/7nmt36t11y5fcm_wdkxslb000000gn/T/ipykernel_1013/754005521.py:8: FutureWarning: Automatic reindexing on DataFrame vs Series comparisons is deprecated and will raise ValueError in a future version.  Do `left, right = left.align(right, axis=1, copy=False)` before e.g. `left == right`
  outliers[["diagonal", "height_left", "height_right", "margin_low", "margin_up", "length"]] = data[(data[1:] < lower_bound) | (data[1:] > upper_bound)][["diagonal", "height_left", "height_right", "margin_low", "margin_up", "length"]]

Étude de la distribution

In [30]:

for col in data.columns[1:]:
    plt.figure()
    plt.hist(data[col], bins = 20)
    plt.title(col)
#    plt.savefig(f"03 {col} .jpg", dpi = 150)

Analyse des relations entre les variables

In [31]:

corr = data.corr()
plt.figure()
plt.imshow(corr, cmap = 'coolwarm')
plt.colorbar()
plt.xticks(range(len(corr.columns[1:])), corr.columns[1:], rotation = 90)
plt.yticks(range(len(corr.columns[1:])), corr.columns[1:])
plt.title("Relations entre variables", fontsize = 15)
#plt.savefig("04.jpg", dpi = 150)
plt.show()

Pairplot en fonction du biais

In [32]:

figg = sns.pairplot(data, hue = "is_genuine")
figg.fig.suptitle("Affichage par variable en fonction du biais", fontsize = 20, y = 1.02);
#plt.savefig("05.jpg", dpi = 150)

Il faut séparer les données en 2 parties avec une équivalente répartition entre valeurs remplacées et valeurs d’origine

Séparation en 2 fichiers

In [33]:

data.head()

Out[33]:

is_genuinediagonalheight_leftheight_rightmargin_lowmargin_uplength
0True171.81104.86104.954.522.89112.83
1True171.46103.36103.663.772.99113.09
2True172.69104.48103.504.402.94113.16
3True171.36103.91103.943.623.01113.51
4True171.73104.28103.464.043.48112.54

In [34]:

X_train, X_test, y_train, y_test = train_test_split(data.drop(columns = "is_genuine", axis = 1), data["is_genuine"], test_size = 0.33, stratify = data["is_genuine"], random_state = 42)

Sélection de faux billets pour l’apprentissage

In [35]:

X_train.head()

Out[35]:

diagonalheight_leftheight_rightmargin_lowmargin_uplength
780172.41103.95103.794.0804143.13113.41
1367171.60104.37104.205.8200003.08112.84
1477172.16104.23104.195.0900003.61112.43
1203172.02104.22104.195.1400003.73110.49
1192171.95104.08104.085.6600003.32110.93

In [36]:

X_test.head()

Out[36]:

diagonalheight_leftheight_rightmargin_lowmargin_uplength
41172.08104.19103.823.993.21113.20
426171.91103.99103.503.412.92113.02
132171.84103.77103.984.612.99113.59
1295171.90103.97104.365.593.61112.05
360171.61103.73104.014.292.93112.76

In [37]:

#X_test.to_csv("file_test.csv", index = False)

Comptage des True et False

In [38]:

y_train = data.loc[:, data.columns == "is_genuine"]
y_train.value_counts()

Out[38]:

is_genuine
True          1000
False          500
dtype: int64

In [39]:

X_train = data.loc[:, data.columns != "is_genuine"]
X_train = sm.tools.add_constant(X_train)
X_train.head()

Out[39]:

constdiagonalheight_leftheight_rightmargin_lowmargin_uplength
01.0171.81104.86104.954.522.89112.83
11.0171.46103.36103.663.772.99113.09
21.0172.69104.48103.504.402.94113.16
31.0171.36103.91103.943.623.01113.51
41.0171.73104.28103.464.043.48112.54

Prédiction avec k-means en utilisant 2 clusters

In [40]:

# Nombre de clusters:
n_clust = 2

# Clustering par K-means:
km = KMeans(n_clusters = n_clust,random_state = 1994)
x_km = km.fit_transform(data[["diagonal","height_left","height_right","margin_low","margin_up","length"]])

In [41]:

# Ajout d'une colonne contenant le cluster
clusters_km = km.labels_

In [42]:

data["cluster_km"] = km.labels_
data["cluster_km"] = data["cluster_km"].apply(str)

In [43]:

centroids_km = km.cluster_centers_

In [44]:

# Clustering par projection des individus sur le premier plan factoriel:
pca_km = decomposition.PCA(n_components = 3).fit(data[["diagonal","height_left","height_right","margin_low","margin_up","length"]])
acp_km = PCA(n_components = 3).fit_transform(data[["diagonal","height_left","height_right","margin_low","margin_up","length"]])

centroids_km_projected = pca_km.transform(centroids_km)

In [45]:

# Graphique:
for couleur,k in zip(["#AAAAAA", "#55AA55"],[0,1]):
    plt.scatter(acp_km[km.labels_ == k, 0], acp_km[km.labels_ == k, 1], c = couleur, s = 60, edgecolors="#FFFFFF", label = "Cluster {}".format(k))
    plt.legend()
    plt.scatter(centroids_km_projected[:, 0],centroids_km_projected[:, 1], color="blue", label="Centroïdes")
plt.title("Projection des individus et des {} centroïdes sur le premier plan factoriel".format(len(centroids_km)), fontsize = 10)
#plt.savefig("06.jpg", dpi = 150)
plt.show()

#Vérification de la classification: Matrice de confusion:
km_matrix = pd.crosstab(data["is_genuine"], clusters_km)
print(km_matrix)
col_0         0    1
is_genuine          
False        19  481
True        997    3

In [46]:

plt.title("Projection des point en fonction de leur véritable nature")
sns.scatterplot(x = acp_km[:, 0], y = acp_km[:, 1], hue = data["is_genuine"]);
#plt.savefig("06b.jpg", dpi = 150)

In [47]:

# Graphique:
plt.figure(figsize=(16, 6))
plt.subplot(1, 2, 1)
sns.heatmap(km_matrix,
            annot = True,
            fmt = ".3g", 
            cmap = sns.color_palette("mako", as_cmap = True),
            linecolor = "white",
            linewidths = 0.3,
            xticklabels = ["1", "0"],
            yticklabels = ["Faux","Vrai"])
plt.xlabel("Cluster")
plt.ylabel("is_genuin")
plt.title("Matrice de confusion", fontsize = 15)

plt.subplot(1, 2, 2)
for couleur,k in zip(["#AAAAAA", "#55AA55"],[0,1]):
    plt.scatter(acp_km[km.labels_ == k, 0],acp_km[km.labels_ == k, 1],c = couleur, s = 60, edgecolors="#FFFFFF", label = "Cluster {}".format(k))
    plt.legend()
    plt.scatter(centroids_km_projected[:, 0],centroids_km_projected[:, 1], color = "blue", label="Centroïdes")
plt.title("Projection des individus et des {} centroïdes sur le premier plan factoriel".format(len(centroids_km)), fontsize = 15)
#plt.savefig("07.jpg", dpi = 150)
;

Out[47]:

''

In [48]:

data.head()

Out[48]:

is_genuinediagonalheight_leftheight_rightmargin_lowmargin_uplengthcluster_km
0True171.81104.86104.954.522.89112.830
1True171.46103.36103.663.772.99113.090
2True172.69104.48103.504.402.94113.160
3True171.36103.91103.943.623.01113.510
4True171.73104.28103.464.043.48112.540

Régression logistique

In [49]:

# Construction du modèle et ajustement des données
log_reg = sm.Logit(y_train, X_train).fit()
Optimization terminated successfully.
         Current function value: 0.028228
         Iterations 13

In [50]:

log_reg.summary()

Out[50]:

Dep. Variable:is_genuineNo. Observations:1500
Model:LogitDf Residuals:1493
Method:MLEDf Model:6
Date:Tue, 04 Apr 2023Pseudo R-squ.:0.9557
Time:09:48:33Log-Likelihood:-42.342
converged:TrueLL-Null:-954.77
Covariance Type:nonrobustLLR p-value:0.000
coefstd errzP>|z|[0.0250.975]
const-204.5582241.768-0.8460.398-678.415269.299
diagonal0.06801.0910.0620.950-2.0712.207
height_left-1.71621.104-1.5550.120-3.8800.447
height_right-2.25841.072-2.1070.035-4.359-0.157
margin_low-5.77560.937-6.1640.000-7.612-3.939
margin_up-10.15312.108-4.8170.000-14.284-6.022
length5.91290.8466.9910.0004.2557.571

Possibly complete quasi-separation: A fraction 0.51 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.

In [51]:

clf = LogisticRegression(random_state=0).fit(X_train[["height_right", "margin_low", "margin_up", "length"]], y_train)
/Users/photos/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py:63: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  return f(*args, **kwargs)

In [52]:

prediction = clf.predict(X_test[["height_right", "margin_low", "margin_up", "length"]])
prediction

Out[52]:

array([ True,  True,  True, False,  True,  True,  True,  True,  True,
       False,  True,  True,  True, False,  True,  True,  True, False,
       False, False, False, False,  True, False, False,  True,  True,
       False,  True,  True, False,  True,  True, False, False,  True,
        True,  True,  True,  True,  True,  True,  True,  True, False,
        True, False,  True, False,  True, False, False, False,  True,
        True,  True,  True,  True,  True,  True, False,  True,  True,
        True,  True, False, False,  True,  True, False,  True,  True,
        True,  True,  True,  True,  True, False, False,  True,  True,
        True,  True, False,  True, False,  True,  True, False,  True,
        True,  True,  True,  True, False,  True, False, False, False,
        True, False,  True, False,  True,  True, False,  True,  True,
       False,  True,  True, False, False, False,  True,  True, False,
        True,  True,  True,  True,  True,  True,  True, False, False,
        True,  True, False,  True, False, False, False,  True,  True,
        True, False,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False, False,  True,  True, False,
       False,  True,  True, False, False,  True,  True,  True, False,
        True,  True,  True, False,  True,  True,  True, False,  True,
        True,  True, False,  True,  True,  True,  True,  True, False,
        True,  True,  True, False, False, False,  True,  True, False,
        True, False,  True, False,  True,  True,  True, False,  True,
        True,  True,  True, False, False,  True, False,  True, False,
        True,  True,  True,  True, False, False,  True,  True,  True,
        True, False, False,  True, False, False,  True,  True,  True,
       False,  True,  True,  True,  True,  True,  True,  True, False,
        True,  True,  True,  True, False,  True, False,  True,  True,
        True,  True, False,  True,  True,  True,  True, False, False,
        True,  True, False,  True,  True,  True,  True, False,  True,
        True,  True, False,  True,  True, False, False, False, False,
        True, False, False, False,  True,  True,  True,  True, False,
        True,  True,  True,  True, False,  True,  True,  True, False,
       False,  True,  True,  True,  True, False,  True, False,  True,
       False, False, False, False,  True, False,  True,  True,  True,
        True, False, False, False, False,  True,  True,  True,  True,
        True,  True, False,  True, False, False,  True,  True,  True,
        True, False,  True,  True,  True, False,  True,  True, False,
        True,  True, False,  True,  True, False, False,  True,  True,
        True,  True,  True,  True, False, False,  True,  True, False,
        True,  True,  True,  True, False,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True, False,  True,
        True,  True,  True, False,  True,  True,  True, False,  True,
        True, False,  True,  True,  True, False,  True,  True,  True,
        True, False, False, False,  True,  True,  True,  True,  True,
        True, False, False,  True,  True,  True, False,  True,  True,
        True,  True,  True, False, False,  True, False,  True, False,
        True, False,  True,  True,  True, False, False, False,  True,
       False,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True, False,
       False, False, False, False, False, False,  True,  True, False,
        True,  True, False, False,  True,  True, False, False,  True,
        True,  True,  True,  True,  True,  True, False,  True, False,
       False, False,  True,  True, False,  True,  True,  True,  True,
        True,  True,  True,  True, False, False,  True,  True,  True,
        True,  True,  True,  True,  True, False, False,  True,  True])

In [53]:

X_train

Out[53]:

constdiagonalheight_leftheight_rightmargin_lowmargin_uplength
01.0171.81104.86104.954.522.89112.83
11.0171.46103.36103.663.772.99113.09
21.0172.69104.48103.504.402.94113.16
31.0171.36103.91103.943.623.01113.51
41.0171.73104.28103.464.043.48112.54
14951.0171.75104.38104.174.423.09111.28
14961.0172.19104.63104.445.273.37110.97
14971.0171.80104.01104.125.513.36111.95
14981.0172.06104.28104.065.173.46112.25
14991.0171.47104.15103.824.633.37112.07

1500 rows × 7 columns

Matrice de confusion

In [54]:

# Créer la matrice de confusion
confusion_matrix = confusion_matrix(y_test, prediction)
print(confusion_matrix)
[[162   3]
 [  2 328]]

In [55]:

# Graphique:
plt.figure(figsize=(10, 6))
plt.subplot()
sns.heatmap(confusion_matrix,
            annot = True,
            fmt = ".3g", 
            cmap = sns.color_palette("mako", as_cmap = True),
            linecolor = "white",
            linewidths = 0.3,
            xticklabels = ["Faux","Vrai"],
            yticklabels = ["0", "1"])
plt.xlabel("Prédictions")
plt.ylabel("is_genuine")
plt.title("Matrice de confusion sur régression logistique", fontsize = 15);
#plt.savefig("08.jpg", dpi = 150)

In [56]:

ref

Out[56]:

diagonalheight_leftheight_rightmargin_lowmargin_uplengthid
0171.76104.01103.545.213.30111.42A_1
1171.87104.17104.136.003.31112.09A_2
2172.00104.58104.294.993.39111.57A_3
3172.49104.55104.344.443.03113.20A_4
4171.65103.63103.563.773.16113.33A_5

In [57]:

# Predictions sur des donnees inconnues:
X_test_clf = ref[["height_right","margin_low","margin_up","length"]]

ref["reg_pred"] = clf.predict(X_test_clf)
print(ref[["id","reg_pred"]])
    id  reg_pred
0  A_1     False
1  A_2     False
2  A_3     False
3  A_4      True
4  A_5      True

In [58]:

ref

Out[58]:

diagonalheight_leftheight_rightmargin_lowmargin_uplengthidreg_pred
0171.76104.01103.545.213.30111.42A_1False
1171.87104.17104.136.003.31112.09A_2False
2172.00104.58104.294.993.39111.57A_3False
3172.49104.55104.344.443.03113.20A_4True
4171.65103.63103.563.773.16113.33A_5True

In [59]:

#file_export = ref[["diagonal", "height_left", "height_right", "margin_low", "margin_up", "length", "id"]]

In [60]:

#file_export.to_csv("ref_predict.csv", index = False)

In [61]:

# save
#with open("model.pkl", "wb") as f:
#    pickle.dump(clf,f)

In [ ]:

 

Direction vers l’application de détection de billets.

Retour vers Data-Analyst