Prédire le score des présidentielles (de 2007 :-) )

Dans le très intéressant datablog d’OWNI la question est posée d’une éventuelle corrélation entre données sur le logement et le score de Sarkozy en 2007, commune par commune.

Cela nous a amené à aborder cette question en envisageant le point de vue prédictif : durant les présidentielles 2007, à partir des données rassemblées par OWNI et des scores partiels pour certaines communes, dans quelle mesure aurait-on pu prévoir le score des communes restantes?

De fait, nous avons importé les données d’OWNI dans PredictiveDB:


test=> select * from communes_owni limit 5 ;
  cp   |         commune         | dep | totallog | soclog |        socratio         | population | sarkozy 
-------+-------------------------+-----+----------+--------+-------------------------+------------+---------
 01001 | L'Abergement-Clémenciat | 01  |      329 |     18 |  5.47112462006079027400 |        804 |   61.52
 01004 | Ambérieu-en-Bugey       | 01  |     5949 |   1280 | 21.51622121364935283200 |      12696 |   52.08
 01005 | Ambérieux-en-Dombes     | 01  |      579 |     31 |  5.35405872193436960300 |       1544 |   64.03
 01007 | Ambronay                | 01  |     1049 |     67 |  6.38703527168732125800 |       2263 |   57.58
 01010 | Anglefort               | 01  |      510 |     44 |  8.62745098039215686300 |        918 |   54.88
(5 lignes)

Puis, nous avons créé une vue prédictive portant sur le score à la présidentielle de Sarkozy (colonne “sarkozy”).

On obtient une vue prédictive avec une erreur moyenne de 5.48, ce qui est plutôt bon. En 2007, les données d’owni portant sur les logements sociaux auraient été un excellent indicateur des votes.

D’autres idées de validation? de prédictions?…

PS. Le détail des performances:

  • root mean squared error (RMSE): 7.09
  • erreur absolue moyenne (MAE): 5.48
  • coefficient de corrélation: 0.67

La différence moyenne entre les scores prévus et les scores réels est d’environ 5.5 points. Il y a des variations quant à la magnitude des erreurs, mais probablement peu importantes (différence entre RMSE et MAE faible).

Categorize text in a click

Categorize all sorts of text is easy with PredictiveDB.

Quick tutorial:

1) Data: We use the four main classes of the data found here.

It’s a collection of web pages annotated with tags.

For instance, the following URL is tagged as a student’s web page (tag student):

    select * from webkbdemo order by random() limit 1;


                        url                                        |   tag   |      webpage
   ----------------------------------------------------------------+---------+---------------------------------------
     http://www.cc.gatech.edu/grads/g/Vipul.Gupta/homepage.html    | student | ****** Vipul Gupta ****** ========
                                                                               Vipul.Gupta.gif"> Vipul is a Masters Student in Computer Science at
                                                                               Georgia Tech. He is a part of Parallel Simulation Group at College of
                                                                               Computingand Medical Informatics Group at Graphics, Visualization, and
                                                                               Usability Center. Some of the projects he is involved in relate to
                                                                               registration, segmentation, matching and display of 3- D brain images
                                                                               and Time Warp for Dynamic Load Balancing on Multiprocessor
                                                                               Architectures.  for more information about him and his country.
                                                                               Matching and Display of 3-D Brain Images Warp Mechanism Registration,
                                                                               Segmentation, Matching and Display of 3-D Brain Images -!> Dynamic
                                                                               Load Balancing using Time Warp Mechanism -!> [Georgia Tech] medical
                                                                               informatics/images/welcome icon.gif"ALIGN=Middle ALT="Georgia Tech">
                                                                               [Georgia Tech] [Georgia Tech] images/jup.gif" Collision of comet
                                                                               Shoemaker-Levy 9 >      Graphics, Visualization & Usability Center
                                                                               College of Computing      Georgia Institute of Technology
                                                                               Atlanta, GA 30332-0280      vgupta@cc.gatech.edu Other information
                                                                               available via finger. 

Available tags are the following:

   demo=> select tag from webkbdemo where tag is not null group by tag;
      tag   
   ---------
    student
    project
    faculty
    course

2) Creation of the predictive view:

 demo=> insert into pdb.predictiveview VALUES ('webkbdemo','public','webkbdemo','url','tagf') ;

It’s over: from the hand-checked tags, we’ve just build an automatic text categorizer.

As usual, predictions are available with a select:

  demo=> select * from pdb.webkbdemo where url='http://www.cs.utexas.edu/users/jyluo';
                    url                  |   tag   |    error_risk     
   --------------------------------------+---------+-------------------
    http://www.cs.utexas.edu/users/jyluo | student | 20.17208413001912

Learn more: types and PredictiveDB

PredictiveDB uses a strong type system, which means that its behaviour strongly depends on the defined types. It is important to take this fact into account during the conception of your database’s schema.

For instance, looking at the fields of the table webkbdemo:

  demo=> \d webkbdemo
           Table « public.webkbdemo »
   Colonne |       Type        | Modificateurs 
  ---------+-------------------+---------------
   url     | character varying | not null
   tag     | character varying | 
   webpage | text              | 

There are three fields featuring two distinct types:

  • fields url and tag:

The type of these fields is VARCHAR. PredictiveDB deals with this data as categorical or nominal attributes. This means that, for PredictiveDB, tag, for example, will take a limited number of values, which are:

   demo=> select tag from webkbdemo where tag is not null group by tag;
      tag   
   ---------
    student
    project
    faculty
    course

In other words, PredictiveDB will not see any difference between these lines:

                                       url                                     |   tag   
   ----------------------------------------------------------------------------+---------
    http://www.das.harvard.edu/users/students/Adam_Deaton                      | student
    http://WWW.CS.Yale.EDU/homes/xiao                                          | student
    http://www.cs.rpi.edu/~eganma/index.html                                   | student
    http://cs.nyu.edu/cs/dept_info/course_home_pages/spr96/G22.2274/index.html | course

and these lines:

 url  |  tag     
------+--------
 url1 |  tag1
 url2 |  tag1
 url3 |  tag1
 url4 |  tag2

URLs http://www.das.harvard.edu/users/students/Adam_Deaton and http://WWW.CS.Yale.EDU/homes/xiao are viewed as mere labels, and their formulation does not matter. The only thing that matters is that we can distinguish from one another. Fields like red/blue/green, a/b/c, and so on, must be declared as VARCHAR.

  • field webpage:

The field message is typed as TEXT. PredictiveDB does not handle TEXT fields as VARCHAR fields: contrary to what we said before, here, the contents of fields matter.

For instance, in the two following pages:

 student | This homepage undergoing vast renovations! Stay tuned...
           This will never be done, to be totally honest...
           ===============================================================================
           Here s some graphic. Actually, this is me, only it s not terribly
           recognizable...
           ===============================================================================
           ***** Stuff that s interesting ***** Well, I ve been until last year a
           Computer Science student back at UB (the University at Buffalo). Now I m
           doing graduate work at William and Mary in Williamsburg, Virginia[...]

and

 course | ****** CIS 730 ****** Principles of Artificial Intelligence *****
          The following information is available *****     * Course Description
          * Course Material     * Assignments     * Exams     * Final Marks 

PredictiveDB will take into account the presence of the word “graduate” in the first document, but not in the second; or that the word “Marks” is specific to the second message.

Conclusion

Well-defined types are required to obtain good-quality predictions: if webpage was declared as VARCHAR in the previous examples, the examples could have been:

 student | webpage34

and

 course | webpage432

without making any difference. All the textual information (course, Assignments, Marks…) would have been ignored.

Présentation à “la Cantine”

Pour ceux qui ont manqué notre présentation à la Cantine, nos transparents sont maintenant en ligne:

New alpha-test accounts in January

Thank you very much for your interest in the alpha-test of PredictiveDB.

PredictiveDB has reached full capacity of its servers and we are working to allocate new servers and open more accounts in January.

Everyone have a happy new year.

Nouveaux comptes en janvier

Merci beaucoup de votre intérêt pour l’alpha-test de PredictiveDB.

La prédiction est coûteuse en temps de calcul et en mémoire et PredictiveDB a besoin de plus de puissance de calcul.

Nous travaillons au déploiement de nouvelles machines pour ouvrir des comptes supplémentaires dès janvier.

D’ici là, passez d’excellentes fêtes de fin d’année.

Classer du texte en une instruction

Classer tout type de texte avec PredictiveDB est aisé.

Démonstration rapide:

1) Données: nous utilisons les quatre classes principales des données présentes ici

Il s’agit d’une collection de pages web annotées à l’aide de méta-données (des tags):

L’URL ci-dessous, par exemple, est annotée comme page personnelle d’un étudiant (tag “student”):

    select * from webkbdemo order by random() limit 1;


                        url                                        |   tag   |      webpage
   ----------------------------------------------------------------+---------+---------------------------------------
     http://www.cc.gatech.edu/grads/g/Vipul.Gupta/homepage.html    | student | ****** Vipul Gupta ****** ========
                                                                               Vipul.Gupta.gif"> Vipul is a Masters Student in Computer Science at
                                                                               Georgia Tech. He is a part of Parallel Simulation Group at College of
                                                                               Computingand Medical Informatics Group at Graphics, Visualization, and
                                                                               Usability Center. Some of the projects he is involved in relate to
                                                                               registration, segmentation, matching and display of 3- D brain images
                                                                               and Time Warp for Dynamic Load Balancing on Multiprocessor
                                                                               Architectures.  for more information about him and his country.
                                                                               Matching and Display of 3-D Brain Images Warp Mechanism Registration,
                                                                               Segmentation, Matching and Display of 3-D Brain Images -!> Dynamic
                                                                               Load Balancing using Time Warp Mechanism -!> [Georgia Tech] medical
                                                                               informatics/images/welcome icon.gif"ALIGN=Middle ALT="Georgia Tech">
                                                                               [Georgia Tech] [Georgia Tech] images/jup.gif" Collision of comet
                                                                               Shoemaker-Levy 9 >      Graphics, Visualization & Usability Center
                                                                               College of Computing      Georgia Institute of Technology
                                                                               Atlanta, GA 30332-0280      vgupta@cc.gatech.edu Other information
                                                                               available via finger. 

Les tags possibles sont les suivants:

   demo=> select tag from webkbdemo where tag is not null group by tag;
      tag   
   ---------
    student
    project
    faculty
    course

2) Création de la vue prédictive

 demo=> insert into pdb.predictiveview VALUES ('webkbdemo','public','webkbdemo','url','tagf') ;

Et voilà!: à partir des annotations manuelles, nous venons de générer un classifieur automatique de texte.

Comme d’habitude, on peut consulter les prédictions à l’aide d’un select:

  demo=> select * from pdb.webkbdemo where url='http://www.cs.utexas.edu/users/jyluo';
                    url                  |   tag   |    error_risk     
   --------------------------------------+---------+-------------------
    http://www.cs.utexas.edu/users/jyluo | student | 20.17208413001912

En savoir plus: le typage dans PredictiveDB

PredictiveDB utilise un système de typage fort, c’est à dire que son comportement est fortement lié aux types définis par l’utilisateur. Il est important de prendre cela en compte pendant la conception du schéma de votre base.

Par exemple, en ce qui concerne les champs présents dans la table webkbdemo:

  demo=> \d webkbdemo
           Table « public.webkbdemo »
   Colonne |       Type        | Modificateurs 
  ---------+-------------------+---------------
   url     | character varying | not null
   tag     | character varying | 
   webpage | text              | 

On distingue trois champs avec deux types distincts:

  • champs url et tag:

Le type de ces champs est VARCHAR. PredictiveDB traite ces données comme des attributs catégoriels ou nominaux. Cela signifie que, pour PredictiveDB, tag, par exemple, prendra un nombre fini de valeurs, en l’occurence:

   demo=> select tag from webkbdemo where tag is not null group by tag;
      tag   
   ---------
    student
    project
    faculty
    course

Autrement dit, PredictiveDB ne fera aucune différence entre les lignes suivantes:

                                       url                                     |   tag   
   ----------------------------------------------------------------------------+---------
    http://www.das.harvard.edu/users/students/Adam_Deaton                      | student
    http://WWW.CS.Yale.EDU/homes/xiao                                          | student
    http://www.cs.rpi.edu/~eganma/index.html                                   | student
    http://cs.nyu.edu/cs/dept_info/course_home_pages/spr96/G22.2274/index.html | course

et celles-ci:

 url  |  tag     
------+--------
 url1 |  tag1
 url2 |  tag1
 url3 |  tag1
 url4 |  tag2

Les URLs http://www.das.harvard.edu/users/students/Adam_Deaton et http://WWW.CS.Yale.EDU/homes/xiao sont considérées comme de simples labels, dont la formulation exacte n’a aucune importance. Seul compte le fait qu’on puisse les distinguer. Ainsi, des champs de type a/b/c, bleu/vert/rouge, etc, devraient tous être déclarés comme VARCHAR.

  • champ webpage:

Le champ message est de type TEXT. PredictiveDB ne traite pas les champs TEXT comme le type VARCHAR: à l’opposé du type précédent, ici, le contenu des champs importe.

Par exemple, pour les deux pages suivantes:

 student | This homepage undergoing vast renovations! Stay tuned...
           This will never be done, to be totally honest...
           ===============================================================================
           Here s some graphic. Actually, this is me, only it s not terribly
           recognizable...
           ===============================================================================
           ***** Stuff that s interesting ***** Well, I ve been until last year a
           Computer Science student back at UB (the University at Buffalo). Now I m
           doing graduate work at William and Mary in Williamsburg, Virginia[...]

et

 course | ****** CIS 730 ****** Principles of Artificial Intelligence *****
          The following information is available *****     * Course Description
          * Course Material     * Assignments     * Exams     * Final Marks 

PredictiveDB tiendra compte de la présence du mot “graduate” dans le premier document, mais pas dans le second; de même pour le mot “Marks”, spécifique au second message.

Conclusion

Un typage cohérent est donc le préalable à de bonnes prédictions: si webpage avait été déclarée en tant que VARCHAR dans les exemples précédents, ils auraient tout aussi bien pu comprendre

 student | webpage34

et

 course | webpage432

sans que cela ne fasse la moindre différence. Toute l’information textuelle (course, Assignments, Marks…) se trouverait ignorée.

Démonstration à “La Cantine”

L’équipe de PredictiveDB sera à La Cantine le 06 Decembre de 09h30 à 12h00.

N’hésitez pas à nous rejoindre.

Un avant-goût de notre démonstration sous la forme d’une courte vidéo:

Predicting with SQL what U.S. House of Representatives Congressmen will vote

version française

As described in our previous post, PredictiveDB allows you to predict the NULL values of a table’s column and associates an error risk to them.

Predicting is an easy task, which only requires basic SQL notions.

“Easy task” means that you only need to know two instructions:

  • select
  • insert

…and nothing else.

Predictive view: select

To output predictions from a table, one needs to run a select from a special view, the predictive view

In the previous example:

select * from iris where id = 150;

queried the table itself, whereas:

select * from pdb.iris where id = 150;

queried the predictive view. In the latter, NULL values are substituted by predictions.

How to create a predictive view: insert

To create a predictive view from a table, one must fill the pdb.predictiveview table, which indexes all predictive views:

 \d pdb.predictiveview
       Column       |       Type        
 -------------------+-------------------
  view_name         | character varying 
  schema_name       | character varying 
  table_name        | character varying 
  key_column        | character varying 
  predictive_column | character varying 

An insert in this table automatically creates the matching predictive view.

An example:

Take a database containing some votes from U.S. House of Representatives Congressmen. The data set includes 16 key votes identified by the Congressional Quarterly Almanac, in 1984:

 \d vote
                 Colonne                 |       Type        
 ----------------------------------------+-------------------
  id                                     | integer           
  handicapped_infants                    | character varying 
  water_project_cost_sharing             | character varying 
  adoption_of_the_budget_resolution      | character varying 
  physician_fee_freeze                   | character varying 
  el_salvador_aid                        | character varying 
  religious_groups_in_schools            | character varying 
  anti_satellite_test_ban                | character varying 
  aid_to_nicaraguan_contras              | character varying 
  mx_missile                             | character varying 
  immigration                            | character varying 
  synfuels_corporation_cutback           | character varying 
  education_spending                     | character varying 
  superfund_right_to_sue                 | character varying 
  crime                                  | character varying 
  duty_free_exports                      | character varying 
  export_administration_act_south_africa | character varying 
  party                                  | character varying 

For instance, the congressman number 1, a republican, made the following votes:

select * from vote where id=1;

1 | n | y | n | y | y | y | n | n | n | y |  | y | y | y | n | y | republican

(Perhaps you have noticed that some votes are blank (no “y” nor “n”): this means a lack of vote — not exactly, in fact, see the source for more details. Missing values are not a problem for PredictiveDB, which handles them very well, contrary to some other approaches (neural networks, SVM).

For some congressmen, votes are known, but not their party. See for example congressman number 435:

 demo=> select id,party from vote offset 430;
  id  |   party    
 -----+------------
  431 | republican
  432 | democrat
  433 | republican
  434 | republican
  435 | 

PredictiveDB, based on votes from this member and from his peers, will allow you to predict which party he belongs to.

1) We create a predictive view thanks to an insert into the pdb.predictiveview index.

 insert into pdb.predictiveview values('vote','public','vote','id','party');

filling:

  • view_name: the predictive wiew’s name (“vote”)
  • schema_name: the target table’s schema name (“public”)
  • table_name: the name of the target table (“vote”)
  • key_column: key of the target table (“id”)
  • predictive_column: the column to predict (“party”)

We can check the declaration of the new predictive view:

       demo=>  select * from pdb.predictiveview;
       view_name | schema_name | table_name | key_column | predictive_column 
      -----------+-------------+------------+------------+-------------------
       vote      | public      | vote       | id         | party

2) As the predictive view has been created, we can now query it:

  demo=>  select * from pdb.vote offset 430;
   id  |   party    |    error_risk     
  -----+------------+-------------------
   431 | republican |                  
   432 | democrat   |                  
   433 | republican |                  
   434 | republican |                  
   435 | republican | 7.339449541284404

For PredictiveDB, congressman number 435 is a republican.

PredictiveDB has the advantage to always associate a performance measure to its predictions: here, it means that this is a fairly reliable prediction, as one has about 93% of chance of being right or about 7% of being wrong, the error risk.

(This measure is based on the supervised learning theory, we will develop these notions in a future post.)

Another example

In the above example, we noticed that one vote of congressman number 1 was missing:

select * from vote where id=1;
1 | n | y | n | y | y | y | n | n | n | y |    | y | y | y | n | y | republican
                                            ^ NULL

The missing vote is synfuels_corporation_cutback:

demo=> select synfuels_corporation_cutback from vote where id=1 ;
 synfuels_corporation_cutback 
------------------------------
         NULL

We can use PredictiveDB to speculate on the putative decision of congressman number 1, if he would have voted:

1) Create the predictive view:

 insert into pdb.predictiveview values('votescc','public','vote','id','synfuels_corporation_cutback');

2) Query predictions:

demo=> select * from pdb.votescc where id=1 ;
 id | synfuels_corporation_cutback |    error_risk     
----+------------------------------+-------------------
  1 | n                            | 30.76923076923077

For PredictiveDB, congressman number 1 would have voted “no”.

Note that this time, the result is not so certain, as about 30 percent of the predictions may be wrong. In other words, it’s a useful clue for a bet (2/3 chance of winning), but not so much to make a decision for him.

Better data is required to get better results. For instance, new examples of votes from other members may improve PredictiveDB’s performance.

Prédire le vote d’un membre du congrès en SQL

english version

Comme décrit dans l’entrée précédente, PredictiveDB permet de compléter les valeurs NULL d’une table en associant un risque d’erreur à ses prédictions.

On effectue ces prédictions très simplement, en SQL.

“Très simplement” signifie qu’il suffit de connaitre deux instructions:

  • select
  • insert

…et rien de plus.

Vue prédictive: select

Afin de consulter les prédictions d’une table, il convient d’effectuer un select sur une vue particulière, nommée vue prédictive.

Dans l’exemple précédent:

select * from iris where id = 150;

interrogeait la table elle-même, tandis que:

select * from pdb.iris where id = 150;

interrogeait la vue prédictive. Dans cette dernière, les valeurs NULL sont remplacées par des prédictions.

Comment créer une vue prédictive?: insert

Pour créer la vue prédictive d’une table, il faut renseigner la table pdb.predictiveview qui les indexe:

 \d pdb.predictiveview
       Column       |       Type        
 -------------------+-------------------
  view_name         | character varying 
  schema_name       | character varying 
  table_name        | character varying 
  key_column        | character varying 
  predictive_column | character varying 

On la complète à l’aide d’un insert, et la vue prédictive correspondante se trouvera automatiquement créée.

Un exemple

Nous disposons d’une base de données recensant certains votes des membres du congrès aux États-Unis. Il s’agit d’une table recensant 16 votes clefs identifiés par le Congressional Quarterly Almanac en 1984:

  \d vote
                 Colonne                 |       Type        
 ----------------------------------------+-------------------
  id                                     | integer           
  handicapped_infants                    | character varying 
  water_project_cost_sharing             | character varying 
  adoption_of_the_budget_resolution      | character varying 
  physician_fee_freeze                   | character varying 
  el_salvador_aid                        | character varying 
  religious_groups_in_schools            | character varying 
  anti_satellite_test_ban                | character varying 
  aid_to_nicaraguan_contras              | character varying 
  mx_missile                             | character varying 
  immigration                            | character varying 
  synfuels_corporation_cutback           | character varying 
  education_spending                     | character varying 
  superfund_right_to_sue                 | character varying 
  crime                                  | character varying 
  duty_free_exports                      | character varying 
  export_administration_act_south_africa | character varying 
  party                                  | character varying 

Par exemple, le membre du congrès numéro 1, un républicain, a effectué les votes suivants:

select * from vote where id=1;

1 | n | y | n | y | y | y | n | n | n | y |  | y | y | y | n | y | republican

(Remarque: vous avez peut-être noté qu’un vote n’a pas été renseigné (pas de “y” ni de “n”): cela traduit une absence de vote — enfin, pas tout à fait, voir la source pour plus de détails. Les valeurs manquantes ne posent aucun problème à PredictiveDB qui s’en accommode sans peine, contrairement à certaines approches (réseaux de neurones, SVM).

Or, concernant certains membres, nous connaissons leurs votes, mais pas leur parti (ex. membre n°435):

 demo=> select id,party from vote offset 430;
  id  |   party    
 -----+------------
  431 | republican
  432 | democrat
  433 | republican
  434 | republican
  435 | 

PredictiveDB, en se basant sur les votes de ce membre et sur ceux de ses pairs, va permettre de prédire son étiquette politique.

1) On crée une vue prédictive à l’aide d’un insert dans l’index des vues prédictives pdb.predictiveview:

 insert into pdb.predictiveview values('vote','public','vote','id','party');

On renseigne successivement:

  • view_name: le nom de la vue (“vote”)
  • schema_name: le nom du schéma Postgresql (“public”)
  • table_name: le nom de la table cible (“vote”)
  • key_column: la clef de la table cible (“id”)
  • predictive_column: la colonne à prédire (“party”)

Éventuellement, on peut vérifier la déclaration de la nouvelle vue prédictive:

       demo=>  select * from pdb.predictiveview;
       view_name | schema_name | table_name | key_column | predictive_column 
      -----------+-------------+------------+------------+-------------------
       vote      | public      | vote       | id         | party

2) Maintenant que la vue a été créée, on peut la consulter:

  demo=>  select * from pdb.vote offset 430;
   id  |   party    |    error_risk     
  -----+------------+-------------------
   431 | republican |                  
   432 | democrat   |                  
   433 | republican |                  
   434 | republican |                  
   435 | republican | 7.339449541284404

Selon PredictiveDB, le membre 435 est un républicain.

PredictiveDB offre l’avantage de fournir un indicateur de la qualité de ses prédictions: ici, il s’agit d’une prédiction plutôt fiable, puisqu’on a (environ) 93 chances sur 100 de ne pas se tromper.

(Cette mesure se base sur une méthodologie éprouvée développée dans le cadre de l’apprentissage supervisé, nous reviendrons sur ces notions lors d’une future entrée.)

Autre exemple

Un peu plus haut, nous remarquions qu’un vote du membre n°1 était non-renseigné (pas de “y” ni de “n”):

select * from vote where id=1;
1 | n | y | n | y | y | y | n | n | n | y |    | y | y | y | n | y | republican
                                            ^ NULL

Il s’agit du vote synfuels_corporation_cutback:

demo=> select synfuels_corporation_cutback from vote where id=1 ;
 synfuels_corporation_cutback 
------------------------------
         NULL

Or, on peut exploiter PredictiveDB pour spéculer sur la décision du membre n°1 s’il avait pris part au vote:

1) Création de la vue prédictive:

 insert into pdb.predictiveview values('votescc','public','vote','id','synfuels_corporation_cutback');

2) Consultation des prédictions:

demo=> select * from pdb.votescc where id=1 ;
 id | synfuels_corporation_cutback |    error_risk     
----+------------------------------+-------------------
  1 | n                            | 30.76923076923077

À en croire PrédictiveDB, le membre n°1 aurait voté “non”.

À noter que, cette fois, une relative incertitude plane, puisque le SGBD prédictif évalue à 30 chances sur 100 le risque de fausses prédictions. Autrement dit, c’est une indication suffisante pour un pari (2 chances sur 3 d’avoir raison), mais pas suffisamment performante pour voter à la place de l’intéressé.

Pour que PredictiveDB puisse fournir une prédiction plus fine, des données de meilleures qualités seraient nécessaires. Par exemple, de nouveaux exemples de votes d’autres membres concourraient peut-être à améliorer la qualité de ses analyses.

What is a *predictive* database?

version française

PredictiveDB is a predictive RDBMS, which means that it can predict (classify) from SQL data. What is the meaning of this?

In two words:

From a table, PredictiveDB allows you to predict the NULL values of a column.

With an example:

We will illustrate the idea behind PredictiveDB with a classical dataset: the Fisher’s Iris dataset.

A floral database contains the width and the length of petals and sepals of three iris species:

  • Iris setosa
  • Iris virginica
  • Iris versicolor

demo=> select * from iris LIMIT 25 ;
 id  | sepallength | sepalwidth | petallength | petalwidth |     species     
-----+-------------+------------+-------------+------------+-----------------
  21 |         5.4 |        3.4 |         1.7 |        0.2 | Iris-setosa
 109 |         6.7 |        2.5 |         5.8 |        1.8 | Iris-virginica
  69 |         6.2 |        2.2 |         4.5 |        1.5 | Iris-versicolor
 107 |         4.9 |        2.5 |         4.5 |        1.7 | Iris-virginica
   5 |         5.0 |        3.6 |         1.4 |        0.2 | Iris-setosa
  57 |         6.3 |        3.3 |         4.7 |        1.6 | Iris-versicolor
  30 |         4.7 |        3.2 |         1.6 |        0.2 | Iris-setosa
   9 |         4.4 |        2.9 |         1.4 |        0.2 | Iris-setosa
  41 |         5.0 |        3.5 |         1.3 |        0.3 | Iris-setosa
  28 |         5.2 |        3.5 |         1.5 |        0.2 | Iris-setosa
  66 |         6.7 |        3.1 |         4.4 |        1.4 | Iris-versicolor
  72 |         6.1 |        2.8 |         4.0 |        1.3 | Iris-versicolor
  91 |         5.5 |        2.6 |         4.4 |        1.2 | Iris-versicolor
  85 |         5.4 |        3.0 |         4.5 |        1.5 | Iris-versicolor
  13 |         4.8 |        3.0 |         1.4 |        0.1 | Iris-setosa
  10 |         4.9 |        3.1 |         1.5 |        0.1 | Iris-setosa
  51 |         7.0 |        3.2 |         4.7 |        1.4 | Iris-versicolor
 114 |         5.7 |        2.5 |         5.0 |        2.0 | Iris-virginica
  24 |         5.1 |        3.3 |         1.7 |        0.5 | Iris-setosa
 150 |         5.9 |        3.0 |         5.1 |        1.8 | 
  88 |         6.3 |        2.3 |         4.4 |        1.3 | Iris-versicolor
 139 |         6.0 |        3.0 |         4.8 |        1.8 | Iris-virginica
 148 |         6.5 |        3.0 |         5.2 |        2.0 | Iris-virginica
 127 |         6.2 |        2.8 |         4.8 |        1.8 | Iris-virginica
 102 |         5.8 |        2.7 |         5.1 |        1.9 | Iris-virginica

For instance, iris number 21, an Iris setosa, shows a sepal lenght of 5.4cm and a sepal width of 3.4cm, and so on. Note that the “species” column has not be filled for the flower number 150 (i.e. it contains a NULL value).


demo=> select * from iris where id = 150;
 id  | sepallength | sepalwidth | petallength | petalwidth | species 
-----+-------------+------------+-------------+------------+---------
 150 |         5.9 |        3.0 |         5.1 |        1.8 | 
(1 ligne)

In other words, an amateur botanist has been able to measure the petals and sepals of the flower, but could not positively identify the species. Distraught, he then uses PredictiveDB which, based on data from Fisher (no magic there), will be able to predict the species of the mysterious plant.


demo=> select * from prediris where id = 150;
 id  |    species     |     error_risk     
-----+----------------+--------------------
 150 | Iris-virginica | 5.2631578947368425
(1 ligne)

In this case, PredictiveDB considers that it is a prominent representative of the species Iris virginica:

iris

PredictiveDB associates an error risk to its predictions - a relatively low risk in this case (about 5 errors in 100 predictions).

Of course, this is an example; besides iris identification, the predicted column can be about a customer, a product, a molecule, a meta-data, a tweet, etc.