PICARIELLO DATA CHALLENGE 2022

Welcome to the official page of the Picariello Data Challenge 2022 organized by the University of Naples "Federico II" and sponsored by Eustema s.p.a.
The Challenge will be open to groups of students coming from all departments of Federico II.
Contacts:
Dr. Donato Cappetta (EUSTEMA)
Prof. Giuseppe Longo (Dip. di Fisica "Ettore Pancini")
Prof. Carlo Sansone (Dip. di Ing. Elettrica e delle Tecnologie dell'Informazione)

If you want to know more about the challenge press on your keyboard.

The Challenge

The objective of the challenge is to simulate an industrial Data Science project as you would encounter working as a data scientist for a company or into an interdisciplinary research group. The challenge is about predicting the outcome, the duration in number of days and the settlement for around 300,000 lawsuits produced by several tribunals of the "Giudice di Pace" scattered around Italy.
In order to register to the challenge and to see the timeline take a look at Registration and Timeline
To know more about the data, take a look at Data
To know more about the scoring, take a look at Scoring
To know how to submit your results, take a look at Submission

At the beginning of the challenge, each group will be provided with training sets containing all the input data fields and the values for the target variables and blind test sets for which you will have to make predictions. Given that we are simulating an industrial data science project, we are not only interested about the performances of your algorithms in predicting the target variables, but also on the completeness and correctness of your Exploratory Data Analysis (EDA) and the quality of both your code and report explaining your strategy to solve the problems and the details of the algorithms and method used.

Prizes

  1. The winning group will receive One IPad Pro for each group member
  2. The second group will receive one 500€ Amazon Giftcard for each group member

Members of both groups will have the possibility of doing an internship/thesis in Eustema (hiring path)

Registration

In order to register to the challenge, you need to form a group of 3 - 4 students (the composition of the groups cannot change during the challenge) and send an email to infodatascience@unina.it containing the following informations:

  • Name of your group
  • Names of the participants
  • Department and Degree

You will receive a reply containg the code to join the Challenge Team on Microsoft Teams. The Team is where the challenge will be presented, where announcements will be made, where you will find the data (when the challenge begins) and where you will have to submit your deliverables.

Timeline

  • 24/06 - 11.00: the Challenge is presented in Hybrid form both in the Caianello Conference Room in the Department of Physics and on Teams and registrations will stay opened till 15/10. Partecipating groups will be visible on Microsoft Teams
  • 15/10 - 9.00: the Challenge begins, data is made available on Microsoft Teams
  • 17/01 - 24.00: the Challenge is closed, in order to have a valid submission you must upload your deliverables in your Group Folder in Microsoft Teams before this deadline
  • 01/02 - 9.00: the set of finalist groups will be selected and each group will be asked to present their work live in front of the judging commisions. The quality of your presentation will be evaluated and will count in your group total score
  • TBA: final ceremony: a maximum of five groups will be asked to present their work, and the Challenge winner groups is announced and awards are given.

The Data

The data will be provided on Microsoft Teams at the beginning of the challenge and will be consituted by 4 .csv files:

  • Training Set for Outcome and and Duration prediction: containing all the data fields and the 2 target variables
  • Blind Test Set for Outcome and Duration prediction: containing all the data fields but no target variables
  • Training Set for Settlement prediction: containing all the data fields and the target variable
  • Blind Test Set for Settlement prediction: containing all the data fields but no target variable

The following data fields are present:

  • ID: an unique identifier to each row of the dataset
  • Case identifier: an unique identifier for the case
  • Judge Identifier: an unique ID given to each Judge
  • Object: a code which identifies the type of the lawsuit
  • Date: the date in which the Judge Identifier was assigned, and hence also the date when the case started
  • Section: the judge's tribunal section
  • Value: the monetary value of the lawsuit
  • Tax Related: indicates if the case is tax related
  • Unified Contribution: is the monetary cost for starting the lawsuit
  • Primary Actor: is the primary actor of the lawsuit, i.e. the accusing part
  • Seconday Actor: is the seconday actor of the lawsuit, i.e. the accusing part
  • Primary Defendant: is the primary defendant of the lawsuit
  • Secondary Defendant: is the primary defendant of the lawsuit
  • Number of Lawyers: the number of lawyers involved in the lawsuit
  • Number of Legal Parties: the number of legal parties involved in the lawsuit
  • Number of Persons: the number of persons involved in the lawsuit
  • City of the judge’s office: the city in which the judge's office is located

  • Outcome: the outcome of the lawsuit (the first target variable) which can be on of the following classes:
    • 0 Accepted
    • 1 Declined
    • 2 Partially Accepted
    • 3 Ceased Matter
  • Duration: the duration of the lawsuit in number of days (the second target variable)
  • Settlement: the final settlement of the lawsuit (the third target variable)


Submission of Results

Soon after your group registration, and before the challenge begins, a folder accessible only to the members of your group will be created in the Files folder of the Challenge Team. In order to complete the challenge your group will have to upload the following files before the 16th of September at midnight:

  1. A file called prediction_outcome_duration.csv containing, for all instances in the Test set, your predictions for the outcome (int) and the duration (int). The file must have a header with 3 columns named respectively ID, Outcome and Duration. A file called prediction_settlement.csv containing, for all instances in the Test set, your predictions for the settlement (float32). The file must have a header with 2 columns named respectively ID and Settlement You must provide the predictions for all the Test instances, otherwise your score will be invalidated;
  2. A file called report.pdf outlining your EDA, the architectural choiches and an outline of your approach to solve the three problems;
  3. All scripts used to solve the problems. Scripts can be written in Python or R languages and they must reproduce your results and must be self-cointained. For Python, produce a requirement.txt to install all the necessaries libraries with pip, while for R provide us with the full list of requirements and package versions. We adivise to follow an object oriented programming parading and to take a look at Python etiquette and R etiquette to make sure that your code is correctly commented and formatted.

Scoring

THe challenge will be evaluated by a committee chaired by Prof. Roberta Siciliano and including faculty members of UNINA and Data Scientists from EUSTEMA. The scoring of the challenge is based on a point system in which the minimum score is 0 and the maximum is 100. The total number of points is divided among the three tasks of the challenge plus the final presentation as it follows:

  • up to 50 points are awarded on the basis of your metrics on the blind test set. For a detailed break down of the score read next paragraph.
  • up to 15 points are awarded on the basis of the quality of your report
  • up to 15 point are awarded on the basis of the quality of your code
  • additional 20 points will be attributed to the quality of the presentation (see below).

Metrics

The following metrics will be used to evaluate your predictions:

  • Outcome classification metric: micro-averaged F1 Score. 15 points will be awarded for an F1 Score of 1.
  • Duration regression metric: Mean Absolute Eror (MAE). 15 points will be awarded for a MAE of 0.
  • Settlement regression metric: Mean Absolute Eror (MAE). 20 points will be awarded for a MAE of 0.

Finalists presentation scoring

The presentations of the finalist groups will be judged by the commission and up to 20 points will be awarded on the basis of the ability of the groups to convince the commission and sell their solution.

GOOD LUCK!