ターゲット漏えい

ターゲット漏えいとは

Note: Before reading this entry, familiarize yourself with those on targets; data collection; and training, validation, and holdout.

Target leakage, sometimes called data leakage, is one of the most difficult problems when developing a machine learning model. It happens when you train your algorithm on a dataset that includes information that would not be available at the time of prediction when you apply that model to data you collect in the future. Since it already knows the actual outcomes, the model’s results will be unrealistically accurate for the training data, like bringing an answer sheet into an exam.

「モデルを使用して予測を行う時点で値を実際に利用できないその他の特徴量はすべて、モデルに漏えいをもたらす可能性がある特徴量です。」– Data Skeptic

To avoid target leakage, omit data that will not be known at the time of the target outcome. The following timeline shows the process of avoiding target leakage when predicting the outcome of a medical visit, such as whether or not a patient will be diagnosed with heart disease (marked as “target observed”). When constructing your training dataset, you should include data that occurs on the timeline before the “target observed” point, such as office visit data, lab procedure data, and diagnostic test data. However, you should not include data from tests or office visits that occurred after the initial diagnosis of heart disease. Those data will have been collected with the diagnosis in mind, which you won’t know when you apply the model to future data to make a prediction.

ターゲット漏えい

Why is Target Leakage Important?

Target leakage is a consistent and pervasive problem in machine learning and data science. It causes a model to overrepresent its generalization error, which makes it useless for any real-world application.

The prevalence of target leakage proves that deep domain knowledge is essential for machine learning and artificial intelligence (AI) initiatives. Business analytics professionals and others with knowledge of the practical application of the machine learning models must be involved in all aspects of data science projects, from problem specification to data collection to deployment, in order to develop models that deliver actual value to the business.

Target leakage is particularly nefarious because it can be both intentional and unintentional, making it difficult to identify. For example, Kaggle contestants have intentionally included sampling errors that resulted in target leakage in order to develop highly accurate models and gain a competitive edge in data science competitions.

ターゲット漏えい + DataRobot

ターゲット漏えいを識別し、修正するには、予測モデルを適用しようとするビジネスコンテキストに関する深い専門知識と理解が必要になります。ターゲット漏えいを 100% の精度で識別する方法はありません。したがって、データを深く理解し、モデルの出力を批判的に分析し、疑わしいものがあれば、さらに調査する必要があります

。DataRobot には、ターゲット漏えいの可能性があるかどうかを判断するのに役立つ機能が複数あります。

  • 精度に関するリーダーボード。モデルにパーフェクトまたはパーフェクトに近い精度スコアが表示される場合、それは危険信号であり、さらなる調査が必要です。
  • 特徴量のインパクト。DataRobot は、各変数が各モデルの結果に与えるインパクトを自動的に計算します。ターゲット漏えいを含む確率が高いものは、注意が必要なほど高いスコアになります。

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What is target leakage?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Target leakage, sometimes called data leakage, happens when you train your algorithm on a dataset that includes information that would not be available at the time of prediction.”}}]}