Source Scoring Model

Updated 6 months ago by Shimon Modi

This article covers the details of how intelligence source scores are computed. The intelligence source score is a quantification of the total value returned by an intelligence source to a private enclave. We take into consideration all of the historical data within each private enclave to calculate the score. Whenever new data is submitted the score is recomputed. Now let’s jump into the details of the score.

The score is computed based on three different IOC scores:

  1. IP Overall Score.
  2. URL Overall Score.
  3. Hashes Overall Score.

Each of these individual scores can range from 0 to 100, and the overall intelligence source score is computed by averaging them.

Each IOC score is comprised of:

  1. Uniqueness Score.
  2. Timeliness Score.

Let’s start by explaining how the uniqueness score is calculated. Consider that we are computing the IP overall score. In this scenario we have a private enclave that has 100 IP's. 7 of these IPs correlate with one or more intelligence sources A, B, and C. If we want to compute the uniqueness score of source A, we would follow this process: collect the number of IPs that were unique to source A - let’s say they were 2 IP's, collect the number of IPs that were in sources A & B - let’s also say they were 2, and finally collect the number of IPs that were in sources A, B & C - let’s say they were 3. The raw uniqueness score would be 2 + 2/2 + 3/3 = 4. In other terms, you can think of uniqueness score as the weighted sum of correlations with a source multiplied by the following weights:

1/(# of intelligence sources containing the indicator) 

As for the timeliness score, it takes into consideration the time difference between the updated time of the private enclave report and the source report submission time. We assume that the report submission time corresponds to the an incident time. If many intelligence sources and private enclave reports contained the same indicator, we pick the first private enclave report submission and find the source report which provides the most recent enrichment. Again, you can think of timeliness as the weighted sum of correlations with a specific intelligence source w.r.t. to the following weights:

1/(# days difference between the private enclave report and source report)
          

 As you can see from these weights the correlation become inversely proportional to the time difference in days. This results in prioritization of enrichment that is provided in a timely manner.

Once the timeliness counts are obtained they are normalized with respect to the total number of extracted IOCs for a certain type. For example, if raw count and uniqueness count were 6 and 4 for IPs and the total number of extracted IPs was 100, the obtained raw_timeliness_score and raw_uniqueness_score are, respectively,  6/100 and 4/100.

In order to scale these scores to the 0-100 range, we performed a study over all of the pairwise raw timeliness and uniqueness scores, between all private enclaves and intelligence sources on the TruSTAR platform. Most scores were skewed towards small values and clustered in a tight range between 0 and 0.35. To increase the interpretability of the data, we had to perform a logarithmic (base e) transformation. To rescale to a 0 to 100 range we find the scaling windows given by the following:

(log(raw_score/(# iocs in the private enclave)) - window_start)/(window_end - window_start)

As mentioned earlier, the window_end and window_start are obtained based on all the data on the platform. In our current example if window_end = -14 and window_start = -1 the computed timeliness and uniqueness scores would be 86 and 83 for an average IP score of 84.5. The same computation is performed for URLs and Hashes. If we have a URL score of 50.5 and a Hash score of 10 the final intelligence source score for source A would be 50%.

With the above scoring methodology you can end up with scores above 100. Based on our analysis we have found that scores above the window_end value (i.e. final score above 100) could be indicative of duplicate data. On the feature, we will issue a warning when this case arises.

We will update this guide as we refine the model and add more IOCs to the overall calculation. If you have more questions please email us at support@trustar.co.


How Did We Do?