Entity Extraction FAQ

Updated 1 month ago by Elvis Hovor

This article covers known issues with how TruSTAR handles entity extraction from data sources and how TruSTAR is working to fix those issues in future releases.

What is Entity Extraction?

The TruSTAR extraction process identifies and extracts Entities from data sources, such as emails, spreadsheets, reports, etc. This is a complex process, given that the data has many structures and formats. The extraction process is an ongoing project, with the development team working to identify and fix known issues.

Related Topic: Entities supported by TruSTAR

Known Issues

The table below lists issues with data extraction that have been identified.

Issue

Details

URL extraction issues

Incomplete extraction; for example: yahoo.c instead of yahoo.com

URL not correctly parsed when it contains parentheses

URL not correctly parsed when it contains bracketed colons "[:]"

Domains, including fully qualified domain names, classified as URLs

Filename extraction issue

Filename not correctly parsed when it contains spaces.

Disambiguation between scripts and domains

Domain incorrectly categorized as a perl script; for example: myacmecompany.pl

Domain incorrectly classified as a Python script; for example: myacmecompany.com.py

Some users have reported that IP addresses are not being extracted. TruSTAR validates IPv4 addresses and if they are in the range of private IP addresses, the IP is not extracted as an IOC. These types of IP addresses include:

  • loopback address (127.x.x.x)
  • site local address (10/8 prefix, 172.16/12 prefix, 192.168/16 prefix)
  • value of 0

Improving Entity Extraction

TruSTAR has upgraded its platform by converting the Compute Engine to use Apache Spark, which is engineered for high performance when handling large datasets. This more powerful Compute Engine enables TruSTAR to handle massive amounts of data extractions and normalizations while ensuring careful monitoring of data quality.

TruSTAR is continuing to invest in improved data extractions. Key initiatives include:

  • Allowing users to submit structured indicator objects.
  • Upgrading the extraction engine by using Apache Spark to process and prioritize email submissions.
  • Working on improving the URL data model. This will include splitting URLs into different components and correctly capturing them and separating domain name concepts from URLs.


How Did We Do?