Entity Extraction FAQ
This article covers known issues with how TruSTAR handles entity extraction from data sources and how TruSTAR is working to fix those issues in future releases.
What is Entity Extraction?
The TruSTAR extraction process identifies and extracts Entities from data sources, such as emails, spreadsheets, reports, etc. This is a complex process, given that the data has many structures and formats. The extraction process is an ongoing project, with the development team working to identify and fix known issues.
Related Topic: Entities supported by TruSTAR
The table below lists issues with data extraction that have been identified.
URL extraction issues
Incomplete extraction; for example: yahoo.c instead of yahoo.com
URL not correctly parsed when it contains parentheses
Domains, including fully qualified domain names, classified as URLs
Filename extraction issue
Filename not correctly parsed when it contains spaces.
Disambiguation between scripts and domains
Domain incorrectly categorized as a perl script; for example: myacmecompany.pl
Domain incorrectly calssified as a Python script; for example: myacmecompany.com.py
Some users have reported that IP addresses are not being extracted. TruSTAR validates IPv4 addresses and if they are in the range of private IP addresses, the IP is not extracted as an IOC. These types of IP addresses include:
- loopback address (127.x.x.x)
- site local address (10/8 prefix, 172.16/12 prefix, 192.168/16 prefix)
- value of 0
Improving Entity Extraction
TruSTAR has upgraded its platform by converting the Compute Engine to use Apache Spark, which is engineered for high performance when handling large datasets. This more powerful Compute Engine enables TruSTAR to handle massive amounts of data extractions and normalizations while ensuring careful monitoring of data quality.
TruSTAR is continuing to invest in improved data extractions. Key initiatives include:
- Allowing users to submit structured indicator objects.
- Upgrading the extraction engine by using Apache Spark to process and prioritize email submissions.
- Working on improving the URL data model. This will include splitting URLs into different components and correctly capturing them and separating domain name concepts from URLs.