Observable Collection FAQ
The TruSTAR platform can identify and collect Observables from data sources, such as emails, spreadsheets, reports, and other submissions. This is a complex process, given that data can have many formats and it can be structures (such as JSON) or unstructures (such as emails). The collection process is an ongoing project, with the development team working to identify and fix known issues.
Related Topic: Observablessupported by TruSTAR
The table below lists issues with data extraction that have been identified.
URL extraction issues
Incomplete extraction; for example: yahoo.c instead of yahoo.com
URL not correctly parsed when it contains parentheses
URL not correctly parsed when it contains bracketed colons "[:]"
Domains, including fully qualified domain names, classified as URLs
Filename extraction issue
Filename not correctly parsed when it contains spaces.
Disambiguation between scripts and domains
Domain incorrectly categorized as a perl script; for example: myacmecompany.pl
Domain incorrectly classified as a Python script; for example: myacmecompany.com.py
Some users have reported that IP addresses are not being extracted. TruSTAR validates IPv4 addresses and if they are in the range of private IP addresses, the IP is not extracted as an IOC. These types of IP addresses include:
- loopback address (127.x.x.x)
- site local address (10/8 prefix, 172.16/12 prefix, 192.168/16 prefix)
- value of 0
TruSTAR has upgraded its platform by converting the Compute Engine to use Apache Spark, which is engineered for high performance when handling large datasets. This more powerful Compute Engine enables TruSTAR to handle massive amounts of data extractions and normalizations while ensuring careful monitoring of data quality.
TruSTAR is continuing to invest in improved data extractions. Key initiatives include:
- Allowing users to submit structured indicator objects.
- Upgrading the extraction engine by using Apache Spark to process and prioritize email submissions.
- Working on improving the URL data model. This will include splitting URLs into different components and correctly capturing them and separating domain name concepts from URLs.