Building a Do-It-Yourself Defect Discovery Practice
With the exception of a few vendors and their heavily invested customers, many agree that...
When SARIF became a defacto standard for security tool output, supported by commercial and OSS tools alike, development and security practitioners alike rejoiced. Both immediately began prototyping scripts making defect discovery work better within their departments. Organizations began imagining in-house vulnerability management leveraging emerging support within SCM platforms, as well as the CSPs, both of which reliably ingesting SARIF and presenting those findings to Developers and Operators alike, within their familiar environment. With SARIF as the bridge, common use cases included adding or changing remediation guidance, affixing labels and tags to findings for better classification and search, modifying scoring, and of course, routing and ingesting results to triage and dashboard security issues.
The SARIF format brought huge wins: a reliably parsable format to tool output, an inclusive but universal schema for findings that works across tool types, from static to dynamic analyses. For many users, this is beneficial enough. It’s attractive to think that these capabilities are sufficient to serve as the plumbing a defect discovery practice can be driven. And if that’s the case, maybe even go a step further and plan a defect discovery practice in which security collaborates with developers to configure and implement OSS scanning tools directly into the SCM and build platforms.
What problems will an organization face as it takes the approach above to defect discovery and application security? This article illuminates and addresses hidden concerns. Whether you decide to tackle these problems head-on and do-it-yourself, or rely on a security platform to help you, these are the roadblocks that prevent scale.
Defect discovery tools use the complexity of their native findings format to provide a few key benefits. Many commercial tools treat results files like a multi-file flattened database, carrying rich (and sometimes overwhelming) amounts of data that their tool’s UI, an IDE/Browser, or plugin visualizes. This allows users to easily navigate and mark up findings and also encode role-based workflows (for triage, suppression, and remediation validation) into the tool. Results files must be processed by their respective tools in order to be read and useful to the UI or humans. Historically, customers relied on the rich information in these files to a) educate engineers on vulnerability and remediation, and b) to drive hands-on/heads-up triage by security practitioners. Some tools base the distributed nature of their suppression and remediation-validation workflow on their findings format. Organizations came to rely on these features even while they begrudged lock-in that accompanied it.
Consider the SARIF spec, and notice incredible thought has been put into supporting richer aspects of native tool formats, such as graph information for source location context or remediation guidance. Unfortunately, the benefit of this thinking hasn’t achieved broad implementation by tools in practice, particularly with OSS tooling. Commercial tool support varies. Without the ability to accurately transcribe these details from native tool results formats, SARIF output often loses desirable detail or workflow (suppression, triage, remediation-validation) capability.
Security workflows in some organizations disallow them from completely foregoing the benefits of the native tool finding formats, particularly when databased within an Enterprise-featured SaaS. In these circumstances, native findings are translated to SARIF formats before being pushed to Development or Cloud platforms. Organizations have to find a way to map the unique identifier found in each SARIF finding to that of its analog in the native format.
Organizations find it valuable to maintain the relationship from SARIF to native finding, either through decoration within the format, or external mapping. SARIF’s fingerprint and GUID properties may – for some tools – represent a 1:1 mapping to a unique native finding, but if so, that mapping may not be apparent in later workflows. Therefore, these attributes are not themselves sufficient. Once this relationship is made, their custom vulnerability management workflows use it bidirectionally to keep finding status, score, and measures in sync between security and development tooling.
Key vulnerability management use cases demand features like recognition of finding state (new, unchanged, and reported again), acknowledgement of approved suppression, and validation of remediation.
From the perspective of a SARIF document, its schema supports recording these attributes. Before SARIF output can be consumed, the format’s different sections need to be applied in context. This entails:
When consuming SARIF, platforms like Github or plugins to MS Code handle this logic invisibly. But, if you’re writing your own vulnerability management tooling and workflow outside of platforms like GitHub, you’ll need to replicate this logic, including:
In essence, SARIF will act as a ‘bus’ for findings to travel over but in order to produce intended behaviors also requires stateful ETL logic to handle the above workflow. In this regard, SARIF results files are like IAC configuration: rich documents full of entities and attributes, but in need of non-trivial rendering logic.
As findings coalesce from various tools within vulnerability management pipelines, normalization of the disparate attributes of each tool will be necessary. Without fail, three areas areas consistently differ from tool to tool:
Among the types of OSS defect discovery tools, even among similar tools within a category, such as SCA or container scanning, the organization of findings by logical location (i.e. the vulnerable or defective resource) and nature differs. For instance: Checkov - an IAC SAST scanner - emits both PASS and FAIL results. Trivy container scanning findings are organized by target image. Npm findings are organized by package, but require some pre-processing to avoid duplication in their reporting. Dependabot uses a different organization and likewise Snyk. Depending on circumstances, Snyk findings are organized into different sections: ‘issues’ or ‘vulnerabilities’.
Post processing SARIF output to create a consistent findings association may be necessary if your organization desires firm-wide measurement and policy, but allows teams to select their own defect discovery tools. This problem grows particularly acute in the SCA, container, and IAC scanning spaces.
Labeling and tagging is inconsistent between the tools. Not all findings appropriately bear a tag. For those that should, not all tool output is consistently tagged. Some tag cwe, cve, or ghsa, some a combination thereof. Experimentation reveals that two tools within a category, such as SCA, will tag a finding with the same SARIF rule ID and logical location with a different ghsa than another tool. If vulnerability management relies on these tags for conditional behavior, enriching the SARIF output to reliably include these tags will be a must for your organization. Such enrichment might include your own labeling, such as rule ID / finding association with a particular policy or standard, such as ASVS, or NIST-XXX.
Finally, scoring. Scoring is about risk management – and like politics I like to say “All risk management is (organization) local.” Semgrep findings carry severity, while Checkov and Brakeman don’t. Almost no tool populates confidence, but Brakeman does. SCA and image scanning tools typically populate CVSS scoring, but their output requires some textual processing and manipulation before it’s reliable.
Your vulnerability management pipeline is likely to need a late-stage post-processing stage that collects scoring attributes from emitted SARIF and then outputs a decorated file that reliably scores each finding per your organization’s model. In my own work, post processing was necessary for many of the OSS tools ingested. Some tools required simple transliteration (“Moderate” → “Medium”) or (“Error” → “Critical”), others needed adjustment either wholesale or rule-specific (<None> → “Info” or <RuleID XXX, “Medium”> → “Low”). You will need to take a stand as to what “unknown” scores from each tool means.
Last, and definitely not least, is the notion of finding ‘sameness’ across tool executions. The basis of many vulnerability management use cases (disposition as “new”, “found again”, “remediated”, “suppressed” and so forth) is predicated on being able to tell that a finding is ‘uniquely <this one>’ and whether or not it’s the ‘same’ as another finding.
SARIF’s specification clearly indicates the tremendous amount of effort that’s been put into this concept. It calls for a stable fingerprint (between executions), as well as an array of partialFingerprints that can decorate a finding, helping facilitate tool or workflow-specific behaviors. The specification devotes an appendix to what properties should hold for a fingerprint, and how downstream results management should rely on them. Cherry-picking some it its guidance:
These directives aim to make a single finding stably identifiable between executions. There are specific and common reasons where this “good enough” defined above is likely to perform poorly. Experimentally, I’ve found that from a static analysis perspective, tools’ schemes are likely to fail when:
Organizations may want to take a two-pronged approach to solving the challenges with SARIF’s fingerprinting specification, and of the limitations of implementing defect discovery tools. First, where finding instances are more likely to collide (the first bullet), post processing tool output may assure the grouping of findings. SARIF’s ‘correlationGuid’ property is intended for this purpose. Those truly unique instances can be differentiated by adding the ‘guid’ property. Logic downstream of SARIF producers can populate these fields to indicate “These three SARIF stanzas are instances of the ‘same’ finding.” In some cases the instances will possess the same fingerprint, in others differing ones. Where findings appear to differ due to code motion or other distinctions within the fingerprint, post processing tool output may produce (or simply consume) partial fingerprints that seek to track these findings through their motion and give downstream results viewing and vulnerability management better fidelity as to ‘new’, ‘existing’, and ‘fixed.’
SARIF is a boon for developers and security practitioners who are trying to author vulnerability management functionality for their organization, or just automate aspects of their existing vuln management regime. SARIF solves a lot of problems we suffered from prior to this standardization – but understandably, it leaves a lot left to solve. I’ve seen many organizations begin authoring their own vuln management code, present at conferences as enthusiastic and hopeful as a startup founder.
These same enthusiasts almost invariably suffer a big letdown and find their implementations grind to a halt after a year or two. My hope is that, having been down this road a few times myself and with customers, the map of hazards I’ve provided above helps you avoid stepping head-long into them and leaving developers complaining about the same thing they always have: unreliable tool output.
I’m pleased and proud to have worked with Boost to tackle these problems behind the scenes. We know that with vulnerability management, like most problems, “the devil is in the details”. I’m excited to see a platform encode the experience of OSS tool maintainers and practitioners such as myself into its logic, and seamlessly solve some of these challenges out of the box.
With the exception of a few vendors and their heavily invested customers, many agree that...
TL;DR: We disclosed to Chainguard in December 2023 that one of their GitHub Actions workflow was...