User:Jesse/Vegas2011/TestingStaticAnalysis

Talk title: Sticking to the Facts: Scientific Study of Static Analysis Tools

Background

NSA "Center for Assured Software" (cas@nsa.gov)
Purpose: "To increase the confidence that software used within DoD's critical systems is free from intentional and unintentional exploitable vulnerabilities"
This static analysis team doesn't run the analyses themselves, but instead recommends which software to use and what is "appropriate use of automation".
The results of the study are anonymized, but the testcases are publicly available, so you can re-run the study yourself.

Talk

Advantages of static analysis

good at finding some types of issues
examines all parts of the software, not just parts that are hit in some execution
automated, scalable, repeatable: can be used early and often

Disadvantages of static analysis

the tools we're looking at don't report positive properties (lack of bugs)
Many false positives; need human confirmation
May report issues you consider unimportant
bad at finding some types of flaws, especially design flaws
potential for false sense of security: false negatives, exaggerated vendor claims

Study focus

tools that identify and report issues in software ("code weakness analysis tools", "static application security testing tools") (presumably in contrast to formal methods proof programs)
for C/C++ and Java
- including standard libraries but not popular third-party libraries
- including Windows-specific libraries (but not other OSes)
looking only at results from the tools (including false positives), but not cost, speed, customization, or usability (IDE integration, reporting across builds, understandability of messages)

Study methods

default tool configuration
no code annotation
synthetic testcases
- often with both flawed and non-flawed variants: the best static analysis will complain only about the first
- for most CWEs
  - other than design issues and meta-issues
  - grouped into "weakness classes", such as "buffer handling" and "control flow management"
- with various control flow and data flow patterns, to test "depth" of each tool's analysis. (for example, scanf-to-printf source and sink pattern). templates generated all the variants for all the relevant CWEs :)

Scoring

scoring doesn't adjust for how common or serious each flaw type is, but maybe mitre's CWE scoring system can help.
scoring penalties for false positives vs false negatives?
only counts tool reports that "match" the CWE (so ignoring "possible null deref" when testing a format string vulnerability). wtf? i guess this helps with inadvertent bugs in the testcases, but it seriously lowers the measured false-positive rate and probably also the true-positive rates.

precision: percentage of results that are true positive (TP / (TP+FP))
recall: percentage of flaws that a tool correctly reported (sensitivity, soundness) (TP / (TP+FN)) = (TP / flaws)
F-score: "harmonic mean"??? (2PR) / (P + R). which tends toward the lower value.
To avoid giving a high score to a "stupid grep" tool, only give points when a tool correctly identifies the bad version as buggy and the good version as okay.
- D = (Disciminations / Flaws)

Conclusions

Tools are not interchangeable. Many are stronger in some areas than in others.
- So using complementary tools may make sense for important projects (or for security teams)
Tools perform differently on different languages. Even if a tool supports both C++ and Java, it might do much better on C++.
- So if you ask "which tool will work best for C#" we don't have a clue
Every tool failed to report a significant portion of the flaws.
- Average tool only even tried to cover 8 of 13 weakness classes
- Average tool covered only 22 of flaws in the classes covered
If you ask "How many tools found this flaw" for all the testscases, the largest slices are "no tools" or "exactly one tool". Very few flaws were found by 4 or more tools!
Java tools slightly worse than C++ tools? Despite/Because of not having to deal with buffer overflow testcases?
Java tools completely fall over on discrimination (no whole-program analysis - dynamic linking assumption?)

Planned future work

The next version of the study, which will start in October 2011, will
- update testcases based on feedback
- solicit input from vendors as to appropriate configuration (at least enabling all the weakness classes!)
- add about 5 tools

Jesse's opinions

Lame that they anonymized the names of the tools.

I think the best thing to come out of this talk is that there's now a benchmark for tools to compete on.

They seemed to have a grudge against open source. "We'd expect commercial tools to do better." Only actually tried one open-source tool for C++. Didn't even try Linux tools. Anonymized the results, except for saying which tools are open-source.

User:Jesse/Vegas2011/TestingStaticAnalysis

Contents

Background

Talk

Advantages of static analysis

Disadvantages of static analysis

Study focus

Study methods

Scoring

Conclusions

Planned future work

Jesse's opinions

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

How to Contribute

MozillaWiki

Around Mozilla

Tools