User:Jesse/Vegas2011/TestingStaticAnalysis
From MozillaWiki
Talk title: Sticking to the Facts: Scientific Study of Static Analysis Tools
Contents
Background
- NSA "Center for Assured Software" (cas@nsa.gov)
- Purpose: "To increase the confidence that software used within DoD's critical systems is free from intentional and unintentional exploitable vulnerabilities"
- This static analysis team doesn't run the analyses themselves, but instead recommends which software to use and what is "appropriate use of automation".
- The results of the study are anonymized, but the testcases are publicly available, so you can re-run the study yourself.
Talk
Advantages of static analysis
- good at finding some types of issues
- examines all parts of the software, not just parts that are hit in some execution
- automated, scalable, repeatable: can be used early and often
Disadvantages of static analysis
- the tools we're looking at don't report positive properties (lack of bugs)
- Many false positives; need human confirmation
- May report issues you consider unimportant
- bad at finding some types of flaws, especially design flaws
- potential for false sense of security: false negatives, exaggerated vendor claims
Study focus
- tools that identify and report issues in software ("code weakness analysis tools", "static application security testing tools") (presumably in contrast to formal methods proof programs)
- for C/C++ and Java
- including standard libraries but not popular third-party libraries
- including Windows-specific libraries (but not other OSes)
- looking only at results from the tools (including false positives), but not cost, speed, customization, or usability (IDE integration, reporting across builds, understandability of messages)
Study methods
- default tool configuration
- no code annotation
- synthetic testcases
- often with both flawed and non-flawed variants: the best static analysis will complain only about the first
- for most CWEs
- other than design issues and meta-issues
- grouped into "weakness classes", such as "buffer handling" and "control flow management"
- with various control flow and data flow patterns, to test "depth" of each tool's analysis. (for example, scanf-to-printf source and sink pattern). templates generated all the variants for all the relevant CWEs :)
Scoring
- scoring doesn't adjust for how common or serious each flaw type is, but maybe mitre's CWE scoring system can help.
- scoring penalties for false positives vs false negatives?
- only counts tool reports that "match" the CWE (so ignoring "possible null deref" when testing a format string vulnerability). wtf? i guess this helps with inadvertent bugs in the testcases, but it seriously lowers the measured false-positive rate and probably also the true-positive rates.
- precision: percentage of results that are true positive (TP / (TP+FP))
- recall: percentage of flaws that a tool correctly reported (sensitivity, soundness) (TP / (TP+FN)) = (TP / flaws)
- F-score: "harmonic mean"??? (2PR) / (P + R). which tends toward the lower value.
- To avoid giving a high score to a "stupid grep" tool, only give points when a tool correctly identifies the bad version as buggy and the good version as okay.
- D = (Disciminations / Flaws)
Conclusions
- Tools are not interchangeable. Many are stronger in some areas than in others.
- So using complementary tools may make sense for important projects (or for security teams)
- Tools perform differently on different languages. Even if a tool supports both C++ and Java, it might do much better on C++.
- So if you ask "which tool will work best for C#" we don't have a clue
- Every tool failed to report a significant portion of the flaws.
- Average tool only even tried to cover 8 of 13 weakness classes
- Average tool covered only 22 of flaws in the classes covered
- If you ask "How many tools found this flaw" for all the testscases, the largest slices are "no tools" or "exactly one tool". Very few flaws were found by 4 or more tools!
- Java tools slightly worse than C++ tools? Despite/Because of not having to deal with buffer overflow testcases?
- Java tools completely fall over on discrimination (no whole-program analysis - dynamic linking assumption?)
Planned future work
- The next version of the study, which will start in October 2011, will
- update testcases based on feedback
- solicit input from vendors as to appropriate configuration (at least enabling all the weakness classes!)
- add about 5 tools
Jesse's opinions
Lame that they anonymized the names of the tools.
I think the best thing to come out of this talk is that there's now a benchmark for tools to compete on.
They seemed to have a grudge against open source. "We'd expect commercial tools to do better." Only actually tried one open-source tool for C++. Didn't even try Linux tools. Anonymized the results, except for saying which tools are open-source.