Platform/Uptime: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(Added a few more suggestions for more proactive things we should evaluate...)
Line 42: Line 42:


* Recover from crash-causing events and continue. Difficult in general, may be possible in restricted cases?
* Recover from crash-causing events and continue. Difficult in general, may be possible in restricted cases?
* Investigate in which ways we can use the crystal ball (link needed) tool to predict release channel crash rates from crash data from pre-release channels.
* Evaluate the feasibility of automatic detection of new crashes and tracing those back to recent changes around the location of the crash and notify whoever most recently changed that code.


== Administration ==
== Administration ==

Revision as of 07:48, 14 April 2016

Uptime is a project that aims to improve Firefox's stability, i.e. reduce its crash rate.

Uptime is a Platform Engineering initiative that aims to extend and complement existing work relating to stability within Mozilla.

Avenues for improvement

Reactive

Reactive strategies are those that help us better identify, diagnose and fix crash-prone code once it has shipped in a Firefox build (from Nightly through to Release). Reactive strategies are based around crash reports. The following is a list of ideas.

  • Improve manual inspection of crash reports; ensure all significant crashes on all release channels are checked in a timely fashion.
  • Improve automated analysis of crash reports, e.g. to identify correlations between crashes on release and earlier channels.
  • Improve crash report aggregation and presentation, to make it easier to identify important crashes.

Also, a non-trivial fraction of crashes may be due to users having faulty hardware.

  • Run a memory check upon crashing, probably based on some heuristic such as crash frequency, and inform the user if they have faulty memory.

Proactive

Proactive strategies are those that help us prevent crash-prone code from shipping. The following is a list of possibilities.

  • Improve test coverage.
  • Improve fuzzing coverage.
    • Better isolate components so they can be fuzzed more easily (e.g. the JS shell).
    • Record non-reproducible fuzz crashes in rr so they can be played back reliably.
  • Extend use of static and dynamic analysis (e.g. Coverity, ASAN, TSAN, Valgrind).
  • Eliminate crash-prone code patterns.
    • Low-level, e.g. replace raw pointers with smart pointers such as UniquePtr.
    • High-level, e.g. disallow binary extensions.
  • Implement more internal verification ("extended assertions"), e.g. verify complex data structures such an compiler IR.
  • Reimplement existing C and C++ components in less crash-prone languages (e.g. JavaScript, Rust).
  • Better utilize available OS protection against malware (largely a Windows-only issue).
  • Recover from crash-causing events and continue. Difficult in general, may be possible in restricted cases?
  • Investigate in which ways we can use the crystal ball (link needed) tool to predict release channel crash rates from crash data from pre-release channels.
  • Evaluate the feasibility of automatic detection of new crashes and tracing those back to recent changes around the location of the crash and notify whoever most recently changed that code.

Administration

Things to be decided.

  • Regular meetings: time, frequency, content?
  • Progress tracking: high-level stability measurements, bug lists, etc. Links to those.
  • Communication: IRC, email, other?