Firefox/Projects/Startup Time Improvements/joelr notes: Difference between revisions

No edit summary
 
(21 intermediate revisions by the same user not shown)
Line 4: Line 4:


= Current status =
= Current status =
=== September 23, 2009 ===
==== x86-64 ====
I now have x86-64 Firefox up and running on Snow Leopard. Josh Aas did all the work, I just [https://bugzilla.mozilla.org/show_bug.cgi?id=518435 put on the final touches].
==== A shiny new world ====
Damon thinks I should help Josh with the 64-bit port. 64-bit Trace Monkey gives us a [http://www.bailopan.net/blog/?p=595 20% improvement in speed] which beats the 5% DWARF improvement from -fomit-frame-pointer.
I didn't know it was my birthday. I'm leaving the 32-bit world behind. 64 bits here I come!


= Previous statuses =  
= Previous statuses =  


==== Restarting ====
=== September 22, 2009 ===
 
I have two bottlenecks on my plate, [https://bugzilla.mozilla.org/show_bug.cgi?id=517045 Bug 517045] and [https://bugzilla.mozilla.org/show_bug.cgi?id=517576 Bug 517576]. Both of them may not be possible to speed up significantly, not without a large research effort. The former requires consultation with the owners and the later requires me to become a layout expert.
 
Benjamin Smedberg suggested taking care of DWARF symbols in breakpad as that will let us use the -fomit-frame-pointer gcc option. Omitting the frame pointer would result in at least a 5% overall speed up because a register will be freed on x86-32.
 
DWARF is a [https://bugzilla.mozilla.org/showdependencygraph.cgi?id=464750 tree of dependent bugs], though, and Jim Blandy is working on them. Jim is a Linux guy, though, and also has ECMAScript 5 strict mode to take care of whereas I'm strictly focused on performance optimization.
 
I'll ask management to see if I can work on DWARF with Jim.
 
=== September 21, 2009 ===
 
==== Wins ====
 
Kudos to Masayuki for fixing [https://bugzilla.mozilla.org/show_bug.cgi?id=517549 Bug 517549]. It's hard to believe but it looks like our Mac Ts [http://graphs.mozilla.org/graph.html#tests=[{%22test%22:%2216%22,%22branch%22:%221%22,%22machine%22:%22169%22},{%22test%22:%2216%22,%22branch%22:%221%22,%22machine%22:%22173%22},{%22test%22:%2216%22,%22branch%22:%221%22,%22machine%22:%22178%22}]&sel=1252833380,1253542320 went down 30%] and it was only 12.3% of my startup. I wonder what happens when [https://bugzilla.mozilla.org/show_bug.cgi?id=517576 Bug 517576] gets fixed, that's 21% of my startup!
 
One other thing I'm wondering about... What if all the bottlenecks I find will get fixed by someone else? Damon wants me to write code!!!
 
=== September 16, 2009 ===
 
==== That mysterious forking ====
 
Apparently, Firefox forks the first time after a build to rebuild caches, etc. Need to make sure to profile from the 2nd run on. This forking behavior baffled me when I started with DTrace.
 
==== I'm an idiot ====
 
I spent a month becoming an expert on DTrace, only to realize that it's not very useful to get a global picture of the bottlenecks because the overhead of tracing everything is so high!
 
Shark is a much better tool to get a global picture, specially using "Time profile all states" which allocates time spent blocking to the calling function. I overlooked this initially since I did not zero in on the right thread, the one that has ''start'' and ''XRE_main'' or ''exit''. Without this it looks like Firefox is spending most of its time waiting on threads and Mach messages.
 
==== Timing startup ====
 
A clear picture courtesy of Shark
 
<pre>
0.0% 92.8% firefox-bin start
0.0% 92.8% firefox-bin main
0.0% 91.6% XUL   XRE_main
0.0% 40.8% XUL   nsAppStartup::CreateHiddenWindow()
0.0% 34.7% XUL   ScopedXPCOMStartup::Initialize()
0.0% 6.4% XUL   ScopedXPCOMStartup::~ScopedXPCOMStartup()
0.0% 3.4% XUL   ScopedXPCOMStartup::SetWindowCreator(nsINativeAppSupport*)
0.0% 2.4% XUL   nsXPTCStubBase::Stub3()
0.0% 0.8% XUL   nsAppStartupNotifier::Observe(nsISupports*, char const*, unsigned short const*)
0.0% 0.8% HIToolbox   GetCurrentEventKeyModifiers
0.0% 0.6% XUL   nsXREDirProvider::DoStartup()
0.0% 0.5% XUL   RemoveComponentRegistries(nsIFile*, nsIFile*, int)
0.0% 0.4% XUL   NS_NewToolkitProfileService(nsIToolkitProfileService**)
0.0% 0.3% XUL   nsToolkitProfile::Lock(nsIProfileUnlocker**, nsIProfileLock**)
0.0% 0.1% XUL   WriteVersion(nsIFile*, nsCString const&, nsCString const&, nsIFile*, nsIFile*)
0.0% 0.1% XUL   nsINIParser_internal::Init(nsILocalFile*)
0.0% 0.1% XUL   LaunchChildMac
0.0% 0.4% XUL   XRE_GetBinaryPath
0.3% 0.4% XUL   XRE_CreateAppData
0.4% 0.4% XUL   NS_LogInit_P
0.0% 6.2% dyld _dyld_start
1.0% 1.0% libSystem.B.dylib _exit
</pre>
 
=== September 14, 2009 ===
 
==== Breaking startup into smaller chunks ====
 
<pre>
bsmedberg:joelr: I really think that "Ts" is still much too large of a chunk to make meaningful improvements on
bsmedberg:joelr: in order to make meaninful discoveries, I think you have to break startup down into pieces and then maybe improve each piece
joelr:bsmedberg: i'm listening
bsmedberg:well, you've got various stuff that happens up until you start the event loop
bsmedberg:breaking XRE_main into phases up until nsAppStartup::Run would be interesting
bsmedberg:I still haven't seen a graph or anything of how we break down
bsmedberg:joelr: I think stepping through in a debugger might give you a better sense of what's going on (up until Run()) than profiles would
bsmedberg:even if you broke things up into "from main() to the beginning of event loop" and "after the event loop starts" that would be more interesting
bsmedberg:joelr: and then within the first chunk, timing NS_InitXPCOM, the extension manager bits, chromereg startup
bsmedberg:joelr: I think this intersects with some of the startup-timeline stuff taras was doing
joelr:bsmedberg: what are the functions for "extension manager bits" and "chromereg startup"?
bsmedberg:joelr: basically from http://mxr.mozilla.org/mozilla-central/source/toolkit/xre/nsAppRunner.cpp#2607 to #3453
joelr:bsmedberg: thanks for the suggestion, i'll do that
bsmedberg:joelr: #3382 to #3408 or so
</pre>
 
==== XUL reflow ====
 
<pre>
bz:joelr: so I'm still wondering about some of the reflow stuff
bz:joelr: I know it's hard, but that's why no one's really dived into it yet
bz:joelr: and I'd love to know whether we do in fact end up looping significantly in sprocket layout during startup, for example
joelr:bz: i'm still wondering about reflow myself but i'm not quite ready to tackle it yet
joelr:bz: don't know enough
</pre>
 
==== XPT_ArenaMalloc and zero-filling memory ====
 
Closed [https://bugzilla.mozilla.org/show_bug.cgi?id=516241 Bug 516241]. There's no evidence that zero-filling memset is a bottleneck. DTrace cannot latch onto the return from memset on Snow Leopard, so I can't time it. All I know is that it's called 133,245 times during startup and zero-fills 14,851,995 bytes of memory.
 
<pre>
bsmedberg:joelr: do you have evidence that the calloc is hurting? It really
doesn't sound like something that would take a lot of time
bsmedberg:there's no IO, and arena pages are roughly page-sized, right?
joelr:bsmedberg: i don't have evidence that it's hurting, not at all. i just
see it in the startup path a lot.
bsmedberg:"see it" in what way?
joelr:bsmedberg: there's no IO but arena pages are not page-sized. they are 1k
blocks whereas pages are 4k
joelr:bsmedberg: i'm dtracing startup, all functions, that's how i see it
bsmedberg:it's less than a page, then... you really shouldn't be having
memory-thrashing issues with the calloc
bsmedberg:joelr: you mean it takes time, or its just a callcount?
bsmedberg:I'm a little skeptical that the arena actually helps at all, but
that's a different matter.
joelr:bsmedberg: call count, i suppose. i'm also skeptical that arena helps at
all since mac osx has it's own memory allocation optimizations
bsmedberg:And the xpt code has no tests, so unles you have a measured
performance problem I suggest not touching it.
joelr:bsmedberg: i see (tests)... well, i'll look into it a little bit more
since i want to measure a difference vs malloc (no zero-fill). it does crash
with malloc at the moment so i won't be looking into it too much.
bsmedberg:you'd have to basically zero-fill when you allocate items from the
arena
joelr:bsmedberg: because those items expect to be zero-filled?
bsmedberg:seems so, if you're crashing with malloc...
bsmedberg:joelr: ok, well, I'm saying "you have no evidence this bug is worth
fixing, please move on"
</pre>
 
=== September 11, 2009 ===
 
==== Reflows ====
 
<pre>
taras: joelr: btw, you gave up on reflows?
taras: that makes me sad
taras: i was hoping we'd end up with some sort of tool to measure cost of laying out various xul components
joelr: taras: i did not give up, i just don't know how to go about it yet
taras: so people could profile their guis
taras: instead of going at it blindly
joelr: taras: i didn't hear your prayers, my apologies
joelr: i'll make a note on the idea of a tool that measures guis, though
joelr: taras: how do i measure? from where to where?
joelr: layout? reflow? something else?
taras: joelr: it's simple, time how long they take
taras: just see long functions take to run, ones that trigger all those fun reflow stacks
</pre>
 
==== Restarting from the ground up ====


I'm very comfortable with DTrace but still not comfortable with the Firefox code base. It will take me a while to grok reflow, for example, much less optimize it. I feel that there are startup speedups that can be achieved quicker, with a better return on investment (aka low-hanging fruit).
I'm very comfortable with DTrace but still not comfortable with the Firefox code base. It will take me a while to grok reflow, for example, much less optimize it. I feel that there are startup speedups that can be achieved quicker, with a better return on investment (aka low-hanging fruit).
Line 25: Line 180:


There are obviously interesting bits here. How can we piggyback on the ''ImageLoaderMachOCompressed'' for example?
There are obviously interesting bits here. How can we piggyback on the ''ImageLoaderMachOCompressed'' for example?
Minimizing file system access provides a good return on investment since it apparently takes a few milliseconds on mobile devices. I'm focusing on this first and foremost since we do have file system access in the midst of dynamic linker static initialization, e.g.
<pre>
20974362 861470 708476        <- xptiSortFileList(void const*, void const*, void*) = 622
20974385 . .        -> xptiSortFileList(void const*, void const*, void*)(0x3050F9C, 0x3050F7C, 0xBFFFEBC0)
20974394 . .        -> nsACString_internal::nsACString_internal(char*, unsigned int, unsigned int)(0xBFFFE978, 0xBFFFE98C, 0x0)
20974397 113 2        <- nsACString_internal::nsACString_internal(char*, unsigned int, unsigned int) = 24
20974400 . .        -> nsACString_internal::nsACString_internal(char*, unsigned int, unsigned int)(0xBFFFE924, 0xBFFFE938, 0x0)
20974402 112 2        <- nsACString_internal::nsACString_internal(char*, unsigned int, unsigned int) = 24
20974424 . .        -> nsLocalFile::GetNativeLeafName(nsACString_internal&)(0x3083600, 0xBFFFE978, 0x40)
20974427 . .        -> objc_exception_try_enter(0xBFFFE868, 0xE000002, 0x181A4)
</pre>
It may be interesting to study the startup path of Chromium and Safari.


=== September 9, 2009 ===
=== September 9, 2009 ===
Line 676: Line 846:
sudo ./cold.sh static-init.d
sudo ./cold.sh static-init.d
</pre>
</pre>
= Shark tips and tricks =
=== Expanding the call tree ===
Alt-Click does it.
=== CHUD signposts on Snow Leopard ===
There are no CHUD header files on Snow Leopard. If you're only interest is in emitting signposts, the easiest thing for you to do is switch to the syscall method instead.  That'll also mean you won't have to link against the CHUD framework. The full details are in the Shark User Guide, but in brief:
<pre>
int err = syscall(SYS_kdebug_trace, APPSDBG_CODE(DBG_MACH_CHUD, <your code number>) | <type>, arg1, arg2, arg3, arg4);
</pre>
Types are DBG_FUNC_START, DBG_FUNC_END and DBG_FUNC_NONE (for point signposts).
Don't forget to include <sys/kdebug.h> and <sys/syscall.h> for the necessary macros and function.


= DTrace tips and tricks =
= DTrace tips and tricks =
109

edits