109
edits
Joel Reymont (talk | contribs) |
Joel Reymont (talk | contribs) |
||
(44 intermediate revisions by the same user not shown) | |||
Line 4: | Line 4: | ||
= Current status = | = Current status = | ||
=== September 23, 2009 === | |||
==== x86-64 ==== | |||
I now have x86-64 Firefox up and running on Snow Leopard. Josh Aas did all the work, I just [https://bugzilla.mozilla.org/show_bug.cgi?id=518435 put on the final touches]. | |||
==== A shiny new world ==== | |||
Damon thinks I should help Josh with the 64-bit port. 64-bit Trace Monkey gives us a [http://www.bailopan.net/blog/?p=595 20% improvement in speed] which beats the 5% DWARF improvement from -fomit-frame-pointer. | |||
I didn't know it was my birthday. I'm leaving the 32-bit world behind. 64 bits here I come! | |||
= Previous statuses = | |||
=== September 22, 2009 === | |||
I have two bottlenecks on my plate, [https://bugzilla.mozilla.org/show_bug.cgi?id=517045 Bug 517045] and [https://bugzilla.mozilla.org/show_bug.cgi?id=517576 Bug 517576]. Both of them may not be possible to speed up significantly, not without a large research effort. The former requires consultation with the owners and the later requires me to become a layout expert. | |||
Benjamin Smedberg suggested taking care of DWARF symbols in breakpad as that will let us use the -fomit-frame-pointer gcc option. Omitting the frame pointer would result in at least a 5% overall speed up because a register will be freed on x86-32. | |||
DWARF is a [https://bugzilla.mozilla.org/showdependencygraph.cgi?id=464750 tree of dependent bugs], though, and Jim Blandy is working on them. Jim is a Linux guy, though, and also has ECMAScript 5 strict mode to take care of whereas I'm strictly focused on performance optimization. | |||
I'll ask management to see if I can work on DWARF with Jim. | |||
=== September 21, 2009 === | |||
==== Wins ==== | |||
Kudos to Masayuki for fixing [https://bugzilla.mozilla.org/show_bug.cgi?id=517549 Bug 517549]. It's hard to believe but it looks like our Mac Ts [http://graphs.mozilla.org/graph.html#tests=[{%22test%22:%2216%22,%22branch%22:%221%22,%22machine%22:%22169%22},{%22test%22:%2216%22,%22branch%22:%221%22,%22machine%22:%22173%22},{%22test%22:%2216%22,%22branch%22:%221%22,%22machine%22:%22178%22}]&sel=1252833380,1253542320 went down 30%] and it was only 12.3% of my startup. I wonder what happens when [https://bugzilla.mozilla.org/show_bug.cgi?id=517576 Bug 517576] gets fixed, that's 21% of my startup! | |||
One other thing I'm wondering about... What if all the bottlenecks I find will get fixed by someone else? Damon wants me to write code!!! | |||
=== September 16, 2009 === | |||
==== That mysterious forking ==== | |||
Apparently, Firefox forks the first time after a build to rebuild caches, etc. Need to make sure to profile from the 2nd run on. This forking behavior baffled me when I started with DTrace. | |||
==== I'm an idiot ==== | |||
I spent a month becoming an expert on DTrace, only to realize that it's not very useful to get a global picture of the bottlenecks because the overhead of tracing everything is so high! | |||
Shark is a much better tool to get a global picture, specially using "Time profile all states" which allocates time spent blocking to the calling function. I overlooked this initially since I did not zero in on the right thread, the one that has ''start'' and ''XRE_main'' or ''exit''. Without this it looks like Firefox is spending most of its time waiting on threads and Mach messages. | |||
==== Timing startup ==== | |||
A clear picture courtesy of Shark | |||
<pre> | |||
0.0% 92.8% firefox-bin start | |||
0.0% 92.8% firefox-bin main | |||
0.0% 91.6% XUL XRE_main | |||
0.0% 40.8% XUL nsAppStartup::CreateHiddenWindow() | |||
0.0% 34.7% XUL ScopedXPCOMStartup::Initialize() | |||
0.0% 6.4% XUL ScopedXPCOMStartup::~ScopedXPCOMStartup() | |||
0.0% 3.4% XUL ScopedXPCOMStartup::SetWindowCreator(nsINativeAppSupport*) | |||
0.0% 2.4% XUL nsXPTCStubBase::Stub3() | |||
0.0% 0.8% XUL nsAppStartupNotifier::Observe(nsISupports*, char const*, unsigned short const*) | |||
0.0% 0.8% HIToolbox GetCurrentEventKeyModifiers | |||
0.0% 0.6% XUL nsXREDirProvider::DoStartup() | |||
0.0% 0.5% XUL RemoveComponentRegistries(nsIFile*, nsIFile*, int) | |||
0.0% 0.4% XUL NS_NewToolkitProfileService(nsIToolkitProfileService**) | |||
0.0% 0.3% XUL nsToolkitProfile::Lock(nsIProfileUnlocker**, nsIProfileLock**) | |||
0.0% 0.1% XUL WriteVersion(nsIFile*, nsCString const&, nsCString const&, nsIFile*, nsIFile*) | |||
0.0% 0.1% XUL nsINIParser_internal::Init(nsILocalFile*) | |||
0.0% 0.1% XUL LaunchChildMac | |||
0.0% 0.4% XUL XRE_GetBinaryPath | |||
0.3% 0.4% XUL XRE_CreateAppData | |||
0.4% 0.4% XUL NS_LogInit_P | |||
0.0% 6.2% dyld _dyld_start | |||
1.0% 1.0% libSystem.B.dylib _exit | |||
</pre> | |||
=== September 14, 2009 === | |||
==== Breaking startup into smaller chunks ==== | |||
<pre> | |||
bsmedberg:joelr: I really think that "Ts" is still much too large of a chunk to make meaningful improvements on | |||
bsmedberg:joelr: in order to make meaninful discoveries, I think you have to break startup down into pieces and then maybe improve each piece | |||
joelr:bsmedberg: i'm listening | |||
bsmedberg:well, you've got various stuff that happens up until you start the event loop | |||
bsmedberg:breaking XRE_main into phases up until nsAppStartup::Run would be interesting | |||
bsmedberg:I still haven't seen a graph or anything of how we break down | |||
bsmedberg:joelr: I think stepping through in a debugger might give you a better sense of what's going on (up until Run()) than profiles would | |||
bsmedberg:even if you broke things up into "from main() to the beginning of event loop" and "after the event loop starts" that would be more interesting | |||
bsmedberg:joelr: and then within the first chunk, timing NS_InitXPCOM, the extension manager bits, chromereg startup | |||
bsmedberg:joelr: I think this intersects with some of the startup-timeline stuff taras was doing | |||
joelr:bsmedberg: what are the functions for "extension manager bits" and "chromereg startup"? | |||
bsmedberg:joelr: basically from http://mxr.mozilla.org/mozilla-central/source/toolkit/xre/nsAppRunner.cpp#2607 to #3453 | |||
joelr:bsmedberg: thanks for the suggestion, i'll do that | |||
bsmedberg:joelr: #3382 to #3408 or so | |||
</pre> | |||
==== XUL reflow ==== | |||
<pre> | |||
bz:joelr: so I'm still wondering about some of the reflow stuff | |||
bz:joelr: I know it's hard, but that's why no one's really dived into it yet | |||
bz:joelr: and I'd love to know whether we do in fact end up looping significantly in sprocket layout during startup, for example | |||
joelr:bz: i'm still wondering about reflow myself but i'm not quite ready to tackle it yet | |||
joelr:bz: don't know enough | |||
</pre> | |||
==== XPT_ArenaMalloc and zero-filling memory ==== | |||
Closed [https://bugzilla.mozilla.org/show_bug.cgi?id=516241 Bug 516241]. There's no evidence that zero-filling memset is a bottleneck. DTrace cannot latch onto the return from memset on Snow Leopard, so I can't time it. All I know is that it's called 133,245 times during startup and zero-fills 14,851,995 bytes of memory. | |||
<pre> | |||
bsmedberg:joelr: do you have evidence that the calloc is hurting? It really | |||
doesn't sound like something that would take a lot of time | |||
bsmedberg:there's no IO, and arena pages are roughly page-sized, right? | |||
joelr:bsmedberg: i don't have evidence that it's hurting, not at all. i just | |||
see it in the startup path a lot. | |||
bsmedberg:"see it" in what way? | |||
joelr:bsmedberg: there's no IO but arena pages are not page-sized. they are 1k | |||
blocks whereas pages are 4k | |||
joelr:bsmedberg: i'm dtracing startup, all functions, that's how i see it | |||
bsmedberg:it's less than a page, then... you really shouldn't be having | |||
memory-thrashing issues with the calloc | |||
bsmedberg:joelr: you mean it takes time, or its just a callcount? | |||
bsmedberg:I'm a little skeptical that the arena actually helps at all, but | |||
that's a different matter. | |||
joelr:bsmedberg: call count, i suppose. i'm also skeptical that arena helps at | |||
all since mac osx has it's own memory allocation optimizations | |||
bsmedberg:And the xpt code has no tests, so unles you have a measured | |||
performance problem I suggest not touching it. | |||
joelr:bsmedberg: i see (tests)... well, i'll look into it a little bit more | |||
since i want to measure a difference vs malloc (no zero-fill). it does crash | |||
with malloc at the moment so i won't be looking into it too much. | |||
bsmedberg:you'd have to basically zero-fill when you allocate items from the | |||
arena | |||
joelr:bsmedberg: because those items expect to be zero-filled? | |||
bsmedberg:seems so, if you're crashing with malloc... | |||
bsmedberg:joelr: ok, well, I'm saying "you have no evidence this bug is worth | |||
fixing, please move on" | |||
</pre> | |||
=== September 11, 2009 === | |||
==== Reflows ==== | |||
<pre> | |||
taras: joelr: btw, you gave up on reflows? | |||
taras: that makes me sad | |||
taras: i was hoping we'd end up with some sort of tool to measure cost of laying out various xul components | |||
joelr: taras: i did not give up, i just don't know how to go about it yet | |||
taras: so people could profile their guis | |||
taras: instead of going at it blindly | |||
joelr: taras: i didn't hear your prayers, my apologies | |||
joelr: i'll make a note on the idea of a tool that measures guis, though | |||
joelr: taras: how do i measure? from where to where? | |||
joelr: layout? reflow? something else? | |||
taras: joelr: it's simple, time how long they take | |||
taras: just see long functions take to run, ones that trigger all those fun reflow stacks | |||
</pre> | |||
==== Restarting from the ground up ==== | |||
I'm very comfortable with DTrace but still not comfortable with the Firefox code base. It will take me a while to grok reflow, for example, much less optimize it. I feel that there are startup speedups that can be achieved quicker, with a better return on investment (aka low-hanging fruit). | |||
Yes, reflow is in the hot path during startup but it's not clear if it's a major source of slowness. I would like to know what happens during Firefox startup, in as much detail as possible and DTrace has granted me my wish in the form of a 400Mb+ log file. The file lists all the function calls up to the return from the JS function BrowserStartup, e.g. | |||
<pre> | |||
... | |||
6200 . . -> ImageLoaderMachO::segIsReadOnlyImport(unsigned int) const(0x8FE46820, 0x1, 0xBFFFC1C8) | |||
6202 . . -> ImageLoaderMachOCompressed::segmentCommandOffsets() const(0x8FE46820, 0x0, 0x0) | |||
6203 81 1 <- ImageLoaderMachOCompressed::segmentCommandOffsets() const = 10 | |||
6205 270 4 <- ImageLoaderMachO::segIsReadOnlyImport(unsigned int) const = 116 | |||
6206 . . -> ImageLoaderMachO::segFileOffset(unsigned int) const(0x8FE46820, 0x1, 0xBFFFC1C8) | |||
6208 . . -> ImageLoaderMachOCompressed::segmentCommandOffsets() const(0x8FE46820, 0x0, 0x0) | |||
... | |||
</pre> | |||
There are obviously interesting bits here. How can we piggyback on the ''ImageLoaderMachOCompressed'' for example? | |||
Minimizing file system access provides a good return on investment since it apparently takes a few milliseconds on mobile devices. I'm focusing on this first and foremost since we do have file system access in the midst of dynamic linker static initialization, e.g. | |||
<pre> | |||
20974362 861470 708476 <- xptiSortFileList(void const*, void const*, void*) = 622 | |||
20974385 . . -> xptiSortFileList(void const*, void const*, void*)(0x3050F9C, 0x3050F7C, 0xBFFFEBC0) | |||
20974394 . . -> nsACString_internal::nsACString_internal(char*, unsigned int, unsigned int)(0xBFFFE978, 0xBFFFE98C, 0x0) | |||
20974397 113 2 <- nsACString_internal::nsACString_internal(char*, unsigned int, unsigned int) = 24 | |||
20974400 . . -> nsACString_internal::nsACString_internal(char*, unsigned int, unsigned int)(0xBFFFE924, 0xBFFFE938, 0x0) | |||
20974402 112 2 <- nsACString_internal::nsACString_internal(char*, unsigned int, unsigned int) = 24 | |||
20974424 . . -> nsLocalFile::GetNativeLeafName(nsACString_internal&)(0x3083600, 0xBFFFE978, 0x40) | |||
20974427 . . -> objc_exception_try_enter(0xBFFFE868, 0xE000002, 0x181A4) | |||
</pre> | |||
It may be interesting to study the startup path of Chromium and Safari. | |||
=== September 9, 2009 === | |||
==== Reflow optimization ==== | |||
I've been thinking about speeding up reflows ([https://bugzilla.mozilla.org/show_bug.cgi?id=514275 Bug 514275]) but did not make progress until I got Boris Zbarsky involved. There are tons and tons of classes that reflow and I'm completely unfamiliar with that code. | |||
Boris suggested restricting invocations of PresShell::ProcessReflowCommands to those that actually do work | |||
<pre> | |||
0 != mDirtyRoots.Length() | |||
</pre> | |||
on XUL documents | |||
<pre> | |||
FrameManager()->GetRootFrame()->GetFirstChild(nsnull) && | |||
FrameManager()->GetRootFrame()->GetFirstChild(nsnull)->GetType() == nsGkAtoms::rootFrame | |||
</pre> | |||
Looking at the [https://bug514275.bugzilla.mozilla.org/attachment.cgi?id=399254 25 reflows and timings] for the "Getting involved with Mozilla..." page, it's clear that some reflows are more expensive than others. It's also interesting that there are [https://bug514275.bugzilla.mozilla.org/attachment.cgi?id=399445 just 11 reflows] for a blank page. | |||
<pre> | |||
bz: 1) try to reduce number of reflows | |||
bz: (e.g. make text controls not reflow sync in this setup if we can; reduce the number of text value sets if we can't) | |||
bz: Or see whether the reflow events we get are really needed (e.g. if the load is not done yet, why are they happening?) | |||
bz: But separately, it looks like reflows seem to be generally expensive | |||
</pre> | |||
==== Performance measurement ==== | |||
Using [http://blog.vlad1.com/2009/07/28/measuring-startup/ Vlad's benchmark], I have 5.3s cold start and 0.9s warm start. Timing using the return from [http://mxr.mozilla.org/mozilla-central/source/browser/base/content/browser.js#903 BrowserStartup (JS)] gives me almost exactly the same results as Vlad's. | |||
<pre> | |||
javascript*:::function-return | |||
/copyinstr(arg2) == "BrowserStartup"/ | |||
{ | |||
exit(1); | |||
} | |||
</pre> | |||
The difference is about 50ms which copyinstr probably accounts for. | |||
=== September 3, 2009 === | |||
On reflow... | |||
<pre> | |||
bz:so there are a few things we could talk about: | |||
bz:1) make any given reflow pass faster (this is hard) | |||
bz:2) reduce number of reflow passes | |||
bz:s/this is hard/this may be hard/ | |||
bz:So one place to start would be seeing why we're doing reflow at all | |||
bz:that is, what things cause posting of reflow events and what things trigger reflow in their post-reflow callbacks | |||
bz:and seeing whether we can reduce or eliminate them. | |||
joelr:post-reflow callbacks? | |||
bz:see comment 1 in the bug | |||
bz:firefox has about 4000 elements in its ui | |||
bz:which means about 4000 boxes to reflow | |||
bz:modulo optimizations, etc | |||
bz:BoxReflow is the reflow of a single box | |||
bz:those are the two things to look into | |||
bz:reducing number of reflow passes | |||
bz:and speeding up individual passes | |||
bz:if you have info on what parts actually take up time during reflow, we should look into that | |||
bz:I would also be interested in the number of nsIFrame::Layout calls you see during startup | |||
bz:that's a more interesting metric to me | |||
bz:esp. if we can chunk it up per reflow pass | |||
</pre> | |||
A reflow pass is ''PresShell::ProcessReflowCommands'' which can re-enter itself via post-reflow callbacks. | |||
[https://bugzilla.mozilla.org/show_bug.cgi?id=514275 Bug 514275] | |||
=== September 2, 2009 === | |||
My previous stack traces were all bogus since symbols are stripped during packaging. This is all sorted out now using this config (for Snow Leopard): | |||
<pre> | |||
export CC="gcc-4.2 -arch i386" | |||
export CXX="g++-4.2 -arch i386" | |||
ac_add_options --target=i386-apple-darwin8.0.0 | |||
ac_add_options --with-macos-sdk=/Developer/SDKs/MacOSX10.5.sdk | |||
ac_add_options --enable-macos-target=10.5 | |||
export HOST_CC="gcc-4.2" | |||
export HOST_CXX="g++-4.2" | |||
export RANLIB=ranlib | |||
export AR=ar | |||
export AS=$CC | |||
export LD=ld | |||
export STRIP="strip -x -S" | |||
export CROSS_COMPILE=1 | |||
export MOZ_DEBUG_SYMBOLS=1 | |||
. $topsrcdir/browser/config/mozconfig | |||
mk_add_options MOZ_OBJDIR=@TOPSRCDIR@/objdir-release-profiling | |||
mk_add_options MOZ_MAKE_FLAGS=-j2 | |||
ac_add_options --disable-tests | |||
ac_add_options --disable-crashreporter | |||
ac_add_options --enable-dtrace | |||
ac_add_options --enable-libxul | |||
ac_add_options --enable-optimize | |||
ac_add_options --enable-debugger-info-modules | |||
ac_add_options --disable-strip | |||
ac_add_options --disable-install-strip | |||
</pre> | |||
I'm getting 3-5s from cold start to return from ''BrowserStartup'' (JavaScript), e.g. | |||
<pre> | |||
./d firefox-bin startup.d stop-js.d | |||
dtrace -Zqw -x dynvarsize=64m -x evaltime=preinit -p 55752 -s startup.d -s stop-js.d -s sigcont.d | |||
Total: 5000.294085ms | |||
./d firefox-bin startup.d stop-js.d | |||
dtrace -Zqw -x dynvarsize=64m -x evaltime=exec -p 55855 -s startup.d -s stop-js.d -s sigcont.d | |||
Total: 3387.623720ms | |||
</pre> | |||
The hot path for those 5s looks like this: | |||
<pre> | |||
./d firefox-bin hotpath.d stop-js.d | |||
dtrace -Zqw -x dynvarsize=64m -x evaltime=preinit -p 55765 -s hotpath.d -s stop-js.d -s sigcont.d | |||
kill -CONT 55765 | |||
XUL`nsFrame::BoxReflow(nsBoxLayoutState&, nsPresContext*, nsHTMLReflowMetrics&, nsIRenderingContext*, int, int, int, int, int) 1382 | |||
firefox-bin`0x0 623 | |||
XUL`nsHTMLInputElement::GetReadOnly(int*) 53 | |||
XUL`nsOverflowContinuationTracker::Insert(nsIFrame*, unsigned int&) 48 | |||
XUL`MapAttributesIntoRule(nsMappedAttributes const*, nsRuleData*) 24 | |||
XUL`nsBlockFrame::Reflow(nsPresContext*, nsHTMLReflowMetrics&, nsHTMLReflowState const&, unsigned int&) 20 | |||
XUL`nsColumnSetFrame::ReflowChildren(nsHTMLReflowMetrics&, nsHTMLReflowState const&, unsigned int&, nsColumnSetFrame::ReflowConfig const&, int, nsCollapsingMargin*, nsColumnSetFrame::ColumnBalanceData&) 18 | |||
XUL`DocumentViewerImpl::PageHide(int) 18 | |||
XUL`nsBlockFrame::DoReflowInlineFrames(nsBlockReflowState&, nsLineLayout&, nsLineList_iterator, nsFlowAreaRect&, int&, nsFloatManager::SavedState*, int*, LineReflowStatus*, int) 14 | |||
XUL`non-virtual thunk to nsHTMLInputElement::Release() 11 | |||
... | |||
</pre> | |||
=== August 30, 2009 === | |||
Using my "Lego blocks" [http://github.com/wagerlabs/firefox-startup/tree/master DTrace toolchain] I thought I would check the "hot paths" through Firefox code during the first 20 seconds of startup. | |||
<pre> | |||
./d firefox-bin hotpath.d stop-20s.d | |||
dtrace -Zqw -x dynvarsize=64m -x evaltime=preinit -p 1478 -s hotpath.d -s stop-20s.d -s sigcont.d | |||
kill -CONT 1478 | |||
XUL`gfxContextAutoSaveRestore::Reset(gfxContext*) 2766 | |||
firefox-bin`0x0 2234 | |||
XUL`operator new(unsigned long, void*) 348 | |||
XUL`gfxImageSurface::Height() const 335 | |||
XUL`gfxPattern::Release() 60 | |||
XUL`DumpJSValue 44 | |||
XUL`gfxContextAutoSaveRestore::gfxContextAutoSaveRestore(gfxContext*) 29 | |||
XUL`gfxUserFontSet::Release() 17 | |||
XUL`gfxImageSurface::Data() const 16 | |||
XUL`gfxContextPathAutoSaveRestore::~gfxContextPathAutoSaveRestore() 8 | |||
</pre> | |||
I'm sampling at 997Hz so that's roughly 2.7s for the top result. A small script is all it took to see where ''gfxContextAutoSaveRestore::Reset'' is being called from. | |||
<pre> | |||
BEGIN | |||
{ | |||
i = 0; | |||
} | |||
pid$target::gfxContextAutoSaveRestore??Reset(gfxContext*):return | |||
/ i < 10 / | |||
{ | |||
printf("arg0: 0x%x, arg1: 0x%x\n", arg0, arg1); | |||
ustack(); | |||
i++; | |||
} | |||
pid$target::gfxContextAutoSaveRestore??Reset(gfxContext*):return | |||
/ i >= 10 / | |||
{ | |||
exit(0); | |||
} | |||
</pre> | |||
The [http://gist.github.com/177980 stack traces] (first one of ten below) include ''gfxImageSurface::Height'' and ''DumpJSValue''. | |||
<pre> | |||
./d firefox-bin autosave.d | |||
dtrace -Zqw -x dynvarsize=64m -x evaltime=preinit -p 1527 -s autosave.d -s sigcont.d | |||
kill -CONT 1527 | |||
arg0: 0x90, arg1: 0x1a71da8 | |||
XUL`gfxContextAutoSaveRestore::Reset(gfxContext*)+0x90 | |||
XUL`gfxImageSurface::Height() const+0x2d976 | |||
XUL`gfxImageSurface::Height() const+0x135b6 | |||
XUL`gfxImageSurface::Height() const+0x107b7 | |||
XUL`non-virtual thunk to nsPrintSession::QueryInterface(nsID const&, void**)+0x53d16 | |||
XUL`non-virtual thunk to nsPrintSession::QueryInterface(nsID const&, void**)+0x53f1b | |||
XUL`NS_GetComponentRegistrar_P+0x21cd | |||
XUL`NS_GetComponentRegistrar_P+0x2324 | |||
XUL`NS_GetComponentRegistrar_P+0x4964 | |||
XUL`non-virtual thunk to nsPrintSession::QueryInterface(nsID const&, void**)+0x4b3fd | |||
XUL`non-virtual thunk to nsPrintSession::QueryInterface(nsID const&, void**)+0x4b850 | |||
XUL`non-virtual thunk to nsPrintSession::QueryInterface(nsID const&, void**)+0x4a9e1 | |||
XUL`DumpJSValue+0x1dfbab | |||
XUL`DumpJSValue+0x1e0103 | |||
XUL`gfxCornerSizes::gfxCornerSizes(gfxSize const&, gfxSize const&, gfxSize const&, gfxSize const&)+0xc9e8e | |||
XUL`gfxCornerSizes::gfxCornerSizes(gfxSize const&, gfxSize const&, gfxSize const&, gfxSize const&)+0xb64b4 | |||
XUL`gfxCornerSizes::gfxCornerSizes(gfxSize const&, gfxSize const&, gfxSize const&, gfxSize const&)+0xbb918 | |||
XUL`gfxCornerSizes::gfxCornerSizes(gfxSize const&, gfxSize const&, gfxSize const&, gfxSize const&)+0x151bdf | |||
XUL`gfxCornerSizes::gfxCornerSizes(gfxSize const&, gfxSize const&, gfxSize const&, gfxSize const&)+0x14e234 | |||
XUL`gfxCornerSizes::gfxCornerSizes(gfxSize const&, gfxSize const&, gfxSize const&, gfxSize const&)+0x14ed99 | |||
</pre> | |||
''gfxImageSurface::Height'' is low-hanging fruit, defined as | |||
<pre> | |||
PRInt32 Height() const { return mSize.height; } | |||
</pre> | |||
I would suggest making it inline and checking startup time again. | |||
''DumpJSValue'' is trickier because it appears to be a [http://mxr.mozilla.org/mozilla-central/source/js/src/xpconnect/src/nsXPConnect.cpp#2714 JavaScript API function]. Still, it's called during the static initialization portion of dynamic linking and all that printing is not helping in a release version of Firefox. | |||
<pre> | |||
./d firefox-bin dump-js-value.d | |||
dtrace -Zqw -x dynvarsize=64m -x evaltime=preinit -p 1579 -s dump-js-value.d -s sigcont.d | |||
kill -CONT 1579 | |||
XUL`DumpJSValue+0x3917 | |||
XUL`DumpJSValue+0x3b92 | |||
XUL`0x111c146 | |||
XUL`0x111c161 | |||
dyld`ImageLoaderMachO::doModInitFunctions(ImageLoader::LinkContext const&)+0xeb | |||
dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int)+0xf6 | |||
dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int)+0x9a | |||
dyld`ImageLoader::runInitializers(ImageLoader::LinkContext const&)+0x3d | |||
dyld`dyld::initializeMainExecutable()+0x86 | |||
dyld`dyld::_main(macho_header const*, unsigned long, int, char const**, char const**, char const**)+0x105d | |||
dyld`dyldbootstrap::start(macho_header const*, int, char const**, long)+0x30b | |||
dyld`_dyld_start+0x27 | |||
firefox-bin`0x1000 | |||
</pre> | |||
I suppose the invocation should only be compiled in in a debug version of Firefox but I'm still searching for that invocation! | |||
=== August 28, 2009 === | |||
Eric Gouriou on my prior use of ''dtrace -c...'': | |||
<pre> | |||
I am afraid that the data collected using this methodology is highly suspect. | |||
- firefox-bin gets run as root | |||
This is not a normal environment for an application. There are plenty of high-level | |||
frameworks which are not quite happy running in that environment, in good part because | |||
the root user will not have a typical home directory setup and due to differences in environment | |||
variables. | |||
This is a known limitation of dtrace -c . Feel free to add your voice via bugreporter.apple.com | |||
- A typical launch is not just fork/exec or posix_spawn | |||
Even if dtrace -c could magically launch using a regular user, it still isn't a typical launch. | |||
dtrace uses fork/exec or posix_spawn (I didn't check which one), while a typical launch | |||
(Dock, Finder, open) uses LaunchServices. It isn't all that different but you might as well | |||
address both issues together. | |||
I'd recommend reworking your script to watch the exec-success of firefox-bin and use | |||
a static probe (no need for $target then) to mark the end time. Then you can use a more normal | |||
launch method, for example open .../Minefiled.app . | |||
</pre> | |||
I did [http://github.com/wagerlabs/firefox-startup/blob/75f719f3a3cdb380071ef8c354cf52564ac4754d/d rework my cold start script] which can be used like this now: | |||
<pre> | |||
./d firefox-bin startup.d stop-main.d | |||
dtrace -Zqw -x dynvarsize=64m -x evaltime=exec -p 35331 -s sigcont.d -s startup.d -s stop-main.d | |||
kill -CONT 35331 | |||
Total: 2000.275985ms | |||
</pre> | |||
The above assumes that I launch Minefield.app after ''d'', from the dock, Finder, command line, etc. | |||
=== August 27, 2009 === | === August 27, 2009 === | ||
Line 131: | Line 605: | ||
I don't have an explanation yet but 9 seconds is a very large difference. | I don't have an explanation yet but 9 seconds is a very large difference. | ||
=== August 26, 2009 === | === August 26, 2009 === | ||
Line 374: | Line 846: | ||
sudo ./cold.sh static-init.d | sudo ./cold.sh static-init.d | ||
</pre> | </pre> | ||
= Shark tips and tricks = | |||
=== Expanding the call tree === | |||
Alt-Click does it. | |||
=== CHUD signposts on Snow Leopard === | |||
There are no CHUD header files on Snow Leopard. If you're only interest is in emitting signposts, the easiest thing for you to do is switch to the syscall method instead. That'll also mean you won't have to link against the CHUD framework. The full details are in the Shark User Guide, but in brief: | |||
<pre> | |||
int err = syscall(SYS_kdebug_trace, APPSDBG_CODE(DBG_MACH_CHUD, <your code number>) | <type>, arg1, arg2, arg3, arg4); | |||
</pre> | |||
Types are DBG_FUNC_START, DBG_FUNC_END and DBG_FUNC_NONE (for point signposts). | |||
Don't forget to include <sys/kdebug.h> and <sys/syscall.h> for the necessary macros and function. | |||
= DTrace tips and tricks = | = DTrace tips and tricks = | ||
=== DTrace overhead and proper measuring technique === | |||
[http://www.mail-archive.com/dtrace-discuss@opensolaris.org/msg03639.html I'm still not done!] | |||
=== Function entry/return vs static probes === | |||
Q: I added a couple of static probes to Firefox to measure actual work done. I could have used a pid$target probe with a function name but work is done within an if statement, which is where I placed the static probes. I'm wondering about my use, though. Is the following significantly more | |||
efficient than pid$target::FunName:entry and return? | |||
A: No, it's not significantly more efficient. Both function entry/return probes and USDT probes involve replacing an instruction in the stream with a trap instruction. The expense of the trap into the kernel is going to overwhelm any subtleties in handling between the two different types of probes. --Chad Mynhier | |||
=== Combining scripts === | |||
I find it convenient to have separate DTrace scripts for various exit probes. This way I can avoid editing the main script whenever I want to stop sampling once 10 seconds go by or ''XRE_main'' is entered or the ''BrowserReturn'' JavaScript function returns. | |||
It can be done like this | |||
<pre> | |||
#!/bin/bash | |||
cmd="./Minefield.app/Contents/MacOS/firefox-bin -no-remote -foreground -P 2" | |||
scripts="" | |||
for i in $*; do scripts="$scripts -s $i"; done | |||
sync && purge && dtrace -x dynvarsize=64m -x evaltime=exec -c "$cmd" -wZ $scripts | |||
</pre> | |||
and then | |||
<pre> | |||
./cold.sh startup.d stop-main.d | |||
Total: 11001.481323ms | |||
</pre> | |||
or | |||
<pre> | |||
./cold.sh startup.d stop-js.d | |||
Total: 7004.799264ms | |||
</pre> | |||
My scripts can be [http://github.com/wagerlabs/firefox-startup/tree/master found on GitHub]. | |||
=== No sudo === | === No sudo === |
edits