Shark User Guide(Legacy)
Figure 6-22 Data Mining Contextual Menu 168Figure 6-23 After Focus Symbol -[SKTGraphicView drawRect:] 169Figure 6-24 After focus and expansion 170Figu
few disclosure triangles open, this view lets you logically follow your code paths until you reach a point wherethey call blocking library routines. A
Note regarding launched target processes: When launching a process (as described in ProcessLaunch (page 122)) with Time Profile (All Thread States) ,
3.Start Delay— Specify a length of time that Shark should wait after being told to start collecting a profilebefore the collection begins. If the prog
is often a good idea to look over the routines near the top of this list and make sure that both the routinesallocating memory are the ones you think
allocation and deallocation operations outside of loops, so that you can reuse the same memory buffersrepeatedly without reallocating them each time t
Advanced Display OptionsEach Malloc Trace records a few additional pieces of information at each allocation event. These are notdisplayed by default,
When you enable display of a particular type of data, it will appear in several places. First, columns displayingit will appear in the Profile Browser
Static AnalysisMost of Shark’s profiling methods limit their code analysis to those functions that appear dynamically infunctions that are executed du
●PowerPC Model — Selects the PowerPC model to use when searching for and assigning problemseverities . ●Intel Model— Selects the Intel model to use w
Sun’s Java virtual machine included with Mac OS X do provide an interface that Shark can use. As a result,Shark includes some special, Java-only confi
Figure 8-8 PowerPC 970 IMC (IFU) Configuration Tab 217Figure 8-9 PowerPC 970 IMC (IDU) Configuration Tab 221Figure 8-10 U1.5/U2 Configuration Tab 223F
●Java Alloc Trace: This records memory allocations and the sizes of the objects allocated, and is analogousto a regular Malloc Trace (Malloc Trace (p
Event Counting and Profiling OverviewAfter analyzing an application using a Time Profile, you may find it informative to count system events or evensa
Counter Spreadsheet Advanced Settings (page 116)). Selecting rows in this list also selects the correspondingcolumns in the counter table, graphing th
e.Shortcut Result Column(s)— These columns show the performance counter results after they havebeen processed by the math in any “shortcut” equations.
see Adding Shortcut Equations (page 119), below. For a complete description of how to write performancecounter equations, including how to add them pe
The Counters MenuWhen you switch to the Counters tab in a session made with timed performance counters, a Counters menuwill appear in the menu bar. Yo
Performance Counter Spreadsheet Advanced SettingsWith the session window in the foreground, select WindowShow Advanced Settings (Command-Shift-M ), a
This drawer contains three main panels, each with many different controls that affect the presentation ofresults:1.Counter Shortcut Equations— This ta
●Bars — Display results using a vertical bar chart. Bars from multiple selected columns will besuperimposed over one another. ●Stacks of Bars— Displa
Adding Shortcut EquationsThis section gives a brief summary of how to add new “shortcut equation” results columns to your performancecounter spreadshe
SwiftObjective-CRetired Document | 2012-07-23 | Copyright © 2012 Apple Inc. All Rights Reserved.12
Note: The built-in L2 cache miss profile configuration is a great way to find lines in your code thataccess memory in ways that cause very slow L2 ca
While none of the default configurations use this capability, it is also possible to essentially record callstackslike a Time Profile simultaneously w
Although the Start button makes starting and stopping Shark quite simple, sometimes it can be impractical,or even impossible to use. For example, how
Process Attach mode, by selecting Process from the Target popup (Command-2). Now, select the “Launch...”target from the top of the process list (or us
2.Working Dir— The full path to the working directory that the application will start using. By default, thisis the path where the executable is locat
Batch ModeBatch mode queues up any sessions recorded without displaying them. Pending sessions are listed in the mainShark window. Batch mode allows m
leaving the area of interest. However, you may not always know when you will encounter the “interesting”region of your program in advance. WTF mode si
Tracing the execution around an asynchronous event, such as inter-thread communication, the arrival of anetwork packet, or OS event such as a page fau
Second, the beginning of a WTF System Trace Timeline (see Figure 5-6) can appear a bit strange; differentprocessors might first appear at vastly diffe
Sampling UnresponsiveApplications menuitem(Command-Shift-A ).WhenUnresponsiveApplicationTriggeringis enabled, Shark will automatically switch to Batc
Important: This document may not represent best practices for current development. Links to downloadsand other resources may no longer be valid.Overv
general, it is intended that you use it to collect sessions and then review your results with a graphical copy ofShark later. This section will discus
Remote ModeA third way to use command line shark is remote mode, which works much like the remote mode supportedby graphical Shark and described in In
●Time Interval — shark -I allows you to change the sampling interval for configurations that support asampling interval. Valid times are entered the
ReportsCommand line shark supports generation of textual reports, either from session files that you’ve already created,or from new sessions as they a
More InformationThis section has presented some of the most common options and techniques for using command-line shark.For more detailed information o
It is important to keep in mind that many profiling techniques used by Shark employ statistical sampling inorder to generate a profile. If the samplin
sprintf(label_str, "Hanoi #%d", i);chudStartRemotePerfMonitor(label_str);Hanoi('A','B','C',i);chudStopRemotePe
The Towers of Hanoi test program demonstrates the need for a sampling interval that is much shorter thanthe time between the calls to start and stop S
When used to stop profiling, chudRemoteCtrl will not return until Shark has stopped profiling. In the caseof command-line shark, chudRemoteCtrl will n
Important: Shark cannot capture symbol information on the iPhone itself, so “raw” sessions recorded froman iPhone will appear in Shark labeled only b
2.It must be relevant. Optimizing functionality that is rarely used is usually counter-productive.3.It shows up as a hot spot in a time profile. If th
●Control network profiling of shared computers — Any computers on the network (in the local domain)running Shark in “shared” network mode will automa
●Config— The currently active Sampling Configuration on the shared computer. The entries in thiscolumn are menus, just like the one in Shark’s main w
then respond to network requests to start and stop profiling. A sample transcript of a remote command lineshark in “Network Sharing” mode is shown in
Mac OS X Firewall ConsiderationsThe sharing firewall on Mac OS X can prevent Shark’s network profiling from working in either sharing orcontrol mode.
Click the Sharing... button in the warning dialog to bring up the System Preferences window Sharing tab.Otherwise click the Ignore button to dismiss t
SwiftObjective-COften, the profile analysis windows can provide you with a very helpful view of your application’s behaviorusing the default settings.
If symbol lookup fails, Shark may present the missing “symbols” in two different ways. If the memory of theprocess is readable — for example, a binary
require debugging information to work, but it can be much more helpful if it’s available. In case you record aShark session and discover that symbols
No matter which way you choose to get here, you will be presented with a Symbolication dialog (Figure 2-20).Figure 6-2 Symbolication DialogUse this di
Shark will warn you if you select a binary that is potentially problematic. If you do happen to select an executablethat isn’t a good match, the profi
●Getting Started with Shark— This introduction and Getting Started with Shark (page 17) are designedto give you an overall introduction to Shark. Aft
Figure 6-4 After SymbolicationManaging SessionsIf you have multiple sessions measuring the same application, it is possible to use Shark to compare or
When used, a new session is created from two existing ones: Session A and Session B. The first session (SessionA) is given a negative scaling factor,
Callstack Data MiningIn order to understand how to use data mining to better understand your application, it is necessary to firstunderstand a few fun
large routines farther down the callstack that call many other routines in the course of their execution. Onceyou have a clear picture of how callstac
Figure 6-7 Tree ViewmainTotal:Self:50fooTotal:Self:20barTotal:Self:20cosTotal:Self:11sqrtTotal:Self:11bazTotal:Self:30barTotal:Self:11sqrtTotal:Self:1
in controlled ways. For example, you often won’t care about the exact places that samples occur within MacOSX’s extensive libraries — only which of yo
a flag such as ‘–g’ with GCC or XLC, and in the process eliminating a lot of user-level code that you probablydo not have control over. Samples from c
9.Focus Callers of Symbol X — Removes functions called by the specified symbol and removes callstacksthat do not contain the specified symbol.10.Focus
The Perf Count Data Mining palette also supplies a global enable/disable toggle, much like the one availablewith conventional data mining, and check b
2.Make four shapes as shown in Figure 6-11Figure 6-11Example Shapes3.Repeat the following steps until the app becomes sluggish (takes a half second or
Counter Event List (page 252), PPC 750 (G3) Performance Counter Event List (page 263), PPC 7400 (G4)Performance Counter Event List (page 265), PPC 745
This should take 8-10 times (maybe more) depending on hardware. When you are done it should looksomething similar to Figure 6-12Figure 6-12Example Sha
This reveals a third pop-up button that you can use to target your application. Select Sketch from the listof running applications.Figure 6-13Sampling
High Level AnalysisThe session window gives you by default a summary of all the functions that the sampler found samples inand the percentage of the s
3.Click on the callstack button on the lower right corner of the table to reveal the callstack pane, asshown in Figure 6-15. As you click on symbols o
2.Double click on the symbol -[SKTGraphicView selectAll:] in the tree view above. You will see asource window that looks like Figure 6-17Figure 6-17So
3.Double-click on the yellow colored line to navigate to the function (performSelector) called here. Whenthe new source window comes up, double-click
4.Double-click on the yellow colored line [self performSelector: sel withObject:[arrayObjectAtIndex:i]]; and you'll get Figure 6-19:Figure 6-19So
5.Double-click on [self invalidateGraphic:graphic]; and you'll get Figure 6-20. This contains oneline of expensive code that tests for nested obj
Introduction To FocusingThis example will take us through analyzing the behavior of drawing the selected rectangles. Here, we willdevelop ideas for an
5.Choose "Focus Symbol -[SKTGraphicView drawRect:]" and you will get something that looks likeFigure 6-23Figure 6-23After Focus Symbol -[SKT
Starting to use Shark is a relatively simple process. You only need to choose one or two items from menus andpress a big “Start” button in order to st
6.Expand -[SKTGraphicView drawRect:] in the bottom outline a few times until it looks likes likeFigure6-24:Figure 6-24After focus and expansionThere a
7.Double click on -[SKTGraphic drawInView:isSelected] to see the source, as shown in Figure 6-25:Figure 6-25Source View: SKTGraphic drawInView:isSelec
8.Double click on line 406 on the text -[self drawHandlesInView: view] and you'll get Figure 6-26:Figure 6-26Source View: SKGraphic drawHandlesIn
9.Double click on line 502 in the text [self drawHandleAtPoint: ...] and it will take you to the codefor [SKTGraphicview drawHandleAtPoint: ...] which
2.We're going to work with the “Heavy View” (the upper profile) for a bit. So click theand set it back to .3.Select the first symbol in the upper
5.In the left hand outline select the symbol ripd_mark and control+click on it to bring up the data miningcontextual menu. Choose "Charge Library
This example is a bit simplistic, but it shows the power of the exclusion operations to strip out unnecessaryinformation and identify where the real c
3.Target your application and choose “Malloc Trace” instead of “Time Profile,” as with Figure 6-32.Figure 6-32Malloc Trace Main Window4.Switch back to
The window should look like Figure 6-33, if you have gone through Tutorial 1 first. Otherwise, it will look similarbut not exactly the same.Figure 6-3
Graphical Analysis of a Malloc Trace1.Click on the Chart Tab and you'll get a window that looks like Figure 6-34.Figure 6-34Chart ViewThe lower g
●Malloc Trace— If your program allocates and deallocates a lot of memory, performance can suffer andthe odds of accidental memory leaks increase. Sha
2.Select the first hump just before sample 6,000 and enlarge it, as shown in Figure 6-35:Figure 6-35Place to SelectThe yellow indicates the tenure of
3.Now use the slider on the bottom left of the window to adjust zoom. Play with this a bit. As you zoom inand out you'll see that there are multi
4.We'll finish up with another good application of this graphical analysis. Click on the call stack buttonto reveal the call stack for this sampl
Up until now, you have been using the configuration menu in Shark’s main window (in Figure 7-1) to selectfrom various built-in sampling methods. Each
The Config EditorThe Configuration Editor lets you individually modify settings for any of Shark’s modules, which are calledPlugIns. The properties av
●You can Rename any custom config in the list, but not built-in config files. A renamed config will bechanged in the appropriate Configs folder immed
●In Advanced mode, all of the available plugins are listed with a checkbox next to each indicatingwhether or not it is enabled in the current config.
●Sampling Tab – The controls on this tab (see Figure 7-3) determine when to start and stop recordingsamples.1.Windowed Time Facility— If enabled, Sha
column to select the performance counter mode (None, Counter, or Trigger). Only a small subset ofpossible counter options are available here. For more
Malloc Data Source PlugIn EditorThe Malloc data source is used for the Malloc Trace config described in Malloc Trace (page 101). It is used forcollect
Mini Configuration EditorsEach configuration typically has a few parameters that are frequently modified. Shark allows you to edit theseeasily using t
Static Analysis Data Source PlugIn EditorThe Static Analysis data source is used by the Static Analysis default configuration, described in StaticAnal
4.Processor Settings— Shark needs to know which model of processor is your target before it can examinecode and find potential problems. Separate menu
Sampler Data Source PlugIn EditorThe Sampler data source provides the same functionality as the separate Sampler application and command-linetool. It
System Trace Data Source PlugIn EditorThis data source collects data for the System Trace default configuration, described in System Tracing (page63).
All Thread States Data Source PlugIn EditorThis data source collects data for the Time Profile (All Thread States) default configuration, described in
Analysis and Viewer PlugIn SummaryAll Data Source PlugIns include configuration editors. However, most of the analysis and viewer editors do not.While
●System Trace: Timeline— This can only be used with the “System Trace” data source and analysis PlugIns.It displays the Timeline tab used by System T
This view contains the following constituent parts:1.PMC Sumary Table – This table summarizes all the performance counters (PMCs) that are currently s
DescriptionShortcutEquationTermsRepresents a summation of results from all processors on counter-Y. For example: pNc1is the term that represents event
Spreadsheet Configuration ExampleBecause this editor is very flexible and powerful, an example can be helpful to illustrate how it might be used.Start
ContentsIntroduction 13Overview 13Philosophy 13Organization of This Document 14Getting Started with Shark 17Main Window 17Mini Configuration Editors 1
Note: Occasionally you may notice a small delay while Shark allocates the sample buffers it needsto record data, due to time spent in the Mac OS X vi
2.Next search the list by typing “INST” into the search field, as is shown in Figure 7-13. Select the“INST_RETIRED” entry and change the mode to “Coun
Next, enter the equation pNc3/pNc2, as is shown in Figure 7-14. This will automatically calculate the numberof cycles per completed instruction, or CP
The different CPUs and North bridge chipsets available in Macintosh systems have widely varying performancemonitoring capabilities. Because there are
Once you have decided which counters you want to measure, and thought a bit about how you might wantto control sampling, there are several configurati
●Sample Limit — Sets the maximum number of samples to record. Specifying a maximum of N sampleswill result in at most N samples being taken on a unip
●chudRecordUserSample— A sample is recorded for every call to theCHUD.frameworkchudRecordUserSample() function. This is analogous to using signposts
Common Elements in Performance Counter Configuration TabsAll of the various performance counter configuration tabs have many unique elements, as the v
3.Sample Interval— This is the number of events that must occur before this PMC will trigger sampling. Itis ignored unless this particular counter has
You can mark processes with Shark’s Process Marker (Figure 8-3). The Process Marker can be opened via theSampling Mark Process menu item. Shark disab
●Scheduler Events: Events such as context switches, “thread ready” events, and stack handoffs ●Disk I/O Events: Disk reads and writes, with optional
Shark allows you to work with multiple sampling sessions at a time, displaying a separate window for eachsession. This is useful for comparing two or
both count a similar but not identical list of events on the programmable processors. Full event listings areprovided in Intel Core Performance Counte
bit-names in the mask list. Any bit in the list labeled *Reserved* should not be enabled. A brief summary ofwhich bits are active for any particular e
Figure 8-6 shows the single configuration tab for the G4+ processor (the one for the G3 and G4 is virtuallyidentical, but lacks PMCs 5–6). For the mos
Warning:If you leave branch folding disabled and exit Shark, branch folding will remain disabled. Whilethis will not cause any correctness problems or
In addition, several additional controls are provided. Most are multiplexer controls to switch the various eventpre-filtering multiplexers, but the la
8.TB Select: This is the divider used for timebase events that cause processor exceptions, and selects fromfour different division ratios. More inform
to count it. Please note that as long as an instruction resides in the L1 instruction cache, its match bit willremain unchanged. Hence, if the match c
Due to the very flexible and complex nature of these mechanisms, it is highly recommended that you read thepertinent sections of the PowerPC 970 Docum
2.IOP Marking – This pre-filter will limit the type of internal PowerPC microinstructions (IOPs) that arematched or sampled. ●All IOPs – (default) Any
5.Major Opcode Bits— This allows you to select marked instructions on the basis of their six major opcodebits (bits 0–5 of each PowerPC instruction).
Note: Shark’s session files have slowly evolved and changed over time, as new features have beenadded that made it difficult to keep backwards-compat
●BSFL column — This lists the BSFL (Branch instruction, instruction that will be Split, First instructionin a dispatch group, and Last instruction in
●1 — Match this bit position with 1. Normally only desired if the corresponding IMRMASK bit is 1, or ifyou want to intentionally match nothing.Figure
settings, we strongly suggest that you start using the “Simple” settings at first, as described in Simple TimedSamples and Counters Config Editor (pag
4.FireWire/Enet— The dedicated FireWire and Ethernet I/O portsFigure 8-10 U1.5/U2 Configuration TabU3 North BridgeThis section describes how you can m
b.Write— Only store requests to memory can increment the counter.c.Read — Only load requests from memory can increment the counter.d.Any — All memory
e.AGP— The AGP interfaceFigure 8-11 U3 Memory Configuration Tab Figure 8-12 shows the second of U3’s two configuration tabs, the API configuration
2.Divider PopUp— This is the same as the Divider PopUp on the memory tab.Figure 8-12 U3 API Configuration TabU4 (Kodiak) North BridgeThis section de
b.Write— Only store requests to memory can increment the counter.c.Read — Only load requests from memory can increment the counter.d.Any — All memory
Figure 8-14 shows the second of U4’s two configuration tabs, the API configuration panel. As with the memorytab, the first line of each PMC’s controls
ARM11 CPU Performance Counter ConfigurationThis section describes how you can make custom configurations for iOS devices with ARM11 processors. Thesed
1.Basic Statistics — This section of the pane contains basic information about the system at the timethe session was recorded. The system’s name, the
Menu ReferenceThis section summarizes Shark’s commands, arranged by menu.SharkThis menu contains the usual application-menu commands.Where DescribedDe
Where DescribedDescriptionShortcutCommandClose the frontmost window. If thefrontmost window is the main controlwindow, this will quit Shark.Cmd-WClose
DescriptionShortcutCommandRedo the next action.Shift-Cmd-ZRedoCut the selected text, placing it on the clipboard.Cmd-XCutCopy the selected text to the
FormatAll items in this menu are standard text processing commands. Since it is generally not possible to apply customformats to most text within Shar
Where DescribedDescriptionShortcutCommandMini ConfigurationEditors (page 19)Show/Hide the mini configeditor attached to the maincontrol window.Shift-C
Where DescribedDescriptionShortcutCommandNetwork/iPhoneProfiling (page 138)Enable Network Profiling of othercomputers or iPhones, instead of localprof
WindowAlong with standard window control functionality, this contains the command to show or hide the AdvancedSettings drawer on the right side of eac
MenuWhere DescribedDescriptionShortcutCommandSamplingBatch Mode (page125)Toggles Batchmode, allowingthe recording ofmultiple sessionsbefore analysisbe
MenuWhere DescribedDescriptionShortcutCommandDataMiningData Mining (page151)Just hide theselectedlibrary(ies),without addingtime to thecallers.Shift-C
MenuWhere DescribedDescriptionShortcutCommandFileSession Files (page21)Attach a copy ofthe frontmostsession to a newemail in yourdefault emailprogram.
AdvancedSettings menu item (Command-Shift-M). An example is depicted below in Main Window. The controlspresented will vary depending upon the current
MenuWhere DescribedDescriptionShortcutCommandDataMiningData Mining (page151)Hide all callstackswhich contain theselectedsymbol(s).Cmd-KRemoveCallstack
MenuWhere DescribedDescriptionShortcutCommandConfigMini ConfigurationEditors (page 19)Show/Hide themini config editorattached to themain controlwindow
Code Analysis with the G5 (PPC970) ModelShark offers several features designed to help the programmer understand instruction execution behavior onthe
note that the data in the G5 Resource Utilization drawer is based on the currently selected instructions in theCode Table, or on the entire code seque
any timer interrupts that occur in it are not serviced until interrupts are reenabled in ml_restore(). It is forthis reason that all of the timer samp
A more accurate picture of the kernel behavior can be seen with event sampling (Figure B-3). This is becauseCPU event sampling reads the SIAR (sampled
Intel’s Core processors have 2 performance counters per core. Both are programmable, and can count 111 (#1)or 112 (#2) different types of events.Most
Valid Event-MaskBitsPMCNumberEvent NumberPerformance Counter Event Namenone1,2147BR_CALL_MISSP_EXECnone1,2139BR_CND_EXECnone1,2140BR_CND_MISSP_EXECnon
Valid Event-MaskBitsPMCNumberEvent NumberPerformance Counter Event Name61,2101BUS_TRAN_BRD5 6 71,2110BUS_TRAN_BURST5 6 71,2109BUS_TRAN_DEF61,2104BUS_T
Valid Event-MaskBitsPMCNumberEvent NumberPerformance Counter Event Namenone1,2215EMON_ESP_UOPS0 11,2218EMON_FUSED_UOPS_RET0 11,27EMON_KNI_PREF_DISPATC
3.Alternating/Solid Table Background— For tabular session window views, such as the profile browsersand code browsers described in Profile Browser (pa
Valid Event-MaskBitsPMCNumberEvent NumberPerformance Counter Event Namenone1,2192INST_RETIREDnone1,2133ITLB_MISS0 1 2 31,264L1_CACHEABLE_DATA_READS0 1
Valid Event-MaskBitsPMCNumberEvent NumberPerformance Counter Event Namenone1,2176MMX_INSTR_EXEC0 1 2 3 4 51,2179MMX_INSTR_TYPE_EXECnone1,2177MMX_SAT_I
Intel’s Core 2 processors have 5 performance counters per core. Two of these are fully programmable, and cancount 116 (#1) or 115 (#2) different types
Valid Event-Mask BitsPMC NumberEvent NumberPerformance Counter Event Name02none1146BR_CALL_EXEC02none1147BR_CALL_MISSP_EXEC02none1139BR_CND_EXEC02none
Valid Event-Mask BitsPMC NumberEvent NumberPerformance Counter Event Name02none1145BR_RET_BAC_MISSP_EXEC02none1143BR_RET_EXEC02none1144BR_RET_MISSP_EX
Valid Event-Mask BitsPMC NumberEvent NumberPerformance Counter Event Name5 6 7199BUS_LOCK_CLOCKS (Core and BusAgents masks apply)0 6 724 5 6 7196BUS_R
Valid Event-Mask BitsPMC NumberEvent NumberPerformance Counter Event Name5 6 71106BUS_TRANS_PWR0 725 6 71103BUS_TRANS_WB026 71125BUSQ_EMPTY0 720 1 6 7
Valid Event-Mask BitsPMC NumberEvent NumberPerformance Counter Event Name020 1 3 51119EXT_SNOOP020217FP_ASSISTnone116FP_COMP_OPS_EXE0 11204FP_MMX_TRAN
Valid Event-Mask BitsPMC NumberEvent NumberPerformance Counter Event Name020 1 2 3 4166L1D_CACHE_LOCK020 1 2 3165L1D_CACHE_ST02none171L1D_M_EVICT02non
Valid Event-Mask BitsPMC NumberEvent NumberPerformance Counter Event Name020 1 2 3 6 7140L2_IFETCH020 1 2 3 4 5 6 7141L2_LD0 6 724 5 6 7136L2_LINES_IN
3.Remain in Background — Shark normally brings itself to the front when sampling completes. Thismeans that it will be the main application while it an
Valid Event-Mask BitsPMC NumberEvent NumberPerformance Counter Event Name02none176LOAD_HIT_PRE0 2 320 21195MACHINE_NUKES0 2 3 4231170MACRO_INSTS.CISC_
Valid Event-Mask BitsPMC NumberEvent NumberPerformance Counter Event Name020 1 2 31213SEG_REG_RENAMES0 2 3 4 520 1 2 31212SEG_RENAME_STALLS02none16SEG
Valid Event-Mask BitsPMC NumberEvent NumberPerformance Counter Event Name0 720 117SSE_PRE_EXEC0 2 3 520 1175SSE_PRE_MISS020 1 314STORES BLOCKED026 715
The PowerPC 750 (G3) cores contain four independent performance counters, each of which can count 12–17different types of events. Four commonly measur
Event NumberPMC Number(s)Performance Counter Event Name91, 2Instr Bkpt Matches21, 2, 3, 4Instr Completed41, 2, 3, 4Instr Dispatched81, 2Instr Fetches1
The PowerPC 7400 (G4) cores contain four independent performance counters, each of which can count 27–48different types of events. Four commonly measu
Event NumberPMC Number(s)Performance Counter Event Name153Branch Unit LR/CTR Stall Cycles371Branch Unit Speculative Load Stall Cycles131Branch Unit Sp
Event NumberPMC Number(s)Performance Counter Event Name183dL1 Cycles241dL1 Hits221dL1 Load Hits152dL1 Load Misses111dL1 Miss Cycles > Threshold172d
Event NumberPMC Number(s)Performance Counter Event Name51EIEIO Instr421External Snoop Requests52Fall through Branches113Floating Point Instr213Full Ca
Event NumberPMC Number(s)Performance Counter Event Name361L2 Allocations441L2 Castout Snoop Hits292L2 Sectors Castout273L2 Snoop Hits133L2 Snoop Inter
1.Ask About Unsaved Sessions— With Shark, you can optionally disable the usual behavior of asking ifyou want to individually save each session file wh
Event NumberPMC Number(s)Performance Counter Event Name114SYNC Instr142System Register Unit Instr31, 2, 3, 4TimeBase (Lower) 0->1 bit transitions40
The PowerPC 7450 (G4+) cores contain six independent performance counters, each of which can count 20–94different types of events. CPU cycles can be m
Event NumberPMC Number(s)Performance Counter Event Name181, 2AltiVec MFVSCR Instr Sync Cycles131, 2, 4AltiVec MTVRSAVE Instr121, 2, 4AltiVec MTVSCR In
Event NumberPMC Number(s)Performance Counter Event Name476Bus Retry from L1 Retry486Bus Retry from Prev-Adjacent426Bus TA's for Reads436Bus TA&ap
Event NumberPMC Number(s)Performance Counter Event Name531dL1 Load Hits213dL1 Load Miss Cycles372dL1 Load Misses431dL1 Load-Miss Cycles > Threshold
Event NumberPMC Number(s)Performance Counter Event Name183DTLB Misses234DTLB Search Cycles401DTLB Search Cycles > Threshold256DTQ Full351EIEIO Inst
Event NumberPMC Number(s)Performance Counter Event Name911FPSCR Renames 1/2 Busy901FPSCR Renames 1/4 Busy921FPSCR Renames 3/4 Busy931FPSCR Renames All
Event NumberPMC Number(s)Performance Counter Event Name196L1 External Interventions106L2 Castout Queue Full Cycles86L2 Castouts206L2 External Interven
Event NumberPMC Number(s)Performance Counter Event Name145, 6L3 Touch Hits376176L3 Write Queue Full Cycles731LD/ST Alias vs. CSQ721LD/ST Alias vs. FSQ
Event NumberPMC Number(s)Performance Counter Event Name292LWARX Instr302MFSPR Instr284Mispredicted Branches361MTSPR Instr01, 2, 3, 4, 5, 6Nothing51, 2
1.Source— Shark will usually find source files automatically if they are not moved between compilationand session viewing times. If you must move the
Event NumberPMC Number(s)Performance Counter Event Name224Store String/Multi Pieces272STSWI/STSWX/STMW Instr153STWCX Instr254Successful STWCX Instr214
Event NumberPMC Number(s)Performance Counter Event Name243VTE2 Line Fetches264VTE3 Line Fetches511Write-Through StoresPPC 7450 (G4+) Performance Count
The PowerPC 970 (G5) cores contain an extremely sophisticated and complex set of performance counters.Unlike the other processors used in Macintoshes,
Byte LaneNumberTTM MuxNumberPMCNumber(s)EventNumber(s)Performance Counter EventName1: TTM00: FPU30[FPU] fp0 estimate + fp1estimate1: TTM00: FPU3, 4, 7
Byte LaneNumberTTM MuxNumberPMCNumber(s)EventNumber(s)Performance Counter EventName0: TTM00: FPU1, 2, 5, 623[FPU] fp1 add, mult, sub,compare, fsel2: T
Byte LaneNumberTTM MuxNumberPMCNumber(s)EventNumber(s)Performance Counter EventName2: TTM11: GPS1, 2, 5, 630[GPS] Cacheable store queue full1: TTM11:
Byte LaneNumberTTM MuxNumberPMCNumber(s)EventNumber(s)Performance Counter EventName0: TTM11: GPS50[GPS] L2 miss on store access (R,S, I) + I=1 store o
Byte LaneNumberTTM MuxNumberPMCNumber(s)EventNumber(s)Performance Counter EventName1: TTM11: GPS3, 4, 7, 821[GPS] Master L2 read transactionon bus was
Byte LaneNumberTTM MuxNumberPMCNumber(s)EventNumber(s)Performance Counter EventName3: TTM11: GPS3, 4, 7, 825[GPS] Snoop state machinedispatched3: TTM1
Byte LaneNumberTTM MuxNumberPMCNumber(s)EventNumber(s)Performance Counter EventName3: TTM00: IFU432[IFU] cycles i L1 write active +nothing2: TTM00: IF
The first and most frequently used Shark configuration is the Time Profile. This produces a statistical samplingof the program’s or system’s execution
Byte LaneNumberTTM MuxNumberPMCNumber(s)EventNumber(s)Performance Counter EventName0: TTM00: ISU10[ISU] completion table full + crmapper full0: TTM11:
Byte LaneNumberTTM MuxNumberPMCNumber(s)EventNumber(s)Performance Counter EventName3: TTM00: ISU432[ISU] duration MSR(EE) = 0 +MSR(EE)=0 and interrupt
Byte LaneNumberTTM MuxNumberPMCNumber(s)EventNumber(s)Performance Counter EventName3: TTM00: ISU332[ISU] fx0 produced a result + fx1produced a result3
Byte LaneNumberTTM MuxNumberPMCNumber(s)EventNumber(s)Performance Counter EventName2: TTM11: ISU2: TTM00: ISU632[ISU] instructions dispatchedcount + g
Byte LaneNumberTTM MuxNumberPMCNumber(s)EventNumber(s)Performance Counter EventName0: TTM11: ISU0: LSU01, 2, 5, 618[LSU0] d erat miss side 00: LSU060[
Byte LaneNumberTTM MuxNumberPMCNumber(s)EventNumber(s)Performance Counter EventName1: LSU030[LSU0] marked flush from LRQshl, lhl side 0 + marked flush
Byte LaneNumberTTM MuxNumberPMCNumber(s)EventNumber(s)Performance Counter EventName2: LSU0532[LSU0] marked L1 d cache storemiss + larx executed 02: LS
Byte LaneNumberTTM MuxNumberPMCNumber(s)EventNumber(s)Performance Counter EventName3: LSU1 6|70: LSU13: LSU1 2|360[LSU1] flush from LRQ shl,lhl side0
Byte LaneNumberTTM MuxNumberPMCNumber(s)EventNumber(s)Performance Counter EventName0: LSU13: LSU1 2|31, 2, 5, 616[LSU1] flush unaligned load side03: L
Byte LaneNumberTTM MuxNumberPMCNumber(s)EventNumber(s)Performance Counter EventName0: LSU13: LSU1 2|31, 2, 5, 621[LSU1] flush unaligned store side13:
System Tracing 63Tracing Methodology 63Basic Usage 64Interpreting Sessions 66Summary View In-depth 67Trace View In-depth 73Timeline View In-depth 77Si
as taking an entire time quantum balances out the numerous times that it is missed entirely, providing a fairlyaccurate measurement of the time spent
Byte LaneNumberTTM MuxNumberPMCNumber(s)EventNumber(s)Performance Counter EventName1: LSU13: LSU1 2|340[LSU1] L1 d cache store miss +L1 dcache entries
Byte LaneNumberTTM MuxNumberPMCNumber(s)EventNumber(s)Performance Counter EventName3: LSU1 2|73: LSU1 6|33: LSU1 6|71: LSU13: LSU1 2|33, 4, 7, 816[LSU
Byte LaneNumberTTM MuxNumberPMCNumber(s)EventNumber(s)Performance Counter EventName1: LSU13: LSU1 2|33, 4, 7, 821[LSU1] L1 dcache store side 13: LSU1
Byte LaneNumberTTM MuxNumberPMCNumber(s)EventNumber(s)Performance Counter EventName3: LSU13: LSU1 2|7832[LSU1] L1 reload data source +Marked L1 reload
Byte LaneNumberTTM MuxNumberPMCNumber(s)EventNumber(s)Performance Counter EventName3: LSU13: LSU1 2|33, 4, 7, 830[LSU1] LMQ slot 0 allocated3: LSU1 6|
Byte LaneNumberTTM MuxNumberPMCNumber(s)EventNumber(s)Performance Counter EventName2: LSU13: LSU1 6|31, 2, 5, 630[LSU1] LS1 reject - reload cdf ortag
Byte LaneNumberTTM MuxNumberPMCNumber(s)EventNumber(s)Performance Counter EventName2: LSU13: LSU1 2|31, 2, 5, 624[LSU1] SRQ store forwarding side03: L
Byte LaneNumberTTM MuxNumberPMCNumber(s)EventNumber(s)Performance Counter EventName2: TTM00: VMX1, 2, 629[VMX] forwarding occurred fromperm or alu or
Byte LaneNumberTTM MuxNumberPMCNumber(s)EventNumber(s)Performance Counter EventName45CPU Marked Instruction finish51Dispatch Successes3: LSU117dL2 Hit
Byte LaneNumberTTM MuxNumberPMCNumber(s)EventNumber(s)Performance Counter EventName2: TTM00: FPU16Instr Src Encode 0 (Lane 2 notset to IFU)0: ISU0: VM
sampling mechanism are spread out to affect most areas of measured execution more or less equally. Incontrast, most event counting-based mechanisms, s
Byte LaneNumberTTM MuxNumberPMCNumber(s)EventNumber(s)Performance Counter EventName0: ISU0: IFU0: VMX2: LSU10: FPU0: ISU0: IFU0: VMX2: TTM00: FPU36Ins
Byte LaneNumberTTM MuxNumberPMCNumber(s)EventNumber(s)Performance Counter EventName2: TTM00: FPU46Instr Src Encode 3 (Lane 2 notset to IFU)0: ISU0: VM
Byte LaneNumberTTM MuxNumberPMCNumber(s)EventNumber(s)Performance Counter EventName0: ISU0: IFU0: VMX2: LSU10: FPU0: ISU0: IFU0: VMX2: TTM00: FPU66Ins
Byte LaneNumberTTM MuxNumberPMCNumber(s)EventNumber(s)Performance Counter EventName2: TTM00: FPU76Instr Src Encode 6 (Lane 2 notset to IFU)0: ISU0: VM
Byte LaneNumberTTM MuxNumberPMCNumber(s)EventNumber(s)Performance Counter EventName0: ISU0: IFU0: VMX2: LSU10: FPU0: ISU0: IFU0: VMX11Instructions Com
Byte LaneNumberTTM MuxNumberPMCNumber(s)EventNumber(s)Performance Counter EventName410Overflow from PMC3510Overflow from PMC4610Overflow from PMC5710O
The U1.5 and U2 North bridge chipsets contain four independent counters, each of which can count any oneof 55 different types of events.The table list
Event NumberPerformance Counter Event Name72Burst Read Reqs [Bus]73Burst Write Reqs [Bus]65Burst Xacts [Bus]91Cache Inhib. Xacts [Bus]94Cycles Addr Bu
Event NumberPerformance Counter Event Name98Read Prefetch Ops [Mem]69Read Xacts [Bus]97Retries on Maxbus [Bus]86Single Beat Mem Reads [Bus]75Single Be
The U3 North bridge chipsets contain two distinct sets of counters.The first set of counters counts memory events, in a manner similar to the counters
5.Sample Limit — The maximum number of samples to record. Specifying a maximum of N samples willresult in at most N samples being taken, even on a mul
Event NumberAPI Performance Counter Event Name0x00API Cycles0xFFNothing0x03Queue Reservations0x01Queue Transactions0x05Retries0x04Transaction Size (by
Source NumberAPI Event Source Name0x200Master Tag: API00x400Master Tag: API0 and API10x300Master Tag: API10xA00Master Tag: HT0x900Master Tag: PCI0x800
Source NumberAPI Event Source Name0x00Synchronization Queue0x15Vsp Coh Rd Rq Queue0xA0Vsp Rd Data Queue0x0FVsp Response Queue0x0AVsp Target Rq Queue0x
The U4/Kodiak North bridge chipsets contain two distinct sets of counters.The first set of counters counts memory events, in a manner similar to the c
Event NumberMemory Performance Counter Event Name83Issued transfer size (accumulate events, no filters)97Non-coherent read request [RT #24253] (count
Source NumberAPI Event Source Name0x28API Wt Data Buffer0x10Bypass Queue0x01Command Slot0x27GCR Rd Data Queue0x0BGCR Response Queue0x08GCR Target Rq Q
Source NumberAPI Event Source Name0x0DPCIE Coh Wt Rq Queue0x25PCIE Rd Data Queue0x05PCIE Rd Target Rq Queue0x09PCIE Response Queue0x21PCIE Wt Data Que
The ARM11 cores used in iOS devices contain three independent performance counters. The first counter cancount only cycle counts, while the other two
Event NumberPMC Number(s)Performance Counter Event Name152-3Main TLB miss352-3Procedure call instruction executed382-3Procedure return instruction exe
This table describes the changes to Shark User Guide .NotesDateTBD2008-04-14New document that explains how to analyze code performance by profilingthe
menu (#8), if you would rather see the “Tree” view, which is described in Tree View (page 36) and organizesthe sample groups according to the program’
Apple Inc.Copyright © 2012 Apple Inc.All rights reserved.No part of this publication may be reproduced,stored in a retrieval system, or transmitted, i
TheEditFindFind command(Command-F )andtherelatedEditFindFindNext (Command-G )andEditFindFindPrevious (Command-Shift-G) commands are very useful
e.Symbol— The symbol where this sample was located. Most of the time, this is the name of the functionor subroutine that was executing when the sample
6.Process Popup Menu— This lists all of the sampled processes, in order of descending number of samplesin the profile, plus an “All” option at the top
The “Tree” view gives you an overall picture of the program calling structure. In the sample profile (Figure 2-8),the top-level function is [CelestiaO
Note on Heavy/Tree comparisons: Please note that there may not be a one-to-one correspondencebetween entries in “Tree” view and “Heavy” view. If you
deep callstacks being over-represented in the profile, since they are counted many times, but makes iteasier to find symbols for frequently-occurring
Network/iPhone Profiling 138Using Shared Profiling Mode 141Mac OS X Firewall Considerations 143Advanced Session Management and Data Mining 145Automati
Chart ViewClick Shark’s Chart tab to explore sample data chronologically, from either a thread- or CPU-based perspective.This can help you understand
1.Callstack Chart— This chart displays the depth (y-axis) of the callstack for each sample, chronologicallyfrom left-to-right over time (x-axis). The
6.Callstack Table— This displays the functions within the callstack for the currently selected sample, withthe leaf function at the top and the base o
13.View Popup Menu— This popup lets you choose to view sets of samples from different processor cores.Advanced Chart View SettingsThe first pane of th
7.Color Selection— Choose colors to use for user sample callstacks, kernel sample callstacks, and theselection area by clicking on these color wells.F
Code BrowserDouble-clicking on an entry in the Results Table or Callstack Table will open a Code Browser view for that entry,as shown in Figure 2-12.
2.Browse Buttons— You can use these buttons to maneuver through function calls. After you double-clickon a function call (denoted by blue text) and go
b.Total — This optional column lists the percentage of displayed references for each instruction orsource line, including called functions. To see sam
9.Source File Popup Menu—A given memory range can contain source code from more than one filebecause of inlining done by the compiler. You can select
a.Address Column — This displays the address of the assembly-language instruction displayed on thisrow. With PowerPC, this value simply increases by 4
Hardware Counter Configuration 202Configuring the Sampling Technique: The Sampling Tab 202Common Elements in Performance Counter Configuration Tabs 20
4.Asm Help Button— Press this button to get help for the selected assembly-language instruction, asdescribed in ISA Reference Window (page 54).Figure
6.Show Self Column— Toggles display of the column that lists the percentage of displayed referencesfor each instruction or source line, but not includ
5.Show G5 (PPC970) Details Drawer— (PowerPC-only) Shark can display graphs of instruction dispatchslot and functional unit utilization in an additiona
Figure 2-15 Advanced Settings for the Code BrowserOther architectures have slightly different options for items 3–5 of the Asm Browser
●Syntax— Chooses whether to display the x86 instructions in Intel assembler syntax or AT&T syntax (thedefault). ●Show Prefixes— If checked, instr
The ISA Reference Window provides an indexed, searchable interface to the PowerPC, IA-32 (32-bit x86), orEM64T (64-bit x86) instruction sets. The refe
Tips and TricksThis section points out a few things that you might see while looking at a Time Profile , what they may mean,and how to optimize your c
●Chart View ●Different parts of the chart look visibly different:Different-looking areas were probably created by different code in your program as i
Shark. Please note that in Xcode you will need to adjust the build settings for the Target that you aretesting and the correct (optimized) build confi
After compiling and running the reference decoder, Shark generated the session displayed in Figure 2-19. Justby pressing the “Start” and “Stop” button
PPC 7450 (G4+) Performance Counter Event List 271PPC 970 (G5) Performance Counter Event List 282UniNorth-2 (U1.5/2) Performance Counter Event List 316
VectorizationOptimizing the Reference_IDCT() function by converting it from floating point to integer also presentedanother possible optimization that
Add_Block()), colorspace conversion (dither()), and pixel interpolation (conv420to422() andconv422to444()) achieved a speedup of 5.69x over the origin
SpeedupOptimization Step1.12xFast floor()1.86xInteger IDCT2.05xVector IDCT5.69xAll VectorTime ProfilingExample: Optimizing MPEG-2 using Time ProfilesR
Shark’s System Trace configuration records an exact trace of system-level events, such as system calls, threadscheduling decisions, interrupts, and vi
and multithreading problems, because these issues frequently hinge upon managing the precise timing ofinteraction events properly in order to minimize
●Start Time ●Stop Time ●A backtrace of the user-space function calls (callstack) associated with each event ●Additional data customized depending on
Out of memory errors?: If you see these when starting a system trace, then just reduce the SampleLimit value until Shark is able to successfully allo
Summary View In-depthThe Summary View is the starting point for most types of analysis, and is shown in Figure 3-3. Its most salientfeature is a pie c
Scheduler SummaryThe Scheduler Summary tab, shown in Figure 3-4, summarizes the overall scheduling behavior of the threadsrunning in the system during
Note on Thread IDs: Thread IDs on Mac OS X are not necessarily unique across the duration of aSystem Trace Session. The Thread IDs reported by the ke
Figures, Tables, and ListingsGetting Started with Shark 17Figure 1-1 Main Window 17Figure 1-2 Process Target 18Figure 1-3 Mini Configuration Editor 19
Note on System Trace callstacks: In rare cases, it is not possible for System Trace to accuratelydetermine the user callstack for the currently activ
More settings for modifying this display are available in the Advanced Settings drawer, and are described inSummary View Advanced Settings (page 71).F
4.Callstack Data Mining— The System Call and VM Fault summaries support Shark’s data mining options,described in Data Mining (page 151), which can als
Trace View In-depthThe Trace View lists all of the events that occurred in the currently selected scope. Because events are mostcommonly viewed with “
●Reason— Reason that the thread tenure ended (described in Thread Run Intervals (page 79)) ●Priority— Dynamic scheduling priority of the threadFigure
occurred. Otherwise, the beginning and ending thread interval indices are listed. Because it is possible foran event to start before the beginning of
You can toggle the display of the Callstack Table , which displays the user callstack for the currently selectedVM fault entry, by clicking the button
●Size— Number of bytes affected by the fault, an integral multiple of the 4096-byte system page sizeFigure 3-10 Trace View: VM FaultsTimeline View In
●Keyboard Navigation— After highlighting a Thread Run Interval by clicking on it, the Left or Right Arrowkeys will take you to the previous or next r
Thread Run IntervalsEach time interval that a thread is actively running on a CPU is a thread run interval. Thread run intervals aredepicted as solid
System Tracing 63Figure 3-1 Time Profile vs. System Trace Comparison 64Figure 3-2 System Trace Mini Config Editor 65Figure 3-3 Summary View 67Figure 3
There are five basic reasons a thread will be switched out by the system to run another thread:Blocked— The thread is waiting on a resource and has vo
MIG Message— Mach interface generator routines, which are usually only used within the kernelFigure 3-14 Timeline View: System CallsCalls from all of
●Arguments— The first four integer argumentsFigure 3-15 System Call InspectorVM FaultsAs is the case with almost all modern operating systems, Mac OS
Non-Zero Fill— A previously unused page not marked “zero fill on demand” was touched for thefirst time. Generally, this is only used in situations whe
Three of these types of faults are visible in Figure 3-16. A zero-fill fault is circled to highlight it. Clicking on aVM Fault Icon will bring up the
Clicking on an Interrupt icon will bring up the Interrupt Inspector. This inspector lists the amount of time theinterrupt consumed, broken down by CPU
the amount of time spent on the CPU and time spent Waiting between the begin and end event. Since youcan supply different arguments to the start and e
4.Draw Context Switch Lines— Check this to enable (default) or disable the thin gray lines that show contextswitches, linking the thread tenures befor
6.Label Events— These checkboxes allow you to enable or disable the display of event icons either entirely,by type group, or on an individual, type-by
Figure 3-20 Timeline View Advanced Settings DrawerSystem TracingInterpreting SessionsRetired Document | 2012-07-23 | Copyright © 2012 Appl
Figure 4-14 Chart View with additional timed counter graphs 121Advanced Profiling Control 122Figure 5-1 Process Attach 122Figure 5-2 Launch Process Pa
Sign PostsEven with all of the system-level instrumentation already included in Mac OS X, you may sometimes find thatit is helpful or even necessary t
●User Applications using CHUD Framework: User Applications that link with the CHUD.framework, andcan simply call chudRecordSignPost(), which has the
Listing 3-2 signPostExample.c#include <CHUD/CHUD.h>#include <stdint.h>/* This corresponds to the sign post defined above, LoopTimer */#def
/** Use the kernel_debug() method when in the kernel (arg5 is unused),* DBG_FUNC_START corresponds to chudBeginIntervalSignPost.*/kernel_debug(APPS_DE
It can also indicate that your threads frequently block while waiting for locks. In this case, it is possiblethat the short intervals are inherent to
●Multi-threaded application only has only one thread running at a time:First of all, ensure you’ve performed the System Trace on a multiprocessor mac
Another possibility is that you’ve simply not given your worker threads enough work to do. Verify thistheory using the tip from the summary view sugge
Not every performance problem stems from computation in a program or a program’s interaction with theoperating system. For these other types of proble
5.Prefer User Callstacks— When enabled, Shark will ignore and discard any samples from threads runningexclusively in the kernel. This can eliminate sp
showing you how much time your threads are blocked and how often they are running. As a result, it is a good“sanity check” technique to make sure that
Commentaires sur ces manuels