Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Next revisionBoth sides next revision
ulevs4bsd [2013/06/08 11:26] – Typo noticed by chuzz peterjeremyulevs4bsd [2013/06/28 21:42] – Temporarily wipe images peterjeremy
Line 6: Line 6:
 ===== Summary ===== ===== Summary =====
  
-The better scheduler depends on both the number of processes and the working set size, and whether you want to minimise total CPU time or wallclock time. +With r227746 on a SPARC, the better scheduler depends on both the number of processes and the working set size, and whether you want to minimise total CPU time or wallclock time. 
- +With r251496 on an E3 Xeon, it doesn't matter - they are virtually identical.
- +
  
  
 ===== Details ===== ===== Details =====
  
-The following represents the results of a synthetic benchmark run on 16-core [[http://docs.oracle.com/cd/E19095-01/sfv890.srvr/index.html|SunFire V890]] server ((8 dual-core 1350MHz [[http://en.wikipedia.org/wiki/UltraSPARC_IV|UltraSPARC-IV]] CPUs and 64GB RAM)) running FreeBSD 10-current((the CVS equivalent of r227746 with a few local mods provided by marius@ and mjacob@ to identify some issues with igsp(4) and schizo interrupt handling)) - [[4BSD dmesg]], [[ULE dmesg]].  This is a server-grade [[http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access|NUMA]] SPARC system so the results aren't necessarily comparable with a multicore x86 desktop system. +The following represents the results of a synthetic benchmark run on
-Note that this system is no longer available so additional tests cannot be run. +  - A 16-core [[http://docs.oracle.com/cd/E19095-01/sfv890.srvr/index.html|SunFire V890]] server ((8 dual-core 1350MHz [[http://en.wikipedia.org/wiki/UltraSPARC_IV|UltraSPARC-IV]] CPUs and 64GB RAM)) running FreeBSD 10-current((the CVS equivalent of r227746 with a few local mods provided by marius@ and mjacob@ to identify some issues with igsp(4) and schizo interrupt handling)) - [[4BSD dmesg]], [[ULE dmesg]].  This is a server-grade [[http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access|NUMA]] SPARC system so the results aren't necessarily comparable with a multicore x86 desktop system.  Note that this system is no longer available so additional tests cannot be run
 +  - An [[http://ark.intel.com/products/65732|Intel Xeon E3-1230v3]] (3.3GHz quad-core with hyperthreading) CPU in a [[http://www.supermicro.com/products/motherboard/Xeon/C202_C204/X9SCA-F.cfm|Supermicro X9SCA-F]] motherboard with [[http://www.kingston.com/datasheets/KVR16E11K4_32.pdf‎|32GB DDR3-1600 RAM]], running FreeBSD 10-current/amd64 r251496.
  
-The benchmark runs multiple copies (processes) of a core that just repeatedly cycles through an array of doubles (to provide a pre-defined working set size), incrementing them.  The source code can be found at [[loop.c]].  The tests were run in single-user mode using both the 4BSD and ULE schedulers.  In each case, the test was run 5 times with 1, 2, 4, 6, 8, 10, 12, 14, 15, 16, 17, 18, 20, 24, 28, 31, 32, 33, 36, 40, 48, 56 and 64 processes and working-set sizes of 1KiB4MiB and 32MiB.  The number of iterations was chosen so that a single process took approximately 20s.+The benchmark runs multiple copies (processes) of a core that just repeatedly cycles through an array of doubles (to provide a pre-defined working set size), incrementing them.  The source code can be found at [[loop.c]].  The tests were run in single-user mode using both the 4BSD and ULE schedulers.   
 +In all cases, the test was run 5 times with a varying number of processes and working-set sizes of 1KiB, 4MiB and 32MiB. 
 +For the V890, 1, 2, 4, 6, 8, 10, 12, 14, 15, 16, 17, 18, 20, 24, 28, 31, 32, 33, 36, 40, 48, 56 and 64 processes were used. 
 +For the Xeon1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20, 22, 24, 26, 28, 30 and 32 processes were used.
  
 ==== 1KiB Working Set ==== ==== 1KiB Working Set ====
  
-This is the case where everything fits into L1 cache and 1.6e9 iterations were used.+This is the case where everything fits into L1 cache 
 +1.6e9 iterations were used on the V890 and 5e9 iterations were used on the Xeon.
  
-{{cpu_1.png|CPU Time with 1KiB Working Set}}+{{cpu_1.png|V890 CPU Time with 1KiB Working Set}} 
 +{{xcpu_1.png|Xeon CPU Time with 1KiB Working Set}}
  
 shows that ULE is very slightly more efficient than 4BSD and (pleasingly) that the amount of CPU time taken to perform a task is independent of the number of active processes for either scheduler. shows that ULE is very slightly more efficient than 4BSD and (pleasingly) that the amount of CPU time taken to perform a task is independent of the number of active processes for either scheduler.
  
-{{wall_1.png|Elapsed Time with 1KiB Working Set}}+{{wall_1.png|V890 Elapsed Time with 1KiB Working Set}} 
 +{{xwall_1.png|Xeon Elapsed Time with 1KiB Working Set}}
  
 shows that both schedulers are well-behaved until there are more processes than cores. shows that both schedulers are well-behaved until there are more processes than cores.
 Once there are more processes than cores, 4BSD remains well behaved, whilst ULE has significant jumps in wallclock time - ie the same set of tasks take longer to run with ULE.  The following graph shows this more obviously. Once there are more processes than cores, 4BSD remains well behaved, whilst ULE has significant jumps in wallclock time - ie the same set of tasks take longer to run with ULE.  The following graph shows this more obviously.
  
-{{eff_1.png|Scheduler efficiency with 1KiB Working set}}+{{eff_1.png|V890 Scheduler efficiency with 1KiB Working set}} 
 +{{eff_1.png|Xeon Scheduler efficiency with 1KiB Working set}}
  
 This graph shows scheduler efficiency as a ratio of wallclock time to CPU time. This graph shows scheduler efficiency as a ratio of wallclock time to CPU time.
Line 43: Line 49:
 This is the case where a couple of processes fit into L2 cache and 1.2e9 iterations were used. This is the case where a couple of processes fit into L2 cache and 1.2e9 iterations were used.
  
-{{cpu_4.png|CPU Time with 4MiB Working Set}}+{{cpu_4.png|V890 CPU Time with 4MiB Working Set}} 
 +{{xcpu_4.png|Xeon CPU Time with 4MiB Working Set}}
  
 shows that both schedulers behave similarly for less than 4 processes and between 10 and 16 processes. shows that both schedulers behave similarly for less than 4 processes and between 10 and 16 processes.
Line 50: Line 57:
 Beyond about 48 processes, 4BSD again takes the lead. Beyond about 48 processes, 4BSD again takes the lead.
  
-{{wall_4.png|Elapsed Time with 4MiB Working Set}}+{{wall_4.png|V890 Elapsed Time with 4MiB Working Set}} 
 +{{xwall_4.png|Xeon Elapsed Time with 4MiB Working Set}}
  
 shows that both schedulers are well-behaved until there are more processes than cores. shows that both schedulers are well-behaved until there are more processes than cores.
 Once there are more processes than cores, 4BSD remains well behaved, whilst ULE has significant variations away from the diagonal slope - though unlike the 1KiB case, sometimes ULE does better than 4BSD. Once there are more processes than cores, 4BSD remains well behaved, whilst ULE has significant variations away from the diagonal slope - though unlike the 1KiB case, sometimes ULE does better than 4BSD.
  
-{{eff_4.png|Scheduler efficiency with 4MiB Working set}}+{{eff_4.png|V890 Scheduler efficiency with 4MiB Working set}} 
 +{{eff_4.png|Xeon Scheduler efficiency with 4MiB Working set}}
  
 The 4BSD scheduler maintains a fairly constant effeciency, with only a slight bump between 4 and 12 processes. The 4BSD scheduler maintains a fairly constant effeciency, with only a slight bump between 4 and 12 processes.
Line 64: Line 73:
 This is the case where every processs is cache busting 5e8 iterations were used. This is the case where every processs is cache busting 5e8 iterations were used.
  
-{{cpu_32.png|CPU Time with 32MiB Working Set}}+{{cpu_32.png|V890 CPU Time with 32MiB Working Set}} 
 +{{xcpu_32.png|Xeon CPU Time with 32MiB Working Set}}
  
 shows that ULE generally uses less CPU time, the exception being between 12 and 16 processes. shows that ULE generally uses less CPU time, the exception being between 12 and 16 processes.
 On the downside the overall efficiency takes a significant hit, with a roughly 60% overhead for more than 16 processes. On the downside the overall efficiency takes a significant hit, with a roughly 60% overhead for more than 16 processes.
  
-{{wall_32.png|Elapsed Time with 32MiB Working Set}}+{{wall_32.png|V890 Elapsed Time with 32MiB Working Set}} 
 +{{xwall_32.png|Xeon Elapsed Time with 32MiB Working Set}}
  
 shows that both schedulers behave similarly except between 16 and 30 processes shows that both schedulers behave similarly except between 16 and 30 processes
 where ULE can run the tasks quicker. where ULE can run the tasks quicker.
  
-{{eff_32.png|Scheduler efficiency with 32MiB Working set}}+{{eff_32.png|V890 Scheduler efficiency with 32MiB Working set}} 
 +{{eff_32.png|Xeon Scheduler efficiency with 32MiB Working set}}
  
 The 4BSD scheduler maintains a fairly constant effeciency, with a slight bump between about 16 and 32 processes. The 4BSD scheduler maintains a fairly constant effeciency, with a slight bump between about 16 and 32 processes.
Line 83: Line 95:
 This shows the 1KiB, 4MiB and 32MiB results overlaid on the same graph. This shows the 1KiB, 4MiB and 32MiB results overlaid on the same graph.
  
-{{cpu.png|CPU Times}}+{{cpu.png|V890 CPU Times}} 
 + 
 +{{wall.png|V890 Elapsed Times}} 
 + 
 +{{eff.png|V890 Scheduler efficiencies}}
  
-{{wall.png|Elapsed Times}}+{{xcpu.png|Xeon CPU Times}}
  
-{{eff.png|Scheduler efficiencies}}+{{xwall.png|Xeon Elapsed Times}}
  
 +{{eff.png|Xeon Scheduler efficiencies}}
ulevs4bsd.txt · Last modified: 2013/06/30 11:53 by peterjeremy
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki