This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revisionNext revisionBoth sides next revision | ||
ulevs4bsd [2013/06/08 11:26] – Typo noticed by chuzz peterjeremy | ulevs4bsd [2013/06/28 21:42] – Temporarily wipe images peterjeremy | ||
---|---|---|---|
Line 6: | Line 6: | ||
===== Summary ===== | ===== Summary ===== | ||
- | The better scheduler depends on both the number of processes and the working set size, and whether you want to minimise total CPU time or wallclock time. | + | With r227746 on a SPARC, the better scheduler depends on both the number of processes and the working set size, and whether you want to minimise total CPU time or wallclock time. |
- | + | With r251496 on an E3 Xeon, it doesn' | |
- | + | ||
===== Details ===== | ===== Details ===== | ||
- | The following represents the results of a synthetic benchmark run on a 16-core [[http:// | + | The following represents the results of a synthetic benchmark run on: |
- | Note that this system is no longer available so additional tests cannot be run. | + | - A 16-core [[http:// |
+ | - An [[http:// | ||
- | The benchmark runs multiple copies (processes) of a core that just repeatedly cycles through an array of doubles (to provide a pre-defined working set size), incrementing them. The source code can be found at [[loop.c]]. | + | The benchmark runs multiple copies (processes) of a core that just repeatedly cycles through an array of doubles (to provide a pre-defined working set size), incrementing them. The source code can be found at [[loop.c]]. |
+ | In all cases, the test was run 5 times with a varying number of processes and working-set sizes of 1KiB, 4MiB and 32MiB. | ||
+ | For the V890, 1, 2, 4, 6, 8, 10, 12, 14, 15, 16, 17, 18, 20, 24, 28, 31, 32, 33, 36, 40, 48, 56 and 64 processes | ||
+ | For the Xeon, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20, 22, 24, 26, 28, 30 and 32 processes were used. | ||
==== 1KiB Working Set ==== | ==== 1KiB Working Set ==== | ||
- | This is the case where everything fits into L1 cache and 1.6e9 iterations were used. | + | This is the case where everything fits into L1 cache |
+ | 1.6e9 iterations were used on the V890 and 5e9 iterations were used on the Xeon. | ||
- | {{cpu_1.png|CPU Time with 1KiB Working Set}} | + | {{cpu_1.png|V890 CPU Time with 1KiB Working Set}} |
+ | {{xcpu_1.png|Xeon | ||
shows that ULE is very slightly more efficient than 4BSD and (pleasingly) that the amount of CPU time taken to perform a task is independent of the number of active processes for either scheduler. | shows that ULE is very slightly more efficient than 4BSD and (pleasingly) that the amount of CPU time taken to perform a task is independent of the number of active processes for either scheduler. | ||
- | {{wall_1.png|Elapsed Time with 1KiB Working Set}} | + | {{wall_1.png|V890 Elapsed Time with 1KiB Working Set}} |
+ | {{xwall_1.png|Xeon | ||
shows that both schedulers are well-behaved until there are more processes than cores. | shows that both schedulers are well-behaved until there are more processes than cores. | ||
Once there are more processes than cores, 4BSD remains well behaved, whilst ULE has significant jumps in wallclock time - ie the same set of tasks take longer to run with ULE. The following graph shows this more obviously. | Once there are more processes than cores, 4BSD remains well behaved, whilst ULE has significant jumps in wallclock time - ie the same set of tasks take longer to run with ULE. The following graph shows this more obviously. | ||
- | {{eff_1.png|Scheduler efficiency with 1KiB Working set}} | + | {{eff_1.png|V890 Scheduler efficiency with 1KiB Working set}} |
+ | {{eff_1.png|Xeon | ||
This graph shows scheduler efficiency as a ratio of wallclock time to CPU time. | This graph shows scheduler efficiency as a ratio of wallclock time to CPU time. | ||
Line 43: | Line 49: | ||
This is the case where a couple of processes fit into L2 cache and 1.2e9 iterations were used. | This is the case where a couple of processes fit into L2 cache and 1.2e9 iterations were used. | ||
- | {{cpu_4.png|CPU Time with 4MiB Working Set}} | + | {{cpu_4.png|V890 CPU Time with 4MiB Working Set}} |
+ | {{xcpu_4.png|Xeon | ||
shows that both schedulers behave similarly for less than 4 processes and between 10 and 16 processes. | shows that both schedulers behave similarly for less than 4 processes and between 10 and 16 processes. | ||
Line 50: | Line 57: | ||
Beyond about 48 processes, 4BSD again takes the lead. | Beyond about 48 processes, 4BSD again takes the lead. | ||
- | {{wall_4.png|Elapsed Time with 4MiB Working Set}} | + | {{wall_4.png|V890 Elapsed Time with 4MiB Working Set}} |
+ | {{xwall_4.png|Xeon | ||
shows that both schedulers are well-behaved until there are more processes than cores. | shows that both schedulers are well-behaved until there are more processes than cores. | ||
Once there are more processes than cores, 4BSD remains well behaved, whilst ULE has significant variations away from the diagonal slope - though unlike the 1KiB case, sometimes ULE does better than 4BSD. | Once there are more processes than cores, 4BSD remains well behaved, whilst ULE has significant variations away from the diagonal slope - though unlike the 1KiB case, sometimes ULE does better than 4BSD. | ||
- | {{eff_4.png|Scheduler efficiency with 4MiB Working set}} | + | {{eff_4.png|V890 Scheduler efficiency with 4MiB Working set}} |
+ | {{eff_4.png|Xeon | ||
The 4BSD scheduler maintains a fairly constant effeciency, with only a slight bump between 4 and 12 processes. | The 4BSD scheduler maintains a fairly constant effeciency, with only a slight bump between 4 and 12 processes. | ||
Line 64: | Line 73: | ||
This is the case where every processs is cache busting 5e8 iterations were used. | This is the case where every processs is cache busting 5e8 iterations were used. | ||
- | {{cpu_32.png|CPU Time with 32MiB Working Set}} | + | {{cpu_32.png|V890 CPU Time with 32MiB Working Set}} |
+ | {{xcpu_32.png|Xeon | ||
shows that ULE generally uses less CPU time, the exception being between 12 and 16 processes. | shows that ULE generally uses less CPU time, the exception being between 12 and 16 processes. | ||
On the downside the overall efficiency takes a significant hit, with a roughly 60% overhead for more than 16 processes. | On the downside the overall efficiency takes a significant hit, with a roughly 60% overhead for more than 16 processes. | ||
- | {{wall_32.png|Elapsed Time with 32MiB Working Set}} | + | {{wall_32.png|V890 Elapsed Time with 32MiB Working Set}} |
+ | {{xwall_32.png|Xeon | ||
shows that both schedulers behave similarly except between 16 and 30 processes | shows that both schedulers behave similarly except between 16 and 30 processes | ||
where ULE can run the tasks quicker. | where ULE can run the tasks quicker. | ||
- | {{eff_32.png|Scheduler efficiency with 32MiB Working set}} | + | {{eff_32.png|V890 Scheduler efficiency with 32MiB Working set}} |
+ | {{eff_32.png|Xeon | ||
The 4BSD scheduler maintains a fairly constant effeciency, with a slight bump between about 16 and 32 processes. | The 4BSD scheduler maintains a fairly constant effeciency, with a slight bump between about 16 and 32 processes. | ||
Line 83: | Line 95: | ||
This shows the 1KiB, 4MiB and 32MiB results overlaid on the same graph. | This shows the 1KiB, 4MiB and 32MiB results overlaid on the same graph. | ||
- | {{cpu.png|CPU Times}} | + | {{cpu.png|V890 CPU Times}} |
+ | |||
+ | {{wall.png|V890 Elapsed Times}} | ||
+ | |||
+ | {{eff.png|V890 Scheduler efficiencies}} | ||
- | {{wall.png|Elapsed | + | {{xcpu.png|Xeon CPU Times}} |
- | {{eff.png|Scheduler efficiencies}} | + | {{xwall.png|Xeon Elapsed Times}} |
+ | {{eff.png|Xeon Scheduler efficiencies}} |