Measuring the Performance of 3D Graphics Architectures

As the final project for the ECS250A course, "Advanced Computer Architecture," my project partner Christian Hofsetz and I chose the topic of 3D graphics architectures.

We wanted to find a way to measure the raw performance of a 3D graphics accelerator, hopefully factoring out the influence of CPU and overall system performance by providing a synthetic benchmark program which can be "tweaked" by a plethora of parameters to probe implementations at the pipeline stage level.

To reach this goal, we evaluated the model of computation employed by industry-standard rendering libraries like OpenGL and Direct3D, the standard rendering pipeline. Based on these evaluations, we tried to single out possible design decisions a systems architect has to improve the performance of the standard pipeline.

Finally, we created the above-mentioned benchmark program, portable to a wide range of systems, which aids the designer or analyst in their tasks.

Abstract

In this paper, we describe the structure of the standard graphics pipeline, as implemented by common graphics libraries like OpenGL and Direct3D. Based on this structure, we present the options a designer of graphics architectures can exploit to increase the performance of his design. To evaluate the advantages and drawbacks of design decisions, the architect needs a benchmarking tool which can probe the performance of the graphics pipeline at the single stage level. We propose a simple synthetic benchmark program, which uses the OpenGL graphics library to achieve this task.

The Benchmark Program

These are all parameters the benchmark program recognizes:
-i <numIterations>
Sets the number of times the scene is rendered. The default number of iterations is 500.
-s <width> <height>
Sets the size of the rendering window to width x height. The default window size is 640 x 480.
-v <numStrips> <quadsPerStrip>
Sets the number of strips making up the sphere, and the number of quadrilaterals per strip. The first and last strip consist of quadsPerStrip triangles each, arranged as fans around the north and south pole, respectively; the other numStrips - 2 strips consist of quadsPerStrip quadrilaterals or quadsPerStrip * 2 triangles, depending on the primitive generation mode. The default number of strips is 32, the default number of quadrilaterals per strip is 64.
Primitive generation modes:
-pt
Generates the sphere as quadsPerStrip * 2 + (numStrips - 2) * quadsPerStrip * 2 independent triangles.
-pq
Generates the sphere as two triangle fans of quadsPerStrip triangles each, and numStrips - 2 quad strips of quadsPerStrip quadrilaterals each.
-pv
Generates the sphere as a vertex array, using the same geometrical primitives as if using the -pq switch.
-pl
Compiles the generated primitives into a display list for later use. The created display list also encapsulates texture image uploads (if enabled).
The default primitive generation mode is independent triangles, and no display lists.
Buffer modes:
-be
Erases the frame buffer before rendering each frame.
-bd
Enables double-buffering, to allow for smooth animations.
-bz
Disables the depth buffer for hidden surface removal.
The default buffer mode is single-buffered, no erase, depth test enabled.
Drawing modes:
-dl <numLightSources>
Enables per-vertex Phong lighting, using numLightSources point light sources. numLightSources has to be eight or less.
-dn
Enables automatic re-normalization of normal vectors. If this feature is enabled, the normals passed to OpenGL will have non-unit length to enforce re-normalization.
-dg
Enables Goraud shading.
-dw
Switches to wireframe rendering mode.
-db
Enables backface culling.
The default drawing modes is no lighting, no automatic normalization, flat shading, filled polygons, no backface culling.
Texture mapping modes:
-t <textureFileName>
Enables texture mapping and loads the given file in PPM file format.
Note: The PPM loading routine is quite dumb: It only reads raw (binary) PPM files, and it expects exactly one comment line between the format identifier ("P6") and the width and height fields. Luckily, this is the standard output format of XV...
-to
Enables using texture objects instead of downloading the texture image for every frame.
-tl
Enables bilinear sampling for texture minification and magnification.
-tm
Enables modulating the interpolated post-lighting polygon color with the texture color, instead of replacing the polygon color with the texture color.
The default texture mode is texturing disabled, no texture objects, nearest-neighbour sampling for both minification and magnification and texture replace rendering mode.
Clipping modes:
-cn
Sets up a single clipping plane which will not clip away any primitives.
-ca
Sets up a clipping plane such that no primitive is ever rendered to the frame buffer.
-c <numberOfClippingPlanes>
Sets up numberOfClippingPlanes additional clipping planes. numberOfClippingPlanes has to be six or less.
The default clipping mode is no user-defined clipping at all.

The "Standard" Benchmark Suite

These are the program parameters making up the standard benchmark suite. Note that all benchmarks are run in single-buffer mode, to factor out differing screen refresh rates. To enable double-buffering for smooth animation, the "-bd" parameter has to be given as well:
1.  Benchmark -v 64 128 -pt -be -ca
This set renders a sphere consisting of 64 strips of 128 quads each, passed to OpenGL as independent triangles. A user clipping plane is set up such that no primitive is rendered ever.
2.  Benchmark -v 64 128 -pq -be -ca
Same as 1., but the sphere is passed as a combination of triangle fans / quad strips.
3.  Benchmark -v 64 128 -pv -be -ca
Same as 1. and 2., but the sphere vertices are passed as a vertex array, and rendered using triangle fans / quad strips.
These three runs in combination test the performance of the Model Transform stage of the standard rendering pipeline.
4.  Benchmark -v 64 128 -pv -be -ca -dg -dl 8
Same as 3., but this time the sphere is lit by eight point light sources and Goraud shaded. Since all primitives are still clipped away, the lighting calculations should never be performed and not influence the runtime.
5.  Benchmark -v 64 128 -pv -be -db
This run renders an unlit, unshaded white sphere using vertex arrays, triangle fans / quad strips and backface culling.
6.  Benchmark -v 64 128 -pv -be -dw
This run renders the sphere in wireframe mode. On some machines (notably all tested SGI machines) this takes about twice as long as rendering filled polygons. A hint that SGI treats lines as pixel-wide polygons, as further supported by the fact that correct lighting and even texturing are performed in wireframe mode as well.
7.  Benchmark -v 64 128 -pv -be -db -dg -dl 8
A red, Goraud-shaded sphere lit by eight point light sources.
8.  Benchmark -v 64 128 -pv -be -db -t globe.ppm -to
The unlit sphere again, texture-mapped with a (small) image of resolution 64x128, using texture objects and nearest-neighbour sampling.
9.  Benchmark -v 64 128 -pv -be -db -t globe.ppm -to -tl
Same as 8., this time using bilinear sampling for texture magnification and minification.
10.  Benchmark -v 64 128 -pv -be -db -t globe.ppm -to -tl -tm -dg -dl 2
Same as 9., but this time modulating the Goraud-shaded, white sphere, lit by two point light sources, with the texture map.

Preliminary Results

We ran our benchmark on the following systems:
  1. SGI O2, MIPS R5000 CPU at 180MHz, 128MB RAM, CRM graphics board.
  2. SGI Octane, MIPS R10000 CPU at 250MHz, 128MB RAM, SI graphics board.
  3. SGI Octane, 2 MIPS R10000 CPU at 250MHz, 256MB RAM, EMXI graphics board.
    (Note: The benchmark only utilizes a single CPU.)
  4. SGI Onyx2, 4 MIPS R10000 CPUs at 195MHz, 512MB RAM, InfiniteReality2 graphics board.
    (Note: The benchmark only utilizes a single CPU.)
  5. SGI Visual Workstation 320, Intel Pentium II CPU at 400MHz, 128MB RAM, Cobalt graphics board.
  6. Windows PC, Intel Pentium III CPU at 600MHz, 128MB RAM, Viper 770 (NVidia Riva TNT2) graphics board. (Timings provided by Christian Hofsetz.)
  7. Windows PC, Intel Celeron CPU at 333MHz (66MHz FSB), 128MB RAM, CL Annihilator Pro (NVidia GeForce 256 DDR) graphics board. (Timings provided by Gerhard Prilmeier.)
  8. LINUX PC, Intel Pentium III CPU at 733MHz (100MHz FSB), 256MB RAM, NVidia GeForce 256 DDR graphics board.
  9. LINUX PC, AMD Athlon Thunderbird CPU at 800MHz (200MHz FSB), 128MB RAM, ELSA GLadiac (NVidia GeForce2 GTS) graphics board.
SystemBenchmark
12345678910
O28.1921.8922.0822.3228.7917.527.8019.8919.9912.64
Octane (1)42.96120.77109.17111.8675.5314.8914.148.514.554.01
Octane (2)43.41127.42107.50102.44101.8738.8831.7096.7995.8857.64
Onyx239.15111.36113.38110.86113.9061.5872.89106.61111.36109.41
Cobalt26.8574.29136.24132.98161.2922.0815.48139.66140.0657.94
TNT221.6745.2973.4261.8852.9126.1515.5648.9748.9247.93
GeForce (1)22.1755.8063.2932.87158.20153.30107.80151.50160.30156.30
GeForce (2)53.65148.81142.0564.43354.6111.36162.87239.23241.55170.07
GeForce2 GTS80.52225.23218.3488.81757.5842.44168.92595.24632.91381.68
 
Table 1: Preliminary benchmark results. The numbers are given in frames/second; higher numbers indicate higher performance.

Downloads

You can download the complete paper as a gzipped PostScript file. The paper contains 30 pages, and the download size is approximately 170k.

You can also download the graphics benchmark itself as a zip archive, containing the C++ source code (I was able to compile it under IRIX 6.5 and Windows NT without any changes), two sample texture images in PPM format, and a makefile containing the parameters to run the "standard benchmark." The download size is approximately 600k (sorry, the second texture image is quite large).

The benchmark does its own timing using the clock() function call. Under UNIX, however, this call only reports the time the user process and its children took (or something like that); it does not include the time the process waited for completion of OpenGL commands. This means, the built-in timing is pretty much useless under this operating system. Under Windows NT the built-in timing is quite reliable.

The time() function, on the other hand, only has a one second resolution and is therefore rather useless.

Luckily, UNIX features the time utility, which can be used to measure reliable overall runtimes. Use it and calculate the fps performance measure by hand.