Search (using Google):  Web Karig

 

22 May 2004

Refresh speed

I performed some experiments that brought home to me how much time writing to the video-card memory really takes — at least on my Soyo laptop.

The project

First I want to discuss the files in the project. I've cut down on the number of files in the project. I've dropped realexp.asm and moved code from protmode.asm and loader.asm into boot.asm, thus eliminating three files. I also removed macros I'm not using from macros.asm. Finally, karig.asm is gone; make.bat simply assembles boot.asm, which includes everything else. Thus my code simply starts up protected mode and a VESA graphics mode, and I can still dump memory to the screen because the project still contains screen.asm and dump.asm.

You can download the project and have a look.

The code

I was originally playing around with the RDTSC instruction, which simply copies a 64-bit value from the Pentium's built-in cycle counter into the EAX and EDX registers. (You'd normally use this instruction as a stopwatch, to measure the number of cycles it took for a stretch of code to execute.) My original code cleared the screen and then entered a loop, which would call a routine to get the RDTSC value,

		call	cls
	.1:	call	gettsc

store the value in memory (at address "tsc"),

		mov	[tsc], eax
		mov	[tsc+4], edx

dump sixteen bytes (at address "tsc") to the screen buffer,

		mov	eax, [row]
		_lit	tsc
		call	dump16
		_drop
		_drop

copy the screen buffer into the video-card RAM (and thus onto the screen),

		call	refresh

and increment the number of the row to which bytes are to be dumped.

		inc	dword [row]

If this number is less than 30, the loop repeats.

		mov	ebx, [row]
		cmp	ebx, 30
		jl	.1

Otherwise, the number is reset to zero. Then the loop repeats.

		xor	ebx, ebx
		mov	[row], ebx
		jmp	.1

The result

The result was slow enough that you could actually see the lines being overwritten on the screen, from top to bottom. It took a little less than one second for the screen to be completely overwritten, from the top row to the bottom. I timed it with my wristwatch: I counted off 100 screen fills in 71 seconds, so one screen fill took about 710 milliseconds. This means that it takes 710/30 or just under 24 milliseconds to get the RDTSC value, dump sixteen bytes to the screen buffer, and refresh the screen. This operation can be performed up to 42 times per second.

Note that refresh is called each time around the loop. I moved the call to refresh outside the loop, so that refresh is called only once every thirty times around the loop:

		call	cls
	.1:	call	gettsc

		mov	[tsc], eax
		mov	[tsc+4], edx
		mov	eax, [row]
		_lit	tsc
		call	dump16
		_drop
		_drop

		inc	dword [row]
		mov	ebx, [row]
		cmp	ebx, 30
		jl	.1

		call	refresh
		xor	ebx, ebx
		mov	[row], ebx
		jmp	.1

When I run this version, the screen changes much more quickly. I'd read that writing to video RAM is slower than writing to ordinary memory RAM, but this was still a revelation to me. One implication is that updating the screen 42 times a second would take up most of the CPU's time and leave little time for any other task.

What could I do?

I'd like to be able to use Karig to experiment with animated images, so I'd like to be able to have smooth animation if I can. Obviously updating the entire screen forty-two times per second is not practical. I have two options:

  • Update all of the screen, less frequently.
  • Update part of the screen, more frequently.

The current version of refresh updates all of the screen by copying the entire screen buffer to video RAM. If this is done 24 times per second (corresponding to the frame rate of a movie), then screen refresh requires about 57% of the CPU's time — meaning that refresh would be running a full 570 milliseconds out of every second that Karig is running, leaving only 430 milliseconds per second for other tasks. If refresh is called only 18.2 times per second (corresponding to the number of interrupts generated per second by a standard hardware timer inside the computer), then screen refresh takes up about 43% of the CPU's time. Possibly I could try for even lower frame rates, but if the frame rate is too low, then animation begins to look "jerky." I'd like to be able to get smooth animation on Karig if possible.

(On the other hand, I don't really need a high frame rate for the text screen. I'd need to refresh the screen only often enough to make text entry look smooth. Ten frames (240 milliseconds of refresh time) per second ought to be fine for the text screen, though of course I'll play with this a bit until I'm happy with the result. However, I'll want Karig to have two screens. The second screen would be a graphic screen, which I could use for experimenting with graphics without messing up the text screen. I'll definitely want a high refresh rate for all or part of this second screen.)

An alternative version of refresh might update only part of the screen. If copying the entire screen buffer to video RAM takes 24 milliseconds, then copying, say, one eighth of the screen buffer to video RAM would take only three milliseconds. This version of refresh would thus take only one eighth of the time that the current version requires, so it could be called eight times as often — which means higher frame rates are possible (even if the frames are smaller). If the frame being updated takes up one eighth of the screen, then I can update it eighty times a second and still consume only about a quarter of each second refreshing the screen. (I'll probably want to play with different custom versions of refresh for different experiments.)

More on the RDTSC instruction

Agner Fog advises that, when using the rdtsc instruction to measure the performance of your code, it is best to precede the RDTSC with xor eax, eax and cpuid. Pentium processors, starting with the Pentium Pro, can often execute multiple instructions at the same time. The cpuid instruction is a "serializing" instruction, meaning that, before it does its thing, it ensures that every instruction in the processor's special instruction cache is executed and removed from the cache. Thus the timing isn't thrown off by the presence of unexecuted instructions in the processor's cache.

Of course, the cpuid instruction trashes four registers — EAX, EBX, ECX, and EDX. For now, my routine here just preserves EAX (with _dup).

gettsc:
; ( -- t ) Returns time "t" as taken from the Pentium's cycle counter.
		_dup
		xor	eax, eax
		cpuid	        ; Trashes EBX and ECX
		rdtsc
		ret

Check the index for other entries.