What is your Favorite 6502 optimization/trick?

TylerBarnes · January 22, 2020

So in the spirit of learning new things, what is your favorite 6502 trick? My question stems from very recently coming across a simple, albeit limited, trick that just seemed to strike a chord with me and made me go 'wow'.

The trick I am talking about is using the BIT opcode in place of a JMP $xxxx that only skips 2 bytes. Consider the following code in which I want to check a state, and conditionally jump to one of two choices, and store it in a destination.

	LDA State 
	CMP #$01
	BEQ LabelTwo
LabelOne:
	LDA ChoiceA
	JMP Jump
LabelTwo:
	LDA ChoiceB
Jump:
	STA Destination

The 'JMP Jump' part of the code will occupy 3 bytes total. 1 for the opcode, and 2 for the address. We can instead replace these three bytes with the single byte of the BIT opcode ($2C) and this effectively skips the 'LDA ChoiceB' altogether while keeping the accumulator and carry flag preserved.

	LDA State 
	CMP #$01
	BEQ LabelTwo
LabelOne:
	LDA ChoiceA
	.db $2C
LabelTwo:
	LDA ChoiceB    ; If ChoiceA, These are read as an address to BIT
Jump:
	STA Destination

What happens is when the CPU reads the BIT opcode, the next two bytes it is expecting are the lo byte of an address, and then the high byte of an address. So it reads $A9 from the LDA opcode as the low byte. Then the value byte that is assigned to ChoiceB as the high byte. It then does the rest of the stuff associated with the BIT instruction (non-destructively compares the accumulator to set flags). By the time it has finished the BIT it is ready to 'STA Destination' the value in the accumulator.

This is not faster mind you. It is 1 cycle slower. But saves two bytes. Not very broad in application, but very cool imo.

The machine code for those interested.

; Lets assume the assembler assigned 'Jump' the address of $1234
; Also assume 'Destination' is RAM location $00 on zero page 
; Lastly assume we have loaded LDA ChoiceA and we are performing the JMP Jump / Bit trick

;___Disassembly____ 

; Version 1
$A9,$00		; LDA #$00  ; LDA ChoiceA
$4C,$34,$12	; JMP $1234 ; JMP Jump
$A9,$01		; LDA #$01  ; LDA ChoiceB 
$85,$00		; STA $00   ; STA Destination

;Version 2:
$A9,$00		; LDA #$00  ; LDA ChoiceA
$2C,$A9,$01	; BIT $01A9 ; BIT $01A9
$85,$00		; STA $00   ; STA Destination

Edited January 22, 2020 by TylerBarnes

Sumez · January 22, 2020

That's an interesting trick, and especially curious to me, because I've honestly never thought of doing optimizations from this "perspective".

That said, these kinds of optimizations are probably the last thing I'd consider doing on a project - and at that time saving 2 bytes here and there probably won't make a difference. You'd need to do it all over the place through your entire project for a notable advantage.

The downside here, and the reason you probably won't ever see me doing something like this, is that it makes the source code a lot less stable. Let's say you decide to add some more logic in the "branch" for LabelTwo, all of a sudden the LabelOne logic will fail to function as intended, and it's a really easy one to miss when returning to old code a few weeks later.

It's the same reason I tend to always include stuff like "unnecessary" CLC/SEC when the state of said flag should already be given due to a previous branch. It might give me a tiny optimization - but it's also creates a risk of a potential hard-to-identify bug in the future.

TylerBarnes · January 22, 2020

20 minutes ago, Sumez said:

You'd need to do it all over the place through your entire project for a notable advantage.

This depends on your platform. Atari carts average 4KB-8KB for ROM. Every byte can matter.

Sumez · January 22, 2020

Yeah sorry. I just stupidly assumed 6502 implied NES programming. On Atari, every byte matters. I'd say cycle count is more central though.

Edited January 22, 2020 by Sumez

TylerBarnes · January 28, 2020

While looking at Mario's source code. Loos like this BIT skip is used throughout to hop over various sections when you only want to load a different value in A but the rest of the subroutine is all the same.

gauauu · January 29, 2020

On 1/22/2020 at 6:55 AM, Sumez said:

Yeah sorry. I just stupidly assumed 6502 implied NES programming. On Atari, every byte matters. I'd say cycle count is more central though.

You're not wrong, but with Atari, it really depends on what part of the code you're talking about. During the display kernel, cycle count is everything. You plan your entire code based on cycle counts of each instruction. You waste rom space aligning things to avoid the 1-cycle page-crossing penalty, or you waste loads of ram pre-calculating everything you can.

But in the rest of your code, you often are doing everything you can to conserve rom and ram space. Many games end up at a point where your rom is full, and you want to add a small feature or fix a bug, and you have to spend an hour searching your codebase for places where you can free up the 8 bytes needed to make your changes. My code is filled with bne's where I originally had a jmp, just to save a byte. Or removing the sec/clc when possible. Or using the bit trick to skip 2 bytes instead of using a jmp.

The trick that first blew my mind was abusing the stack instructions to quickly move data around. (ie by first moving the stack pointer to your data, then using pla to load the byte into A and move the stack pointer with one instruction). I assume this is what tepples' popslide library does on the NES, but the first time I saw it on Atari my mind was blown (the developersof combat used a similar but different technique of moving the stack pointer to a set of registers, and using the php command to push the zero flag into the register)

Vectrex28 · January 30, 2020

I like small but clever tricks such as doing BIT buttons to check the state of the controller without having to load it in A beforehand. Only works for A and B but it's a nifty trick for faster A/B checking. Also, using illegal opcodes to speed up some processes. For instance, LDA #$FF then STA $0200,x AXS #$04 is a great way to make a fast sprite clearing loop which saves up on a lot of DEX/INXes. I also unrolled it to STA $0200,x STA $0280,x to make it even faster while still being compact

Edited January 30, 2020 by Vectrex28

FrankenGraphics · February 9, 2020

Not that you'll ever use this unless you're crazy about doing signed math (and then, the usage is probably still rare..), but this is similar to the bit-skip trick in the OP.

You can synthesize a set overflow instruction by BIT addr, where the addr must point to an rts instruction or a value of #$60.

Note how the 6502 instruction set has a clear overflow instruction (clv), but not a set overflow (sev) instruction?

For some reason, setting the overflow flag is depending on a low transition to the SO pin (maybe some math help processor was supposed to interact with it or maybe it was thought of as a way to poll for requests), but most computers, the NES/FC included, leave it unconnected or tied to a constant.

To recap, overflow yields a positive if a clc adc or sec sbc is out of signed range, ie outside -128 to 127.

Edited February 9, 2020 by FrankenGraphics

FrankenGraphics · February 9, 2020

On 1/31/2020 at 12:21 AM, Vectrex28 said:

[...] For instance, LDA #$FF then STA $0200,x AXS #$04 is a great way to make a fast sprite clearing loop which saves up on a lot of DEX/INXes. [...]

This example is equivalent to 4 DEX. If someone reading this is using an upwards-counting OAMBuffer cleanout routine, it should be:

lda #$ff
sta $200,x
axs #$fc ;equivalent to +4 each iteration

since AXS means X = (A && X) - #

Edit: if the assembler you're using doesn't support any undocumented opcodes, then

.db $CB

will do the trick.

Edited February 9, 2020 by FrankenGraphics

Vectrex28 · February 19, 2020

On 2/9/2020 at 4:59 PM, FrankenGraphics said:
This example is equivalent to 4 DEX. If someone reading this is using an upwards-counting OAMBuffer cleanout routine, it should be:
lda #$ff
sta $200,x
axs #$fc ;equivalent to +4 each iteration
since AXS means X = (A && X) - #

Edit: if the assembler you're using doesn't support any undocumented opcodes, then
.db $CB
will do the trick.

I'm actually using downwards-counting routines because of a habit to optimise loops by doing DEX BPL .loop instead of INX CPX #$XX BNE .loop

Obviously it only works on loops that take less than 128 iterations but most loops shouldn't take that many iterations, unless it's sprite clearing loops and the like

TylerBarnes · August 21, 2020

Found a cool optimization today. My first venture in successfully exploiting an illegal opcode. Used ASR ($4B). I know it's not too terribly exotic, but fun nonetheless.

It uses the immediate addressing mode, ASR #imm
The instruction will AND the immediate byte with A, and then shifts A one bit over. It affects N, Z, and C but they are not relevant in the way I used it.

To start it would help to know what code I started with and what it was doing. It is a part of my collision detection routine for a 4x30 bytemap. (1 bit per tile).

To index into this and compare the bit position, you would use X/64+(Y/8)*4 to grab the relevant byte from the 1D array, and then use X/8 AND %0111 as an index into a table checking for the exact bit position of the selected byte.

There is a section in this calculation that is dealing with the (Y/8)*4 that works out particularly well with the ASR opcode to optimize.

; Lets assume we have calcluated X/64, and already put it in a variable called 'tmp'.
; Also, I am showing bits I care about in green, to follow their movement
; Known zeros are written
;--------------------------------------------
[X/64 calculation here]
STA tmp ; X/64 in tmp
TYA ; %yyyyyyyy ; (Y/8)
   LSR ; %0yyyyyyy
   LSR ; %00yyyyyy
   LSR ; %000yyyyy
   ASL ; %00yyyyy0 ; *4
   ASL ; %0yyyyy00
   ADC tmp ; add both to complete X/64+(Y/8)*4
   TAY   ; Y is now Byte index into main 4x30 map
;--------------------------------------------

In effect, I noticed that what we are ultimately left with is the 5 bits we care about having been shifted to the right once, and afterwards, bits 0 and 1 need to be ignored.
Turning this %yyyyyyyy into this %0yyyyy00.

So a quicker way to do this would be:
TYA ; %yyyyyyyy ; (Y/8)*4
AND %11111000 ; %yyyyy000
LSR ; %0yyyyy00

However, ASR happened to conveniently do both of these for me at once. It ANDs A with the immediate byte and then shifts right once.
TYA ; %yyyyyyyy ; (Y/8)*4
ASR %1111100   ; %0yyyyy00

;----------------------------------------------

For those interested I'm including the full routine here for context.

Spoiler

The code I originally wrote.


; In bit pattern below, Bits we care for are shown with capital letters; lowercase we are trying to get rid of
; Known zeros will be written

CheckCollide:  ; when entering routine, X and Y pre loaded with xPos and yPos of checked point 
	TXA        ; %XXxxxxxx ; X/64
	LSR        ; %0XXxxxxx
	LSR        ; %00XXxxxx
	LSR        ; %000XXxxx
	LSR        ; %0000XXxx
	LSR        ; %00000XXx
	LSR        ; %000000XX
	STA tmp    ; %000000XX
	TYA        ; %YYYYYyyy ;(Y/8)
	LSR        ; %0YYYYYyy
	LSR        ; %00YYYYYy
	LSR        ; %000YYYYY
	ASL        ; %00YYYYY0 ;*4
	ASL        ; %0YYYYY00
	ADC tmp    ; %0YYYYYXX add both to complete X/64+(Y/8)*4
	TAY        ; Y is now Byte index into main 4x30 map
	
	TXA        ; %xxXXXxxx ;X/8
	LSR        ; %0xxXXXxx
	LSR        ; %00xxXXXx
	LSR        ; %000xxXXX
	AND #%0111 ; %00000XXX
	TAX        ; X is index into BitMask for bit pos within the byte 
	
	LDA CollisionRAM, y  ; Load the byte player is in from 4x30 map
	AND BitMask, x       ; compare the exact bit position with lookup table 
	RTS                  ; Zero is clear if collistion is true for later branch

BitMask: 
	.db %10000000
	.db %01000000
	.db %00100000
	.db %00010000
	.db %00001000
	.db %00000100
	.db %00000010
	.db %00000001

The code after refactoring.


;In bit pattern below, Bits we care for are shown with capital letters; lowercase we are trying to get rid of
; Known zeros will be written

CheckCollide:        ; when entering routine, X and Y pre loaded with xPos and yPos of checked point
	TXA              ;%XXxxxxxx C=0 ;   
	ROL              ;%Xxxxxxx0 C=X
	ROL              ;%xxxxxx0X C=X
	ROL              ;%xxxxx0XX C=x
	AND #%00000011   ;%000000XX C=x
	STA tmp 

	TYA              ;%YYYYYyyy  ; (Y/8)*4
	ASR %11111000    ;%0YYYYY00  ;illegal opcode $4B (&& with byte and LSR) N,Z,C
	ADC tmp          ;%0YYYYYXX  add both to complete X/64+(Y/8)*4
	TAY              ; Y is now Byte index into main 4x30 map

	TXA              ;%xxXXXxxx  ; X/8 (Capital X are bits we care about) 
	LSR              ;%0xxXXXxx
	LSR              ;%00xxXXXx
	ASR %1110        ;%00000XXX
	TAX              ; X is index into BitMask for bit pos within the byte 
	
	LDA CollisionRAM, y  ; Load the byte player is in from 4x30 map
	AND BitMask, x       ; compare the exact bit position with lookup table 
	RTS                  ; Zero Flag is clear if collistion is true for later branch

Edited September 10, 2020 by TylerBarnes

Sign In

What is your Favorite 6502 optimization/trick?

Recommended Posts

TylerBarnes | 184

Link to comment

Share on other sites

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Popular Posts

gauauu

Posted Images

Sumez | 2,933

Link to comment

Share on other sites

TylerBarnes | 184

Link to comment

Share on other sites

Sumez | 2,933

Link to comment

Share on other sites

TylerBarnes | 184

Link to comment

Share on other sites

gauauu | 81

Link to comment

Share on other sites

Vectrex28 | 366

Link to comment

Share on other sites

FrankenGraphics | 104

Link to comment

Share on other sites

FrankenGraphics | 104

Link to comment

Share on other sites

Vectrex28 | 366

Link to comment

Share on other sites

TylerBarnes | 184

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in