TylerBarnes | 184 Posted January 22, 2020 Share Posted January 22, 2020 (edited) So in the spirit of learning new things, what is your favorite 6502 trick? My question stems from very recently coming across a simple, albeit limited, trick that just seemed to strike a chord with me and made me go 'wow'. The trick I am talking about is using the BIT opcode in place of a JMP $xxxx that only skips 2 bytes. Consider the following code in which I want to check a state, and conditionally jump to one of two choices, and store it in a destination. LDA State CMP #$01 BEQ LabelTwo LabelOne: LDA ChoiceA JMP Jump LabelTwo: LDA ChoiceB Jump: STA Destination The 'JMP Jump' part of the code will occupy 3 bytes total. 1 for the opcode, and 2 for the address. We can instead replace these three bytes with the single byte of the BIT opcode ($2C) and this effectively skips the 'LDA ChoiceB' altogether while keeping the accumulator and carry flag preserved. LDA State CMP #$01 BEQ LabelTwo LabelOne: LDA ChoiceA .db $2C LabelTwo: LDA ChoiceB ; If ChoiceA, These are read as an address to BIT Jump: STA Destination What happens is when the CPU reads the BIT opcode, the next two bytes it is expecting are the lo byte of an address, and then the high byte of an address. So it reads $A9 from the LDA opcode as the low byte. Then the value byte that is assigned to ChoiceB as the high byte. It then does the rest of the stuff associated with the BIT instruction (non-destructively compares the accumulator to set flags). By the time it has finished the BIT it is ready to 'STA Destination' the value in the accumulator. This is not faster mind you. It is 1 cycle slower. But saves two bytes. Not very broad in application, but very cool imo. The machine code for those interested. ; Lets assume the assembler assigned 'Jump' the address of $1234 ; Also assume 'Destination' is RAM location $00 on zero page ; Lastly assume we have loaded LDA ChoiceA and we are performing the JMP Jump / Bit trick ;___Disassembly____ ; Version 1 $A9,$00 ; LDA #$00 ; LDA ChoiceA $4C,$34,$12 ; JMP $1234 ; JMP Jump $A9,$01 ; LDA #$01 ; LDA ChoiceB $85,$00 ; STA $00 ; STA Destination ;Version 2: $A9,$00 ; LDA #$00 ; LDA ChoiceA $2C,$A9,$01 ; BIT $01A9 ; BIT $01A9 $85,$00 ; STA $00 ; STA Destination Edited January 22, 2020 by TylerBarnes Link to comment Share on other sites More sharing options...
Sumez | 3,136 Posted January 22, 2020 Share Posted January 22, 2020 That's an interesting trick, and especially curious to me, because I've honestly never thought of doing optimizations from this "perspective". That said, these kinds of optimizations are probably the last thing I'd consider doing on a project - and at that time saving 2 bytes here and there probably won't make a difference. You'd need to do it all over the place through your entire project for a notable advantage. The downside here, and the reason you probably won't ever see me doing something like this, is that it makes the source code a lot less stable. Let's say you decide to add some more logic in the "branch" for LabelTwo, all of a sudden the LabelOne logic will fail to function as intended, and it's a really easy one to miss when returning to old code a few weeks later. It's the same reason I tend to always include stuff like "unnecessary" CLC/SEC when the state of said flag should already be given due to a previous branch. It might give me a tiny optimization - but it's also creates a risk of a potential hard-to-identify bug in the future. Link to comment Share on other sites More sharing options...
TylerBarnes | 184 Posted January 22, 2020 Author Share Posted January 22, 2020 20 minutes ago, Sumez said: You'd need to do it all over the place through your entire project for a notable advantage. This depends on your platform. Atari carts average 4KB-8KB for ROM. Every byte can matter. Link to comment Share on other sites More sharing options...
Sumez | 3,136 Posted January 22, 2020 Share Posted January 22, 2020 (edited) Yeah sorry. I just stupidly assumed 6502 implied NES programming. On Atari, every byte matters. I'd say cycle count is more central though. Edited January 22, 2020 by Sumez Link to comment Share on other sites More sharing options...
TylerBarnes | 184 Posted January 28, 2020 Author Share Posted January 28, 2020 While looking at Mario's source code. Loos like this BIT skip is used throughout to hop over various sections when you only want to load a different value in A but the rest of the subroutine is all the same. Link to comment Share on other sites More sharing options...
gauauu | 81 Posted January 29, 2020 Share Posted January 29, 2020 On 1/22/2020 at 6:55 AM, Sumez said: Yeah sorry. I just stupidly assumed 6502 implied NES programming. On Atari, every byte matters. I'd say cycle count is more central though. You're not wrong, but with Atari, it really depends on what part of the code you're talking about. During the display kernel, cycle count is everything. You plan your entire code based on cycle counts of each instruction. You waste rom space aligning things to avoid the 1-cycle page-crossing penalty, or you waste loads of ram pre-calculating everything you can. But in the rest of your code, you often are doing everything you can to conserve rom and ram space. Many games end up at a point where your rom is full, and you want to add a small feature or fix a bug, and you have to spend an hour searching your codebase for places where you can free up the 8 bytes needed to make your changes. My code is filled with bne's where I originally had a jmp, just to save a byte. Or removing the sec/clc when possible. Or using the bit trick to skip 2 bytes instead of using a jmp. The trick that first blew my mind was abusing the stack instructions to quickly move data around. (ie by first moving the stack pointer to your data, then using pla to load the byte into A and move the stack pointer with one instruction). I assume this is what tepples' popslide library does on the NES, but the first time I saw it on Atari my mind was blown (the developersof combat used a similar but different technique of moving the stack pointer to a set of registers, and using the php command to push the zero flag into the register) 1 Link to comment Share on other sites More sharing options...
Vectrex28 | 366 Posted January 30, 2020 Share Posted January 30, 2020 (edited) I like small but clever tricks such as doing BIT buttons to check the state of the controller without having to load it in A beforehand. Only works for A and B but it's a nifty trick for faster A/B checking. Also, using illegal opcodes to speed up some processes. For instance, LDA #$FF then STA $0200,x AXS #$04 is a great way to make a fast sprite clearing loop which saves up on a lot of DEX/INXes. I also unrolled it to STA $0200,x STA $0280,x to make it even faster while still being compact Edited January 30, 2020 by Vectrex28 Link to comment Share on other sites More sharing options...
FrankenGraphics | 104 Posted February 9, 2020 Share Posted February 9, 2020 (edited) Not that you'll ever use this unless you're crazy about doing signed math (and then, the usage is probably still rare..), but this is similar to the bit-skip trick in the OP. You can synthesize a set overflow instruction by BIT addr, where the addr must point to an rts instruction or a value of #$60. Note how the 6502 instruction set has a clear overflow instruction (clv), but not a set overflow (sev) instruction? For some reason, setting the overflow flag is depending on a low transition to the SO pin (maybe some math help processor was supposed to interact with it or maybe it was thought of as a way to poll for requests), but most computers, the NES/FC included, leave it unconnected or tied to a constant. To recap, overflow yields a positive if a clc adc or sec sbc is out of signed range, ie outside -128 to 127. Edited February 9, 2020 by FrankenGraphics Link to comment Share on other sites More sharing options...
FrankenGraphics | 104 Posted February 9, 2020 Share Posted February 9, 2020 (edited) On 1/31/2020 at 12:21 AM, Vectrex28 said: [...] For instance, LDA #$FF then STA $0200,x AXS #$04 is a great way to make a fast sprite clearing loop which saves up on a lot of DEX/INXes. [...] This example is equivalent to 4 DEX. If someone reading this is using an upwards-counting OAMBuffer cleanout routine, it should be: lda #$ff sta $200,x axs #$fc ;equivalent to +4 each iteration since AXS means X = (A && X) - # Edit: if the assembler you're using doesn't support any undocumented opcodes, then .db $CB will do the trick. Edited February 9, 2020 by FrankenGraphics Link to comment Share on other sites More sharing options...
Vectrex28 | 366 Posted February 19, 2020 Share Posted February 19, 2020 On 2/9/2020 at 4:59 PM, FrankenGraphics said: This example is equivalent to 4 DEX. If someone reading this is using an upwards-counting OAMBuffer cleanout routine, it should be: lda #$ff sta $200,x axs #$fc ;equivalent to +4 each iteration since AXS means X = (A && X) - # Edit: if the assembler you're using doesn't support any undocumented opcodes, then .db $CB will do the trick. I'm actually using downwards-counting routines because of a habit to optimise loops by doing DEX BPL .loop instead of INX CPX #$XX BNE .loop Obviously it only works on loops that take less than 128 iterations but most loops shouldn't take that many iterations, unless it's sprite clearing loops and the like Link to comment Share on other sites More sharing options...
TylerBarnes | 184 Posted August 21, 2020 Author Share Posted August 21, 2020 (edited) Found a cool optimization today. My first venture in successfully exploiting an illegal opcode. Used ASR ($4B). I know it's not too terribly exotic, but fun nonetheless. It uses the immediate addressing mode, ASR #imm The instruction will AND the immediate byte with A, and then shifts A one bit over. It affects N, Z, and C but they are not relevant in the way I used it. To start it would help to know what code I started with and what it was doing. It is a part of my collision detection routine for a 4x30 bytemap. (1 bit per tile). To index into this and compare the bit position, you would use X/64+(Y/8)*4 to grab the relevant byte from the 1D array, and then use X/8 AND %0111 as an index into a table checking for the exact bit position of the selected byte. There is a section in this calculation that is dealing with the (Y/8)*4 that works out particularly well with the ASR opcode to optimize. ; Lets assume we have calcluated X/64, and already put it in a variable called 'tmp'. ; Also, I am showing bits I care about in green, to follow their movement ; Known zeros are written ;-------------------------------------------- [X/64 calculation here] STA tmp ; X/64 in tmp TYA ; %yyyyyyyy ; (Y/8) LSR ; %0yyyyyyy LSR ; %00yyyyyy LSR ; %000yyyyy ASL ; %00yyyyy0 ; *4 ASL ; %0yyyyy00 ADC tmp ; add both to complete X/64+(Y/8)*4 TAY ; Y is now Byte index into main 4x30 map ;-------------------------------------------- In effect, I noticed that what we are ultimately left with is the 5 bits we care about having been shifted to the right once, and afterwards, bits 0 and 1 need to be ignored. Turning this %yyyyyyyy into this %0yyyyy00. So a quicker way to do this would be: TYA ; %yyyyyyyy ; (Y/8)*4 AND %11111000 ; %yyyyy000 LSR ; %0yyyyy00 However, ASR happened to conveniently do both of these for me at once. It ANDs A with the immediate byte and then shifts right once. TYA ; %yyyyyyyy ; (Y/8)*4 ASR %1111100 ; %0yyyyy00 ;---------------------------------------------- For those interested I'm including the full routine here for context. Spoiler The code I originally wrote. ; In bit pattern below, Bits we care for are shown with capital letters; lowercase we are trying to get rid of ; Known zeros will be written CheckCollide: ; when entering routine, X and Y pre loaded with xPos and yPos of checked point TXA ; %XXxxxxxx ; X/64 LSR ; %0XXxxxxx LSR ; %00XXxxxx LSR ; %000XXxxx LSR ; %0000XXxx LSR ; %00000XXx LSR ; %000000XX STA tmp ; %000000XX TYA ; %YYYYYyyy ;(Y/8) LSR ; %0YYYYYyy LSR ; %00YYYYYy LSR ; %000YYYYY ASL ; %00YYYYY0 ;*4 ASL ; %0YYYYY00 ADC tmp ; %0YYYYYXX add both to complete X/64+(Y/8)*4 TAY ; Y is now Byte index into main 4x30 map TXA ; %xxXXXxxx ;X/8 LSR ; %0xxXXXxx LSR ; %00xxXXXx LSR ; %000xxXXX AND #%0111 ; %00000XXX TAX ; X is index into BitMask for bit pos within the byte LDA CollisionRAM, y ; Load the byte player is in from 4x30 map AND BitMask, x ; compare the exact bit position with lookup table RTS ; Zero is clear if collistion is true for later branch BitMask: .db %10000000 .db %01000000 .db %00100000 .db %00010000 .db %00001000 .db %00000100 .db %00000010 .db %00000001 The code after refactoring. ;In bit pattern below, Bits we care for are shown with capital letters; lowercase we are trying to get rid of ; Known zeros will be written CheckCollide: ; when entering routine, X and Y pre loaded with xPos and yPos of checked point TXA ;%XXxxxxxx C=0 ; ROL ;%Xxxxxxx0 C=X ROL ;%xxxxxx0X C=X ROL ;%xxxxx0XX C=x AND #%00000011 ;%000000XX C=x STA tmp TYA ;%YYYYYyyy ; (Y/8)*4 ASR %11111000 ;%0YYYYY00 ;illegal opcode $4B (&& with byte and LSR) N,Z,C ADC tmp ;%0YYYYYXX add both to complete X/64+(Y/8)*4 TAY ; Y is now Byte index into main 4x30 map TXA ;%xxXXXxxx ; X/8 (Capital X are bits we care about) LSR ;%0xxXXXxx LSR ;%00xxXXXx ASR %1110 ;%00000XXX TAX ; X is index into BitMask for bit pos within the byte LDA CollisionRAM, y ; Load the byte player is in from 4x30 map AND BitMask, x ; compare the exact bit position with lookup table RTS ; Zero Flag is clear if collistion is true for later branch Edited September 10, 2020 by TylerBarnes Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now