How do we debug stack overflows in MPLAB?

RadioT · Post by **RadioT** » Fri Oct 25, 2013 8:37 pm

Hi all,

We are building some software modules that call funcitons and can go deep on the stack with the right combination of interrupts. We are working to find the combination that is causing stack overflows.

It seems MPLab doesn't show the stack with our 18F87J11 (or all pic 18s??). We also can't set breakpoints on it, for example: "if stack hits 25 deep, break". Or even break on write.

We could work with the stack if we were able to run our code in the MPLab simulator but we need to access peripheral controllers, so that's out.

And suggestions on this?

Cheeers,

-Tom

Jerry Messina · Post by **Jerry Messina** » Sat Oct 26, 2013 12:37 pm

Stack issues are a real pain, and the debuggers don't help much.

Here's a little module that might help. It relies on setting the 'reset on stack error' config bit, and captures the stack contents into an array after the reset occurs. If the module detects a stack overflow, it'll automatically stop the debugger once it's captured the stack

stackdebug.bas:

Code: Select all

module stack_debug

// enable reset on stack fault config setting
config STVREN = ON

// the stack is 21-bits wide, so we use a 32-bit value for the stack() array
// if you have a device with < 64K of flash, you can cheat here and just use
// a 16-bit word (which saves ram). this should work ok unless you have a 
// REALLY serious stack error issue. if in doubt, then comment this out and
// just use 'type STACKENTRY_T = longword'
#if (_maxrom > $ffff) then
type STACKENTRY_T = longword
#else
type STACKENTRY_T = word
#endif

// stack fault bits (located in the STKPTR reg)
const STKUNF = 6,
      STKOVF = 7

// the pic18 stack has 32 entries, 0-31
const STACK_SIZE = 32

// set this to the beginning stack entry you wish to capture.
// normally this would be 0 to view the entire stack, but if you can't 
// spare that much ram then you can use this to skip over the beginnning
// entries and just look at the later part of the stack. 
// for example, setting STARTING_ENTRY=20 will just capture the last 12
// entries (32-20=12) 
const STARTING_ENTRY = 0

// stack contents array
public dim stack(STACK_SIZE-STARTING_ENTRY) as STACKENTRY_T

// software breakpoint instruction
// this is undocumented... it assembles to an opcode of 0x00E0
// it will stop any of the hardware debuggers
public inline sub _trap()
    asm
        trap
        nop         // added for breakpoint skidding
    end asm
end sub


if (STKPTR.bits(STKOVF) = 1) then
    FSR0 = addressof(stack)         // get address of the stack array
    STKPTR = STARTING_ENTRY         // start at the beginning of the stack
    repeat
        POSTINC0 = TOSL             // read contents of the stack into the stack array
        POSTINC0 = TOSH
        if (sizeof(STACKENTRY_T) = sizeof(longword)) then
            POSTINC0 = TOSU
            POSTINC0 = 0
        endif
        STKPTR = STKPTR + 1
    until (STKPTR = 0)              // and loop until we've wrapped (32 entries)
    // the debugger will stop here
    // you can inspect the stack() array in the watch window
    _trap()
endif

// clear stack error flags
STKPTR.bits(STKOVF) = 0
STKPTR.bits(STKUNF) = 0

// and the stack cache
clear(stack)

end module

Here's a main program you can use to test it

Code: Select all

program stack_test

'device = 18F4520

// include this module before all others
// it MUST run before anything modifies the stack in order to capture
// the post-crash/reset info. also, make sure 'config STVREN=ON'
include "stackdebug.bas"

//
// generate a stack overflow. if everything is set correctly, this should
// cause a STVREN reset, and you should end up at the _trap() instruction
// located in stackdebug.bas, where you can inspect the stack() array using
// the debugger watch window
//
// note: the numbers here are dummies, and are ignored... we're not
// really pushing a value, just the current program counter
//
asm
    push 1
    push 2
    push 3
    push 4
    push 5
    push 6
    push 7
    push 8
    push 9
    push 10
    push 11
    push 12
    push 13
    push 14
    push 15
    push 16
    push 17
    push 18
    push 19
    push 20
    push 21
    push 22
    push 23
    push 24
    push 25
    push 26
    push 27
    push 28
    push 29
    push 30
    push 31
    push 32
    push 33
end asm

// you should never get here
while (true)
end while

end program

You have to make sure NOTHING modifies the stack/STKPTR before this module runs, and if you want to capture the whole stack it needs a fair amount of ram (128 bytes)

RadioT · Post by **RadioT** » Sat Oct 26, 2013 4:22 pm

Nice! Thanks, Jerry!

What we did last night was run a number of PUSH commands in assembler to build the stack up, then ran the rest of our application as a function call. Thus we started with a stack of say, 20, then simply watched to see where it failed.

Code: Select all

#if debug_smallStack

	sub stackFillThenRunApp()

		while Sys.currentStackHeight < StartingStackSize - 1
			ASM
				PUSH
			end ASM
		wEnd

		while true
			runApplication()
		wend

	end sub

#endif

Then we watched the stack depth by pausing at critical points. We discovered that if we had multiple interrupts being serviced at the same time, it was possible to overflow the hardware stack. Thus we put in a condition where EUSART1 cannot be serviced while EUSART2 is being serviced. This prevented the stack going an extra 7 layers deep.

However, the code you posted gives us a much better diagnostic tool, where we now can see exactly where the calls originated. In comparison, what we were doing was the civil engineering equivalent to testing bridge maximum loads by driving heavier and heavier trucks over new bridges to see at what weights cause a collapse!

Thanks,

-Tom[/list]

Jerry Messina · Post by **Jerry Messina** » Sat Oct 26, 2013 10:18 pm

We discovered that if we had multiple interrupts being serviced at the same time, it was possible to overflow the hardware stack

Good detective work! Shame the hardware tools don't help much with this sort of thing. I know you can get the RealIce (and I think the ICD3) to stop on a stack overflow, but you STILL can't view the stack contents.

This points out one of the pitfalls with using multiple interrupt priorities, and why it's important to always do as little as possible in the ISR. Add multi-level interrupts to the mix and it can use up a lot of resources quickly.

Of course, some folks take that to the extreme and recommend that you "just set a flag in the ISR" and handle everything in the main loop. If you're going to do that then there's no point in using the interrupt in the first place! I don't think they understand that in doing that they just turned an interrupt-driven system into a completely polled one... a complete waste of time with a lot more effort.

Anyway, glad you got it sorted out.