Best way to optimise an string array search?

SHughes_Fusion · Post by **SHughes_Fusion** » Mon Jun 02, 2014 11:29 am

As I've posted elsewhere on here I'm working on a project that includes an XML parser. To do this I've set up a const array of all 'valid' XML tags and matching arrays which hold what I term the tag 'type' and tag 'ID' which are just bytes corresponding to aliases I've set up.

So, for example, the tag <Ping> has a type of XML_Command and an ID of XML_Ping in my code. The 'type' allows my code to fork to the appropriate code to handle that tag and the 'ID' allows that code to process the tag in more depth.

The way I've done this is fairly powerful as you only need one list of valid tags and one 'Extract' command to get the next tag from this list. However, it does mean that the tag matching involves matching a string against an array of around 40 strings. (Not all tags are valid in all contexts)

I'd imagine this has potential to be quite inefficient so I was wondering if anyone had any experience of the most efficient way to do a comparison against a string array?

Also, is there an explanation available as to how Swordfish compares strings as this would also help me optimise the code.

I should say I don't believe there is an issue, I just always like to make my code as efficient as possible!

RangerBob · Post by **RangerBob** » Mon Jun 02, 2014 1:14 pm

Personally I've only every really used the "Compare" function in the in-built string library. Not reinventing the wheel and all that.

You can have a look in the library to see how it performs it; using the FSRs and POSTINC registers. Seems pretty darn efficient to me.

Regards,

Rangerbob

Jerry Messina · Post by **Jerry Messina** » Mon Jun 02, 2014 5:45 pm

For a short list I usually just use a direct comparision, ie "if string_var = const_string_list(i))" statement.

String comparisons are done on a character-by-character basis and terminate when a mismatch (or end of string) is found. As long as the items in the const string_list() are fairly unique this isn't too bad... most of the time is spent calculating the string array addresses.

The string library Compare() function is some 2x-3x slower than this since it makes copies of the strings prior to the comparison. It also usually ends up using more code, too.

For long lists, instead of using the strings themselves I use a hash function on the input string and just compare the resulting hash value vs a const table of precomputed entries . This typically saves code since you don't need the actual strings, and tends to be faster as long as you pick a hashing function that's not too taxing. Many times I'll use a version of the 16-bit Fletcher Checksum since this is dead easy to compute vs something like a full-blown CRC.

Are you trying to optimize size or speed?

SHughes_Fusion · Post by **SHughes_Fusion** » Tue Jun 03, 2014 10:43 am

Interesting idea, Jerry. I'm mainly thinking about optimising the speed of matching rather than code size.

My thinking is that even the shortest XML message is going to have three tags - root opening and closing plus I allow some commands to be self-closing (i.e. with the / after the tag as opposed to needing matched pairs). However, I also support multiple commands per message and some of the more involved commands have sub-child tags and so on. I'd say the average message probably contains 5 or 6 tags and I have to compare against a list of 40+ valid tags. Certain messages can contain a dozen tags, in theory if the app was to set all system parameters in one message it could contain 22 tags. (Including opening and closing tags of course)

Does your system produce unique hash values, or does it just help you narrow down and you still need to do a full compare?

Unfortunately some of my tags are very similar - for example, I have three timers on the system. I set the timers or query the value that has been set using xxxTimer and query the current timer value using xxxTime which I would imagine will slow down a string comparison quite a bit.

When it comes to it, there is no actual reason why the tags need to be in plain text - they are used by a smartphone app to talk to my hardware via Bluetooth - but it does make coding and testing easier.

I've been doing a bit of research and this is a very involved subject! I tried a simple additive hash in Excel and only got one clash so it may be easier to just do that and adjust tag names to avoid clashes. I'm currently expanding my spreadsheet to include a Fletcher implementation to see if that helps.

Jerry Messina · Post by **Jerry Messina** » Tue Jun 03, 2014 11:39 am

It's possible that two random arrays could produce the same hash result, but not likely. The Fletcher algorithm is an additive one, but it keeps two sums which makes it much better at generating unique results than a simple xsum. Just treat the two 8-bit sums as a single 16-bit value.

Here's the routine I use to generate the const values (I just run it in the MPLAB simulator). Try it out if you like.

Code: Select all

program hash_mod

//
//-----------------------------------------------------------------------------
// Fletcher xsum (16-bit)
//-----------------------------------------------------------------------------
// Fletcher's checksum is described in RFC 1146
dim crc16 as word

dim
    sum as crc16.bytes(0),
    total as crc16.bytes(1)

public inline sub fletcher_xsum_init()
    crc16 = 0
end sub

public inline sub fletcher_xsum(b as WREG)
    inc(sum, b)
    inc(total, sum)
end sub

// get the crc
public inline function fletcher_xsum() as crc16
    // result = crc16
end function

//
//-----------------------------------------------------------------------------
// token hasher
//-----------------------------------------------------------------------------
//
public function _hash_token(s as FSR0) as word
    fletcher_xsum_init()

    while (INDF0 <> 0)
        fletcher_xsum(POSTINC0)
    end while

    result = fletcher_xsum()
end function

public inline function hash_token(byref src as string) as word
    result = _hash_token(@src)
end function

public inline function hash_token(byref src() as byte) as word
    result = _hash_token(@src)
end function

//
//-----------------------------------------------------------------------------
// main
//-----------------------------------------------------------------------------
//
dim s as string,
    hash as word

s = "hashme"
hash = hash_token(s)

end program

SHughes_Fusion · Post by **SHughes_Fusion** » Tue Jun 03, 2014 1:18 pm

Hi Jerry,

I've implemented that in Excel with all the tag names I've defined so far and I don't get any clashes.

I had come up with what seemed a simpler option but which would require one tag name to be changed - this was just adding the character number as well as the ASCII value - this helped differentiate based on string length as well.

However, looking at your implementation I think this looks pretty much as quick to implement but far more robust in terms of uniqueness of the results. One thing I'm not sure of, the descriptions I find online (Wikipedia for example) suggest that 256 is a bad choice for the modulo, recommending 255 instead. Your implementation will of course be far quicker on an 8-bit PIC and seems to give good hash values for my current data set, but I wonder if you've found any problems with this solution?

It also isn't clear at what point you are supposed to do the modulo - some pages I've read just sum the values and do the modulo at the end, others do it as they go as you do (by implication of using a byte variable).

I guess I now need to look at finding a PC-based way of taking a table of tags and creating my hash table!

Thanks again.

SHughes_Fusion · Post by **SHughes_Fusion** » Tue Jun 03, 2014 1:27 pm

Just had another thought.... How does Swordfish perform when comparing a value in RAM to one stored as a Const in code?

What I'm thinking is that I was planning to create the hash table to insert in to the code before compilation.

Another option would be to create it on the fly each time my system starts up. I have plenty of RAM spare so that isn't a concern.

If comparing against a table in RAM would be quicker than against one stored as Consts then I'll follow that path. If there is no difference then I might as well save code space and generate the table before compile-time.

However, I have no idea how Swordfish goes about accessing data stored as a Const, whether it needs any extra instruction cycles etc.

Any advice?

Jerry Messina · Post by **Jerry Messina** » Tue Jun 03, 2014 2:08 pm

RE: Fletcher xsum
You're right, it's not the best implementation (or even a proper one), but I wanted something small, fast, and simple.
It does a pretty decent job on text strings, better than most simple xsum routines I looked at which couldn't tell the difference between "ABE" and "ACD" for example.

>>How does Swordfish perform when comparing a value in RAM to one stored as a Const in code?

The PIC18 is pretty limited when it comes to reading code space. It's a lot more flexible when it comes to tables in RAM, where if you really want you can optimize things using the FSR registers and the indirect access "registers"/instructions like POSTINC, POSTDEC, etc. There are some equivalent operations you can do using the TABLEPTR and code space like

Code: Select all

public inline function TBLRD_() as TABLAT
    asm
        TBLRD*
    end asm
end function

public inline function TBLRD_POSTINC() as TABLAT
    asm
        TBLRD*+
    end asm
end function

public inline function TBLRD_POSTDEC() as TABLAT
    asm
        TBLRD*-
    end asm
end function

public inline function TBLRD_PREINC() as TABLAT
    asm
        TBLRD+*
    end asm
end function

but the FSR/RAM versions usually end up faster, at least for me.

Whether it's worth the bother/resources is up to you... you're really getting into pretty low-level stuff and RAM space is usually a pretty scarce thing in the larger programs I work with. I'd suggest trying some of the simpler ways first and see if that meets the mark. Sometimes it's easy to spend way too much time and effort optimizing every last cycle out of things only to find it has minimal effect.

SHughes_Fusion · Post by **SHughes_Fusion** » Tue Jun 03, 2014 2:28 pm

To be honest, it is probably no more in terms of coding time to do it in RAM - I can more easily write code to convert a const array to a table of hash values than I can generate it outside of Swordfish. (I'm out of touch with PC coding unfortunately...)

I'm happy I have plenty of RAM spare - currently running at around 2.5k free - so I think I'll just aim at holding the table in RAM anyway, whether I create it externally or at run-time.

Thanks for the advice, always good to find a neat solution to a problem.

Swordfish

Best way to optimise an string array search?

Best way to optimise an string array search?

Re: Best way to optimise an string array search?

Re: Best way to optimise an string array search?

Re: Best way to optimise an string array search?

Re: Best way to optimise an string array search?

Re: Best way to optimise an string array search?

Re: Best way to optimise an string array search?

Re: Best way to optimise an string array search?

Re: Best way to optimise an string array search?