Motivation

String equality tests can be slow, but using integer values to enumerate names for things (e.g., labels, tags, enum and event names, keywords in an input deck, etc.) can prevent plugins from extending the set of names with new values and can also lead to errors when values are written to files that must be maintained even as the set of valid values is modified.

To ameliorate the situation, SMTK provides some utility classes for string tokenization. Treating some set of fixed strings as tokens allows compact storage (the token ID can be stored in place of a potentially long string) and easy extension (since files can contain the mapping of token IDs to strings and those mappings can be reconciled at runtime).

Concepts

SMTK provides a string Token class to represent a string with an integer ID; a token can be constructed from a std::string or from a string literal like so:

#include "smtk/string/Token.h"
using smtk::string::Token;

Token a = "foo";
Token b = """bar"""_token;
Token c = std::string("baz");

A token can provide the source string data if it was constructed by passing a string to be hashed, but may throw an std::invalid_argument exception if then token was constructed by passing the hash code directly (i.e., with smtk::string::fromHash()). So, in the example above, a.data() and c.data() will return “foo” and “baz”, respectively, but b.data() will throw an exception unless “bar” was added to the string-token manager elsewhere.

Equality comparisons are done by comparing integer token identifiers (hash codes) and are thus fast. Inequality comparisons resort to string-value comparisons and thus may be slow for large strings with identical prefixes.

std::cout << "a is \"" << a.data() << "\" with id " << a.id() << "\n";
// prints: a is "foo" with id 9631199822919835226
std::cout << "b is \"" << b.data() << "\" with id " << b.id() << "\n";
// prints: b is "bar" with id 11474628671133349555

a == b; // This is fast since it only compares token IDs.
a != b; // This is fast since it only compares token IDs.
a < b; // This is slow since it compares underlying strings.

As noted in the example above, less-than and greater-than operators are slow because they compare the underlying strings. This preserves lexographic ordering when you store tokens in ordered containers.

Switch statements

As of SMTK 22.10, string tokens may be constructed via a constexpr literal operator named _hash. This makes it possible for you to use switch statements for string tokens, like so:

using namespace smtk::string::literals; // for ""_token
smtk::string::Token car;
int hp; // horsepower
switch (car.id())
{
  case "camaro"_hash: hp = 90; break;
  case "mustang"_hash: hp = 86; break;
  case "super beetle"_hash: hp = 48; break;
  default: hp = -1; break;
}

String source data

Token IDs and their corresponding string source-data are stored in a class-static dictionary called the string Manager as mentioned above. This dictionary is what allows Tokens to return the original string data while only holding a token ID.

The dictionary can be serialized-to and deserialized-from JSON. Individual token instances are serialized simply as their integer identifier. Because platforms may tokenize strings differently, the dictionary provides a “fallback map” constructed during deserialization to translate hash codes from other platforms.

Token hashing algorithm

The hash algorithm will generate hashes of type std::size_t but only supports 32- and 64-bit platforms at the moment. Note that because the string manager uses a serialization helper to translate serialized hash values (this was previously required since std::hash_function<> implementations varied), reading tokens serialized by a 32-bit platform on a 64-bit platform will not present problems. However, reading 64-bit hashes on a 32-bit platform is not currently supported; it may be in a future release but we do not foresee a need for it.