Discussion:
What is LC's internal text format?
Ben Rubinstein via use-livecode
2018-11-12 22:35:39 UTC
Permalink
This is something that I've been wondering about for a while.

My unexamined assumption had been that in the 'new' fully unicode LC, text was
held in UTF-8. However when I saved some text strings in binary I got
something like UTF-8 - but not quite. And the recent experiments with offset
suggested that LC at the least is able to distinguish between a string which
is fully represented as single-byte (or perhaps ASCII?). And the reports of
the ingenious investigators using UTF-32 to speed up offsets, and discovering
that offset somehow managed to be case-insensitive in this case, made me
wonder whether after using textEncode(xt, "UTF-32") LC marks the string in
some way to give a clue about how to interpret it as text?

So could someone who is familar with this bit of the engine enlighten us? In
particular:
- What is the internal format?
- Is it different on different platforms?
- Given that it appears to include a flag to indicate whether it is
single-byte text or not, are there any other attributes?
- Does saving a string in 'binary' file faithfully report the internal format?

TIA,

Ben
Monte Goulding via use-livecode
2018-11-12 23:50:00 UTC
Permalink
Text strings in LiveCode are native encoded (MacRoman or ISO 8859) where possible and where you don’t explicitly tell the engine it’s unicode (via textDecode) so that they can follow faster single byte code paths. If you use textDecode then the engine will first check if the text can be native encoded and use native if so otherwise it will use UTF 16 encoding.

For what it’s worth using `offset` is the wrong thing to do if you have textEncoded your strings into binary data. You want to use `byteOffset` otherwise the engine will convert your data to a string and assume native encoding. This is probably why you are getting some case insensitivity.

I haven’t been following along the offset discussion. I’ll have to take a look to see if there were some speed comparisons between offset and codepointOffset.

Cheers

Monte
Post by Ben Rubinstein via use-livecode
This is something that I've been wondering about for a while.
My unexamined assumption had been that in the 'new' fully unicode LC, text was held in UTF-8. However when I saved some text strings in binary I got something like UTF-8 - but not quite. And the recent experiments with offset suggested that LC at the least is able to distinguish between a string which is fully represented as single-byte (or perhaps ASCII?). And the reports of the ingenious investigators using UTF-32 to speed up offsets, and discovering that offset somehow managed to be case-insensitive in this case, made me wonder whether after using textEncode(xt, "UTF-32") LC marks the string in some way to give a clue about how to interpret it as text?
- What is the internal format?
- Is it different on different platforms?
- Given that it appears to include a flag to indicate whether it is single-byte text or not, are there any other attributes?
- Does saving a string in 'binary' file faithfully report the internal format?
TIA,
Ben
_______________________________________________
use-livecode mailing list
http://lists.runrev.com/mailman/listinfo/use-livecode
Geoff Canyon via use-livecode
2018-11-13 06:15:06 UTC
Permalink
On Mon, Nov 12, 2018 at 3:50 PM Monte Goulding via use-livecode <
Post by Monte Goulding via use-livecode
Text strings in LiveCode are native encoded (MacRoman or ISO 8859) where
possible and where you don’t explicitly tell the engine
For what it’s worth using `offset` is the wrong thing to do if you have
textEncoded your strings into binary data. You want to use `byteOffset`
otherwise the engine will convert your data to a string and assume native
encoding. This is probably why you are getting some case insensitivity.
Unless I'm misunderstanding, this hasn't been my observation. Using offset
on a string that has been textEncodet()ed to UTF-32 returns values that are
4 * (the character offset - 1) + 1 -- if it were re-encoded, wouldn't it
return the actual offsets (except when it fails)? Also, 𐀁 encodes to
00010001, and routines that convert to UTF-32 and then use offset will find
five instances of that character in the UTF-32 encoding because of improper
boundaries. To see this, run this code:

on mouseUp
put textencode("𐀁","UTF-32") into X
put textencode("𐀁𐀁𐀁","UTF-32") into Y
put offset(X,Y,1)
end mouseUp

That will return 2, meaning that it found the encoding for X starting at
character 2 + 1 = 3 of Y. In other words, it found X using the last half of
the first "𐀁" and the first half of the second "𐀁"
Mark Waddingham via use-livecode
2018-11-13 06:21:35 UTC
Permalink
Post by Geoff Canyon via use-livecode
On Mon, Nov 12, 2018 at 3:50 PM Monte Goulding via use-livecode <
Unless I'm misunderstanding, this hasn't been my observation. Using offset
on a string that has been textEncodet()ed to UTF-32 returns values that are
4 * (the character offset - 1) + 1 -- if it were re-encoded, wouldn't it
return the actual offsets (except when it fails)? Also, 𐀁 encodes to
00010001, and routines that convert to UTF-32 and then use offset will find
five instances of that character in the UTF-32 encoding because of improper
on mouseUp
put textencode("𐀁","UTF-32") into X
put textencode("𐀁𐀁𐀁","UTF-32") into Y
put offset(X,Y,1)
end mouseUp
That will return 2, meaning that it found the encoding for X starting at
character 2 + 1 = 3 of Y. In other words, it found X using the last half of
the first "𐀁" and the first half of the second "𐀁"
The textEncode function generates binary data which is composed of
bytes. When you use binary data in a text function (which offset is),
the engine uses a compatability conversion which treats the sequence of
bytes as a sequence of native characters (this preserves what happened
pre-7.0 when strings were only ever native, and as such binary and
string were essentially the same thing).

So if you textEncode a 1 (native) character string as UTF-32, you will
get a four byte string, which will then turn back into a 4 (native)
character string when passed to offset.

Warmest Regards,

Mark.
--
Mark Waddingham ~ ***@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps
Geoff Canyon via use-livecode
2018-11-13 07:35:58 UTC
Permalink
So then why does put textEncode("a","UTF-32") into X;put chartonum(byte 1
of X) put 97? That implies that "byte" 1 is "a", not 1100001. Likewise, put
textEncode("㍁","UTF-32") into X;put chartonum(byte 1 of X) puts 65.

I've looked in the dictionary and I don't see anything that comes close to
describing this.

gc

On Mon, Nov 12, 2018 at 10:21 PM Mark Waddingham via use-livecode <
Post by Mark Waddingham via use-livecode
Post by Geoff Canyon via use-livecode
On Mon, Nov 12, 2018 at 3:50 PM Monte Goulding via use-livecode <
Unless I'm misunderstanding, this hasn't been my observation. Using offset
on a string that has been textEncodet()ed to UTF-32 returns values that are
4 * (the character offset - 1) + 1 -- if it were re-encoded, wouldn't it
return the actual offsets (except when it fails)? Also, 𐀁 encodes to
00010001, and routines that convert to UTF-32 and then use offset will find
five instances of that character in the UTF-32 encoding because of improper
on mouseUp
put textencode("𐀁","UTF-32") into X
put textencode("𐀁𐀁𐀁","UTF-32") into Y
put offset(X,Y,1)
end mouseUp
That will return 2, meaning that it found the encoding for X starting at
character 2 + 1 = 3 of Y. In other words, it found X using the last half of
the first "𐀁" and the first half of the second "𐀁"
The textEncode function generates binary data which is composed of
bytes. When you use binary data in a text function (which offset is),
the engine uses a compatability conversion which treats the sequence of
bytes as a sequence of native characters (this preserves what happened
pre-7.0 when strings were only ever native, and as such binary and
string were essentially the same thing).
So if you textEncode a 1 (native) character string as UTF-32, you will
get a four byte string, which will then turn back into a 4 (native)
character string when passed to offset.
Warmest Regards,
Mark.
--
LiveCode: Everyone can create apps
_______________________________________________
use-livecode mailing list
http://lists.runrev.com/mailman/listinfo/use-livecode
Mark Waddingham via use-livecode
2018-11-13 08:03:07 UTC
Permalink
Post by Geoff Canyon via use-livecode
So then why does put textEncode("a","UTF-32") into X;put chartonum(byte 1
of X) put 97?
Because:

1) textEncode("a", "UTF-32") produces the byte sequence <97,0,0,0>
2) byte 1 of <97,0,0,0> is <97>
3) charToNum(<97>) first converts the byte <97> into a native string
which is "a" (as the 97 is the code for 'a' in the native encoding
table), then converts that (native) char to a number -> 97
Post by Geoff Canyon via use-livecode
That implies that "byte" 1 is "a", not 1100001.
1100001 is 97 but printed in base-2.

FWIW, I think you are confusing 'binary string' with 'binary number' -
these are not the same thing.

A 'binary string' (internally the data type is 'Data') is a sequence of
bytes (just as a 'string' is a sequence of
characters/codepoints/codeunits).

A 'binary number' is a number which has been rendered to a string with
base-2.

Bytes are like characters (and codepoints, and codeunits) in that they
are 'abstract' things - they aren't numbers, and have no direct
conversion to them - which is why we have byteToNum, numToByte,
nativeCharToNum, numToNativeChar, codepointToNum and numToCodepoint.

The charToNum and numToChar functions are actually deprecated /
considered legacy - as their function (when useUnicode is set to true)
depends on processing unicode text as binary data - which isn't how
unicode works post-7 (indeed, there was no way to fold their behavior
into the new model - hence the deprecation, and replacement with
nativeCharToNum / numToNativeChar).

You'll notice that there is no modern 'charToNum'/'numToChar' - just
'codepointToNum'/'numToCodepoint'. A codepoint is an index into the
(large - 21-bit) Unicode code table; Unicode characters can be composed
of multiple codepoints (e.g. [e,combining-acute] and thus don't have a
'number' per-se.

Warmest Regards,

Mark.
Post by Geoff Canyon via use-livecode
I've looked in the dictionary and I don't see anything that comes close to
describing this.
gc
On Mon, Nov 12, 2018 at 10:21 PM Mark Waddingham via use-livecode <
Post by Mark Waddingham via use-livecode
Post by Geoff Canyon via use-livecode
On Mon, Nov 12, 2018 at 3:50 PM Monte Goulding via use-livecode <
Unless I'm misunderstanding, this hasn't been my observation. Using offset
on a string that has been textEncodet()ed to UTF-32 returns values that are
4 * (the character offset - 1) + 1 -- if it were re-encoded, wouldn't it
return the actual offsets (except when it fails)? Also, 𐀁 encodes to
00010001, and routines that convert to UTF-32 and then use offset will find
five instances of that character in the UTF-32 encoding because of improper
on mouseUp
put textencode("𐀁","UTF-32") into X
put textencode("𐀁𐀁𐀁","UTF-32") into Y
put offset(X,Y,1)
end mouseUp
That will return 2, meaning that it found the encoding for X starting at
character 2 + 1 = 3 of Y. In other words, it found X using the last half of
the first "𐀁" and the first half of the second "𐀁"
The textEncode function generates binary data which is composed of
bytes. When you use binary data in a text function (which offset is),
the engine uses a compatability conversion which treats the sequence of
bytes as a sequence of native characters (this preserves what happened
pre-7.0 when strings were only ever native, and as such binary and
string were essentially the same thing).
So if you textEncode a 1 (native) character string as UTF-32, you will
get a four byte string, which will then turn back into a 4 (native)
character string when passed to offset.
Warmest Regards,
Mark.
--
LiveCode: Everyone can create apps
_______________________________________________
use-livecode mailing list
http://lists.runrev.com/mailman/listinfo/use-livecode
_______________________________________________
use-livecode mailing list
http://lists.runrev.com/mailman/listinfo/use-livecode
--
Mark Waddingham ~ ***@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps
Geoff Canyon via use-livecode
2018-11-13 10:06:48 UTC
Permalink
I don't *think* I'm confusing binary string/data with binary numbers -- I
was just trying to illustrate that when a Latin Small Letter A (U+0061)
gets encoded, somewhere there is stored (four bytes, one of which is) a
byte 97, i.e. the bit sequence 1100001, unless computers don't work that
way anymore.

What I now see is tripping me up is the implicit cast to a character you're
saying that charToNum supports, without the corresponding cast to a number
supported in numToChar -- i.e. this fails:

put textEncode("a","UTF-32") into X;put numtochar(byte 1 of X)

while this works:

put textEncode("a","UTF-32") into X;put numtochar(bytetonum(byte 1 of X))

Thanks for the insight,

Geoff

On Tue, Nov 13, 2018 at 12:03 AM Mark Waddingham via use-livecode <
Post by Mark Waddingham via use-livecode
Post by Geoff Canyon via use-livecode
So then why does put textEncode("a","UTF-32") into X;put chartonum(byte 1
of X) put 97?
1) textEncode("a", "UTF-32") produces the byte sequence <97,0,0,0>
2) byte 1 of <97,0,0,0> is <97>
3) charToNum(<97>) first converts the byte <97> into a native string
which is "a" (as the 97 is the code for 'a' in the native encoding
table), then converts that (native) char to a number -> 97
Post by Geoff Canyon via use-livecode
That implies that "byte" 1 is "a", not 1100001.
1100001 is 97 but printed in base-2.
FWIW, I think you are confusing 'binary string' with 'binary number' -
these are not the same thing.
A 'binary string' (internally the data type is 'Data') is a sequence of
bytes (just as a 'string' is a sequence of
characters/codepoints/codeunits).
A 'binary number' is a number which has been rendered to a string with
base-2.
Bytes are like characters (and codepoints, and codeunits) in that they
are 'abstract' things - they aren't numbers, and have no direct
conversion to them - which is why we have byteToNum, numToByte,
nativeCharToNum, numToNativeChar, codepointToNum and numToCodepoint.
The charToNum and numToChar functions are actually deprecated /
considered legacy - as their function (when useUnicode is set to true)
depends on processing unicode text as binary data - which isn't how
unicode works post-7 (indeed, there was no way to fold their behavior
into the new model - hence the deprecation, and replacement with
nativeCharToNum / numToNativeChar).
You'll notice that there is no modern 'charToNum'/'numToChar' - just
'codepointToNum'/'numToCodepoint'. A codepoint is an index into the
(large - 21-bit) Unicode code table; Unicode characters can be composed
of multiple codepoints (e.g. [e,combining-acute] and thus don't have a
'number' per-se.
Warmest Regards,
Mark.
Post by Geoff Canyon via use-livecode
I've looked in the dictionary and I don't see anything that comes close to
describing this.
gc
On Mon, Nov 12, 2018 at 10:21 PM Mark Waddingham via use-livecode <
Post by Mark Waddingham via use-livecode
Post by Geoff Canyon via use-livecode
On Mon, Nov 12, 2018 at 3:50 PM Monte Goulding via use-livecode <
Unless I'm misunderstanding, this hasn't been my observation. Using offset
on a string that has been textEncodet()ed to UTF-32 returns values
that
Post by Geoff Canyon via use-livecode
Post by Mark Waddingham via use-livecode
Post by Geoff Canyon via use-livecode
are
4 * (the character offset - 1) + 1 -- if it were re-encoded, wouldn't it
return the actual offsets (except when it fails)? Also, 𐀁 encodes to
00010001, and routines that convert to UTF-32 and then use offset will find
five instances of that character in the UTF-32 encoding because of improper
on mouseUp
put textencode("𐀁","UTF-32") into X
put textencode("𐀁𐀁𐀁","UTF-32") into Y
put offset(X,Y,1)
end mouseUp
That will return 2, meaning that it found the encoding for X starting at
character 2 + 1 = 3 of Y. In other words, it found X using the last half of
the first "𐀁" and the first half of the second "𐀁"
The textEncode function generates binary data which is composed of
bytes. When you use binary data in a text function (which offset is),
the engine uses a compatability conversion which treats the sequence of
bytes as a sequence of native characters (this preserves what happened
pre-7.0 when strings were only ever native, and as such binary and
string were essentially the same thing).
So if you textEncode a 1 (native) character string as UTF-32, you will
get a four byte string, which will then turn back into a 4 (native)
character string when passed to offset.
Warmest Regards,
Mark.
--
LiveCode: Everyone can create apps
_______________________________________________
use-livecode mailing list
Please visit this url to subscribe, unsubscribe and manage your
http://lists.runrev.com/mailman/listinfo/use-livecode
_______________________________________________
use-livecode mailing list
Please visit this url to subscribe, unsubscribe and manage your
http://lists.runrev.com/mailman/listinfo/use-livecode
--
LiveCode: Everyone can create apps
_______________________________________________
use-livecode mailing list
http://lists.runrev.com/mailman/listinfo/use-livecode
Mark Waddingham via use-livecode
2018-11-13 10:52:52 UTC
Permalink
Post by Geoff Canyon via use-livecode
I don't *think* I'm confusing binary string/data with binary numbers -- I
was just trying to illustrate that when a Latin Small Letter A (U+0061)
gets encoded, somewhere there is stored (four bytes, one of which is) a
byte 97, i.e. the bit sequence 1100001, unless computers don't work that
way anymore.
Yes - a byte is not a number, a char is not a number a bit sequence is
not a number.

Chars have never been numbers in LC - when LC sees a char - it sees a
string and so
when such a thing is used in number context it converts it to the number
it *looks* like
i.e. "1" -> 1, but "a" -> error in number context (bearing in mind the
code for "1" is not 1).

i.e. numToChar(charToNum("1")) + 0 -> 1

The same is try for 'byte' in LC7+ (indeed prior to that byte was a
synonym for char).
Post by Geoff Canyon via use-livecode
What I now see is tripping me up is the implicit cast to a character you're
saying that charToNum supports, without the corresponding cast to a number
put textEncode("a","UTF-32") into X;put numtochar(byte 1 of X)
Right so that shouldn't work - byte 1 of X here is <97> (a byte), bytes
get converted to native
chars in string context, so numToChar(byte 1 of X) -> numToChar(<97> as
char) -> numToChar("a")
and "a" is not a number.

You'd get exactly the same result if you did put numToChar(char 1 of
"a").

As I said, bytes are not numbers, just as chars are not numbers - bytes
do implicitly convert to
(native) chars though - so when you use a binary string in number
context, it gets treated as a
numeric string.

Put another way, just as the code for a char is not used in conversion
in number context, the
code of a byte is not used either.

Warmest Regards,

Mark.
--
Mark Waddingham ~ ***@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps
Jerry Jensen via use-livecode
2018-11-13 19:36:01 UTC
Permalink
Yes - a byte is not a number, a char is not a number a bit sequence is not a number.
Reminds of a clever sig line from somebody on this list.
I can’t remember who, so author please step up and take credit.
Paraphrasing The Prisoner:

“I am not a number, I am a free NaN”.
Ben Rubinstein via use-livecode
2018-11-13 11:43:08 UTC
Permalink
I'm grateful for all the information, but _outraged_ that the thread that I
carefully created separate from the offset thread was so quickly hijacked for
the continuing (useful!) detailed discussion on that topic.

From recent contributions on both threads I'm getting some more insights, but
I'd really like to understand clearly what's going on. I do think that I
should have asked this question more broadly: how does the engine represent
values internally?


I believe from what I've read that the engine can distinguish the following
kinds of value:
- empty
- array
- number
- string
- binary string

From Monte I get that the internal encoding for 'string' may be MacRoman, ISO
8859 (I thought it would be CP1252), or UTF16 - presumably with some attribute
to tell the engine which one in each case.

So then my question is whether a 'binary string' is a pure blob, with no clues
as to interpretation; or whether in fact it does have some attributes to
suggest that it might be interpreted as UTF8, UTF132 etc?

If there are no such attributes, how does codepointOffset operate when passed
a binary string?

If there are such attributes, how do they get set? Evidently if textEncode is
used, the engine knows that the resulting value is the requested encoding. But
what happens if the program reads a file as 'binary' - presumable the result
is a binary string, how does the engine treat it?

Is there any way at LiveCode script level to detect what a value is, in the
above terms?

And one more question: if a string, or binary string, is saved in a 'binary'
file, are the bytes stored on disk a faithful rendition of the bytes that
composed the value in memory, or an interpretation of some kind?

TIA,

Ben
Mark Waddingham via use-livecode
2018-11-13 13:31:35 UTC
Permalink
Post by Ben Rubinstein via use-livecode
I'm grateful for all the information, but _outraged_ that the thread
that I carefully created separate from the offset thread was so
quickly hijacked for the continuing (useful!) detailed discussion on
that topic.
The phrase 'attempting to herd cats' springs to mind ;)
Post by Ben Rubinstein via use-livecode
From recent contributions on both threads I'm getting some more
insights, but I'd really like to understand clearly what's going on. I
do think that I should have asked this question more broadly: how does
the engine represent values internally?
The engine uses a number of distinct types 'behind the scenes'. The ones
pertinent to LCS (there are many many more which LCS never sees) are:

- nothing: a type with a single value nothing/null)
- boolean: a type with two values true/false
- number: a type which can either store a 32-bit integer *or* a double
- string: a type which can either store a sequence of native (single
byte) codes, or a sequence of unicode (two byte - UTF-16) codes
- name: a type which stores a string, but uniques the string so that
caseless and exact equality checking is constant time
- data: a type which stores a sequence of bytes
- array: a type which stores (using a hashtable) a mapping from
'names' to any other storage value type

The LCS part of the engine then sits on top of these core types,
providing
various conversions depending on context.

All LCS syntax is actually typed - meaning that when you pass a value to
any
piece of LCS syntax, each argument is converted to the type required.

e.g. charToNativeNum() has signature 'integer charToNativeNum(string)'
meaning that it
expects a string as input and will return a number as output.

Some syntax is overloaded - meaning that it can act in slightly
different (but always consistent) ways depending on the type of the
arguments.

e.g. & has signatures 'string &(string, string)' and 'data &(data,
data)'.

In simple cases where there is no overload, type conversion occurs
exactly as required:

e.g. In the case of charToNativeNum() - it has no overload, so always
expects a string
which means that the input argument will always undergo a 'convert to
string' operation.

The convert to string operation operates as follows:

- nothing -> ""
- boolean -> "true" or "false"
- number -> decimal representation of the number, using numberFormat
- string -> stays the same
- name -> uses the string the name contains
- data -> converts to a string using the native encoding
- array -> converts to empty (a very old semantic which probably does
more harm than good!)

In cases where syntax is overloaded, type conversion generally happens
in syntax-specific sequence in order to preserve consistency:

e.g. In the case of &, it can either take two data arguments, or two
string arguments. In this case,
if both arguments are data, then the result will be data. Otherwise both
arguments will be converted
to strings, and a string returned.
Post by Ben Rubinstein via use-livecode
From Monte I get that the internal encoding for 'string' may be
MacRoman, ISO 8859 (I thought it would be CP1252), or UTF16 -
presumably with some attribute to tell the engine which one in each
case.
Monte wasn't quite correct - on Mac it is MacRoman or UTF-16, on Windows
it
is CP1252 or UTF-16, on Linux it is IOS8859-1 or UTF-16. There is an
internal
flag in a string value which says whether its character sequence is
single-byte (native)
or double-byte (UTF_16).
Post by Ben Rubinstein via use-livecode
So then my question is whether a 'binary string' is a pure blob, with
no clues as to interpretation; or whether in fact it does have some
attributes to suggest that it might be interpreted as UTF8, UTF132
etc?
Data (binary string) values are pure blobs - they are sequences of bytes
- it has
no knowledge of where it came from. Indeed, that would generally be a
bad idea as you
wouldn't get repeatable semantics (i.e. a value from one codepath which
is data, might
have a different effect in context from one which is fetched from
somewhere else).

That being said, the engine does store some flags on values - but purely
for optimization.
i.e. To save later work. For example, a string value can store its
(double) numeric value in
it - which saves multiple 'convert to number' operations performed on
the same (pointer wise) string (due to the copy-on-write nature of
values, and the fact that all literals are unique names, pointer-wise
equality of values occurs a great deal).
Post by Ben Rubinstein via use-livecode
If there are no such attributes, how does codepointOffset operate when
passed a binary string?
CodepointOffset is has signature 'integer codepointOffset(string)', so
when you
pass a binary string (data) value to it, the data value gets converted
to a string
by interpreting it as a sequence of bytes in the native encoding.
Post by Ben Rubinstein via use-livecode
If there are such attributes, how do they get set? Evidently if
textEncode is used, the engine knows that the resulting value is the
requested encoding. But what happens if the program reads a file as
'binary' - presumable the result is a binary string, how does the
engine treat it?
There are no attributes of that ilk. When you read a file as binary you
get data (binary
string) values - which means when you pass them to string taking
functions/commands that
data gets interpreted as a sequence of bytes in the native encoding.
This is why you must
always explicitly textEncode/textDecode data values when you know they
are not representing
native encoded text.
Post by Ben Rubinstein via use-livecode
Is there any way at LiveCode script level to detect what a value is,
in the above terms?
Yes - the 'is strictly' operators:

is strictly nothing
is strictly a boolean
is strictly an integer - a number which has internal rep 32-bit int
is strictly a real - a number which has internal rep double
is strictly a string
is strictly a binary string
is strictly an array

It should be noted that 'is strictly' reports only how that value is
stored and not anything based on the value itself. This only really
applies to 'an integer' and 'a real' - you can store an integer in a
double and all LCS arithmetic operators act on doubles.

e.g. (1+2) is strictly an integer -> false
(1+2) is strictly a real -> true

In contrast, though, *some* syntax will return numbers which are stored
internally as integers:

e.g. nativeCharToNum("a") is strictly an integer -> true

I should point out that what 'is strictly' operators return for any
given context is not stable in the sense that future engine versions
might return different things. e.g. We might optimize arithmetic in the
future (if we can figure out a way to do it without performance
penalty!) so that things which are definitely integers, are stored as
integers (e.g. 1 + 2 in the above).
Post by Ben Rubinstein via use-livecode
And one more question: if a string, or binary string, is saved in a
'binary' file, are the bytes stored on disk a faithful rendition of
the bytes that composed the value in memory, or an interpretation of
some kind?
What happens when you read or write data or string values to a file
depends on how you opened the file.

If you opened the file for binary (whether reading or writing), when you
read you will get data, when you write string values will be converted
to data via the native encoding (default rule).

If you opened the file for text, then the engine will try and determine
(using a BOM) the existing text encoding of the file. If it can't
determine it (if for example, you are opening a file for write which
doesn't exist), it will assume it is encoded as native.

Otherwise the file will have an explicit encoding associated with it
specified by you - reading from it will interpret the bytes in that
explicit encoding; while writing to it will expect string values which
will be encoded appropriately. In the latter case if you write data
values, they will first be converted to a string (assuming native
encoding) and then written as strings in the file's encoding (i.e.
default type conversion applies).

Essentially you can view file's a typed-stream - if you opened for
binary read/write give/take data; if you opened for text then read/write
give/take strings and default type conversion rules apply.

Warmest Regards,

Mark.
--
Mark Waddingham ~ ***@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps
Bob Sneidar via use-livecode
2018-11-13 16:01:23 UTC
Permalink
There is a quest in World of Warcraft where the objective is actually to herd cats. It can be done, but only one cat at a time. :-)

Bob S
Post by Mark Waddingham via use-livecode
Post by Ben Rubinstein via use-livecode
I'm grateful for all the information, but _outraged_ that the thread
that I carefully created separate from the offset thread was so
quickly hijacked for the continuing (useful!) detailed discussion on
that topic.
The phrase 'attempting to herd cats' springs to mind ;)
Ben Rubinstein via use-livecode
2018-11-13 19:33:56 UTC
Permalink
That's really helpful - and in parts eye-opening - thanks Mark.

I have a few follow-up questions.

Does textEncode _always_ return a binary string? Or, if invoked with "CP1252",
"ISO-8859-1", "MacRoman" or "Native", does it return a string?
CodepointOffset has signature 'integer codepointOffset(string)', so when you
pass a binary string (data) value to it, the data value gets converted to a
string by interpreting it as a sequence of bytes in the native encoding.
OK - so one message I take are that in fact one should never invoke
codepointOffset on a binary string. Should it actually throw an error in this
case?

By the same token, probably one should only use 'byte', 'byteOffset',
'byteToNum' etc with binary strings - would it be better, to avoid confusion,
if char, offset, charToNum should refuse to operate on a binary string?
e.g. In the case of &, it can either take two data arguments, or two
string arguments. In this case, if both arguments are data, then the result
will be data. Otherwise both arguments will be converted to strings, and a
string returned.
The second message I take is that one needs to be very careful, if operating
on UTF8 or other binary strings, to avoid 'contaminating' them e.g. by
concatenating with a simple quoted string, as this may cause it to be silently
converted to a non-binary string. (I presume that 'put "simple string"
after/before pBinaryString' will cause a conversion in the same way as "&"?
What about 'put "!" into char x of pBinaryString?)

The engine can tell whether a string is 'native' or UTF16. When the engine is
converting a binary string to 'string', does it always interpret the source as
the native 8-bit encoding, or does it have some heuristic to decide whether it
would be more plausible to interpret the source as UTF16?

Thanks again for all the detail!

Ben
Post by Ben Rubinstein via use-livecode
I'm grateful for all the information, but _outraged_ that the thread
that I carefully created separate from the offset thread was so
quickly hijacked for the continuing (useful!) detailed discussion on
that topic.
The phrase 'attempting to herd cats' springs to mind ;)
Post by Ben Rubinstein via use-livecode
From recent contributions on both threads I'm getting some more
insights, but I'd really like to understand clearly what's going on. I
do think that I should have asked this question more broadly: how does
the engine represent values internally?
The engine uses a number of distinct types 'behind the scenes'. The ones
  - nothing: a type with a single value nothing/null)
  - boolean: a type with two values true/false
  - number: a type which can either store a 32-bit integer *or* a double
  - string: a type which can either store a sequence of native (single byte)
codes, or a sequence of unicode (two byte - UTF-16) codes
  - name: a type which stores a string, but uniques the string so that
caseless and exact equality checking is constant time
  - data: a type which stores a sequence of bytes
  - array: a type which stores (using a hashtable) a mapping from 'names' to
any other storage value type
The LCS part of the engine then sits on top of these core types, providing
various conversions depending on context.
All LCS syntax is actually typed - meaning that when you pass a value to any
piece of LCS syntax, each argument is converted to the type required.
e.g. charToNativeNum() has signature 'integer charToNativeNum(string)' meaning
that it
expects a string as input and will return a number as output.
Some syntax is overloaded - meaning that it can act in slightly different (but
always consistent) ways depending on the type of the arguments.
e.g. & has signatures 'string &(string, string)' and 'data &(data, data)'.
In simple cases where there is no overload, type conversion occurs exactly as
e.g. In the case of charToNativeNum() - it has no overload, so always expects
a string
which means that the input argument will always undergo a 'convert to string'
operation.
   - nothing -> ""
   - boolean -> "true" or "false"
   - number -> decimal representation of the number, using numberFormat
   - string -> stays the same
   - name -> uses the string the name contains
   - data -> converts to a string using the native encoding
   - array -> converts to empty (a very old semantic which probably does more
harm than good!)
In cases where syntax is overloaded, type conversion generally happens in
e.g. In the case of &, it can either take two data arguments, or two string
arguments. In this case,
if both arguments are data, then the result will be data. Otherwise both
arguments will be converted
to strings, and a string returned.
Post by Ben Rubinstein via use-livecode
From Monte I get that the internal encoding for 'string' may be
MacRoman, ISO 8859 (I thought it would be CP1252), or UTF16 -
presumably with some attribute to tell the engine which one in each
case.
Monte wasn't quite correct - on Mac it is MacRoman or UTF-16, on Windows it
is CP1252 or UTF-16, on Linux it is IOS8859-1 or UTF-16. There is an internal
flag in a string value which says whether its character sequence is
single-byte (native)
or double-byte (UTF_16).
Post by Ben Rubinstein via use-livecode
So then my question is whether a 'binary string' is a pure blob, with
no clues as to interpretation; or whether in fact it does have some
attributes to suggest that it might be interpreted as UTF8, UTF132
etc?
Data (binary string) values are pure blobs - they are sequences of bytes - it has
no knowledge of where it came from. Indeed, that would generally be a bad idea
as you
wouldn't get repeatable semantics (i.e. a value from one codepath which is
data, might
have a different effect in context from one which is fetched from somewhere
else).
That being said, the engine does store some flags on values - but purely for
optimization.
i.e. To save later work. For example, a string value can store its (double)
numeric value in
it - which saves multiple 'convert to number' operations performed on the same
(pointer wise) string (due to the copy-on-write nature of values, and the fact
that all literals are unique names, pointer-wise equality of values occurs a
great deal).
Post by Ben Rubinstein via use-livecode
If there are no such attributes, how does codepointOffset operate when
passed a binary string?
CodepointOffset is has signature 'integer codepointOffset(string)', so when you
pass a binary string (data) value to it, the data value gets converted to a
string
by interpreting it as a sequence of bytes in the native encoding.
Post by Ben Rubinstein via use-livecode
If there are such attributes, how do they get set? Evidently if
textEncode is used, the engine knows that the resulting value is the
requested encoding. But what happens if the program reads a file as
'binary' - presumable the result is a binary string, how does the
engine treat it?
There are no attributes of that ilk. When you read a file as binary you get
data (binary
string) values - which means when you pass them to string taking
functions/commands that
data gets interpreted as a sequence of bytes in the native encoding. This is
why you must
always explicitly textEncode/textDecode data values when you know they are not
representing
native encoded text.
Post by Ben Rubinstein via use-livecode
Is there any way at LiveCode script level to detect what a value is,
in the above terms?
  is strictly nothing
  is strictly a boolean
  is strictly an integer - a number which has internal rep 32-bit int
  is strictly a real - a number which has internal rep double
  is strictly a string
  is strictly a binary string
  is strictly an array
It should be noted that 'is strictly' reports only how that value is stored
and not anything based on the value itself. This only really applies to 'an
integer' and 'a real' - you can store an integer in a double and all LCS
arithmetic operators act on doubles.
e.g. (1+2) is strictly an integer -> false
     (1+2) is strictly a real -> true
In contrast, though, *some* syntax will return numbers which are stored
e.g. nativeCharToNum("a") is strictly an integer -> true
I should point out that what 'is strictly' operators return for any given
context is not stable in the sense that future engine versions might return
different things. e.g. We might optimize arithmetic in the future (if we can
figure out a way to do it without performance penalty!) so that things which
are definitely integers, are stored as integers (e.g. 1 + 2 in the above).
Post by Ben Rubinstein via use-livecode
And one more question: if a string, or binary string, is saved in a
'binary' file, are the bytes stored on disk a faithful rendition of
the bytes that composed the value in memory, or an interpretation of
some kind?
What happens when you read or write data or string values to a file depends on
how you opened the file.
If you opened the file for binary (whether reading or writing), when you read
you will get data, when you write string values will be converted to data via
the native encoding (default rule).
If you opened the file for text, then the engine will try and determine (using
a BOM) the existing text encoding of the file. If it can't determine it (if
for example, you are opening a file for write which doesn't exist), it will
assume it is encoded as native.
Otherwise the file will have an explicit encoding associated with it specified
by you - reading from it will interpret the bytes in that explicit encoding;
while writing to it will expect string values which will be encoded
appropriately. In the latter case if you write data values, they will first be
converted to a string (assuming native encoding) and then written as strings
in the file's encoding (i.e. default type conversion applies).
Essentially you can view file's a typed-stream - if you opened for binary
read/write give/take data; if you opened for text then read/write give/take
strings and default type conversion rules apply.
Warmest Regards,
Mark.
Monte Goulding via use-livecode
2018-11-13 23:44:38 UTC
Permalink
Post by Ben Rubinstein via use-livecode
That's really helpful - and in parts eye-opening - thanks Mark.
I have a few follow-up questions.
Does textEncode _always_ return a binary string? Or, if invoked with "CP1252", "ISO-8859-1", "MacRoman" or "Native", does it return a string?
Internally we have different types of values. So we have MCStringRef which is the thing which either contains a buffer of native chars or a buffer of UTF-16 chars. There are others. For example, MCNumberRef will either hold a 32 bit signed int or a double. These are returned by numeric operations where there’s no string representation of a number. So:

put 1.0 into tNumber # tNumber holds an MCStringRef
put 1.0 + 0 int0 tNumber # tNumber holds an MCNumberRef

The return type of textEncode is an MCDataRef. This is a byte buffer, buffer size & byte count.

So:
put textEncode(“foo”, “UTF-8”) into tFoo # tFoo holds MCDataRef

Then if we do something like:
set the text of field “foo” to tFoo

tFoo is first converted to MCStringRef. As it’s an MCDataRef we just move the buffer over and say it’s a native encoded string. There’s no checking to see if it’s a UTF-8 string and decoding with that etc.

Then the string is put into the field.

If you remember that mergJSON issue you reported where mergJSON returns UTF-8 data and you were putting it into a field and it looked funny this is why.
Post by Ben Rubinstein via use-livecode
CodepointOffset has signature 'integer codepointOffset(string)', so when you
pass a binary string (data) value to it, the data value gets converted to a
string by interpreting it as a sequence of bytes in the native encoding.
OK - so one message I take are that in fact one should never invoke codepointOffset on a binary string. Should it actually throw an error in this case?
No, as mentioned above values can move to and from different types according to the operations performed on them and this is largely opaque to the scripter. If you do a text operation on a binary string then there’s an implicit conversion to a native encoded string. You generally want to use codepoint in 7+ generally where previously you used char unless you know you are dealing with a binary string and then you use byte.
Post by Ben Rubinstein via use-livecode
By the same token, probably one should only use 'byte', 'byteOffset', 'byteToNum' etc with binary strings - would it be better, to avoid confusion, if char, offset, charToNum should refuse to operate on a binary string?
That would not be backwards compatible.
Post by Ben Rubinstein via use-livecode
e.g. In the case of &, it can either take two data arguments, or two
string arguments. In this case, if both arguments are data, then the result
will be data. Otherwise both arguments will be converted to strings, and a
string returned.
The second message I take is that one needs to be very careful, if operating on UTF8 or other binary strings, to avoid 'contaminating' them e.g. by concatenating with a simple quoted string, as this may cause it to be silently converted to a non-binary string. (I presume that 'put "simple string" after/before pBinaryString' will cause a conversion in the same way as "&"? What about 'put "!" into char x of pBinaryString?)
When concatenating if both left and right are binary strings (MCDataRef) then there’s no conversion of either to string however we do not currently have a way to declare a literal as a binary string (might be nice if we did!) so you would need to:

put textEncode("simple string”, “UTF-8”) after pBinaryString
Post by Ben Rubinstein via use-livecode
The engine can tell whether a string is 'native' or UTF16. When the engine is converting a binary string to 'string', does it always interpret the source as the native 8-bit encoding, or does it have some heuristic to decide whether it would be more plausible to interpret the source as UTF16?
No it does not try to interpret. ICU has a charset detector that will give you a list of possible charsets along with a confidence. It could be implemented as a separate api:

get detectedTextEncodings(<binary string>, [<optional hint charset>]) -> array of charset/confidence pairs

get bestDetectedTextEncoding(<binary string>, [<optional hint charset>]) -> charset

Feel free to feature request that!

Cheers

Monte
Monte Goulding via use-livecode
2018-11-14 00:39:36 UTC
Permalink
Post by Monte Goulding via use-livecode
You generally want to use codepoint in 7+ generally where previously you used char unless you know you are dealing with a binary string and then you use byte.
Sorry! I have written codepoints here when I was thinking codeunits! Use codeunits rather than codepoints as they are a fixed number of bytes (2). Codepoints may be 2 or 4 bytes so there is a cost in figuring out the number of codepoints or the exact byte codepoint x refers to. So for chunk expressions on unicode strings use `codeunit x to y`.

Cheers

Monte
Monte Goulding via use-livecode
2018-11-14 00:49:27 UTC
Permalink
Post by Monte Goulding via use-livecode
Post by Monte Goulding via use-livecode
You generally want to use codepoint in 7+ generally where previously you used char unless you know you are dealing with a binary string and then you use byte.
Sorry! I have written codepoints here when I was thinking codeunits! Use codeunits rather than codepoints as they are a fixed number of bytes (2). Codepoints may be 2 or 4 bytes so there is a cost in figuring out the number of codepoints or the exact byte codepoint x refers to. So for chunk expressions on unicode strings use `codeunit x to y`.
Argh… sorry again… codeunits are a fixed number of bytes but that fixed number depends on whether the string is native encoded (1 byte) or UTF-16 (2 bytes)!

And for completeness codeunit/codepoint is not equivalent to char. If you really need to count graphemes then you will need to use char.

Cheers

Monte
Ben Rubinstein via use-livecode
2018-11-20 16:33:19 UTC
Permalink
Hi Monte,

Thanks for this, sorry for delayed reply - I've been away.
Post by Monte Goulding via use-livecode
Post by Ben Rubinstein via use-livecode
Does textEncode _always_ return a binary string? Or, if invoked with
"CP1252", "ISO-8859-1", "MacRoman" or "Native", does it return a string?
Post by Monte Goulding via use-livecode
Internally we have different types of values. So we have MCStringRef which
is the thing which either contains a buffer of native chars or a buffer of
UTF-16 chars. There are others.
...
Post by Monte Goulding via use-livecode
The return type of textEncode is an MCDataRef. This is a byte buffer,
buffer size & byte count.
Post by Monte Goulding via use-livecode
put textEncode(“foo”, “UTF-8”) into tFoo # tFoo holds MCDataRef
set the text of field “foo” to tFoo
tFoo is first converted to MCStringRef. As it’s an MCDataRef we just move
the buffer over and say it’s a native encoded string. There’s no checking to
see if it’s a UTF-8 string and decoding with that etc.

So my question would be, is this helpful? If, given any MCDataRef (i.e.
'binary string') LC makes the assumption - when it needs an MCStringRef - that
the binary string is 'native' - then I would think it will be wrong more often
that is correct!

IIUC, the chief ways to obtain an MCDataRef are by reading a file in binary
mode, or by calling textEncode (or loading a non-file URL???). Insofar as one
could make an assumption at all, my guess is that in the first case the data
is more likely to be UTF8; and whatever is most likely in the second case,
'native' is about the least likely. (If the assumption was UTF16 it would at
least make more sense.)

Would it not be better to refuse to make an assumption, i.e. require an
explicit conversion? If you want to proceed on the assumption that a file is
'native' text, read it as text; if you know what it is, read it as binary and
use textEncode. If you used textEncode anyway (or numToByte) then obviously
you know what it is, and when you want to make a string out of it you can tell
LC how to interpret it. Wouldn't it be better to throw an error if passing an
MCDataRef where an MCStringRef is required, than introduce subtle errors by
just making (in my opinion implausible) assumptions?

And now that the thought has occurred to me - when a URL with a non-file
protocol is used a source of value, what is the type of the value -
MCStringRef or MCDataRef?

thanks for the continuing education!

Ben
Post by Monte Goulding via use-livecode
Post by Ben Rubinstein via use-livecode
That's really helpful - and in parts eye-opening - thanks Mark.
I have a few follow-up questions.
Does textEncode _always_ return a binary string? Or, if invoked with "CP1252", "ISO-8859-1", "MacRoman" or "Native", does it return a string?
put 1.0 into tNumber # tNumber holds an MCStringRef
put 1.0 + 0 int0 tNumber # tNumber holds an MCNumberRef
The return type of textEncode is an MCDataRef. This is a byte buffer, buffer size & byte count.
put textEncode(“foo”, “UTF-8”) into tFoo # tFoo holds MCDataRef
set the text of field “foo” to tFoo
tFoo is first converted to MCStringRef. As it’s an MCDataRef we just move the buffer over and say it’s a native encoded string. There’s no checking to see if it’s a UTF-8 string and decoding with that etc.
Then the string is put into the field.
If you remember that mergJSON issue you reported where mergJSON returns UTF-8 data and you were putting it into a field and it looked funny this is why.
Post by Ben Rubinstein via use-livecode
CodepointOffset has signature 'integer codepointOffset(string)', so when you
pass a binary string (data) value to it, the data value gets converted to a
string by interpreting it as a sequence of bytes in the native encoding.
OK - so one message I take are that in fact one should never invoke codepointOffset on a binary string. Should it actually throw an error in this case?
No, as mentioned above values can move to and from different types according to the operations performed on them and this is largely opaque to the scripter. If you do a text operation on a binary string then there’s an implicit conversion to a native encoded string. You generally want to use codepoint in 7+ generally where previously you used char unless you know you are dealing with a binary string and then you use byte.
Post by Ben Rubinstein via use-livecode
By the same token, probably one should only use 'byte', 'byteOffset', 'byteToNum' etc with binary strings - would it be better, to avoid confusion, if char, offset, charToNum should refuse to operate on a binary string?
That would not be backwards compatible.
Post by Ben Rubinstein via use-livecode
e.g. In the case of &, it can either take two data arguments, or two
string arguments. In this case, if both arguments are data, then the result
will be data. Otherwise both arguments will be converted to strings, and a
string returned.
The second message I take is that one needs to be very careful, if operating on UTF8 or other binary strings, to avoid 'contaminating' them e.g. by concatenating with a simple quoted string, as this may cause it to be silently converted to a non-binary string. (I presume that 'put "simple string" after/before pBinaryString' will cause a conversion in the same way as "&"? What about 'put "!" into char x of pBinaryString?)
put textEncode("simple string”, “UTF-8”) after pBinaryString
Post by Ben Rubinstein via use-livecode
The engine can tell whether a string is 'native' or UTF16. When the engine is converting a binary string to 'string', does it always interpret the source as the native 8-bit encoding, or does it have some heuristic to decide whether it would be more plausible to interpret the source as UTF16?
get detectedTextEncodings(<binary string>, [<optional hint charset>]) -> array of charset/confidence pairs
get bestDetectedTextEncoding(<binary string>, [<optional hint charset>]) -> charset
Feel free to feature request that!
Cheers
Monte
_______________________________________________
use-livecode mailing list
http://lists.runrev.com/mailman/listinfo/use-livecode
Bob Sneidar via use-livecode
2018-11-20 17:11:47 UTC
Permalink
I'm not grasping the import of the question here, but it seems to me that the question is about what happens "under the hood", in relation to the format of the data as it is exposed to any I/O. It seems to me that in this context it's academic. If there is a problem with what's going on "under the hood", that of course needs to be addressed. But if it's not affecting what the developer/user "sees" in terms of the format of the data, I don't see the point.

Bob S
Post by Ben Rubinstein via use-livecode
Hi Monte,
Thanks for this, sorry for delayed reply - I've been away.
Post by Monte Goulding via use-livecode
Post by Ben Rubinstein via use-livecode
Does textEncode _always_ return a binary string? Or, if invoked with "CP1252", "ISO-8859-1", "MacRoman" or "Native", does it return a string?
Internally we have different types of values. So we have MCStringRef which is the thing which either contains a buffer of native chars or a buffer of UTF-16 chars. There are others.
...
Post by Monte Goulding via use-livecode
The return type of textEncode is an MCDataRef. This is a byte buffer, buffer size & byte count.
put textEncode(“foo”, “UTF-8”) into tFoo # tFoo holds MCDataRef
set the text of field “foo” to tFoo
tFoo is first converted to MCStringRef. As it’s an MCDataRef we just move the buffer over and say it’s a native encoded string. There’s no checking to see if it’s a UTF-8 string and decoding with that etc.
So my question would be, is this helpful? If, given any MCDataRef (i.e. 'binary string') LC makes the assumption - when it needs an MCStringRef - that the binary string is 'native' - then I would think it will be wrong more often that is correct!
IIUC, the chief ways to obtain an MCDataRef are by reading a file in binary mode, or by calling textEncode (or loading a non-file URL???). Insofar as one could make an assumption at all, my guess is that in the first case the data is more likely to be UTF8; and whatever is most likely in the second case, 'native' is about the least likely. (If the assumption was UTF16 it would at least make more sense.)
Would it not be better to refuse to make an assumption, i.e. require an explicit conversion? If you want to proceed on the assumption that a file is 'native' text, read it as text; if you know what it is, read it as binary and use textEncode. If you used textEncode anyway (or numToByte) then obviously you know what it is, and when you want to make a string out of it you can tell LC how to interpret it. Wouldn't it be better to throw an error if passing an MCDataRef where an MCStringRef is required, than introduce subtle errors by just making (in my opinion implausible) assumptions?
And now that the thought has occurred to me - when a URL with a non-file protocol is used a source of value, what is the type of the value - MCStringRef or MCDataRef?
thanks for the continuing education!
Ben
Post by Monte Goulding via use-livecode
Post by Ben Rubinstein via use-livecode
That's really helpful - and in parts eye-opening - thanks Mark.
I have a few follow-up questions.
Does textEncode _always_ return a binary string? Or, if invoked with "CP1252", "ISO-8859-1", "MacRoman" or "Native", does it return a string?
put 1.0 into tNumber # tNumber holds an MCStringRef
put 1.0 + 0 int0 tNumber # tNumber holds an MCNumberRef
The return type of textEncode is an MCDataRef. This is a byte buffer, buffer size & byte count.
put textEncode(“foo”, “UTF-8”) into tFoo # tFoo holds MCDataRef
set the text of field “foo” to tFoo
tFoo is first converted to MCStringRef. As it’s an MCDataRef we just move the buffer over and say it’s a native encoded string. There’s no checking to see if it’s a UTF-8 string and decoding with that etc.
Then the string is put into the field.
If you remember that mergJSON issue you reported where mergJSON returns UTF-8 data and you were putting it into a field and it looked funny this is why.
Post by Ben Rubinstein via use-livecode
CodepointOffset has signature 'integer codepointOffset(string)', so when you
pass a binary string (data) value to it, the data value gets converted to a
string by interpreting it as a sequence of bytes in the native encoding.
OK - so one message I take are that in fact one should never invoke codepointOffset on a binary string. Should it actually throw an error in this case?
No, as mentioned above values can move to and from different types according to the operations performed on them and this is largely opaque to the scripter. If you do a text operation on a binary string then there’s an implicit conversion to a native encoded string. You generally want to use codepoint in 7+ generally where previously you used char unless you know you are dealing with a binary string and then you use byte.
Post by Ben Rubinstein via use-livecode
By the same token, probably one should only use 'byte', 'byteOffset', 'byteToNum' etc with binary strings - would it be better, to avoid confusion, if char, offset, charToNum should refuse to operate on a binary string?
That would not be backwards compatible.
Post by Ben Rubinstein via use-livecode
e.g. In the case of &, it can either take two data arguments, or two
string arguments. In this case, if both arguments are data, then the result
will be data. Otherwise both arguments will be converted to strings, and a
string returned.
The second message I take is that one needs to be very careful, if operating on UTF8 or other binary strings, to avoid 'contaminating' them e.g. by concatenating with a simple quoted string, as this may cause it to be silently converted to a non-binary string. (I presume that 'put "simple string" after/before pBinaryString' will cause a conversion in the same way as "&"? What about 'put "!" into char x of pBinaryString?)
put textEncode("simple string”, “UTF-8”) after pBinaryString
Post by Ben Rubinstein via use-livecode
The engine can tell whether a string is 'native' or UTF16. When the engine is converting a binary string to 'string', does it always interpret the source as the native 8-bit encoding, or does it have some heuristic to decide whether it would be more plausible to interpret the source as UTF16?
get detectedTextEncodings(<binary string>, [<optional hint charset>]) -> array of charset/confidence pairs
get bestDetectedTextEncoding(<binary string>, [<optional hint charset>]) -> charset
Feel free to feature request that!
Cheers
Monte
_______________________________________________
use-livecode mailing list
http://lists.runrev.com/mailman/listinfo/use-livecode
_______________________________________________
use-livecode mailing list
http://lists.runrev.com/mailman/listinfo/use-livecode
Mark Wieder via use-livecode
2018-11-20 17:55:46 UTC
Permalink
Post by Ben Rubinstein via use-livecode
Would it not be better to refuse to make an assumption, i.e. require an
explicit conversion?
While I'd love to have the option of strongly typed variables at the
scripting level, I know better than to expect that this will ever happen.
--
Mark Wieder
***@gmail.com
Ben Rubinstein via use-livecode
2018-11-20 18:24:58 UTC
Permalink
This post might be inappropriate. Click to display it.
Geoff Canyon via use-livecode
2018-11-20 18:31:14 UTC
Permalink
I'll chip in and point out that the implicit conversion caused significant
hiccups in figuring out the offsets issues -- several people (including me)
were fooled by the fact that conversion to UTF-32 results in binary data,
but can be transparently treated as text. Or maybe I'm
mistaken/misremembering, which reinforces the fact that it's confusing. :-)

On Tue, Nov 20, 2018 at 10:25 AM Ben Rubinstein via use-livecode <
Post by Ben Rubinstein via use-livecode
This isn't about strongly typed variables though, but about when (correct)
conversion is possible.
LC throws an error if you implicitly ask it to convert the wrong kind of
string to a number - for example, add 45 to "horse". (Obviously multiplication
is fine: the answer would be "45 horses".)
LC throws an error if you implicitly ask it convert the wrong kind of string
or number to a colour: try setting the backcolor of a control to "horse".
LC throws an error if asked to convert a number, or the wrong kind of string,
to a boolean: try setting the hilite of a button to 45.
In all these cases, LC knows it cannot do the right thing, so it throws an
error to tell you so, rather than guessing, for example, what the truth value
of "45" is.
I'm just suggesting that it cannot know how to correctly convert binary data
into a string - so it should throw an error rather than possibly (probably?)
do the wrong thing.
Post by Mark Wieder via use-livecode
Post by Ben Rubinstein via use-livecode
Would it not be better to refuse to make an assumption, i.e. require an
explicit conversion?
While I'd love to have the option of strongly typed variables at the
scripting
Post by Mark Wieder via use-livecode
level, I know better than to expect that this will ever happen.
_______________________________________________
use-livecode mailing list
http://lists.runrev.com/mailman/listinfo/use-livecode
Bob Sneidar via use-livecode
2018-11-20 18:55:05 UTC
Permalink
This isn't about strongly typed variables though, but about when (correct) conversion is possible.
LC throws an error if you implicitly ask it to convert the wrong kind of string to a number - for example, add 45 to "horse". (Obviously multiplication is fine: the answer would be "45 horses".)
LC throws an error if you implicitly ask it convert the wrong kind of string or number to a colour: try setting the backcolor of a control to "horse".
LC throws an error if asked to convert a number, or the wrong kind of string, to a boolean: try setting the hilite of a button to 45.
In all these cases, LC knows it cannot do the right thing, so it throws an error to tell you so, rather than guessing, for example, what the truth value of "45" is.
I'm just suggesting that it cannot know how to correctly convert binary data into a string - so it should throw an error rather than possibly (probably?) do the wrong thing.
Too many assumptions about the "string" would be necessary here. What if I wanted to write a utility that displayed in ascii format a sector on a disk like the old MacOS DiskEdit used to do? Certainly, much of the data would be impossible to format, but some might be discernible. I'm suggesting that there is nothing intrinsic about binary data that can absolutely identify it as the type of string or data you are expecting, whereas with typed data there is. So when it comes to binary data, it seems to me to be better to assume nothing about whether or not the data is valid.

Again, I am not terribly versed in data processing at this level, but it seems to me that referring to binary data as "typed" data is a bit of a misnomer. ALL data is in the end stored as binary data. Typing is a predefined way to structure how the data is stored, partly for efficiency, and partly to preclude the volatility of processing the wrong type of data at the machine level.

Imagine if every time an addition process was called by a microprocessor it had to do error checking to see if the values the binary data actually represented were integers. The processing overhead would be impossibly voluminous. So this is enforced by the compiler, hence data typing. It only matters to the higher level language what the binary data represents.

In the case of LC, and other non-typed languages, this is determined at run time, and not at compile time. That is really the big difference. In C++, typing prevents me as a developer from trying to add 45 to "horses" before I compile the entire application, as with that kind of environment, compiling a large application can (or used to) take a really long time, and debugging to find out where you went wrong could be immensly tedious. Since LC "compiles" scripts as it goes, it isn't really necessary anymore to type variables, and that frees us, the developers to think about the flow of the application rather than get bogged down in the minutia of programming faux pas.

I freely admit though that when it comes to this subject matter, better minds than I are more suited to the discussion.

Bob S
Lagi Pittas via use-livecode
2018-11-21 17:00:59 UTC
Permalink
Hi Mark,

I can't see any reason why not - except for time (and money).

The fact that the language has been "forked - livecode builder" means there
is a precedent for changes into the way the language works. I cannot see
why LCB could not be one of the "open language" variants that uses the
livecode hierarchy/gui/message path etc.

My original reason for funding the Kickstarter WAS the open language - to
be able to write in Python/LC/JS and Visual foxpro within the same
environment (we can but dream). If you have code already written working
and debugged in 1 language use it as long as each languages source is
translated to the LC bytecode engine.

Typescript adds static typing, classes and modules to JavaScript with a
transpiler - I bet if there was a Kickstarter to add a JavaScript and
python script to Livecode we would get over a Million dollars - who wants
to use glade or pyQT or WxPython or Tkinter. How many JavaScript and
Python programmers would love to create desktop applications? I would
suggest that the coding in the LC for Filemaker is much more involved -
It's worth a punt Kevin? Mark? And please don't come back with the usual
ripost - "if you want to program in Python use Python" - It's the
Environment and the tools and the multiple deployments and and and .......
and using all those libraries in Python and JavaScript land .....

Best Lagi




On Tue, 20 Nov 2018 at 17:56, Mark Wieder via use-livecode <
Post by Mark Wieder via use-livecode
Post by Ben Rubinstein via use-livecode
Would it not be better to refuse to make an assumption, i.e. require an
explicit conversion?
While I'd love to have the option of strongly typed variables at the
scripting level, I know better than to expect that this will ever happen.
--
Mark Wieder
_______________________________________________
use-livecode mailing list
http://lists.runrev.com/mailman/listinfo/use-livecode
Bob Sneidar via use-livecode
2018-11-26 18:39:27 UTC
Permalink
I would be concerned that if a large number of Java coders (far more than the LC coders) were to come on board, we would end up with a java development environment as the java people would dominate the demand and direction of LC.

Bob S
Post by Lagi Pittas via use-livecode
How many JavaScript and
Python programmers would love to create desktop applications?
Lagi Pittas via use-livecode
2018-11-28 14:35:56 UTC
Permalink
Hi Bob,

So my rant didn't go to the bit bucket.

To answer your question NO that wouldn't happen - adding "Side" languages
to use with the IDE introduces people to a saner way or doing things.

We could use Python and or Javascript for the great libraries and make
them callable with a LCB wrapper and use LC for the stuff we need to
understand.
I Never mentioned Java and Javascript is certainly NOT Java thank God. For
Java they can use FFI as punishment for all all the boilerplate. For Kotlin
we could make an exception. ;-)

Anyway It's no different to having LCB as a "second" statically typed
language.

Actually if we could make it work like Steve Wozniak's pseudo 16 bit
interpreter "Sweet 16" or how you could switch into assembler using
$ASMMODE in Turbo Pascal or [ ] BBC Basic except switch into the "side
language" that would be the icing on the cake.

I now quite like Livecodes non dot-notation and I've even acclimatised
myself to "PUT" , as usual it's "comfortable shoes". To me Pascal was the
easiest language to read but even I can't argue that Hypertalk is MUCH
easier to read. Never bothered about being terse - only being readable.

Regards Lagi

On Mon, 26 Nov 2018 at 18:39, Bob Sneidar via use-livecode <
Post by Bob Sneidar via use-livecode
I would be concerned that if a large number of Java coders (far more than
the LC coders) were to come on board, we would end up with a java
development environment as the java people would dominate the demand and
direction of LC.
Bob S
On Nov 21, 2018, at 09:00 , Lagi Pittas via use-livecode <
How many JavaScript and
Python programmers would love to create desktop applications?
_______________________________________________
use-livecode mailing list
http://lists.runrev.com/mailman/listinfo/use-livecode
Geoff Canyon via use-livecode
2018-11-13 17:21:31 UTC
Permalink
On Tue, Nov 13, 2018 at 3:43 AM Ben Rubinstein via use-livecode <
Post by Ben Rubinstein via use-livecode
I'm grateful for all the information, but _outraged_ that the thread that I
carefully created separate from the offset thread was so quickly hijacked for
the continuing (useful!) detailed discussion on that topic.
Nothing I said in this thread has anything to do with optimizing the
allOffsets routines; I only used examples from that discussion because they
illustrate my puzzlement on the exact topic you (in general) raised: how
data types are handled by the engine. I'd generalize the responses, to say
that it seems how the engine stores data and how it presents that data are
not identical in all cases.

Separately, it's interesting to hear that the engine (can) store(s) numeric
values for strings, as an optimization.

The above notwithstanding: sorry I outraged you; I'll exit this thread.
Mark Waddingham via use-livecode
2018-11-13 17:29:35 UTC
Permalink
Post by Geoff Canyon via use-livecode
Nothing I said in this thread has anything to do with optimizing the
allOffsets routines; I only used examples from that discussion because they
illustrate my puzzlement on the exact topic you (in general) raised: how
data types are handled by the engine. I'd generalize the responses, to say
that it seems how the engine stores data and how it presents that data are
not identical in all cases.
The best way to think about it is that the engine stores data pretty
much in the form it is presented with it; however, what script sees of
data is in the form it requests. In particular, if data has been through
some operation, or mutated, then there is a good change it won't be in
the same form it was before.

e.g. put tVar + 1 into tVar

Here tVar could start off as a string, but would end up as a number by
virtue of the fact you've performed an arithmetic operation on it.
Post by Geoff Canyon via use-livecode
The above notwithstanding: sorry I outraged you; I'll exit this thread.
Obviously I'm not Ben, but I *think* it was 'faux outrage' (well I hope
it was - hence my jocular comment about herding cats!) - so I don't
think there's a reason to exit...

Warmest Regards,

Mark.
--
Mark Waddingham ~ ***@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps
Ben Rubinstein via use-livecode
2018-11-13 20:54:26 UTC
Permalink
For the avoidance of doubt, all my outrage is faux outrage.
Public life on both sides of the Atlantic (and around the world) has
completely exhausted capacity for real outrage.

Come back Geoff!

Ben
Post by Geoff Canyon via use-livecode
Nothing I said in this thread has anything to do with optimizing the
allOffsets routines; I only used examples from that discussion because they
illustrate my puzzlement on the exact topic you (in general) raised: how
data types are handled by the engine. I'd generalize the responses, to say
that it seems how the engine stores data and how it presents that data are
not identical in all cases.
The best way to think about it is that the engine stores data pretty much in
the form it is presented with it; however, what script sees of data is in the
form it requests. In particular, if data has been through some operation, or
mutated, then there is a good change it won't be in the same form it was before.
e.g. put tVar + 1 into tVar
Here tVar could start off as a string, but would end up as a number by virtue
of the fact you've performed an arithmetic operation on it.
Post by Geoff Canyon via use-livecode
The above notwithstanding: sorry I outraged you; I'll exit this thread.
Obviously I'm not Ben, but I *think* it was 'faux outrage' (well I hope it was
- hence my jocular comment about herding cats!) - so I don't think there's a
reason to exit...
Warmest Regards,
Mark.
Geoff Canyon via use-livecode
2018-11-13 22:29:23 UTC
Permalink
I never left, I just went silent.

But since I'm "back", I'm curious to know what the engine-types think of
Bernd's solution for fixing the UTF-32 offsets code. It seems that when
converting both the stringToFind and stringToSearch to UTF-32 and then
searching the binary with byteOffset, you won't find "Reykjavík" in
"Reykjavík er höfuðborg"

But if you first append "せ" to each string, then do the textEncode, then
strip the last 4 bytes, the match will work. That seems like strange voodoo
to me.

On Tue, Nov 13, 2018 at 12:54 PM Ben Rubinstein via use-livecode <
Post by Ben Rubinstein via use-livecode
For the avoidance of doubt, all my outrage is faux outrage.
Public life on both sides of the Atlantic (and around the world) has
completely exhausted capacity for real outrage.
Come back Geoff!
Ben
Post by Mark Waddingham via use-livecode
Post by Geoff Canyon via use-livecode
Nothing I said in this thread has anything to do with optimizing the
allOffsets routines; I only used examples from that discussion because
they
Post by Mark Waddingham via use-livecode
Post by Geoff Canyon via use-livecode
illustrate my puzzlement on the exact topic you (in general) raised: how
data types are handled by the engine. I'd generalize the responses, to
say
Post by Mark Waddingham via use-livecode
Post by Geoff Canyon via use-livecode
that it seems how the engine stores data and how it presents that data
are
Post by Mark Waddingham via use-livecode
Post by Geoff Canyon via use-livecode
not identical in all cases.
The best way to think about it is that the engine stores data pretty
much in
Post by Mark Waddingham via use-livecode
the form it is presented with it; however, what script sees of data is
in the
Post by Mark Waddingham via use-livecode
form it requests. In particular, if data has been through some
operation, or
Post by Mark Waddingham via use-livecode
mutated, then there is a good change it won't be in the same form it was
before.
Post by Mark Waddingham via use-livecode
e.g. put tVar + 1 into tVar
Here tVar could start off as a string, but would end up as a number by
virtue
Post by Mark Waddingham via use-livecode
of the fact you've performed an arithmetic operation on it.
Post by Geoff Canyon via use-livecode
The above notwithstanding: sorry I outraged you; I'll exit this thread.
Obviously I'm not Ben, but I *think* it was 'faux outrage' (well I hope
it was
Post by Mark Waddingham via use-livecode
- hence my jocular comment about herding cats!) - so I don't think
there's a
Post by Mark Waddingham via use-livecode
reason to exit...
Warmest Regards,
Mark.
_______________________________________________
use-livecode mailing list
Please visit this url to subscribe, unsubscribe and manage your
http://lists.runrev.com/mailman/listinfo/use-livecode
Loading...