Watch, Follow, &
Connect with Us
Public Report
Report From: Delphi-BCB/RTL/Delphi/Other Classes/TEncoding    [ Add a report in this area ]  
Report #:  79042   Status: Open
Remove MB_ERR_INVALID_CHARS flag from TEncoding.UTF8
Project:  Delphi Build #:  14.0.3615.26342
Version:    14.0 Submitted By:   Remy Lebeau (TeamB)
Report Type:  Minor failure / Design problem Date Reported:  10/27/2009 4:24:51 PM
Severity:    Commonly encountered problem Last Updated: 3/20/2012 2:24:39 AM
Platform:    All platforms Internal Tracking #:   273480
Resolution: None (Resolution Comments) Resolved in Build: : None
Duplicate of:  None
Voting and Rating
Overall Rating: No Ratings Yet
0.00 out of 5
Total Votes: None
Description
TEncoding.UTF8 is the only encoding object that uses the MB_ERR_INVALID_CHARS flag when calling MultiByteToWideChar().  When a byte buffer is passed to GetCharCount() or GetChars(), and the buffer contains an incomplete character sequence at the end (because the sequence is straddling multiple buffer boundaries), the entire decode operation fails even if the buffer contains fully decodable sequences.

Using TEncoding.GetEncoding(65001) instead of TEncoding.UTF8 does not use the MB_ERR_INVALID_CHARS flag, and GetCharCount() and GetChars() is able to process full sequences correctly, ignoring any partial sequences at the end of the byte buffer.

MB_ERR_INVALID_CHARS should be removed from the SysUtils.TUTF8Encoding class.
Steps to Reproduce:
// using TEncoding.UTF8...

var
  utf8: TBytes;
  utf16: TCharArray;
  enc: TEncoding;
begin
  SetLength(utf8, 8);
  utf8[0] := Ord('T');
  utf8[1] := Ord('e');
  utf8[2] := Ord('s');
  utf8[3] := Ord('t');
  utf8[4] := Ord(' ');
  // UTF-8 encoding of Greek PI, for example
  utf8[5] := $CE;
  utf8[6] := $A0;
  utf8[7] := 0;
  utf16 := TEncoding.UTF8.GetChars(utf8, 0, 6);
  // utf16 is completely empty!
end;


// using TEncoding.GetEncoding(65001)...

var
  utf8: TBytes;
  utf16: TCharArray;
  enc: TEncoding;
begin
  SetLength(utf8, 8);
  utf8[0] := Ord('T');
  utf8[1] := Ord('e');
  utf8[2] := Ord('s');
  utf8[3] := Ord('t');
  utf8[4] := Ord(' ');
  // UTF-8 encoding of Greek PI, for example
  utf8[5] := $CE;
  utf8[6] := $A0;
  utf8[7] := 0;
  enc := TEncoding.GetEncoding(65001);
  try
    utf16 := enc.GetChars(utf8, 0, 6);
  finally
    enc.Free;
  end;
  // utf16 contains 'Test ' as expected...
end;
Workarounds
None
Attachment
None
Comments

Michiel Spoor at 1/17/2013 12:07:44 AM -
Proposed solution does not look good. It silently ignores the unconvertible token.
Please NEVER silently ignore errors!

Item #111980 seems related

Server Response from: ETNACODE01