Lindenberg Software

Issues in Devanagari cluster validation

Norbert Lindenberg
August 25, 2020

The Unicode Standard and the documentation of the OpenType Devanagari and Universal shaping engines don’t agree on the definition of a valid Devanagari cluster, and Devanagari cluster validation in OpenType shaping engines and in fonts produces inconsistent results.

Contents

What’s cluster validation?

In complex writing systems, it’s not always obvious in which order characters that are pronounced together or that form a cluster in rendering should be stored in a Unicode character sequence. Glyphs are often rendered in a different sequence than the corresponding sounds in spoken language are pronounced, and commonly some glyphs are shown above or below other glyphs, where the ordering is not clear. However, a defined character sequence is often important for correct processing of the text – sorting strings, searching for specific words, finding line breaks, and rendering the text using fonts.

Both the Unicode Standard and the OpenType script development documents therefore commonly define the structure of a valid cluster in a script. OpenType shaping engines for complex scripts usually validate incoming text as the first step in transforming a character sequence into a two-dimensional arrangement of glyphs, and insert a dotted circle ◌ whenever they find a character where they don’t expect it. Fonts implemented using the Graphite and Apple Advanced Typography shaping systems have to implement validation themselves.

Unfortunately the Unicode Standard and the OpenType script development documents don’t always agree with each other on the structure of valid clusters, or even have internal inconsistencies, and shaping engine implementations don’t always follow either of these documents.

What’s a valid Devanagari cluster?

Defining a cluster model for Devanagari is complicated by the fact that Devanagari has been used over a very long time (close to 2000 years) and for over 200 languages (SIL). Documentation about the script tends to focus on contemporary use for a few popular languages such as Hindi, Marathi, and Nepali, and on historical use for Sanskrit. It’s easy to derive from such documentation a set of character combinations that must be supported, but it’s not sufficient to derive which combinations should be prohibited. This document therefore compares current documentation of cluster models in the Unicode standard and OpenType documentation as well as a range of implementations to find disagreements that need to be investigated and resolved.

Characters used in and with the Devanagari script

Based on the the core specification and the data for Unicode 13.0, this document uses the following Devanagari character set:

Unicode character properties for Devanagari

The Unicode Standard provides several character properties that can help describe the structure of Devanagari clusters. The Universal Shaping Engine (USE), part of OpenType rendering systems, uses these properties to define character classes, which are used in its definition of (generic Brahmic) clusters. The following table shows the Devanagari characters identified above and their properties and classes.

Code pointsCharactersGeneral categoryCanonical combining classIndic syllabic categoryIndic positional categoryUSE subclass
0971Lm0OtherNABASE_IND
1CF2..1CF3ᳲ ᳳLo0Consonant_DeadNABASE_IND
0950, A8F4..A8F7, A8FB, A8FDॐ ꣴ ꣵ ꣶ ꣷ ꣻ ꣽLo0OtherNABASE_IND
1CED◌᳭Mn220OtherBottomBASE_IND
1CE2..1CE8◌᳢ ◌᳣ ◌᳤ ◌᳥ ◌᳦ ◌᳧ ◌᳨Mn1OtherOverstruckBASE_IND
002C, 002E, 0964..0965, 0970, 1CD3, A8F8..A8FA, A8FC, . । ॥ ॰ ᳓ ꣸ ꣹ ꣺ ꣼Po0OtherNABASE_IND
02BCʼLm0OtherNAOTHER
1CE9..1CEC, 1CEE..1CF1ᳩ ᳪ ᳫ ᳬ ᳮ ᳯ ᳰ ᳱLo0OtherNAOTHER
A830..A835꠰ ꠱ ꠲ ꠳ ꠴ ꠵No0OtherNAOTHER
1CF5..1CF6ᳵ ᳶLo0Consonant_With_StackerNACONS_WITH_STACKER
093DLo0AvagrahaNABASE
A8F2..A8F3ꣲ ꣳLo0BinduNABASE
0915..0939, 0958..095F, 0978..097Fक ख ग घ ङ च छ ज झ ञ ट ठ ड ढ ण त थ द ध न ऩ प फ ब भ म य र ऱ ल ळ ऴ व श ष स ह क़ ख़ ग़ ज़ ड़ ढ़ फ़ य़ ॸ ॹ ॺ ॻ ॼ ॽ ॾ ॿLo0ConsonantNABASE
0904..0914, 0960..0961, 0972..0977, A8FEऄ अ आ इ ई उ ऊ ऋ ऌ ऍ ऎ ए ऐ ऑ ऒ ओ औ ॠ ॡ ॲ ॳ ॴ ॵ ॶ ॷ ꣾLo0Vowel_IndependentNABASE
0966..096F० १ २ ३ ४ ५ ६ ७ ८ ९Nd0NumberNABASE
25CCSo0Consonant_PlaceholderNABASE_OTHER
00A0 Zs0Consonant_PlaceholderNABASE_OTHER
093C◌़Mn7NuktaBottomCONS_MOD_BELOW
094D◌्Mn9ViramaBottomHALANT
093F, 094E◌ि ◌ॎMc0Vowel_DependentLeftVOWEL_PRE
093A, 0945..0948, 0955, A8FF◌ऺ ◌ॅ ◌ॆ ◌े ◌ै ◌ॕ ◌ꣿMn0Vowel_DependentTopVOWEL_ABOVE
0941..0944, 0956..0957, 0962..0963◌ु ◌ू ◌ृ ◌ॄ ◌ॖ ◌ॗ ◌ॢ ◌ॣMn0Vowel_DependentBottomVOWEL_BELOW
093B, 093E, 0940, 0949..094C, 094F◌ऻ ◌ा ◌ी ◌ॉ ◌ॊ ◌ो ◌ौ ◌ॏMc0Vowel_DependentRightVOWEL_POST
0900..0902◌ऀ ◌ँ ◌ंMn0BinduTopVOWEL_MOD_ABOVE
0951, 1CD0..1CD2, 1CDA..1CDB, 1CE0, 1CF4, 20F0, A8E0..A8F1◌॑ ◌᳐ ◌᳑ ◌᳒ ◌᳚ ◌᳛ ◌᳠ ◌᳴ ◌⃰ ◌꣠ ◌꣡ ◌꣢ ◌꣣ ◌꣤ ◌꣥ ◌꣦ ◌꣧ ◌꣨ ◌꣩ ◌꣪ ◌꣫ ◌꣬ ◌꣭ ◌꣮ ◌꣯ ◌꣰ ◌꣱Mn230Cantillation_MarkTopVOWEL_MOD_ABOVE
0952, 1CD5..1CD9, 1CDC..1CDF◌॒ ◌᳕ ◌᳖ ◌᳗ ◌᳘ ◌᳙ ◌᳜ ◌᳝ ◌᳞ ◌᳟Mn220Cantillation_MarkBottomVOWEL_MOD_BELOW
1CD4◌᳔Mn1Cantillation_MarkOverstruckVOWEL_MOD_BELOW
1CE1◌᳡Mc0Cantillation_MarkRightVOWEL_MOD_POST
0903◌ःMc0VisargaRightVOWEL_MOD_POST
A838Sc0OtherNASYM or BASE_IND
A836..A837, A839꠶ ꠷ ꠹So0OtherNASYM or BASE_IND
200DCf0JoinerNAZERO_WIDTH_JOINER
200CCf0Non_JoinerNAZERO_WIDTH_NON_JOINER
1CF8..1CF9◌᳸ ◌᳹Mn230Cantillation_MarkNAundefined

There are some problems in this table:

We’ll investigate below what this means for validation specifications and implementations.

Devanagari clusters in the Unicode Standard

The description of Devanagari in the Unicode Standard doesn’t provide a concise definition of an orthographic Devanagari cluster. Instead, there are several separate pieces of information scattered over the chapter:

From that we can derive a regular expression for a Devanagari cluster, assuming terms relating to Indic syllabic categories mean those categories and that the undefined “svara” means all remaining categories that contain marks, and using the syntax characters of Perl/Python/Java/JavaScript regular expression patterns:

Vowel_Independent | (0930 Nukta? Joiner? Virama)? (Consonant Nukta? Virama (Joiner | Non_Joiner)?)* Consonant Nukta? (Virama | Vowel_Dependent? [0900-0902]* (Cantillation_Mark | Visarga | 1CED | [1CE2-1CE8])*)

The canonical combining classes don’t always align with this, however. In particular, the cantillation marks U+1CD4 and U+1CE2..U+1CE8 have combining class 1, so normalization would reorder them to before a nukta, which has combining class 7.

The regular expression does not capture that an independent vowel shouldn’t follow a virama. The Unicode Standard also includes three tables showing glyphs for vowel letters, atomic consonants, and consonant conjuncts that have a simple encoding as well as longer character sequences that could result in the same rendering. The longer character sequences should not be used. Validation has to check for any such character sequences and reject them.

Devanagari clusters in the Devanagari shaping engine

In OpenType rendering systems, Devanagari is usually handled by the Devanagari shaping engine. The documentation for this engine provides three patterns for Devanagari syllables:

{C+[N]+<H+[<ZWNJ|ZWJ>]|<ZWNJ|ZWJ>+H>} + C+[N]+[A] + [< H+[<ZWNJ|ZWJ>] | {M}+[N]+[H]>]+[SM]+[(VD)]

[Ra+H]+V+[N]+[<[<ZWJ|ZWNJ>]+H+C|ZWJ+C>]+[{M}+[N]+[H]]+[SM]+[(VD)]

#[Ra+H]+NBSP+[N]+[<[<ZWJ|ZWNJ>]+H+C>]+[{M}+[N]+[H]]+[SM]+[(VD)]

The syntax of these patterns is documented, except for the “#” character, but of the nonterminals only some clearly map to Unicode character categories, and especially SM, syllable modifier signs, and VD, vedic, seem to be buckets with a large variety of characters. My best guess for the meaning of the symbols that don’t directly map to Indic syllabic categories:

With that, and omitting the unexplained “#”, the patterns become:

(Consonant Nukta? (Virama (Non_Joiner | Joiner)? | (Non_Joiner | Joiner) Virama))* Consonant Nukta? 0952? (Virama (Non_Joiner | Joiner)? | Vowel_Dependent* Nukta? Virama?)? SM? VD{0,2}

(0930 Virama)? Vowel_Independent Nukta? ((Joiner | Non_Joiner)? Virama Consonant | Joiner Consonant)? (Vowel_Dependent* Nukta? Virama?)? SM? VD{0,2}

(0930 Virama)? 00A0 Nukta? ((Joiner | Non_Joiner)? Virama Consonant)? (Vowel_Dependent* Nukta? Virama?)? SM? VD{0,2}

Devanagari clusters in the Universal shaping engine

The Universal shaping engine (USE) in OpenType is a generic shaping engine designed primarily for Brahmic scripts. It is not the default engine for Devanagari, but the OpenType implementations in HarfBuzz and CoreText let a font opt into shaping by the USE by using the “dev3” script tag.

The USE’s cluster validation is quite clearly defined, as long as all characters involved have their USE classes fully defined – which, as we’ve seen above, unfortunately isn’t the case for Devanagari. My best guesses for the classes of characters where the USE class is currently undefined, wrong, or ambiguous:

Omitting classes and cluster patterns that aren’t relevant to Devanagari, the patterns defined in the USE documentation become:

BASE_IND | OTHER

CONS_WITH_STACKER? (BASE | BASE_OTHER) CONS_MOD_BELOW* (HALANT BASE CONS_MOD_BELOW*)* VOWEL_PRE* VOWEL_ABOVE* VOWEL_BELOW* VOWEL_POST* VOWEL_MOD_ABOVE* VOWEL_MOD_BELOW* VOWEL_MOD_POST*

CONS_WITH_STACKER? (BASE | BASE_OTHER) CONS_MOD_BELOW* (HALANT BASE CONS_MOD_BELOW*)* HALANT

SYM

Test environments

The following sections look at the differences between these definitions and test how different implementations handle them. The comparison tables have the following columns, some of them only in Firefox or in Safari:

When a font doesn’t support all the characters in the tested character sequence, the corresponding cells are shown in gray.

Cluster bases

The Devanagari section of the Unicode Standard describes only consonants as cluster bases; the Devanagari shaping engine adds independent vowels and no-break space; the Universal Shaping Engine also allows digits, avagraha characters, and dotted circle.

This table tests with the consonant , the independent vowel , the digit , the DEVANAGARI SIGN AVAGRAHA , the VEDIC SIGN ARDHAVISARGA , and the dotted circle by attaching the nukta  ◌़, the pre-base vowel  ◌ि, the anusvara ◌ं, and the visarga ◌ः.

TextEdgeFirefox dev2Safari dev2Firefox dev3Safari dev3AnnapurnaSangamMT
क़
कि
कं
कः
अ़
अि(✓)
अं
अः
०़(✓)
०ि(✓)(✓)
०ं(✓)
०ः
ऽ़(✓)
ऽि(✓)(✓)
ऽं(✓)
ऽः
ᳲ़
ᳲि
ᳲं
ᳲः
◌़
◌ि(✓)
◌ं
◌ः
 ़
 ि
 ं
 ः

Observations: The set of supported cluster bases varies significantly. Shaping engines support more than the Unicode Standard and the Devanagari shaping engine documentation would suggest, but support for digits and avagraha is limited. It’s not clear whether HarfBuzz’s support for ardhavisarga as a base is based on evidence that marks attach to it.

Recommendations: Cluster models for Devanagari should include consonants, independent vowels, digits, avagraha, dotted circle, and no-break space as base characters. Which Vedic signs to include as bases needs to be decided based on evidence that marks attach to them – see the section Vedic character combinations for some examples.

Repha

Repha is an above-base mark representing an initial dead consonant ra in a cluster. According to the Unicode Standard, it’s encoded as ra plus virama when followed by a consonant, but as just ra when followed by a vowel (primarily vocalic r and l), which should be encoded in its dependent form. The regular expressions for the Devanagari shaping engine, on the other hand, expect repha-vowel combinations to be encoded as ra plus virama followed by the independent vowel, and add the use of repha with no-break space. In this case, we’re not only checking whether the character sequence is accepted as valid, but also whether the sequence of ra and virama is recognized as part of a larger cluster and the repha glyph is used. A checkmark in parentheses indicates that this does not happen.

TextEdgeFirefox dev2Safari dev2Firefox dev3Safari dev3AnnapurnaSangamMT
र्क
र्अ(✓)(✓)(✓)(✓)
र्ऋ(✓)(✓)(✓)(✓)
रृ(✓)(✓)(✓)(✓)
र् (✓)(✓)(✓)(✓)
र्◌(✓)(✓)

Observations: Every implementation allows every representation of repha, including the one that the Unicode Standard prohibits. However, it seems independent vowels, no-break space, and dotted circle are sometimes treated as new clusters rather than continuations of the ra-virama cluster, as the repha glyph is not used.

Recommendations: The combination of ra, virama with an independent vowel should be treated as two separate clusters, so that the repha glyph is not used. No-break space and dotted circle should be treated as continuations of the cluster that the ra-virama sequence started, so that the repha glyph is used.

Nukta

The Unicode Standard and the USE treat nukta signs as consonant modifiers and require them to immediately follow consonants. The Devanagari shaping engine also allows them after dependent vowels, and both OpenType shaping engines allow them after independent vowels.

TextEdgeFirefox dev2Safari dev2Firefox dev3Safari dev3AnnapurnaSangamMT
क़ि
कि़(✓)
क़ा
का़
अ़

Observations: Support for nukta after dependent vowels is spotty. Using nukta in this position could be for two reasons: either users want to actually indicate that the dependent vowel has a different pronunciation, or they intend it to apply to the consonant and just don’t know that they should input it before the vowel. Only the first reason would justify support.

Recommendation: Investigate the reasons for using nukta after dependent vowels.

Virama

The Unicode Standard and the USE require a virama to follow a consonant, possibly after an intervening nukta, joiner, or non-joiner. The Devanagari shaping engine also allows it after a dependent vowel, and both OpenType shaping engines allow it after an independent vowel.

TextEdgeFirefox dev2Safari dev2Firefox dev3Safari dev3AnnapurnaSangamMT
क्
क़्
क्क
क‌्क
क‍्क
क्‌क
क्‍क
क़‌्क
क़‍्क
क़्‌क
क़्‍क
कि्
का्
अ्

Observations: Almost all implementations allow virama after both dependent and independent vowels.

Recommendations: Investigate why that’s allowed.

Unusual mark combinations

The mark combinations tested in this section are not common, but they do occur, and a consistent interpretation across standards, shaping engines, and fonts would be helpful. In clusters with multiple vowels, we also check in which order the vowels are expected.

TextEdgeFirefox dev2Safari dev2Firefox dev3Safari dev3AnnapurnaSangamMT
क्ं
क्ः
क्॑
कंः
किं्
किा
काि(✓)(✓)(✓)
किु(✓)
कुि(✓)(✓)(✓)
कुे(✓)
केु(✓)
कॅु(✓)
कुॅ(✓)

Observations: Multiple vowels are allowed if you get the order right – and that order may differ between Sangam and the shaping engines. For other unusual combinations, it’s hard to predict whether they will be accepted or not. The USE implementation in CoreText seems to align with implementations of the Devanagari shaping engine rather than with the documentation of the USE when it comes to virama combinations.

Recommendations: Investigate which of these combinations make sense in real life, support those well in one specific order, and discontinue support for others.

Udatta and Anudatta

The OpenType Devanagari shaping engine places U+0952 DEVANAGARI STRESS SIGN ANUDATTA before dependent vowels, while U+0951 DEVANAGARI STRESS SIGN UDATTA, which might be classified as either a syllable modifier sign or a vedic sign, should follow the vowel and possibly other syllable modifiers or vedic signs. The USE treats both as vowel modifiers, and places them after vowels.

TextEdgeFirefox dev2Safari dev2Firefox dev3Safari dev3AnnapurnaSangamMT
क॑ि(✓)(✓)
कि॑
क॒ि(✓)(✓)
कि॒
क॑ं
कं॑
क॒ं
कं॒
क॑ः
कः॑
क॒ः
कः॒
क॑िं◌◌◌◌(✓)◌◌(✓)
कि॑ं
किं॑(✓)
क॒िं◌◌◌◌(✓)◌◌(✓)
कि॒ं
किं॒(✓)
क॑िः◌◌◌◌(✓)◌◌(✓)
कि॑ः
किः॑(✓)
क॒िः◌◌◌◌(✓)◌◌(✓)
कि॒ः
किः॒(✓)
क॒॑
क॒॑
अ॑
अ॒

Observations: Nobody follows the Devanagari shaping engine document’s suggestion that anudatta should be allowed before vowels. There’s not enough consensus where these marks should go otherwise – in particular, implementations of the Devanagari shaping engine require them to come after visarga, while implementations of the USE require them to precede that character. Sangam doesn’t seem to like them in any more complex cluster.

Recommendations: Treat both udatta and anudatta the same way. Keep them in the position where the implementations of the Devanagari shaping engine expect them, that is after the visarga and similar marks. This requires a modification to the USE documentation.

Vedic character combinations

This section investigates character sequences that include Vedic characters from the Devanagari and Vedic extension blocks. The samples are taken from documents in the Unicode document registry that provided attestations for the characters: Everson et al. 2007; South Asia subcommittee 2008; Sharma 2011a; Sharma 2009; Srinidhi and Sridatta 2017; Sharma 2011b.

TextEdgeFirefox dev2Safari dev2Firefox dev3Safari dev3AnnapurnaSangamMT
ये᳠
ऊ꣡꣡(✓)
रा꣢꣯(✓)
यो꣩
च꣣ना꣢꣫(✓)
२꣮
रा꣰
नृ꣢भा᳒ऽ᳒२᳒◌◌◌◌
यृ᳘
न्वो᳡
य॑ः᳢◌◌◌◌
य᳢॑ः
जऀ॑(✓)(✓)(✓)(✓)(✓)(✓)
ꣴ॑
ᳩं᳘◌◌◌◌
ᳪँ
कॎो
ए᳘꣣
ए᳘꣣
तु᳘
तो᳴
ᳵक(✓)(✓)(✓)(✓)(✓)
आ᳸

Observations: No shaping engine treated all samples as valid; the HarfBuzz Devanagari shaping engine came closest. In the second-to-last line, the glyph for should be stacked on top of the one for U+1CF5 VEDIC SIGN JIHVAMULIYA ; it’s not clear whether the rendering failures are due to the shaping engines, or the font, or both.

Recommendations: As all samples above are attested, they should all be supported. Vedic signs that serve as bases for other signs, such as U+A8FA DEVANAGARI SIGN DOUBLE CANDRABINDU VIRAMA or U+1CF5 VEDIC SIGN JIHVAMULIYA , need to be classified to allow for such use. For marks, the correct order needs to be determined, in a way that cooperates with normalization (see the next section).

Canonical-equivalent mark sequences

The Unicode Standard specifies that certain character sequences are canonical-equivalent to each other, and should therefore display identically. Equivalence is determined through three operations: decomposition, composition, and reordering of marks based on their canonical combining class. Within the Devanagari script, there’s no decomposition or composition, but some marks have canonical combining class values that enable reordering: Nukta, virama, cantillation marks, and the unclassified marks U+1CED and U+1CE2 through U+1CE8. In the following table, rows that use the same code points (in different order) are canonical-equivalent and should be displayed identically. In each case, the first row contains the sequence in normalization form C.

TextEdgeFirefox dev2Safari dev2Firefox dev3Safari dev3AnnapurnaSangamMT
क᳔़
क᳔़
क᳔्
क᳔्
क᳔॑
क᳔॑
क़्
क़्
क़॑
क़॑
क्॑
क्॑
क॒॑
क॒॑

Observations: The shaping system used in Firefox, HarfBuzz, normalizes any input string, so there’s no visible difference between the different rows with the same code points. In a number of cases, however, normalization is incompatible with the expectations of the shaping engines. The combining character classes used in normalization express primarily the position of a mark relative to the base, while shaping engines expect character sequences to reflect a linguistic model. The workaround Unicode sometimes recommends for such cases, adding the character U+034F COMBINING GRAPHEME JOINER to block mark reordering, is not mentioned in the Devanagari section of the Unicode Standard.

Recommendations: The canonical combining class of a character can’t be changed, no matter what problems it causes, because of the Unicode stability policy regarding normalization. The USE needs to be updated to be compatible with normalization. Cantillation marks with position “Overstruck” have to be a separate subclass, and must be allowed before nukta, but also in later places such as after visarga.

Confusable mark sequences

This section is similar to the one above in that it tests character sequences where the visual placement doesn’t fully determine the encoding order. However, while sequences in the previous section were canonical-equivalent, sequences here are not equivalent and therefore should render differently. The Universal Shaping Engine documents the order in which marks should occur (generally left-top-bottom-right within each USE character class); the Unicode Standard and the Devanagari shaping engine don’t.

TextEdgeFirefox dev2Safari dev2Firefox dev3Safari dev3AnnapurnaSangamMT
कं़
क़ं
कुे
केु
किा
काि(✓)(✓)(✓)
कं᳔
क᳔ं
कं॒
क॒ं

Observations: The good news is that for each tested mark pair there clearly is a preferred order. The not-so-good news is that the not-preferred order for vowels is still allowed by the Devanagari shaping engines in HarfBuzz and CoreText. Sangam doesn’t seem to like any combination of anusvara and anudatta.

Recommendations: Cluster patterns should allow only one preferred order of non-spacing marks to avoid confusable sequences.

Repeated marks

In a few combinations of Vedic signs, repeated marks in Devanagari are meaningful. Most of the time, they’re spelling mistakes. In either case, they should be visible.

TextEdgeFirefox dev2Safari dev2Firefox dev3Safari dev3AnnapurnaSangamMT
कुु
कुुु◌◌◌◌
कंं
कंंं◌◌◌◌◌◌
कःः
कःःः◌◌
क़़
क़़़◌◌◌◌
क़़ि◌◌(✓)◌◌
क््
क्््◌◌◌◌◌◌◌◌◌◌
क॒॒
क॒॒॒◌◌◌◌
क꣡꣡
क꣡꣡꣡◌◌
क᳘᳘
क᳘᳘᳘◌◌

Observations: A mark repeated three times will never get you fewer dotted circles than the same mark repeated twice. Beyond that, there doesn’t seem to be any recognizable logic behind these results. Note that repeated marks are the only situation where Annapurna inserts dotted circles.

Recommendations: Cluster patterns should allow repetition of marks if it’s meaningful. Fonts should be designed to make repeated marks visible, normally by stacking them, but for certain Vedic signs by side-by-side positioning.

“Discouraged” characters

The Devanagari block of the Unicode Standard includes two characters that were intended for use with Latin, not with Devanagari, and whose use is now discouraged entirely: U+0953 Devanagari grave accent and U+0954 Devanagari acute accent. Note that U+0953 Devanagari grave accent is easily confused with U+0947 Devanagari vowel sign e.

TextEdgeFirefox dev2Safari dev2Firefox dev3Safari dev3AnnapurnaSangamMT
क॓
क॔
अ॓
अ॔
०॓
०॔
ऽ॓
ऽ॔
ᳲ॓
ᳲ॔

Observations: Apparently no implementer got the message that these characters should not be used with Devanagari – where dotted circles are inserted, they seem to reflect unsupported base characters, not the “discouraged” marks.

Recommendations: The use of these characters in Devanagari clusters should be disallowed; a dotted circle should be inserted before them.

“Do not use” character sequences

The Devanagari section of the Unicode Standard includes three tables showing glyphs for vowel letters, atomic consonants, and consonant conjuncts that have a simple encoding as well as longer character sequences that could result in the same rendering. The longer character sequences should not be used, and validation should therefore reject them. They are tested here.

TextEdgeFirefox dev2Safari dev2Firefox dev3Safari dev3AnnapurnaSangamMT
अॆ
अा
र्इ
उु
एॅ
एॆ
एे
अॉ
आॅ
अॊ
आॆ
अो
आे
अौ
आै
अॅ
अऺ
अऻ
आऺ
अॏ
अॖ
अॗ
ख्ा
ख्‍ा
ग्ा
ग्‍ा
घ्ा
घ्‍ा
च्ा
च्‍ा
ज्ा
ज्‍ा
झ्ा
झ्‍ा
ञ्ा
ञ्‍ा
ण्ा
ण्‍ा
त्ा
त्‍ा
थ्ा
थ्‍ा
ध्ा
ध्‍ा
न्ा
न्‍ा
ऩ्ा
ऩ्‍ा
ऩ्ा
ऩ्‍ा
प्ा
प्‍ा
ब्ा
ब्‍ा
भ्ा
भ्‍ा
म्ा
म्‍ा
य्ा
य्‍ा
ल्ा
ल्‍ा
व्ा
व्‍ा
श्ा
श्‍ा
ष्ा
ष्‍ा
स्ा
स्‍ा
ख़्ा
ख़्‍ा
ख़्ा
ख़्‍ा
ग़्ा
ग़्‍ा
ग़्ा
ग़्‍ा
ज़्ा
ज़्‍ा
ज़्ा
ज़्‍ा
य़्ा
य़्‍ा
य़्ा
य़्‍ा
ॹ्ा
ॹ्‍ा
ॺ्ा
ॺ्‍ा
ॻ्ा
ॻ्‍ा
ॼ्ा
ॼ्‍ा
ॾ्ा
ॾ्‍ा
ॿ्ा
ॿ्‍ा
क्च्ा
क्च्‍ा
क्ष्ा
क्ष्‍ा
त्त्ा
त्त्‍ा
न्त्ा
न्त्‍ा

Observations: HarfBuzz reliably rejects “do not use” sequences. DirectWrite, CoreText, and Sangam allow some of them, without much recognizable logic.

Recommendations: These sequences should be reliably rejected.

Marks without bases

Marks without bases are not valid clusters, and OpenType recommends inserting dotted circles to indicate that.

TextEdgeFirefox dev2Safari dev2Firefox dev3Safari dev3AnnapurnaSangamMT
ि

Observations: The shaping engines and the Sangam font insert dotted circles before the marks that have the script property value Devanagari. U+1CE2, which has the script property value Common, doesn’t get one – it’s possible that it gets redirected to the default shaping engine, which does not validate.

Recommendations: OpenType specifications should be created that clarify how text is broken into script runs and clusters, and provide a validation model that makes sense across all scripts.

References

Annapurna: Peter Martin: Annapurna SIL. Font version 1.204. Part of The Annapurna Font Family. SIL International, 2019.

Edge: Microsoft Edge. Browser version 44.19041.1.0; EdgeHTML 18.19041. Included in Microsoft Windows 10 version 2004. Microsoft, 2020.

Everson et al. 2007: Michael Everson and Peter Scharf (editors), Michel Angot, R. Chandrashekar, Malcolm Hyman, Susan Rosenfield, B. V. Venkatakrishna Sastry, Michael Witzel: Proposal to encode 55 characters for Vedic Sanskrit in the BMP of the UCS. Unicode Consortium, 2007.

Firefox: Firefox. Browser version 79.0. Mozilla, 2020.

MT: Monotype: Devanagari MT. Font version 13.0d1e3. Included in macOS 10.15.6. Apple, 2020.

Noto: Jelle Bosma: Noto Sans Devanagari Regular. Font version 2.001. Google, 2020.

OpenType Devanagari: Developing OpenType Fonts for Devanagari Script. Microsoft, dated 02/08/2018, accessed 2020-07-12.

OpenType USE: Creating and supporting OpenType fonts for the Universal Shaping Engine. Microsoft, dated 07/31/2020, accessed 2020-08-07.

Safari: Safari. Browser version 13.1.2. Included in macOS 10.15.6. Apple, 2020.

Sangam: Muthu Nedumaran: Devanagari Sangam MN. Font version 14.0d1e12. Included in macOS 10.15.6. Apple, 2020.

Sharma 2009: Shriramana Sharma: Request for encoding 1CF4 VEDIC TONE CANDRA ABOVE. Unicode Consortium, 2009.

Sharma 2011a: Shriramana Sharma: Request to annotate 1CD8 VEDIC TONE CANDRA BELOW. Unicode Consortium, 2011.

Sharma 2011b: Shriramana Sharma: Proposal to encode svara markers for the Jaiminiya Archika. Unicode Consortium, 2011.

SIL: Devanagari (Nagari). SIL International. Accessed 2020-08-19.

South Asia subcommittee 2008: South Asia subcommittee: South Asia Subcommittee Report. Unicode Consortium, 2008.

Srinidhi and Sridatta 2017: Srinidhi A and Sridatta A: Request to change the glyphs of Vedic signs Jihvamuliya and Upadhmaniya. Unicode Consortium, 2017.

Unicode: The Unicode Consortium: The Unicode Standard, Version 13.0. The Unicode Consortium, 2020. For Devanagari, in particular section 12.1 Devanagari, pages 447-472.

Unicode Normalization: The Unicode Consortium: Unicode Character Encoding Stability Policies. The Unicode Consortium. Accessed 2020-08-19.