Add 12 Indic language hyphenation patterns by santhoshtr · Pull Request #30 · typst/hypher

santhoshtr · 2026-04-01T16:47:54Z

Following languages are added

Assamese (as)
Bengali (bn)
Gujarati (gu)
Hindi (hi)
Kannada (kn)
Malayalam (ml)
Marathi (mr)
Oriya (or)
Panjabi (pa)
Sanskrit (sa)
Tamil (ta)
Telugu (te)

As dicussed in typst/typst#8033 this PR adds 12 indic languages to Hypher. As a follow up I will attempt to define hyphenation character property(in another PR)

Except Sanskrit, all other hyphenation patterns are authored by myself. And license is permissive (MIT)

laurmaedje

Thanks for the PR! Just one small remark.

laurmaedje · 2026-04-02T08:29:48Z

-By default, this crate supports hyphenating more than 30 languages. Embedding
-automata for all these languages will add ~1.1 MiB to your binary.
+By default, this crate supports hyphenating 48 languages. Embedding
+automata for all these languages will add ~1.3 MiB to your binary.


On main, I get 1162474B if I sum all the file sizes (via find . -type f -exec stat -f '%z' {} + | awk '{sum += $1} END {print sum}'). On your branch, I get 1166722B, which is barely more. Divided by $1024^2$, both amount to rounded ~1.1 MiB. Assuming that previous number was correct, that makes sense, since the tries you added are very small.

Did you make a different calculation?

With this new languages this is what I see:

du -hsc * 60K af.bin 4.0K as.bin 4.0K be.bin 16K bg.bin 4.0K bn.bin 4.0K ca.bin 40K cs.bin 8.0K da.bin 204K de.bin 4.0K el.bin 28K en.bin 16K es.bin 20K et.bin 4.0K fi.bin 8.0K fr.bin 8.0K gl.bin 4.0K gu.bin 4.0K hi.bin 4.0K hr.bin 348K hu.bin 24K is.bin 4.0K it.bin 12K ka.bin 4.0K kn.bin 4.0K ku.bin 4.0K la.bin 8.0K lt.bin 4.0K ml.bin 8.0K mn.bin 4.0K mr.bin 64K nl.bin 156K no.bin 4.0K or.bin 4.0K pa.bin 16K pl.bin 4.0K pt.bin 36K ru.bin 4.0K sa.bin 16K sk.bin 8.0K sl.bin 4.0K sq.bin 16K sr.bin 24K sv.bin 4.0K ta.bin 4.0K te.bin 4.0K tk.bin 4.0K tr.bin 24K uk.bin 1.3M total

But your method is more accurate since du counts "blocks" in disk and not actual file size(A file 200B will use 4K in dsik as that is one block). I will keep the number ~1.1.

laurmaedje · 2026-04-02T08:36:04Z

As a follow up I will attempt to define hyphenation character property(in another PR)

Just to clarify: On the Typst side, you'd be on board with keeping this internal as an automatic language-based property, right? And since which character is used is not defined by hypher, I don't think adjustments here would be necessary.

- Assamese (as) - Bengali (bn) - Gujarati (gu) - Hindi (hi) - Kannada (kn) - Malayalam (ml) - Marathi (mr) - Oriya (or) - Panjabi (pa) - Sanskrit (sa) - Tamil (ta) - Telugu (te)

santhoshtr · 2026-04-02T10:51:40Z

On the Typst side, you'd be on board with keeping this internal as an automatic language-based property, right? And since which character is used is not defined by hypher, I don't think adjustments here would be necessary.

I am less familiar with these systems. So please correct me if I am wrong.

I was in the assumption that Hypher supplies language based properties to typst. For example, the bounds is such a property that hypher defines. In CSS spec bounds is equivalent to hyphenate-limit-chars. Going by this logic, I assume language specific hyphenation-character should also be in hypher. Then typst should read that for the language in context. Am I right?

laurmaedje · 2026-04-02T10:52:16Z

Going by this logic, I assume language specific hyphenation-character should also be in hypher. Then typst should read that for the language in context. Am I right?

Ah, that's a fair view on things. I would be fine with that!

laurmaedje · 2026-04-02T10:52:43Z

Thank you!

santhoshtr mentioned this pull request Apr 2, 2026

Support for Malayalam (ml) Language - Hyphenation #8

Open

laurmaedje reviewed Apr 2, 2026

View reviewed changes

Add 12 Indic language hyphenation patterns

01dd84b

- Assamese (as) - Bengali (bn) - Gujarati (gu) - Hindi (hi) - Kannada (kn) - Malayalam (ml) - Marathi (mr) - Oriya (or) - Panjabi (pa) - Sanskrit (sa) - Tamil (ta) - Telugu (te)

santhoshtr force-pushed the indic-patterns branch from bc9fd77 to 01dd84b Compare April 2, 2026 10:43

laurmaedje merged commit 8715c4c into typst:main Apr 2, 2026
3 checks passed

santhoshtr mentioned this pull request Apr 3, 2026

Add hyphenation_character method to Lang #31

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 12 Indic language hyphenation patterns#30

Add 12 Indic language hyphenation patterns#30
laurmaedje merged 1 commit intotypst:mainfrom
santhoshtr:indic-patterns

santhoshtr commented Apr 1, 2026

Uh oh!

laurmaedje left a comment

Uh oh!

laurmaedje Apr 2, 2026

Uh oh!

santhoshtr Apr 2, 2026

Uh oh!

laurmaedje commented Apr 2, 2026

Uh oh!

santhoshtr commented Apr 2, 2026

Uh oh!

laurmaedje commented Apr 2, 2026

Uh oh!

Uh oh!

laurmaedje commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

santhoshtr commented Apr 1, 2026

Uh oh!

laurmaedje left a comment

Choose a reason for hiding this comment

Uh oh!

laurmaedje Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

santhoshtr Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

laurmaedje commented Apr 2, 2026

Uh oh!

santhoshtr commented Apr 2, 2026

Uh oh!

laurmaedje commented Apr 2, 2026

Uh oh!

Uh oh!

laurmaedje commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants