Skip to content

Add 12 Indic language hyphenation patterns#30

Merged
laurmaedje merged 1 commit intotypst:mainfrom
santhoshtr:indic-patterns
Apr 2, 2026
Merged

Add 12 Indic language hyphenation patterns#30
laurmaedje merged 1 commit intotypst:mainfrom
santhoshtr:indic-patterns

Conversation

@santhoshtr
Copy link
Copy Markdown
Contributor

Following languages are added

  • Assamese (as)
  • Bengali (bn)
  • Gujarati (gu)
  • Hindi (hi)
  • Kannada (kn)
  • Malayalam (ml)
  • Marathi (mr)
  • Oriya (or)
  • Panjabi (pa)
  • Sanskrit (sa)
  • Tamil (ta)
  • Telugu (te)

As dicussed in typst/typst#8033 this PR adds 12 indic languages to Hypher. As a follow up I will attempt to define hyphenation character property(in another PR)

Except Sanskrit, all other hyphenation patterns are authored by myself. And license is permissive (MIT)

Copy link
Copy Markdown
Member

@laurmaedje laurmaedje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Just one small remark.

Comment thread README.md Outdated
By default, this crate supports hyphenating more than 30 languages. Embedding
automata for all these languages will add ~1.1 MiB to your binary.
By default, this crate supports hyphenating 48 languages. Embedding
automata for all these languages will add ~1.3 MiB to your binary.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On main, I get 1162474B if I sum all the file sizes (via find . -type f -exec stat -f '%z' {} + | awk '{sum += $1} END {print sum}'). On your branch, I get 1166722B, which is barely more. Divided by $1024^2$, both amount to rounded ~1.1 MiB. Assuming that previous number was correct, that makes sense, since the tries you added are very small.

Did you make a different calculation?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this new languages this is what I see:

du -hsc *
60K	af.bin
4.0K	as.bin
4.0K	be.bin
16K	bg.bin
4.0K	bn.bin
4.0K	ca.bin
40K	cs.bin
8.0K	da.bin
204K	de.bin
4.0K	el.bin
28K	en.bin
16K	es.bin
20K	et.bin
4.0K	fi.bin
8.0K	fr.bin
8.0K	gl.bin
4.0K	gu.bin
4.0K	hi.bin
4.0K	hr.bin
348K	hu.bin
24K	is.bin
4.0K	it.bin
12K	ka.bin
4.0K	kn.bin
4.0K	ku.bin
4.0K	la.bin
8.0K	lt.bin
4.0K	ml.bin
8.0K	mn.bin
4.0K	mr.bin
64K	nl.bin
156K	no.bin
4.0K	or.bin
4.0K	pa.bin
16K	pl.bin
4.0K	pt.bin
36K	ru.bin
4.0K	sa.bin
16K	sk.bin
8.0K	sl.bin
4.0K	sq.bin
16K	sr.bin
24K	sv.bin
4.0K	ta.bin
4.0K	te.bin
4.0K	tk.bin
4.0K	tr.bin
24K	uk.bin
1.3M	total

But your method is more accurate since du counts "blocks" in disk and not actual file size(A file 200B will use 4K in dsik as that is one block). I will keep the number ~1.1.

@laurmaedje
Copy link
Copy Markdown
Member

As a follow up I will attempt to define hyphenation character property(in another PR)

Just to clarify: On the Typst side, you'd be on board with keeping this internal as an automatic language-based property, right? And since which character is used is not defined by hypher, I don't think adjustments here would be necessary.

- Assamese (as)
- Bengali (bn)
- Gujarati (gu)
- Hindi (hi)
- Kannada (kn)
- Malayalam (ml)
- Marathi (mr)
- Oriya (or)
- Panjabi (pa)
- Sanskrit (sa)
- Tamil (ta)
- Telugu (te)
@santhoshtr
Copy link
Copy Markdown
Contributor Author

On the Typst side, you'd be on board with keeping this internal as an automatic language-based property, right? And since which character is used is not defined by hypher, I don't think adjustments here would be necessary.

I am less familiar with these systems. So please correct me if I am wrong.

I was in the assumption that Hypher supplies language based properties to typst. For example, the bounds is such a property that hypher defines. In CSS spec bounds is equivalent to hyphenate-limit-chars. Going by this logic, I assume language specific hyphenation-character should also be in hypher. Then typst should read that for the language in context. Am I right?

@laurmaedje
Copy link
Copy Markdown
Member

Going by this logic, I assume language specific hyphenation-character should also be in hypher. Then typst should read that for the language in context. Am I right?

Ah, that's a fair view on things. I would be fine with that!

@laurmaedje laurmaedje merged commit 8715c4c into typst:main Apr 2, 2026
3 checks passed
@laurmaedje
Copy link
Copy Markdown
Member

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants