📑Coffee's Blog

A Filename for Sorting an "Archive" folder to the bottom, with CJK Support

Coffeebank🗓️October 31, 2024

A while back, I wanted to naturally sort unwanted files and folders to the bottom -- and came across this thread about filenames:

Character in filename that makes it sorted in last position in Windows

A common trick to have files sorted first in Windows Explorer is to add the _ character as prefix, as displayed in the screenshot below.

Question: Which character can I add to the filename to have a file always in last position, after all other files?

Of course, adding Z (or ZZ, ZZZ, etc.) works, but it gives weird filenames such as ZZZOtherInformation.txt..

Source: superuser.com

Some suggested using Greek or Hebrew characters. Others suggested highly unusual Latin characters or accent/graves.

Those wouldn't fit my needs because I regularly type CJK (Chinese, Japanese, Korean) characters. Unicode sorts Latin letters at the top, Greek below Latin, CJK near the middle, and Specials at the bottom.

So, I decided to dive into the Unicode Blocks.


Candidate 1: The Symbols

To start, U+FF00..U+FFEF - Halfwidth and Fullwidth Forms could be interesting.

  • - Halfwidth Rightwards Arrow U+FFEB
  • - Halfwidth Black Square U+FFED
  • - Halfwidth White Circle U+FFEE

With a Unicode codepoint of "FFEE", that gets us pretty close to the finish line.

These all look relatively natural -- at least, compared to using unsupported characters, or letters of a non-Latin language.

Full-width and half-width characters are also commonly used in Japanese, so if a system supports Unicode text, these should also work out-of-the-box.

Testing the sort in JavaScript, we can see indeed they are sorted last:

// JavaScript

var test = ["c", "a", "B", "Ω greek omega", "ת hebrew", "中文 cjk", "にほんご jp", "一 cjk", "한국어 kr", "𮕩 cjk p2 exf", "嶲 cjk compat ideograph", "𰻞 cjk p3 exg biang", "鿬 cjk unified tn", "𬭶 cjk p2 exe hs", "𬺓 cjk p2 exe", "𱁬 cjk p3 exg", "㊰", "→ hwfw", "■ hwfw", "○ hwfw", "🯹 symbol legacy"]

test.sort()

// output

["B","a","c","Ω greek omega","ת hebrew","にほんご jp","㊰","一 cjk","中文 cjk","鿬 cjk unified tn","한국어 kr","🯹 symbol legacy","𬭶 cjk p2 exe hs","𬺓 cjk p2 exe","𮕩 cjk p2 exf","嶲 cjk compat ideograph","𰻞 cjk p3 exg biang","𱁬 cjk p3 exg","→ hwfw","■ hwfw","○ hwfw"]

Candidate 2: The Characters

As the forum post above mentioned though, symbols did not sort properly in directories in the OS.

On Linux, Dolphin (KDE) and PCManFM (LXDE) both sorted the symbols to the top:

→, ■, ○, a, B, c, Ω, ת, 한국어, にほんご, 中文, ㊰, 嶲, 鿬, 𮕩, 𰻞, 𬭶, 𱁬, 𬺓

It turns out, Unicode has multiple "planes", and we only scratched the surface to plane 0 - the BMP (Basic Multilingual Plane). There are actually others such as plane 1 (SMP), plane 2 (SIP), and plane 3 (TIP).

Planes 2 and 3 contain even more CJK Ideographs, so we can't ignore them if we want to sort among files with CJK filenames.

In addition, each plane also gets updates separately, so sometimes Plane 2 characters are added after Plane 3 characters:

  • (Plane 2) U+2B820..U+2CEAF - CJK Unified Ideographs Extension E (𬺓)

  • (Plane 2) U+2CEB0..U+2EBEF - CJK Unified Ideographs Extension F (𮕩)

  • (Plane 3) U+30000..U+3134F - CJK Unified Ideographs Extension G (𰻞, 𱁬)

  • (Plane 3) U+31350..U+323AF - CJK Unified Ideographs Extension H

  • (Plane 2) U+2EBF0..U+2EE5F - CJK Unified Ideographs Extension I

The Periodic Table

The Periodic Table of Elements lists chemical elements, their atomic numbers, and their name and abbreviations.

Chinese, however, uses their own single-character names for these chemical elements.

Characters were still being added even in 2017 -- quite new, and for our purposes, would be sorted pretty far down the Unicode list!

  • 𬭛 - 107 Bh (Bohrium) U+2CB5B (Plane 2 Extension E)
  • 𬭳 - 106 Sg (Seaborgium) U+2CB73 (Plane 2 Extension E)
  • 𬭶 - 108 Hs (Hassium) U+2CB76 (Plane 2 Extension E)

Recognized as official Chinese characters under the 通用规范汉字表, it should be well supported as Unicode text compared to more novel characters.

Note that the Simplified Chinese variant is used here. For example, 𬭳 U+2CB73 becomes 𨭎 U+28B4E in Traditional Chinese, which is all the way up in Plane 2 Extension B. This heavily reduces its effectiveness for our purposes. Consider other alternative characters below as needed.

Other Characters

Although not officially endorsed by any governments, these characters are still popular and have a significant chance of being well supported on modern systems:

𰻞 U+30EDE is Traditional Chinese. The simplified variant is at 𰻝 U+30EDD, in the preceding Unicode codepoint. Its fame comes as "one of the most complex Chinese characters in modern usage" (Wikipedia), as well as being quite the delicious bowl of noodles.

𱁬 U+3106C is Japanese Kokuji. While not an official character, it was indexed in 今昔文字鏡 and is popularly known as "the most graphically complex CJK character" (Wikipedia).

The online virality and staying power of these interesting characters lend credibility to possible support across modern systems.

Newer Updates

However, if systems do update and start sorting even newer characters, these might come in handy someday:

  • 𲎯 - ⿱欪⿱亼⿱吅⿵冂卄 - U+323AF (Plane 3 Extension H)
  • 𮹝 - ⿱龙⿰龙龙 - U+2EE5D (Plane 2 Extension I)

In the previous section, 𱁬 U+3106C was selected from Plane 3 Extension G because it was one of the few that could be rendered on my system (no doubt partly due to its fame).

As of October 2024, I can't comment on Extensions H and I, as my system can't render any. I also cannot say whether any of them are nearly as famous.


Coffee's Recommendations

Based on my discoveries, I think I will be using the following options.

These are close enough to the end of their Unicode blocks to keep their advantage, but are nicer for me personally:

Plane 0

  • - Halfwidth Black Square U+FFED
    • It looks like a bullet point, which blends in well.

Plane 2

  • 𬭳 - 106 Sg (Seaborgium) U+2CB73 (Plane 2)

Plane 3