Benchmark: Diacritics removal (+ lowercase2)

Script Preparation code:

const replaceMap= {
  Æ: 'AE',
  æ: 'ae',
  Ꜳ: 'AA',
  ꜳ: 'aa',
  ꬱ: 'aə',
  Ꜵ: 'AO',
  ꜵ: 'ao',
  Ꜷ: 'AU',
  ꜷ: 'au',
  Ꜹ: 'AV',
  Ꜻ: 'AV',
  ꜹ: 'av',
  ꜻ: 'av',
  Ꜽ: 'AY',
  ꜽ: 'ay',
  ꭁ: 'əø',
  ﬀ: 'f‌f',
  ﬃ: 'f‌f‌i',
  ﬄ: 'f‌f‌l',
  ﬁ: 'f‌i',
  ﬂ: 'f‌l',
  '℔': 'lb',
  Ƕ: 'Hv',
  ƕ: 'hv',
  ꭢ: 'ɔe',
  ﬆ: 'st',
  ﬅ: 'ſt',
  ᵫ: 'ue',
  Ỻ: 'lL',
  ỻ: 'll',
  Œ: 'OE',
  œ: 'oe',
  Ꝏ: 'OO',
  ꝏ: 'oo',
  ẞ: 'ſs',
  ß: 'ſz',
  Ꜩ: 'TZ',
  ꜩ: 'tz',
  Ꝡ: 'VY',
  ꝡ: 'vy',
}
const replacementsRegex = new RegExp(Object.keys(replaceMap).join('|'), 'g')
window.removeDiacritics = (value) => {
    return value.toLowerCase()
        .normalize('NFD')
        .replace(/[\u0300-\u036f]/g, '')
        .replace(replacementsRegex, (char) => {
            return replaceMap[char] || char
        })
}

window.oldRemoveDiacritics = function(s) {
    if (!s)
        return s;

    let r = s.toLowerCase();

    r = r.replace(/[àáâãäå]/g, 'a');
    r = r.replace(/æ/g, 'ae');
    r = r.replace(/ç/g, 'c');
    r = r.replace(/[èéêë]/g, 'e');
    r = r.replace(/[ìíîï]/g, 'i');
    r = r.replace(/ñ/g, 'n');
    r = r.replace(/[òóôõö]/g, 'o');
    r = r.replace(/œ/g, 'oe');
    r = r.replace(/[ùúûü]/g, 'u');
    r = r.replace(/[ýÿ]/g, 'y');
    r = r.replace(/\\W/g, '');

    // To prevent some weird behaviors with MacOS diacritics
    r = _(r).map(function(char) {
        return String.fromCharCode(char.charCodeAt(0));
    });

    return r.join('');
};

Tests:

Old
oldRemoveDiacritics(` Dji pou magnî do vêre, çoula m' freut nén må - æsope - robot œuf Lorem, ipsum dolor sit amet consectetur adipisicing elit. Reiciendis voluptatem dolores molestiae possimus laudantium consequuntur placeat earum neque maxime modi quasi fugiat quae inventore, illo in corporis corrupti esse a! `)
New
removeDiacritics(` Dji pou magnî do vêre, çoula m' freut nén må - æsope - robot œuf Lorem, ipsum dolor sit amet consectetur adipisicing elit. Reiciendis voluptatem dolores molestiae possimus laudantium consequuntur placeat earum neque maxime modi quasi fugiat quae inventore, illo in corporis corrupti esse a! `)

Rendered benchmark preparation results:

Suite status: <idle, ready to run>

Previous results

Test case name	Result
Old
New

Fastest: N/A

Slowest: N/A

Latest run results:

No previous run results

This benchmark does not have any results yet. Be the first one to run it!

Autogenerated LLM Summary (model llama3.2:3b, generated one year ago):

LLMs can make mistakes. Check important info.

Let's break down the provided benchmark definition and test cases.

**Benchmark Definition:**

The benchmark measures the performance of two different JavaScript functions: `oldRemoveDiacritics` and `removeDiacritics`. These functions are responsible for removing diacritical marks from input strings.

**Functions:**

1. **`oldRemoveDiacritics`**: This function is an older implementation that uses a hardcoded list of replacements for diacritical marks. It also performs some additional preprocessing steps, such as converting special characters to their corresponding Unicode code points.
2. **`removeDiacritics`**: This function uses a regular expression to match and replace diacritical marks in the input string. It also uses the `normalize('NFD')` method to decompose the input string into its base form and diacritical marks.

**Library:**

The `replaceMap` object is not a library, but rather a custom data structure created by the benchmark author to map diacritical marks to their corresponding replacements. It's used in both functions.

**Special JavaScript Feature/Syntax:**

* The `normalize()` method is a built-in JavaScript method that decomposes a string into its base form and diacritical marks.
* The `[\\u0300-\\u036f]` regular expression pattern matches all Unicode code points in the range U+0300 to U+036F, which include most diacritical marks.

**Options Compared:**

The benchmark compares two options:

1. **`oldRemoveDiacritics`**: This option uses a hardcoded list of replacements and additional preprocessing steps.
2. **`removeDiacritics`**: This option uses a regular expression to match and replace diacritical marks, with the added benefit of decomposing the input string into its base form.

**Pros and Cons:**

1. **`oldRemoveDiacritics`**:
	* Pros: Simple and easy to understand.
	* Cons: Hardcoded list of replacements can be error-prone and limited in its ability to handle complex diacritical marks.
2. **`removeDiacritics`**:
	* Pros: More flexible and able to handle a wide range of diacritical marks.
	* Cons: Requires more processing power and memory due to the regular expression matching.

**Other Considerations:**

* The benchmark uses a custom `replaceMap` object, which may not be easily portable or reusable outside of this specific benchmark.
* The preprocessing steps in `oldRemoveDiacritics` may not be necessary for all use cases.
* The `removeDiacritics` function's regular expression pattern may need to be adjusted if new Unicode characters are added.

**Benchmark Results:**

The latest benchmark results show that the `removeDiacritics` function is performing better than the `oldRemoveDiacritics` function, with higher execution counts per second. This suggests that the regular expression approach is more efficient in this specific use case.

Related benchmarks:

Diacritics removal (+ lowercase2) (version: 1)

Comparing performance of: Old vs New

Created: 5 years ago by: Registered User

Jump to the latest result

Old

New

Suite status: <idle, ready to run>

Fastest: N/A

Slowest: N/A

No previous run results

Autogenerated LLM Summary (model llama3.2:3b, generated one year ago):