Benchmark: Tokenize : 2 méthodes différentes

Script Preparation code:

testString = `
Pour analyser le sujet, il faut penser à analyser :
- les 2 notions importantes du sujet
- mais aussi les autres termes du sujet (attention chaque terme est important)

Il y a plusieurs stratégies possibles d'analyse :
- On peut d'abord simplement noter ce à quoi ça nous fait penser : on note des synonymes, des expressions, des situations et des idées associées à cette notion. On part de ses représentations spontanées (ce qui nous vient immédiatement à l'esprit).
- On pense aussi au contraire de la notion, aux opposés, à des distinctions, des différences
- On essaie de trouver des exemples d'application de la notion (à quoi ça renvoie concrètement ?)
- On essaie de trouver des connaissances utiles sur cette notion (qu'est-ce qu'on a vu dans le cours à ce propos ? Y a-t-il des philosophes, des courants philosophiques, du vocabulaire technique qu'on pourrait mobiliser ?)
- On essaie de définir la notion en dégageant ses caractéristiques fondamentales et spécifiques (qui la distinguent d'autres notions)

Dans tous les cas, il faut travailler au brouillon à l'écrit, avec son stylo, et pas seulement dans sa tête, en notant bien toutes ses idées (même si on a l'impression que c'est nul : cela va au contraire nous permettre de débloquer notre pensée !).
`

Tests:

tokenize
tokenize(testString) function tokenize(text) { // Fonction pour diviser une chaîne de caractères en tokens const words = text.toLowerCase().split(/\s|'|,|\.|\:|\?|\!|\(|\)|\[|\]/).filter(word => word.length >= 5) || []; // On garde d'abord seulement les mots d'au moins 5 caractères const tokens = []; // On va créer des tokens avec à chaque fois un poids associés for (const word of words) { // Premier type de token : le mot en entier ; poids le plus important tokens.push({word, weight: 5}); // Ensuite on intègre des tokens de 5, 6 et 7 caractères consécutifs pour détecter des racines communes // Plus le token est long, plus le poids du token est important // Si le token correspond au début du mot, le poids est plus important if (word.length >= 5) { for (let i = 0; i <= word.length - 5; i++) { const weight = i === 0 ? 0.6 : 0.4; const token = word.substring(i, i + 5) tokens.push({token,weight: weight}); } } if (word.length >= 6) { for (let i = 0; i <= word.length - 6; i++) { const weight = i === 0 ? 0.8 : 0.6; const token = word.substring(i, i + 6) tokens.push({token,weight: weight}); } } if (word.length >= 7) { for (let i = 0; i <= word.length - 7; i++) { const weight = i === 0 ? 1 : 0.8; const token = word.substring(i, i + 7) tokens.push({token,weight: weight}); } } } return tokens; }
tokenize 2
tokenize2(testString) function tokenize2(text) { // Fonction pour diviser une chaîne de caractères en tokens // On garde d'abord seulement les mots d'au moins 5 caractères const words = text.toLowerCase().split(/\s|'|,|\.|\:|\?|\!|\(|\)|\[|\]/).filter(word => word.length >= 5) || []; // On définit des poids variables selon la taille du token const weights = [0.4, 0.6, 0.8]; // On va créer des tokens avec à chaque fois un poids associé const tokens = words.reduce((acc, word) => { // Premier type de token : le mot en entier ; poids le plus important acc.push({ word, weight: 5 }); // Ensuite on intègre des tokens de 5, 6 et 7 caractères consécutifs pour détecter des racines communes // Plus le token est long, plus le poids du token est important // Si le token correspond au début du mot, le poids est plus important const windowSizeMin = 5 for (let windowSize = windowSizeMin; windowSize <= windowSizeMin+weights.length-1; windowSize++) { for (let i = 0; i <= word.length - windowSize; i++) { const weightStart = i === 0 ? 0.2 : 0; const weight = Math.round((weights[windowSize-windowSizeMin] + weightStart)*10)/10; const token = word.substring(i, i + windowSize); acc.push({ token, weight }); } } return acc; }, []); return tokens; }

Rendered benchmark preparation results:

Suite status: <idle, ready to run>

Previous results

Test case name	Result
tokenize
tokenize 2

Fastest: N/A

Slowest: N/A

Latest run results:

Run details: (Test run date: one year ago)

User agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36

Browser/OS: Chrome 135 on Windows

View result in a separate tab

Test name	Executions per second
tokenize	54414.3 Ops/sec
tokenize 2	44489.9 Ops/sec

Autogenerated LLM Summary (model llama3.2:3b, generated one year ago):

LLMs can make mistakes. Check important info.

I'll break down the benchmark test cases and explain what's being tested, the different approaches compared, their pros and cons, and other considerations.

**Benchmark Test Cases**

The benchmark test cases are two separate functions: `tokenize` and `tokenize2`. Both functions aim to tokenize a given input string into individual words or tokens, with varying weights assigned to each token based on its length.

**Tokenization Process**

The goal of tokenization is to split a text into individual words or phrases that convey meaning. In this case, the tokens are assigned weights based on their length, with longer tokens receiving higher weights.

**Approach 1: `tokenize` function**

This function uses a simple approach:

* It splits the input string into individual words using a regular expression.
* For each word, it assigns a weight based on its length:
	+ Words of length 5-7 receive a weight of 5
	+ Words of length 8-10 receive a weight of 6
	+ Words of length 11+ receive a weight of 7

The function returns an array of tokens, each with its corresponding weight.

**Approach 2: `tokenize2` function**

This function takes a slightly different approach:

* It splits the input string into individual words using a regular expression.
* For each word, it assigns a weight based on its length using a variable weighting scheme:
	+ Words of length 5 receive a weight of 0.4
	+ Words of length 6 receive a weight of 0.6
	+ Words of length 7 receive a weight of 0.8

The function uses a `reduce` method to iterate through each word and calculate its weighted token.

**Comparison**

The main difference between the two approaches is the weighting scheme used for longer words. In the `tokenize` function, longer words are assigned a fixed weight (5-7), while in the `tokenize2` function, the weight increases gradually as the word length increases.

**Pros and Cons**

**Tokenize:**

* Pros:
	+ Simple and straightforward implementation
	+ Easy to understand and maintain
* Cons:
	+ May not capture nuanced relationships between words of different lengths

**Tokenize2:**

* Pros:
	+ Can capture more nuanced relationships between words of different lengths
	+ May provide better performance for longer input strings
* Cons:
	+ More complex implementation compared to `tokenize`
	+ May be harder to understand and maintain

**Other Considerations**

* **Performance**: Both functions have similar performance characteristics, as they both involve iterating through the input string. However, the `tokenize2` function may perform slightly better for longer input strings due to its more efficient weighting scheme.
* **Robustness**: The `tokenize` function is more robust in terms of handling edge cases, such as words with non-alphanumeric characters or words with multiple spaces.

In summary, while both functions achieve the same goal of tokenization, the `tokenize2` function offers a more nuanced approach to weighting tokens based on their length, which may provide better performance for longer input strings. However, the `tokenize` function is simpler and easier to understand, making it a good choice for situations where simplicity is prioritized over complexity.

Related benchmarks:

Tokenize : 2 méthodes différentes (version: 0)

Comparing performance of: tokenize vs tokenize 2

Created: 2 years ago by: Guest

Jump to the latest result

tokenize

tokenize 2

Suite status: <idle, ready to run>

Fastest: N/A

Slowest: N/A

Autogenerated LLM Summary (model llama3.2:3b, generated one year ago):