Toggle navigation
MeasureThat.net
Create a benchmark
Tools
Feedback
FAQ
Register
Log In
Tokenize : 2 méthodes différentes
(version: 0)
Comparing performance of:
tokenize vs tokenize 2
Created:
2 years ago
by:
Guest
Jump to the latest result
Script Preparation code:
testString = ` Pour analyser le sujet, il faut penser à analyser : - les 2 notions importantes du sujet - mais aussi les autres termes du sujet (attention chaque terme est important) Il y a plusieurs stratégies possibles d'analyse : - On peut d'abord simplement noter ce à quoi ça nous fait penser : on note des synonymes, des expressions, des situations et des idées associées à cette notion. On part de ses représentations spontanées (ce qui nous vient immédiatement à l'esprit). - On pense aussi au contraire de la notion, aux opposés, à des distinctions, des différences - On essaie de trouver des exemples d'application de la notion (à quoi ça renvoie concrètement ?) - On essaie de trouver des connaissances utiles sur cette notion (qu'est-ce qu'on a vu dans le cours à ce propos ? Y a-t-il des philosophes, des courants philosophiques, du vocabulaire technique qu'on pourrait mobiliser ?) - On essaie de définir la notion en dégageant ses caractéristiques fondamentales et spécifiques (qui la distinguent d'autres notions) Dans tous les cas, il faut travailler au brouillon à l'écrit, avec son stylo, et pas seulement dans sa tête, en notant bien toutes ses idées (même si on a l'impression que c'est nul : cela va au contraire nous permettre de débloquer notre pensée !). `
Tests:
tokenize
tokenize(testString) function tokenize(text) { // Fonction pour diviser une chaîne de caractères en tokens const words = text.toLowerCase().split(/\s|'|,|\.|\:|\?|\!|\(|\)|\[|\]/).filter(word => word.length >= 5) || []; // On garde d'abord seulement les mots d'au moins 5 caractères const tokens = []; // On va créer des tokens avec à chaque fois un poids associés for (const word of words) { // Premier type de token : le mot en entier ; poids le plus important tokens.push({word, weight: 5}); // Ensuite on intègre des tokens de 5, 6 et 7 caractères consécutifs pour détecter des racines communes // Plus le token est long, plus le poids du token est important // Si le token correspond au début du mot, le poids est plus important if (word.length >= 5) { for (let i = 0; i <= word.length - 5; i++) { const weight = i === 0 ? 0.6 : 0.4; const token = word.substring(i, i + 5) tokens.push({token,weight: weight}); } } if (word.length >= 6) { for (let i = 0; i <= word.length - 6; i++) { const weight = i === 0 ? 0.8 : 0.6; const token = word.substring(i, i + 6) tokens.push({token,weight: weight}); } } if (word.length >= 7) { for (let i = 0; i <= word.length - 7; i++) { const weight = i === 0 ? 1 : 0.8; const token = word.substring(i, i + 7) tokens.push({token,weight: weight}); } } } return tokens; }
tokenize 2
tokenize2(testString) function tokenize2(text) { // Fonction pour diviser une chaîne de caractères en tokens // On garde d'abord seulement les mots d'au moins 5 caractères const words = text.toLowerCase().split(/\s|'|,|\.|\:|\?|\!|\(|\)|\[|\]/).filter(word => word.length >= 5) || []; // On définit des poids variables selon la taille du token const weights = [0.4, 0.6, 0.8]; // On va créer des tokens avec à chaque fois un poids associé const tokens = words.reduce((acc, word) => { // Premier type de token : le mot en entier ; poids le plus important acc.push({ word, weight: 5 }); // Ensuite on intègre des tokens de 5, 6 et 7 caractères consécutifs pour détecter des racines communes // Plus le token est long, plus le poids du token est important // Si le token correspond au début du mot, le poids est plus important const windowSizeMin = 5 for (let windowSize = windowSizeMin; windowSize <= windowSizeMin+weights.length-1; windowSize++) { for (let i = 0; i <= word.length - windowSize; i++) { const weightStart = i === 0 ? 0.2 : 0; const weight = Math.round((weights[windowSize-windowSizeMin] + weightStart)*10)/10; const token = word.substring(i, i + windowSize); acc.push({ token, weight }); } } return acc; }, []); return tokens; }
Rendered benchmark preparation results:
Suite status:
<idle, ready to run>
Run tests (2)
Previous results
Fork
Test case name
Result
tokenize
tokenize 2
Fastest:
N/A
Slowest:
N/A
Latest run results:
Run details:
(Test run date:
one year ago
)
User agent:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36
Browser/OS:
Chrome 135 on Windows
View result in a separate tab
Embed
Embed Benchmark Result
Test name
Executions per second
tokenize
54414.3 Ops/sec
tokenize 2
44489.9 Ops/sec
Autogenerated LLM Summary
(model
llama3.2:3b
, generated one year ago):
I'll break down the benchmark test cases and explain what's being tested, the different approaches compared, their pros and cons, and other considerations. **Benchmark Test Cases** The benchmark test cases are two separate functions: `tokenize` and `tokenize2`. Both functions aim to tokenize a given input string into individual words or tokens, with varying weights assigned to each token based on its length. **Tokenization Process** The goal of tokenization is to split a text into individual words or phrases that convey meaning. In this case, the tokens are assigned weights based on their length, with longer tokens receiving higher weights. **Approach 1: `tokenize` function** This function uses a simple approach: * It splits the input string into individual words using a regular expression. * For each word, it assigns a weight based on its length: + Words of length 5-7 receive a weight of 5 + Words of length 8-10 receive a weight of 6 + Words of length 11+ receive a weight of 7 The function returns an array of tokens, each with its corresponding weight. **Approach 2: `tokenize2` function** This function takes a slightly different approach: * It splits the input string into individual words using a regular expression. * For each word, it assigns a weight based on its length using a variable weighting scheme: + Words of length 5 receive a weight of 0.4 + Words of length 6 receive a weight of 0.6 + Words of length 7 receive a weight of 0.8 The function uses a `reduce` method to iterate through each word and calculate its weighted token. **Comparison** The main difference between the two approaches is the weighting scheme used for longer words. In the `tokenize` function, longer words are assigned a fixed weight (5-7), while in the `tokenize2` function, the weight increases gradually as the word length increases. **Pros and Cons** **Tokenize:** * Pros: + Simple and straightforward implementation + Easy to understand and maintain * Cons: + May not capture nuanced relationships between words of different lengths **Tokenize2:** * Pros: + Can capture more nuanced relationships between words of different lengths + May provide better performance for longer input strings * Cons: + More complex implementation compared to `tokenize` + May be harder to understand and maintain **Other Considerations** * **Performance**: Both functions have similar performance characteristics, as they both involve iterating through the input string. However, the `tokenize2` function may perform slightly better for longer input strings due to its more efficient weighting scheme. * **Robustness**: The `tokenize` function is more robust in terms of handling edge cases, such as words with non-alphanumeric characters or words with multiple spaces. In summary, while both functions achieve the same goal of tokenization, the `tokenize2` function offers a more nuanced approach to weighting tokens based on their length, which may provide better performance for longer input strings. However, the `tokenize` function is simpler and easier to understand, making it a good choice for situations where simplicity is prioritized over complexity.
Related benchmarks:
Object.assing vs spead operator
Object creation vs function definition vs arror function vs function expression vs named function expression
const vs let vs var fork
const vs let vs var comparison
IndexOf vs Includes in string - larger string edition
Comments
Confirm delete:
Do you really want to delete benchmark?