A point in multi-dimensional space, described by a vector
Mathematically represent characteristics and meaning of a word, phrase or text
It encodes both semantic and contextual information
AI and LLMs enable the creation of embeddings with hundreds or even thousands of dimensions
Cosine distance is defined as the distance of the angle between two vectors normalized to unit length, ranging from 0 to 2
the shortest distance win, we assume everything below 0.1 as a very very close match
function distance($a,$b):float
{
return 1 - (dotp($a,$b) /
sqrt(dotp($a,$a) * dotp($b,$b))
);
}
// calculating the dot-product
function dotp($a,$b):float
{
$products = array_map(function($da, $db) {
return $da * $db;
}, $a, $b);
return (float)array_sum($products);
}
This will produce a floating point number between 0 and 2
the word "Queen"
[-0.0045574773,-0.0067263762,-0.002498418,
-0.018243857,-0.01689091,0.010516719,-0.0076504247,
-0.024046184,-0.017365139,-0.012818122,0.0145058185,
0.022330591,0.014533714,-0.0029691597,-0.018801773,
0.008884814,0.043322187,0.021061333,0.029513761,
-0.008801127,0.0020712635,0.014136199,-0.005460604,
0.003598559,-0.005296716,-0.010230786,0.0072319875,
... ,-0.011262931]
encoded in 1536 dimensions, capturing meaning, context and relations
composer req openai-php/client
function getEmbedding(string $text): array
{
$client = OpenAI::client(getenv('OPENAIKEY'));
$result = $client->embeddings()->create([
'model'=>'text-embedding-ada-002',
'input'=>$text,
'encoding_format'=> 'float',
]);
return $result->toArray()['data'][0]['embedding'];
}
Tip: cache the resulting embedding
$embedding = new Embedding();
$text1 = $argv[1];
$text2 = $argv[2];
$d1 = $embedding->calculateEmbedding($text1);
$d2 = $embedding->calculateEmbedding($text2);
$distance = $embedding->distance($d1, $d2);
$rating = $distance > 0.125 ? 'BAD' : 'GOOD';
printf("\n\n%s\n\n%s\n\nDistance: %s\n\nRating: %s\n\n",
$text1, $text2, $distance, $rating);
0.125 is the threshold we defined for us
$embedding = new Embedding();
$normalizeMe = $argv[1];
$dictionary = $embedding->generateDictionary([
'rock', 'paper', 'scissors', 'lizard', 'spock',
]);
$distances = $embedding->calculateDistances(
$embedding->calculateEmbedding( $normalizeMe ),
$dictionary
); // already sorted
print_r($distances);
printf('the Input "%s" is normalized to "%s"'."\n\n\n",
$normalizeMe,array_keys($distances)[0]);
This way we can normalize random data and text to our domain.
$testBias = new Embedding();
$f_p = $testBias->calculateEmbedding( 'female profession');
$m_p = $testBias->calculateEmbedding( 'male profession');
$jobs = [
'Doctor',
'Nurse', // and more professions
];
foreach ($jobs as $job) {
$j_e = $testBias->calculateEmbedding($job);
$d1 = $testBias->distance( $j_e, $f_p );
$d2 = $testBias->distance( $j_e, $m_p );
printf("%s is a profession for %s.\n",
$job, $d1 < $d2 ? 'WOMEN' : 'MEN');
}
Biases are inherent to all AI Models! It always depends on by who and how they are trained!
$em = new Embedding();
$theText = $argv[1];
[$keyword, $distance] = $em->getNearestNeighbour(
$em->calculateEmbedding( $theText ),
$em->generateDictionary([
'truck','computer memory','sheep'
])
);
printf("\n\n\n".'"%s" probably is : %s'."\n\n\n",
$theText,$keyword);
Takeway: Models are trained by nerds!
Good thing: you can use embeddings to test the bias in your own models!
$embedding = new Embedding();
$rateme = $argv[1];
$distances = $embedding->calculateDistances(
$embedding->calculateEmbedding( $rateme ),
$embedding->generateDictionary( [
'joy', 'sadness', 'anger', // ... 120 more emotions
])
);
$filtered = array_filter( $distances, function ( $val ) {
return $val < 0.25;
});
printf('The input "%s" carries the sentiments "%s"',
$rateme, implode( ', ', array_keys( $filtered ) ) );
checking the "mood" - this how Zoom knows how a conference call went
Input parser for spoken language
finding similar products (closest neighbour) - as an alternate recommendation based on loose couplings like the order history rather than keywords or categories
creating a search by "intention" not keyword
or a chat feature based on your database
Ollama has 3 embedding models available (Rest API)
Google Vertex-AI has 2 embedding models
Algorithm based models, like Tensorflow, Word2Vec, Bayes etc. pp.
Barry Stahl - who gave me the insipiration and examples for this talk
SUDHAUS7 - for letting me abuse their OpenAI Key and paying for the time to research the whole topic
TYPO3 GmbH - for letting me on their stage to talk about this stuff - again!
Twitter: @FoppelFB | Mastodon: @foppel@phpc.social
fberger@sudhaus7.de | https://sudhaus7.de/