A point in multi-dimensional space, described by a vector
Mathematically represent characteristics and meaning of a word, phrase or text
It encodes both semantic and contextual information
AI and LLMs enable the creation of embeddings with hundreds or even thousands of dimensions
Cosine distance is defined as the distance of the angle between two vectors normalized to unit length, ranging from 0 to 2
the shortest distance wins
function distance($a,$b):float
{
return 1 - (dotp($a,$b) /
sqrt(dotp($a,$a) * dotp($b,$b))
);
}
// calculating the dot-product
function dotp($a,$b):float
{
$products = array_map(function($da, $db) {
return $da * $db;
}, $a, $b);
return (float)array_sum($products);
}
This will produce a floating point number between 0 and 2
the word/concept "Queen"
[-0.0045574773,-0.0067263762,-0.002498418,
-0.018243857,-0.01689091,0.010516719,-0.0076504247,
-0.024046184,-0.017365139,-0.012818122,0.0145058185,
0.022330591,0.014533714,-0.0029691597,-0.018801773,
0.008884814,0.043322187,0.021061333,0.029513761,
-0.008801127,0.0020712635,0.014136199,-0.005460604,
0.003598559,-0.005296716,-0.010230786,0.0072319875,
... ,-0.011262931]
encoded in 768 dimensions, capturing meaning, context and relations
composer req openai-php/client
ollama run nomic-embed-text:latest
$client = OpenAI::factory()
->withBaseUri('http://localhost:11434/v1')
->make();
$result = $client->embeddings()->create([
'model'=>'nomic-embed-text',
'dimensions'=>768,
'input'=>$text,
'encoding_format'=> 'float',
]);
return $result->toArray()['data'][0]['embedding'];
Other good embeding engine: snowflake-arctic-embed2
This is how LLMs calculate what you 'mean'
$embedding = new Embedding();
$normalizeMe = $argv[1];
$dictionary = $embedding->generateDictionary([
'rock', 'paper', 'scissors', 'lizard', 'spock',
]);
$distances = $embedding->calculateDistances(
$embedding->calculateEmbedding( $normalizeMe ),
$dictionary
); // already sorted
print_r($distances);
printf('the Input "%s" is normalized to "%s"'."\n\n\n",
$normalizeMe,array_keys($distances)[0]);
This way we can normalize random data and text to our domain.
docker run --name my-Redis-container \
-p 6378:6379 -v `pwd`/dockerRedisdata:/data \
-d Redis/Redis-stack-server:latest
"Index" is here a synonym for a table and an index in an SQL database
{
uid: 123,
pid: 111,
table: 'tt_content',
text: 'lorem ipsum vitae dolor sit amit...',
slug: 'https://my.domain.de/the/path/to/the/page',
embedding: [0.45633,-0.567476,0.126775,...]
}
FT.CREATE idx:MYINDEX ON JSON PREFIX 1 text: SCORE 1.0
SCHEMA
$.uid NUMERIC
$.pid NUMERIC
$.table TEXT WEIGHT 1.0 NOSTEM
$.text TEXT WEIGHT 1.0 NOSTEM
$.slug TEXT WEIGHT 1.0 NOSTEM
$.embedding AS vector VECTOR FLAT 6 TYPE FLOAT32 DIM 1536 DISTANCE_METRIC COSINE
$client = new Predis\Client([ 'port'=>6378 ]);
$fields = [
new NumericField('$.uid', "uid"),
new NumericField('$.pid', "pid"),
new TextField('$.table', "table"),
new TextField('$.text', "text"),
new TextField('$.slug', "slug"),
new VectorField( '$.embedding', 'FLAT' ,[
'TYPE','FLOAT32',
'DIM','1536',
'DISTANCE_METRIC','COSINE'
],'vector')
];
$arguments = new CreateArguments();
$arguments->on('JSON' )->prefix( ['MYINDEX:'])->score(1.0);
$status = $client->ftcreate(
'idx:MYINDEX', $fields, $arguments
);
var_dump($status);
// $row is from the TYPO3 database,
// or the TYPO3 Datahandler or similar
// $table for example tt_content
$text = sprintf("%s\n%s",
$row['header'],
strip_tags( $row['bodytext'] )
);
$key = 'MYINDEX:'.$table.':'.$row['uid'];
$set = [
'uid'=>$row['uid'],
'pid'=>$row['pid'],
'slug'=>$base.$row['slug'],
'table'=>$table,
'text' => $text,
'embedding'=>getEmbedding($text)
];
$client = new Predis\Client(['port'=>6378]);
$client->jsonset( $key, '$', json_encode($set) );
use Predis\Command\Argument\Search\SearchArguments;
$searchEmbedding = getEmbedding( $searchTerm );
$client = new Predis\Client([ 'port' => 6378 ]);
$searchArguments = new SearchArguments();
$searchArguments
->dialect( 2 ) // must be 2 for vector
->sortBy('__vector_score') // closest neighbour by definition
->addReturn( 6, '__vector_score', 'uid','pid'
,'text','table','slug')
->limit(0, 10)
->params( [
'query_vector',
pack( 'f1536', ... $searchEmbedding )
]);
$rawResult = $client->ftsearch(
'idx:MYINDEX',
'(*)=>[KNN 100 @vector $query_vector]',
$searchArguments
);
[$count,$results] = normalizeRedisResult( $rawResult );
normalizing the Redis-result
The RAW Result:
Array (
[0] => 3 // amount of result sets
[1] => MYINDEX:tt_content:2467 // KEY first set
[2] => Array (
[0] => __vector_score
[1] => 0.18477755785
[2] => uid
[3] => 2467
[4] => text
[5] => Contact
[6] => table
[7] => tt_content
[8] => pid
[9] => 265
[10] => slug
[11] => https://www.tcworld.info/contact
)
[3,4] // next key and set
[5,6] // next key and set
)
normalizing the Redis-result
function normalizeRedisResult(array $result):array {
$count = array_shift($result);
$return = [];
foreach($result as $value) {
if(!is_array($value)) $key=$value;
else {
$set = [];
for($i=0,$l=count($value); $i<$l; $i=$i+2) {
$k = $value[$i];
$set[$k] = $value[$i+1];
}
$return[ $key ] = $set;
}
}
return [$count,$return];
}
normalizing the Redis-result
Array
(
[MYINDEX:tt_content:2509] => Array
(
[__vector_score] => 0.194359481335
[uid] => 2509
[text] => lorem ipsum vitae...
[table] => tt_content
[pid] => 273
[slug] => https://www.tcworld.info/faq
)
[MYINDEX:tt_content:2528] => Array ()
[MYINDEX:tt_content:2567] => Array ()
)
CSS: https://picocss.com/
Javascript: VanillaJS
Embeddings: OpenAI text-embedding-3-small
smartbrew.ai
Twitter: @FoppelFB | Mastodon: @foppel@phpc.social
fberger@sudhaus7.de | https://sudhaus7.de/