Fun Site showing how floating point bit patterns work. Here is a screen shot of an example from Wikipedia.
A blog about software and making.
Another great workshop with the people at Boltmade. This time around, it was series of exercises to refactor the deployment of a cloud app using Ansible.
At its most basic Ansible is a way of replacing your app’s initialization shell scripts or manual deployment steps.
Deployments will become complex over time. Shell scripts hit a complexity wall where you have to give up and start over with a config management system. Ansible is simple to start with and can evolve as complexity grows over time. Extremely useful if you are deploying unusual combinations of software.
Puppet & Chef require a remote deploy server, so there is a centralized point of failure. With Ansible, you just ssh in and start doing stuff so it can also provision a local server.
Deployments are not static. Ansible allows you to tie your deployment requirements to your source making deployment repeatable and easy. By looking at your source history, you can see how your server requirements have changed over time. You can look at old versions of your application and see what the infrastructure was like back then. It also captures any workarounds required to deploy combinations of software and infrastructure versions.
Ansible is a structured way of accomplishing a deployment. It gives you a common language for how things are configured/deployed at your company. The release engineers who follow you can understand your deployment process (playbook) because it has standard style. Removes the need to have separate documents for how things should be setup.
It’s possible to configure Ansible to do idempotent deployments so you can deploy again and again and get same results. Ansible can be made smart enough to skip steps that aren’t needed. You won’t need to comment out sections (like you have to with bash scripts) while doing iterative development.
A potential downside of ansibles simplicity is that if you can forget about temporary fixes while getting things working that come back and haunt you later (ex: hardcoded/default passwords).
Random Notes:
Google compute engine (GCE) was down globally for 18 minutes on the 11th. The usual culprit of configuration error was present but there was also a series of bugs that combined together to propagate a bad configuration into production.
It’s fascinating how small things combine to wreak havoc on large scale systems. Google is constantly testing so I imagine theses bugs have been quietly waiting in edge case land until random chance brings them together. With enough scale and complexity, the improbable becomes probable.
I tried to use a MinHash to group similar product listings, but it didn’t work out. I was trying to generate a list of aliases for entities by finding similar documents that have different entity names as attributes. Assuming that an entity has multiple similar documents associated with it that are listed under its different aliases I should be able to generate a set of aliases by grouping similar documents together.
The data I’m working with has multiple versions of the same manufacturer.
Manufacturer | Listing Text |
---|---|
Fujifilm Canada | Fujifilm FinePix JV100 12 MP Digital Camera with 3x Optical Zoom and 2.7-Inch LCD (Black) |
FUJIFILM | Fujifilm FinePix XP10 12 MP Waterproof Digital Camera with 5x Optical Zoom and 2.7-Inch LCD (Black) |
Fujifilm Imaging Systems | Fujifilm Finepix Z700EXR Digitalkamera (12 Megapixel, 5-fach opt.Zoom, 8,9 cm Display, Bildstabilisator) silber |
FUJIFILM Electronic Imaging Europe GmbH - Firstorder | Fujifilm FINEPIX Z90 Digitalkamera (14 Megapixel, 5-fach opt. Zoom, 7,6 cm (3 Zoll) Display) silber |
Fuji Photo Film Europe GmbH | Fujifilm FINEPIX JX280 Digitalkamera (14 Megapixel, 5-fach opt. Zoom, 6,9 cm (2,7 Zoll) Display) schwarz |
While these are all Finepix listings the manufacture name is different for each one. This prevents me from blocking on the manufacturer’s name as a first stage when matching products to listings.
I tried using the MinHash technique to find approximately similar documents but I don’t think there’s enough text per listing to get a good join. Using only unique tokens to generate the min hashes helped but I’m getting too many false negatives on the Fuji listings for this to be useful.
Next, I’m going to generate n-grams using model numbers from the products file and doing a plain old Jaccard similarity coefficient instead. Hopefully, that can let me group listings for the same model together.
I’ll probably come back to this code and try using it for another problem with long texts. The examples of MinHash I’ve seen are on longer documents where they use shingles(n-grams where n > 1) instead of just unigrams.