Computer vision service
WHY Cog Services: You need to decide whether to build or buy a solution. Building a sophisticated image processing and analysis engine is costly. One alternative is to use the Computer Vision API from Microsoft.
We will do the following
- Create a Cognitive Services account.
- Get information about the visual content found in an image.
- Generate a thumbnail of an image.
- Detect and extract printed text from an image.
- Detect and extract handwritten text from an image.
Calling the API
- get an API key
- make a POST call to the API
- parse the JSON response
Format of the URL
region.api.cognitive.microsoft.com/vision/v2.0/resource/[parameters]
- region - the region where you created the account, for example, westus.
- resource - the Computer Vision resource you are calling such as
- analyze, describe, generateThumbnail, ocr, models, recognizeText, tag.
Azure CLI
You can manage cognitive services through either UI or the Azure CLI
az cognitiveservices <command> // e.g `list`
Commands:
- list, List available Azure Cognitive Services accounts.
- account show, Get the details of an Azure Cognitive Services account.
- account create, Create an Azure Cognitive Services account.
- account delete, Delete an Azure Cognitive Services account.
- keys list, List the keys of an Azure Cognitive Services account.
Create a Cognitive Service account
we need a Cognitive Services account for the Computer Vision API The command for creating the account is as follows:
az cognitiveservices account create \
--kind ComputerVision \
--name ComputerVisionService \
--sku S1 \
--resource-group [resourceGroup] \
--location [location]
Ok so we need to create a resource group
and decide on a location
.
Create a resource group
We can simply do that by typing
az group create \
--name resourceforcogservices \
--location westeurope
Create the cog services account
az cognitiveservices account create \
--kind ComputerVision \
--name ComputerVisionService \
--sku S1 \
--resource-group resourceforcogservices \
--location westeurope
That should have create the cog services account
.
Verify account
We can easily verify that the account was created by typing:
az cognitiveservices account list \
--resource-group resourceforcogservices // returns array []
// OR
az cognitiveservices account show \
--name ComputerVisionService \
--resource-group resourceforcogservices // returns a specific account by name
Get the keys
az cognitiveservices account keys list \
--name ComputerVisionService \
--resource-group resourceforcogservices
This retuns a list of keys that we can use to do our REST API call.
We can save down the first key
in our response to a variable that we can refer to later, like so:
key=$(az cognitiveservices account keys list \
--name ComputerVisionService \
--resource-group resourceforcogservices \
--query key1 -o tsv)
As you can see above we pose a query and ask specifically for key1
.
We can see our command worked by typing:
echo $key // should print `key1` value
Talking to the API
Time has come to talk to the API. We will attempt to analyze an image. Let's have a look at what an API URL might look like :
https://<region>.api.cognitive.microsoft.com/vision/v2.0/analyze?visualFeatures=<...>&details=<...>&language=<...>
Above we are using the analyze
method. There are also some parameters in there that we need to set such as visualFeatures
, details
and language
- details, this can have value
Landmarks
orCelebrities
- visualFeatures, this is about what kind of information you want back, The
Categories
option will categorize the content of the images like trees, buildings, and more.Faces
will identify people's faces and give you their gender and age
There are some requirements on the images that you mean to process, namely:
- Supported image formats: JPEG, PNG, GIF, BMP.
- Image file size must be less than 4 MB.
- Image dimensions must be at least 50 x 50.
Identifying landmarks
We need to set visualFeatures=Categories,Description
and details=Landmarks
. Then we need to set our key as the value of the header param Ocp-Apim-Subscription-Key
ans
On some system the lib
jq
is not installed. We need it before we run our curl command below,brew install jq
curl "https://westeurope.api.cognitive.microsoft.com/vision/v2.0/analyze?visualFeatures=Categories,Description&details=Landmarks" \ -H "Ocp-Apim-Subscription-Key: $key" \ -H "Content-Type: application/json" \ -d "{'url' : 'https://raw.githubusercontent.com/MicrosoftDocs/mslearn-process-images-with-the-computer-vision-service/master/images/mountains.jpg'}" \ | jq '.'
curl "https://westeurope.api.cognitive.microsoft.com/vision/v2.0/analyze?visualFeatures=Categories,Description&details=Landmarks" \ -H "Ocp-Apim-Subscription-Key: $key" \ -H "Content-Type: application/json" \ -d "{'url' : 'https://static1.squarespace.com/static/51b3dc8ee4b051b96ceb10de/t/5a3d8165e2c4836cd19a9f7b/1513980264718/?format=2500w'}" \ | jq '.'
You should get back an answer looking like this:
{
"categories": [
{
"name": "outdoor_mountain",
"score": 0.99609375,
"detail": {
"landmarks": []
}
}
],
"description": {
"tags": [
"snow",
"outdoor",
"mountain",
"nature",
"covered",
"skiing",
"man",
"flying",
"standing",
"wearing",
"side",
"air",
"slope",
"jumping",
"plane",
"red",
"hill",
"riding",
"people",
"group",
"yellow",
"board",
"doing",
"airplane"
],
"captions": [
{
"text": "a snow covered mountain",
"confidence": 0.956279380622841
}
]
},
"requestId": "3c99e0c5-0ab9-4baf-9b7d-b804a4e377f9",
"metadata": {
"width": 600,
"height": 462,
"format": "Jpeg"
}
}
Checking for inappropriate content
We can also call the APIO to determine wether our image contains adult/ inappropriate content. To accomplish that we only need to change the parameters.
Let's change visualFeatures
to instead say Adult,Description
. Our call will now look like this:
curl "https://westeurope.api.cognitive.microsoft.com/vision/v2.0/analyze?visualFeatures=Adult,Description" \
-H "Ocp-Apim-Subscription-Key: $key" \
-H "Content-Type: application/json" \
-d "{'url' : 'https://raw.githubusercontent.com/MicrosoftDocs/mslearn-process-images-with-the-computer-vision-service/master/images/people.png'}" \
| jq '.'
The response coming back looks like this:
{
"adult": {
"isAdultContent": false,
"isRacyContent": false,
"adultScore": 0.0013711383799090981,
"racyScore": 0.0046537225134670734
},
"description": {
"tags": [
"outdoor",
"person",
"posing",
"photo",
"grass",
"group",
"standing",
"man",
"people",
"woman",
"young",
"holding",
"dress",
"white"
],
"captions": [
{
"text": "a group of people posing for a photo",
"confidence": 0.9906094762899921
}
]
},
"requestId": "",
"metadata": {
"width": 600,
"height": 462,
"format": "Png"
}
}
Let's highlight specifically on the adult
property section:
"adult": {
"isAdultContent": false,
"isRacyContent": false,
"adultScore": 0.0013711383799090981,
"racyScore": 0.0046537225134670734
}
We see the following parameters of interest:
isAdultContent
has the valuefalse
.isRacyContent
is alsofalse
adultScore
is very low value close to 0. It would need to be closer to1.0
to imply it's adult content.racyScore
, same thing here, a value close to0
Generate thumbnail
Ok, so there is a thumbnail generating service. Why do we want that? Well let's look at the sales pitch for this service:
Computer Vision first generates a high-quality thumbnail and then analyzes the objects within the image to determine the region of interest (ROI). Computer Vision then crops the image to fit the requirements of the region of interest. The generated thumbnail can be presented using an aspect ratio that is different from the aspect ratio of the original image, depending on your needs
Ok, so it seems to do a better job than a human finding the interesting part of an image, neat. Let's try it out.
Ok, let's have a look at what the URL looks like that we need to call to create the thumbnails we need:
https://<region>.api.cognitive.microsoft.com/vision/v2.0/generateThumbnail?width=<...>&height=<...>&smartCropping=<...>
Ok first thing we see here is that we changed the name of the service we are going to use to generateThumbnail
. Of course that means the parameters will change to.
Parameters:
- width, required
- height, required
- smartCropping, Here is an example to why I would want this enabled - with smart cropping enabled, a cropped profile picture would keep someone's face within the picture frame even when the picture has a different aspect ratio
So an actual call to this service looks like this:
curl "https://westeurope.api.cognitive.microsoft.com/vision/v2.0/generateThumbnail?width=100&height=100&smartCropping=true" \
-H "Ocp-Apim-Subscription-Key: $key" \
-H "Content-Type: application/json" \
-d "{'url' : 'https://raw.githubusercontent.com/MicrosoftDocs/mslearn-process-images-with-the-computer-vision-service/master/images/dog.png'}" \
-o thumbnail.jpg
This will create an image with the dimensions 100x100 and with the name thumbnail.jpg
.
Extract printed text from an image
The Computer Vision API contains some neat OCR algorithms that will be able to recognize text in our images.
The URL for calling this API looks like this:
https://<region>.api.cognitive.microsoft.com/vision/v2.0/ocr?language=<...>&detectOrientation=<...>
The service used is called ocr
.
Parameters:
- language: The language code of the text to be detected in the image. The default value is unk,or unknown. This let's the service auto detect the language of the text in the image.
- detectOrientation: When true, the service tries to detect the image orientation and correct it before further processing, for example, whether the image is upside-down.
To perform the actual call we need to perform the following POST request:
curl "https://westeurope.api.cognitive.microsoft.com/vision/v2.0/ocr" \
-H "Ocp-Apim-Subscription-Key: $key" \
-H "Content-Type: application/json" \
-d "{'url' : 'https://raw.githubusercontent.com/MicrosoftDocs/mslearn-process-images-with-the-computer-vision-service/master/images/ebook.png'}" \
| jq '.'
The answer back looks like this:
{
"language": "en",
"textAngle": 0,
"orientation": "Up",
"regions": [
{
"boundingBox": "57,18,348,509",
"lines": [
{
"boundingBox": "335,18,70,14",
"words": [
{
"boundingBox": "335,18,70,14",
"text": "Microsoft"
}
]
},
{
"boundingBox": "67,102,286,27",
"words": [
{
"boundingBox": "67,102,60,27",
"text": "NET"
},
{
"boundingBox": "140,102,213,27",
"text": "Microservices:"
}
]
},
{
"boundingBox": "57,139,239,29",
"words": [
{
"boundingBox": "57,139,185,29",
"text": "Architecture"
},
{
"boundingBox": "253,139,43,29",
"text": "for"
}
]
},
{
"boundingBox": "58,178,291,28",
"words": [
{
"boundingBox": "58,178,209,28",
"text": "Containerized"
},
{
"boundingBox": "280,179,69,27",
"text": ".NET"
}
]
},
{
"boundingBox": "57,216,188,37",
"words": [
{
"boundingBox": "57,216,188,37",
"text": "Applications"
}
]
},
{
"boundingBox": "277,465,93,10",
"words": [
{
"boundingBox": "277,466,30,9",
"text": "Cesar"
},
{
"boundingBox": "310,466,14,9",
"text": "de"
},
{
"boundingBox": "328,465,9,10",
"text": "la"
},
{
"boundingBox": "341,466,29,9",
"text": "Torre"
}
]
},
{
"boundingBox": "277,480,65,13",
"words": [
{
"boundingBox": "277,480,17,10",
"text": "Bill"
},
{
"boundingBox": "298,480,44,13",
"text": "Wagner"
}
]
},
{
"boundingBox": "277,495,71,9",
"words": [
{
"boundingBox": "277,495,10,9",
"text": "M"
},
{
"boundingBox": "289,495,16,9",
"text": "ike"
},
{
"boundingBox": "308,495,40,9",
"text": "Rousos"
}
]
},
{
"boundingBox": "277,515,110,12",
"words": [
{
"boundingBox": "277,515,46,10",
"text": "Microsoft"
},
{
"boundingBox": "327,515,60,12",
"text": "Corporation"
}
]
}
]
}
]
}
The interesting part is in the regions
key which is an array of found text occurrences. Let's have a look at one such item:
{
"language": "en",
"textAngle": 0,
"orientation": "Up",
"regions": [
{
"boundingBox": "57,18,348,509",
"lines": [
{
"boundingBox": "335,18,70,14",
"words": [
{
"boundingBox": "335,18,70,14",
"text": "Microsoft"
}
]
}
]
}
We see above from our response we are given the orientation
for the picture and also regions
which are all the different sections in the image. Looking deeper in the response we have lines
which is an array that points to x number of instance where each instance looks like this:
{
"boundingBox": "335,18,70,14",
"words": [
{
"boundingBox": "335,18,70,14",
"text": "Microsoft"
}
]
}
Above we see that words
that contains one instance per found word at a specific place.
Extract handwritten text
This is great to use when you have a lot of handwritten notes and want to know what they say. It's really good at recognizing hand written text.
Let's have a look at what the URL looks like to call this service:
https://<region>.api.cognitive.microsoft.com/vision/v2.0/recognizeText?mode=<...>
Above we are using the method recognizeText
and we use the following parameters with it:
mode
, this has to have the valueHandWritten
orPrinted
Calling this service therefore looks like this:
curl "https://westeurope.api.cognitive.microsoft.com/vision/v2.0/recognizeText?mode=Handwritten" \
-H "Ocp-Apim-Subscription-Key: $key" \
-H "Content-Type: application/json" \
-d "{'url' : 'https://raw.githubusercontent.com/MicrosoftDocs/mslearn-process-images-with-the-computer-vision-service/master/images/handwriting.jpg'}" \
-D -
This gives us a response back that does not contain the actual answer but rather tells us where to go once it managed to finish the operation. The URL to go can be found in the Operation-Location
parameter. In this specific case I got an URL back with the value:
https://westeurope.api.cognitive.microsoft.com/vision/v2.0/textOperations/1769e0e4-a93b-4b78-93f1-00b3bb27e1b4
To get the result we will need to perform the following call:
curl -H "Ocp-Apim-Subscription-Key: $key" "https://westeurope.api.cognitive.microsoft.com/vision/v2.0/textOperations/1769e0e4-a93b-4b78-93f1-00b3bb27e1b4" | jq '.'
The response will now look like this:
{
"status": "Succeeded",
"recognitionResult": {
"lines": [
{
"boundingBox": [
126,
127,
388,
133,
387,
168,
125,
162
],
"text": "It does not matte",
"words": [
{
"boundingBox": [
110,
128,
163,
128,
164,
164,
110,
164
],
"text": "It"
},
{
"boundingBox": [
167,
128,
239,
128,
240,
164,
167,
164
],
"text": "does"
},
{
"boundingBox": [
228,
128,
300,
127,
300,
163,
228,
164
],
"text": "not"
},
{
"boundingBox": [
296,
127,
395,
127,
395,
163,
297,
163
],
"text": "matte"
}
]
},
{
"boundingBox": [
153,
180,
434,
194,
433,
230,
151,
216
],
"text": "how slowly you go",
"words": [
{
"boundingBox": [
141,
183,
200,
184,
195,
223,
136,
222
],
"text": "how"
},
{
"boundingBox": [
222,
185,
314,
187,
309,
226,
217,
224
],
"text": "slowly"
},
{
"boundingBox": [
326,
187,
382,
189,
376,
228,
321,
226
],
"text": "you"
},
{
"boundingBox": [
398,
189,
440,
190,
435,
229,
393,
228
],
"text": "go"
}
]
},
{
"boundingBox": [
126,
237,
313,
243,
312,
280,
125,
274
],
"text": "so as long as",
"words": [
{
"boundingBox": [
114,
238,
162,
239,
160,
277,
112,
276
],
"text": "so"
},
{
"boundingBox": [
162,
239,
210,
239,
208,
277,
160,
277
],
"text": "as"
},
{
"boundingBox": [
210,
239,
276,
240,
274,
278,
208,
277
],
"text": "long"
},
{
"boundingBox": [
276,
240,
321,
241,
319,
279,
274,
278
],
"text": "as"
}
]
},
{
"boundingBox": [
192,
281,
431,
293,
429,
337,
190,
325
],
"text": "you do not stop",
"words": [
{
"boundingBox": [
194,
279,
258,
283,
246,
325,
182,
321
],
"text": "you"
},
{
"boundingBox": [
271,
284,
315,
287,
303,
329,
259,
326
],
"text": "do"
},
{
"boundingBox": [
315,
287,
370,
290,
358,
333,
303,
329
],
"text": "not"
},
{
"boundingBox": [
366,
290,
440,
295,
428,
337,
354,
332
],
"text": "stop"
}
]
},
{
"boundingBox": [
156,
367,
318,
346,
325,
399,
163,
419
],
"text": "fighting",
"words": [
{
"boundingBox": [
168,
366,
323,
339,
328,
396,
174,
423
],
"text": "fighting"
}
]
}
]
}
}