Project: Robotics Class - Pick and Place Task with a Robotic Arm

This project was a final assignment for a class Robotics (code: B3B33ROB). Within the assignment, we had 3 options to choose from as to which kind of a task we would like to solve with a given robotic arm.

The first option was to control a robotic arm in such a way that it would fold a sweater that was underneath it in a specific position. Folding a sweater was defined as grabbing both of the sleeves and folding them across each other. People working on this project didn’t have to implement any recognition as there were no cameras on this robot, however, the tricky part was that the teachers who were rating this task held high standards to a protocol and the folding and students had to often improve their implementation once they were trying to submit their work. Robot: BOSCH turboscara SR 450.

The next option to choose from was a SCARA robot which had a pen as its end effector. This robot had to draw a rectangle with the biggest area. The solution had to be theoretically derived and even though it may seem easy at first this task was a rather hard one, as one has to account for the robot movement limitations and a movement of multiple rotational joints. Robot: Mitsubishi Melfa RV-6S.

The third and the last task to choose from was pick and place task where the robot had to detect, recognize different cylinders and cubes of different colors and different sizes, pair the objects with the same color and place them on top of each other so that the smallest object is on the top. Available robots were Mitsubishi RV6SDL, CRS 93, and CRS 97.

We (me and my classmate Alenka) chose to go with the third option. It seemed to be the clearest option as the problems that needed to be solved were rather transparent, straightforward, and there was less of a chance that by starting to work on this task one would open Pandora’s box.

This article will go through my implementation of detection and recognition. Explanations will be conceptual and will not go deep into the code.

The libraries used by my implementation were OpenCV, NumPy, Pickle, math, matplotlib, and datetime.

As the projects description didn’t include the need to optimize and to try to minimize data flow I was a bit wasteful, I knew what I was doing but speed, simplicity, and easiness to access the data and speed of the overall development process was a top priority.

The data which were used by functions and were exported were nested Python dictionaries with very humanly readable data organization. Data stored in these dictionaries was like a rolling snowball that grew as it went through the functions that made use of the present data and added their own.

We were encouraged to use standard detection and recognition techniques and not to get bogged down in machine learning as the nature of the problem was a rather simple one. Cubes and cylinders had certain constant colors, constant sizes (there were 3 different sizes: small, medium, and large for both cylinders and cubes), room illumination in the room was also constant however there was a small variance of the illumination throughout the day as the windows were present. Additionally, it was recommended to make use of the OpenCV and other libraries as much as possible and not try to reinvent the wheel for f.e. blob detection algorithms.

Detection part of the process involved trying to find our objects in an image captured by the camera above the workspace. There are different strategies for this problem, either you will make use of the colors of the objects and use that information or you will be mainly focused on the color of the static background which you will use to extract the objects from the image by “removing” it. I would presume that the latter strategy would be more suitable and more versatile for improvements f.e. detecting color we have never seen before, however, we were a little too deep already with the first strategy and it worked sufficiently for our purpose.

But how does our program know about the colors? Well, it is information that has to be loaded into the program externally.

But wait, how will we represent colors? We will use HSV (hue, saturation, value) color representation which is a little bit different from the RGB (red, green, blue) color one. In essence, these representations represent the same thing but the way the HSV (hue, saturation, value) color space is accessed is more suitable for our color detection. How? We have to first understand the 3 values for each of these spaces.

I won’t be explaining these color representations however, I think other sources are more suitable for this. Here are Wikipedia links for RGB and HSV. In short, HSV is better suited for the detection task as the 3 values better describe attributes of a given color, and it is much easier to compare colors under different light intensity as this intensity is expressed as “value” in the HSV spectrum and color being expressed by “hue”.

Now that we understand the colors in our image let’s look at what we are working with.

We know what colors we are looking for and since our implementation is not working with the background we have to therefore convey this knowledge to our program somehow. To do this we’ve implemented an automatic color extractor script that takes a cropped image of a particular object (the name of the image is an integer representing the index of that color) and outputs the HSV pixel value from the image into a specific text file “colors.txt” used by the main program.

So right now we have the image, the data about the colors and we want to use that to find and distinguish the objects in the image.

Function: get_colored_objects()

Firstly, we will create a binary mask (of the whole image) for each of the colors we defined by cropped images. But for filtering all of the pixels we will need some form of thresholding, in other words, we will need to be able to decide if a certain pixel qualifies for that particular color as there won’t be too many of the same pixels as the HSV value from the mentioned “colors.txt” file. These calibration constants will have their own text file “calibration_constants.txt” and structure. Each of the rows will correspond to 1 color with the color index indicated at the beginning of a line. Then the structure is as follows: 1: color index, 2: hue symmetric threshold, 3: saturation symmetric threshold, 4: value symmetric threshold, 5: hue symmetric threshold, 6: minimal pixel count, 7: hue symmetric threshold, 8: small cylinder left interval, 8: medium cylinder left interval, 9: big cylinder left interval, 10: small cube left interval, 11: medium cube left interval, 12: big cube left interval.

The other variables past the 6th value will be explained in the recognition part of this article. The 6th, minimal pixel count that is, creates a barrier of acceptance by the algorithm. If we detect some object in the binary mask image and it has a smaller total pixel count (number of white pixels in the mask) it will be ignored.

To finally extract the objects from the image we will use the function cv2.connectedComponentsWithStats() implemented in OpenCV which will be applied at each of the created color masks. This function takes as an input binary mask and returns connected components, which are blobs in the image that are interconnected by the white pixels. Each of these blobs will have its index in the returned array. Each of the blobs will also hold x and y-axis information about the point that is most to the left (x-axis position) and up (y-axis position), width, height, area (pixel count), and center of that blob.

We will then filter the blobs by the mentioned “minimal pixel count” constant in “calibration_constants.txt”. And save all of the information in the dictionary with the binary masks as well.

Function: add_cropped_images_to_dict()

So, right now we have a binary mask for each of the colors (of original image dimensions), information about each of the blob, its color, its dimensions, and position. Our next step will be the recognition, but before that, we need to crop each blob from the mask and save it as a “mini/cropped” binary image. This functionality is managed in the function add_cropped_image_to_dict() which takes the mentioned blob attributes and saves the cropped blob image in the dictionary with other information about that blob.


Since we have all that we need we will continue with recognizing the sizes and shapes of objects. We will be distinguishing between cylinders and cubes and since our camera has a top view all the time we just need to recognize squares and circles.

There are certainly many possible strategies most notably probably trying to find a ratio of circumference to radius but for our purposes, we used polynomial approximation of our binary mask. If our object is a square it should be approximated by 4 points compared to a circle which should have at least 5 approximation points. This strategy was used and if our mask had 4 or fewer approximation points it was determined that the mask belongs to a square (cube object), else if the mask was approximated by 5 and more, it was determined to be a circle (cylinder).

Polynomial approximation function was imported from OpenCV cv2.approxPolyDP. (sidenote: function often returns multiple approximations, we take the one with the biggest area)

In addition to distinguishing objects, we also need to distinguish 3 different sizes within the objects. As mentioned before, our “calibration_constants.txt” file also contains intervals for distinguishing different object sizes based on pixel count. Pixel count in this case is taken from the return of approximation function to compensate for potential “holes” in the mask which would bring the total pixel count down. In the text file, there are 3 left intervals for small, medium, and big sizes for cylinders and cubes f.e. small circle: 0, medium circle: 5500, big circle: 13000. These variables are loaded in the dictionary at the beginning.

The last thing that needs to be recognized is the rotation of the cube relative to the camera. Since the angle is important only for the cubes, and cubes have 4 approximation points, we will “draw” a line between 1st and 3rd point (each approximation point on the square is chronologically incremented) and the inverse of the slope of this line will give us the angle we are looking for.

After comparing all the pixel areas of recognized blobs and correctly assigning them their sizes and finding the rotational angles for cubes we will store this data in the same dictionary we have been using up to this point. After all of this is done, the dictionary is saved in the folder in a pickle format. We eventually didn’t import the data from the folder but this Python 3 code was then a little bit edited to work with Python 2 which is used for the robot.

Since the detection part is slow because we iterate all the pixels in an image multiple times we can downscale the original image, downscale our constants and then upscale the resulting positions of recognized objects and send it to the robot more quickly.

Another great way to optimize this algorithm would be to use a background color to separate objects into different entities and then compare individual pixels to different colors. This would result in comparing fewer pixels and also avoiding detecting duplicates as one object would become inactive if the color has been found which is not the case at this moment and duplicates may arise (and did with the orange and red color). Additionally, we can also downscale the original image.

Another thing we did was that we used the cv2.bilateralFilter function (similar to blurring) to remove potential pixel-wise anomalies by mixing the colors/pixels so that the color of an object is consistent throughout its visible area.

This part was implemented by my classmate Alenka and the implementation mostly consisted of 3 parts: mapping image pixels to 3D coordinates that robot uses to move in the 3D space, creating the program that processes the recognized objects and stacks them accordingly to color and size, calibrating gripping and offsets for picking and placing the objects. I won’t go through these but what may be interesting is how the mapping of pixels was being done.

For mapping, we need corresponding points in the image and the 3D world. This data was being collected by placing the cube in the robot gripper, moving it to some arbitrary point within the camera just slightly touching the table. Then you just save the robot end-effector coordinates, move the robot away without moving the cube and then use the camera to get the position of the cube within the image. The more points you have the more accuracy you will most likely achieve, we used 24 points and it was sufficient but less would also be enough since the camera is in such an ideal position.

You will then use these points to calculate a transformation matrix to transform within these 2 coordinate systems (2D image coordinates to 2D robot coordinates where the z-axis will be calibrated by hand because the table is flat and 1 z-axis is at the same level as the table).

Our implementation was not the best, not the fastest and by far not the best optimized but it was sufficient for the task at hand and submit our assignment.

There are a few ways our implementation could be improved or changed.

The first one is probably the already mentioned one. It would be better to detect objects by firstly removing the background and then separating the entities in the image. This would eliminate duplicated objects if multiple colors detect the same object and would lead to faster detection since we don’t have to iterate the whole image for every new color. Additionally, this change would also allow us to get rid of our calibrated colors, since we know there are objects in the image we can just create the color from those separated objects by f.e. averaging its pixels.

The second improvement would be to use C++ for computer vision tasks and utilizing its speed, the implementation wouldn’t be that complicated as the OpenCV library used for this task is also available in Python which we used for development.

The third and last would be to use the ratio of circumference to the area as a distinguishing attribute between a circle and a square. This ration however would be calculated from the points of polynomial approximation we used to recognize objects. There were some occasional errors where a square was sometimes recognized as a circle because it was approximated by 5 or more points and this solution would have solved it.

There are certainly many more improvements that would make this solution faster and more reliable but those are the ones from the top of my head that would correct the errors which were occasionally present. Using a neural network would most likely also lead to more reliable object detection, most certainly if the camera was in a worse or changing position.

The code for the detection and recognition won’t be on GitHub since I want to make it harder to find so it won’t be used by the next students who will be enrolling in this class. The code can be downloaded through the link below.

Downolad source code